Large Instances Meeting Notes 2025-09-02

Large Instances Meeting Notes 2025-09-02

Recording: https://drive.google.com/file/d/1Bg0dL1Q0Ak2iTOYYhaKkluX7noGt57Cc/view?usp=drive_web

Transcript: https://docs.google.com/document/d/1Zfzig3AgFQbHYm8iYG_Fh8jcHr2Hb7uVNT2mmJ8AyAU/edit?usp=drive_web

AI Summary

Attendees: Braden MacDonald, Felipe Montoya, Gábor Boros, Jhony Avella, Maksim Sokolskiy, Nihad Rahim

Key Updates & Discussions

  1. Spot vs. On-Demand Instances (Jhony)

  • Observed AWS clusters shifting workloads from spot to on-demand instances, despite fallback settings.

  • Implemented three capacity types: full on-demand, spot with fallback, and full spot.

  • Currently using Cluster Autoscaler, but considering testing Karpenter for better optimization.

  1. Cypress Enterprise Tests (Felipe & Maksim)

  • Some teams started using the upstream Open edX Cypress test repository.

  • Maksim to contribute PRs upstream and collaborate with Daniel Koga for test reviews.

  1. Picasso & Sumac Build Improvements (Felipe & Gábor)

  • SSH key requirement in Picasso builds now optional.

  • Builds on GitHub’s free runners fail for large MF customizations (e.g., Sumac).

  • sometimes necessary to use larger runners (e.g., 16 GB), though want users to be able to build on free runners where possible.

  1. OpenCraft Updates (Gábor & Braden)

  • Experimented with Picasso + Harmony-based clusters for automated deployments.

  • Successfully integrated ArgoCD + Argo Workflows for resource creation (MySQL, MongoDB, etc.).

  • New cluster being created with this stack; migrations and automation work underway.

  1. Search Benchmarking (Braden)

  • Team evaluating TypeSense vs. Meilisearch for large instances (Axim project).

  • Built a plugin to support TypeSense in courseware/forum search.

  • Meilisearch recently added sharding to their new “enterprise” offering but still lacks high availability; TypeSense offers clustering but requires more memory.

  1. Deployment & MFE Automation (Nihad, Maksim, Braden)

  • Nihad asked how teams automate MFE pipelines (currently manual with bastion server + shared images).

  • Felipe: standard approach uses Tutor plugins + Picasso (image builds) + Dry Dock (production config).

  • Maksim: RacoonGang’s automation uses Ansible playbooks + GitLab runners; biggest challenges appear with multi-tenancy and per-client MFE builds.

  • Braden: different orgs use different methods depending on needs; upcoming frontend-base project will unify MFE builds into a single repo, simplifying workflows.


Takeaways

  • AWS scheduling bias toward on-demand instances remains unresolved but partly mitigated.

  • Stronger collaboration starting around Cypress testing.

  • Harmony + Argo integration shows promise for automated cluster resource creation.

  • Ongoing evaluation of search engines (Meilisearch vs. TypeSense).

  • Multiple MFE automation strategies exist; future consolidation efforts aim to simplify.