Large Instances Meeting Notes 2025-09-02
Recording: https://drive.google.com/file/d/1Bg0dL1Q0Ak2iTOYYhaKkluX7noGt57Cc/view?usp=drive_web
Transcript: https://docs.google.com/document/d/1Zfzig3AgFQbHYm8iYG_Fh8jcHr2Hb7uVNT2mmJ8AyAU/edit?usp=drive_web
AI Summary
Attendees: Braden MacDonald, Felipe Montoya, Gábor Boros, Jhony Avella, Maksim Sokolskiy, Nihad Rahim
Key Updates & Discussions
Spot vs. On-Demand Instances (Jhony)
Observed AWS clusters shifting workloads from spot to on-demand instances, despite fallback settings.
Implemented three capacity types: full on-demand, spot with fallback, and full spot.
Currently using Cluster Autoscaler, but considering testing Karpenter for better optimization.
Cypress Enterprise Tests (Felipe & Maksim)
Some teams started using the upstream Open edX Cypress test repository.
Maksim to contribute PRs upstream and collaborate with Daniel Koga for test reviews.
Picasso & Sumac Build Improvements (Felipe & Gábor)
SSH key requirement in Picasso builds now optional.
Builds on GitHub’s free runners fail for large MF customizations (e.g., Sumac).
sometimes necessary to use larger runners (e.g., 16 GB), though want users to be able to build on free runners where possible.
OpenCraft Updates (Gábor & Braden)
Experimented with Picasso + Harmony-based clusters for automated deployments.
Successfully integrated ArgoCD + Argo Workflows for resource creation (MySQL, MongoDB, etc.).
New cluster being created with this stack; migrations and automation work underway.
Search Benchmarking (Braden)
Team evaluating TypeSense vs. Meilisearch for large instances (Axim project).
Built a plugin to support TypeSense in courseware/forum search.
Meilisearch recently added sharding to their new “enterprise” offering but still lacks high availability; TypeSense offers clustering but requires more memory.
Deployment & MFE Automation (Nihad, Maksim, Braden)
Nihad asked how teams automate MFE pipelines (currently manual with bastion server + shared images).
Felipe: standard approach uses Tutor plugins + Picasso (image builds) + Dry Dock (production config).
Maksim: RacoonGang’s automation uses Ansible playbooks + GitLab runners; biggest challenges appear with multi-tenancy and per-client MFE builds.
Braden: different orgs use different methods depending on needs; upcoming frontend-base project will unify MFE builds into a single repo, simplifying workflows.
Takeaways
AWS scheduling bias toward on-demand instances remains unresolved but partly mitigated.
Stronger collaboration starting around Cypress testing.
Harmony + Argo integration shows promise for automated cluster resource creation.
Ongoing evaluation of search engines (Meilisearch vs. TypeSense).
Multiple MFE automation strategies exist; future consolidation efforts aim to simplify.