Large Instances Meeting Notes 2026-02-03

Large Instances Meeting Notes 2026-02-03

Transcript | Recording

In attendance:

From eduNEXT: @Felipe Montoya , @Jhony Avella , @Moisés González

From OpenCraft: @Gábor Boros , @Braden MacDonald

Summary

Overall Context

The meeting focused on:

  • Operating Aspects at unprecedented traffic levels

  • Kubernetes infrastructure evolution (ingress controllers, tooling)

  • Ongoing Harmony / Drydock / Picasso deployments at OpenCraft

  • Operational tooling changes (dashboard, provisioning, automation)

Two key highlights:

  • eduNEXT’s experience running Aspects under record-breaking load and identifying what is actually required to make it work reliably.

  • OpenCraft’s launch of the new PR sandboxes system.

 


Aspects at Extremely High Traffic (Felipe’s Update)

During January, eduNEXT operated an Open edX instance with Aspects receiving ~7 million log inserts per day, pushing Aspects further than any previously recorded deployment.

  • This translated to hundreds of millions of rows in ClickHouse.

  • According to Axim / Data Working Group discussions, dashboard errors tend to appear around ~300 million records if not properly tuned.

  • A dedicated investigation call was held with Ty Hob, Sarah Burns, Dave Ormsbee, and eduNEXT Ops.

The system did work, but only after specific optimizations were put in place.


The Three Required Optimizations for High-Traffic Aspects (Key Takeaways)

Felipe emphasized that three specific changes were critical to successfully handling this volume of traffic:

1. High Disk I/O Performance (SSD-Class or Better)

  • ClickHouse performs inserts in a non-sequential (random I/O) pattern, not simple append-only writes.

  • Disk performance was explicitly tested using tools like dd configured to simulate random writes.

  • Slow disks were a hard blocker — HDD-level performance is insufficient.

  • Requirement: SSD-quality disks (or better) with strong random write performance.

➡️ Without this, ClickHouse becomes the bottleneck long before CPU or memory.


2. ClickHouse “Fire-and-Forget” Insert Mode via Ralph

  • The Aspects pipeline uses Ralph to push data into ClickHouse.

  • By default, Ralph waits for ClickHouse to confirm inserts.

  • At very high throughput, this caused Ralph to block, delaying or halting ingestion of subsequent batches.

  • Switching to fire-and-forget insert mode meant:

    • Ralph sends inserts to ClickHouse

    • Does not wait for insert confirmation

    • Keeps ingestion flowing even under heavy load

➡️ This was described as critical to prevent ingestion back-pressure and pipeline stalls.


3. Dedicated Celery Queue for Aspects

  • Aspects traffic was competing with:

    • LMS/CMS async tasks

    • Studio operations

    • Report generation

  • Under load, this contention caused cascading slowdowns.

  • Solution:

    • Separate Celery queue exclusively for Aspects

    • Isolates analytics ingestion from core platform operations

    • Prevents Aspects load from degrading LMS/CMS responsiveness

➡️ This separation made a “big difference” in system stability.


Outcome

  • With these three changes in place, Aspects successfully handled ~7× the previously assumed traffic limit.

  • The traffic spike lasted about a week; the system remained stable throughout.

  • Conclusion: Aspects can scale to very high traffic if infrastructure and configuration are done correctly.


Other Notable Topics (Brief)

Ingress Controller Replacement

  • Traefik evaluated as the primary open-source replacement for ingress-nginx

  • AWS Load Balancer Controller considered as a complementary option (not standalone, to avoid AWS lock-in)

  • Traefik favored due to:

    • Ingress-nginx compatibility layer

    • Gateway API support

    • Cloud-agnostic deployment

  • Consensus: Traefik + AWS controller together works well on AWS, while remaining portable. Use Traefik alone on other cloud providers.

Kubernetes Tooling

  • Kubernetes Dashboard deprecated

  • Headlamp selected as replacement (recommended upstream)

  • Move away from centralized tooling (e.g., Rancher) toward fully independent clusters

Platform & Ops Updates

  • First production Kubernetes cluster running Picasso + Harmony + Drydock at OpenCraft is live, to host Axim’s auto-generated PR sandboxes. Will provide faster deployments and better operational visibility to everyone.

  • OpenCraft is planning to migrate all its customers to the new stack, and hopes to use live migrations between clusters with no downtime

  • Deployment times for existing instances reduced to ~15 minutes

  • Increasing use of ArgoCD and Argo Workflows

  • Gradual move away from OpenFaaS toward Argo-based workflows