Large Instances Meeting Notes 2026-02-03
In attendance:
From eduNEXT: @Felipe Montoya , @Jhony Avella , @Moisés González
From OpenCraft: @Gábor Boros , @Braden MacDonald
Summary
Overall Context
The meeting focused on:
Operating Aspects at unprecedented traffic levels
Kubernetes infrastructure evolution (ingress controllers, tooling)
Ongoing Harmony / Drydock / Picasso deployments at OpenCraft
Operational tooling changes (dashboard, provisioning, automation)
Two key highlights:
eduNEXT’s experience running Aspects under record-breaking load and identifying what is actually required to make it work reliably.
OpenCraft’s launch of the new PR sandboxes system.
Aspects at Extremely High Traffic (Felipe’s Update)
During January, eduNEXT operated an Open edX instance with Aspects receiving ~7 million log inserts per day, pushing Aspects further than any previously recorded deployment.
This translated to hundreds of millions of rows in ClickHouse.
According to Axim / Data Working Group discussions, dashboard errors tend to appear around ~300 million records if not properly tuned.
A dedicated investigation call was held with Ty Hob, Sarah Burns, Dave Ormsbee, and eduNEXT Ops.
The system did work, but only after specific optimizations were put in place.
The Three Required Optimizations for High-Traffic Aspects (Key Takeaways)
Felipe emphasized that three specific changes were critical to successfully handling this volume of traffic:
1. High Disk I/O Performance (SSD-Class or Better)
ClickHouse performs inserts in a non-sequential (random I/O) pattern, not simple append-only writes.
Disk performance was explicitly tested using tools like
ddconfigured to simulate random writes.Slow disks were a hard blocker — HDD-level performance is insufficient.
Requirement: SSD-quality disks (or better) with strong random write performance.
➡️ Without this, ClickHouse becomes the bottleneck long before CPU or memory.
2. ClickHouse “Fire-and-Forget” Insert Mode via Ralph
The Aspects pipeline uses Ralph to push data into ClickHouse.
By default, Ralph waits for ClickHouse to confirm inserts.
At very high throughput, this caused Ralph to block, delaying or halting ingestion of subsequent batches.
Switching to fire-and-forget insert mode meant:
Ralph sends inserts to ClickHouse
Does not wait for insert confirmation
Keeps ingestion flowing even under heavy load
➡️ This was described as critical to prevent ingestion back-pressure and pipeline stalls.
3. Dedicated Celery Queue for Aspects
Aspects traffic was competing with:
LMS/CMS async tasks
Studio operations
Report generation
Under load, this contention caused cascading slowdowns.
Solution:
Separate Celery queue exclusively for Aspects
Isolates analytics ingestion from core platform operations
Prevents Aspects load from degrading LMS/CMS responsiveness
➡️ This separation made a “big difference” in system stability.
Outcome
With these three changes in place, Aspects successfully handled ~7× the previously assumed traffic limit.
The traffic spike lasted about a week; the system remained stable throughout.
Conclusion: Aspects can scale to very high traffic if infrastructure and configuration are done correctly.
Other Notable Topics (Brief)
Ingress Controller Replacement
Traefik evaluated as the primary open-source replacement for ingress-nginx
AWS Load Balancer Controller considered as a complementary option (not standalone, to avoid AWS lock-in)
Traefik favored due to:
Ingress-nginx compatibility layer
Gateway API support
Cloud-agnostic deployment
Consensus: Traefik + AWS controller together works well on AWS, while remaining portable. Use Traefik alone on other cloud providers.
Kubernetes Tooling
Kubernetes Dashboard deprecated
Headlamp selected as replacement (recommended upstream)
Move away from centralized tooling (e.g., Rancher) toward fully independent clusters
Platform & Ops Updates
First production Kubernetes cluster running Picasso + Harmony + Drydock at OpenCraft is live, to host Axim’s auto-generated PR sandboxes. Will provide faster deployments and better operational visibility to everyone.
OpenCraft is planning to migrate all its customers to the new stack, and hopes to use live migrations between clusters with no downtime
Deployment times for existing instances reduced to ~15 minutes
Increasing use of ArgoCD and Argo Workflows
Gradual move away from OpenFaaS toward Argo-based workflows