Large Instances Meeting Notes 2024-06-11
Assign meeting lead and note taker. @Felipe Montoya and @Braden MacDonald respectively.
Greetings & introductions as needed. n/a
Updates from each org on the call - What's new with your deployment(s)?
eduNEXT - @Moisés González tested @Braden MacDonald 's fixes to the new Meilisearch indexing command. Saw a number of XBlock errors during the index, but the fix worked and solved the problem. However, it could only index 4,000 courses per day and then a scheduled database maintenance task interrupted it before it finished. Had to start the indexing again. However, did already find that with 800,000 blocks in the index, results returned in < 5 ms - very fast.
eduNEXT - @Moisés González would like to discuss Reaching 50k concurrent users on Open edX with Oracle Cloud later.
Q from @Maksim Sokolskiy : have you finished the K6 tests? @Moisés González No, it’s been on hold while someone was on leave, but should be resuming the work soon.
eduNEXT - @Jhony Avella We’re experiencing 502 errors in one of our clusters. We’ve usually fixed this error by adding a grace period for pods to be killed, and making sure not to send traffic to pods before they’re ready to receive traffic. Now we’re thinking it’s an issue with the liveness or readiness probes. We had thought it might be the ingress controller, but we updated that and it wasn’t the problem.
@Maksim Sokolskiy : we completely changed the Open edX heartbeat endpoint. Currently for example, if MySQL stops responding, the heartbeat will fail and the pod will be dropped out of service, [even though the pod itself is healthy].
@Jhony Avella : That doesn’t seem to be our issue, because the heartbeat is OK. It could be related to our HPA scaling pods in and out very aggressively.
@Felipe Montoya Would it make sense to make the heartbeat endpoint to be plugin customizable? @Maksim Sokolskiy It can already be customized via config.
eduNEXT asks if anyone has experience using Percona Cluster to improve performance of Open edX. @Felipe Montoya mentioned that wanted to use ProxySQL at some point, and ran into some bugs - later fixed those bugs but by the time the PR was done, they had moved on from trying ProxySQL; so it should be possible to use now, but they haven’t tried that.
@Maksim Sokolskiy reminds us of Raccoon Gang’s work to batch writes to the student record. They collect updates into batches using the event batch, then write them in batched transactions and this greatly reduces the load compared to writing and rebuilding indexes for each student separately. Hoping to share more results soon.
Raccoon Gang - @Maksim Sokolskiy : We are currently running country-wide exams in Ukraine, with something like 20k simultaneous users answering questions on exams. We’re using a bit of a hack to shard users across two separate clusters. We have also found that using AuroraDB the costs for two medium clusters are cheaper than one big cluster. Also working on testing our new refactoring of search, trying to reproduce the behavior using OpenSearch. Been a bit of a struggle but hoping to finish it up tomorrow.
OpenCraft - @Braden MacDonald thanks @Moisés González for the helpful testing of Meilisearch features.
@Gábor Boros - We’re continuing to separate our “Grove” orchestration software into separate pieces, and integrate more pieces of Harmony. Recently separated out our CloudFront plugin: GitHub - open-craft/tutor-contrib-cloudfront: CloudFront plugin for Tutor . Our goal is to completely separate cluster-level infra provisioning from instance-level infra/resources like CloudFront.
@Moisés González we were once exploring CrossPlane, which is like Terraform but in a k8s manifest. We experimented with a tutor plugin that writes the k8s manifest, and then CrossPlane will see that and provision the actual resources like S3 buckets.
Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.
Only one open PR and it’s under active review as mentioned.
Open discussion/questions, if any.
@Moisés González from tutor users group discussion, we identified two major concerns: (1) the complexity of configuring the platform and how it involves python code, yaml files, tutor config, etc. It’s hard to reason about and understand what the final config will be. I suggested removing the separate tutor defaults so we have a consistent blank slate of the platform defaults? We didn’t reach any conclusions but will continue to discuss. (2) We need an interface for tutor plugins to configure complex services like autoscaling. A clean interface for tutor plugin authors to provide for users to configure and control the plugin.
@Maksim Sokolskiy mentions the question of whether tutor should be configured using python or using yaml. @Moisés González Yes, and we evaluated that and saw that the vast majority of our config is static values, not using python code.
@Moisés González Reaching 50k concurrent users on Open edX with Oracle Cloud : this was an interesting report from Edly. They used locust to test the performance of Open edX. The most interesting part was that they found that horizontal scaling wasn’t very effective compared to vertical scaling - “just one pod per node, and each node had maximum resource”.
@Maksim Sokolskiy maybe there’s a mistake in the interview; it doesn’t make sense to have only 5 workers in one pod on a node with “30 OCPUs and 128 GB”. Also a key figure missing is the requests per second. Also, wonder if there’s a reliability concern with this setup - if a single such pod fails, thousands of users could be dropped?