Large Instances Meeting Notes 2023-07-11

Meeting video recording: https://drive.google.com/file/d/10Fu646PITuoTNUdEZ3RCToRv-LoKacQ1/view?usp=drive_link

  1. Assign meeting lead and note taker.

  2. Greetings & introductions as needed.

  3. Updates from each org on the call - 2U, eduNEXT, OpenCraft, Raccoon Gang. What's new with your deployment(s)?

@Gábor Boros at OpenCraft was on holiday, is back now and is working on reviewing PRs where he is tagged. (For the Karpenter PR, @Jhony Avella has closed it and opened a new PR with rebased changes etc. It’s not quite ready for merge, and needs some adjustments for the new Helm chart release etc.)

@Moisés González for eduNEXT: the last two weeks have been fairly uneventful.

@Felipe Montoya The cluster that we’ve had Harmony running on for two weeks has been mostly stable, with some of the usual issues that we always see on our clusters.

We’ve been looking into the best ways to host production-grade MFEs in a cluster. At the end of the day, a CDN works way better than the alternatives we’ve tried. So we’re working on a tutor plugin to make it easy to set up a CDN.

@Braden MacDonald : an MFE deployment just needs a few static files hosted on a CDN, doesn’t it?

OpenCraft

@Felipe Montoya : Currently, the tutor-mfe plugin builds a docker image, which is then downloaded onto the cluster and the files are served via Caddy. So without a CDN in place, Caddy can become a bottleneck. It’s true it would be nicer without the docker images as a middle step but that’s too complex for now, because it’s not how Tutor currently works.

Racoon Gang

@Maksim Sokolskiy @ Racoon Gang : Mostly testing Harmony on my local machine and it’s working well. Re our recent big migration to Kubernetes: everything mostly was fine but at the end of the big exam period we encountered a huge degradation of RDS. Issue seemed to be related to the huge amount of data and historical data from the two week period, so looking at the retention policy. AuroraDB doesn’t provide the kind of sharding or horizontal scaling we need, and vertical scaling won’t help, so we’re trying to figure out a better solution, with something like multi-master sharding for MySQL.

@Felipe Montoya : we tried using Percona multi-master but found it was only using a single master node. Then we found a proxy SQL solution which gave way more control over how transactions were routed, but it failed spectacularly due to transaction atomicity issues.

@Jeremy Bowman (Deactivated) If you’re encountering MySQL scaling issues, consider documenting them in the related discussion at https://discuss.openedx.org/t/evaluating-big-technology-changes-like-postgresql-support/10643 .

eduNEXT

@Moisés González We have also been encountering some issues with Redis; over time, it fills up and we have to purge it. This is ElastiCache redis. @Gábor Boros We saw a similar issue with Tutor-deployed clusters. @Maksim Sokolskiy We saw a similar issue: RDS was locked, but Redis continued to be filled with the grade tasks until it overflowed by ~8GB until it eventually crashed. @Felipe Montoya We saw heartbeat failures and it looked at first like MySQL was slow, but on a hunch I looked into Redis and found that it was the real culprit, even though it seemed like MySQL was the issue. @Maksim Sokolskiy We enabled “frozen score” in the django admin, because re-calculating the grades on every score submission was resulting in a crazy level of load on the system.

2U

As mentioned, we’re exploring how to evaluate large tech changes like PostgeSQL support: Big Technology Changes .

Arch-BOM, Arbi-BOM etc. teams now have a product manager, and we’re working on roadmap. If there’s anything you want to nominate for prioritization, let me know. We’re considering dev environments that play better with k8s, better development data.

  1. Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.

  • Helm chart release is done

  • OpenSearch PR is approved and ready to merge

  • The eduNEXT autoscaling plugin is being upgraded for Tutor 16 compatibility.

What should we work on next? Monitoring, testing are good candidates.

  1. Open discussion/questions, if any.

@Moisés González If we’re all having issues with Redis, we should write it down and document it somewhere to share our findings.

@Felipe Montoya Has anyone upgraded from MySQL 5.7 to MySQL 8?

@Gábor Boros We did it on DigitalOcean managed MySQL. There’s a foreign key setting that we had to set, and then the upgrade went smoothly. On RDS is “just worked” smoothly, but we set up a new 8 instance, then dumped data from 5.7 and imported it.

Chat log

00:03:15 Braden MacDonald: https://openedx.atlassian.net/wiki/spaces/COMM/pages/edit-v2/3814457347 00:06:03 jhony: @Gábor Boros this is the new PR: https://github.com/openedx/openedx-k8s-harmony/pull/41 00:07:32 Gábor Boros: Thank you! 00:14:29 felipe: mysql or aurora? 00:18:03 jhony: Another option for scalability and sharding: https://vitess.io/ 00:18:23 Jeremy Bowman: https://openedx.atlassian.net/wiki/spaces/AC/pages/3801743364/MySQL+vs+PostgreSQL 00:19:26 felipe: memory