Large Instances Meeting Notes 2023-09-19

Video recording: https://drive.google.com/file/d/1kDK_5QveO2Npqlz2ENH6P1iJNWtoAMD1/view?usp=sharing

Assign meeting lead and note taker.

Braden will lead the meeting.

Greetings & introductions as needed.

n/a

Updates from each org on the call - 2U, eduNEXT, OpenCraft, Raccoon Gang. What's new with your deployment(s)?

@Jeremy Bowman (Deactivated) - 2U - not much to say. Working on MySQL 8 upgrade at 2U.

@Gábor Boros - OpenCraft - not much either.

@Jhony Avella - eduNEXT - we introduced Harmony in one of our production installations. (nginx + cert-manager). @Moisés González : we chose one of our smaller installations. We deployed the Helm charts using ArgoCD. We used kustomize to deploy the CRDs since they’re managed outside of the Harmony helm charts. So far it seems to be working well but it’s new so we’ll see.

@Jhony Avella - we are exploring different options for running Open edX in lighter distributions of Kubernetes, such as k3s. We hope to have findings to share in two weeks.

@Maksim Sokolskiy - Racoon Gang - no particular updates.

Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.

https://github.com/openedx/openedx-k8s-harmony/issues/46 - eduNEXT is experimenting with hosting the database within the cluster.

https://github.com/openedx/openedx-k8s-harmony/issues/45 - want to know if we can share components like Clickhouse among multiple Open edX instances in a cluster to reduce resource costs.

  • @Maksim Sokolskiy says Racoon Gang may be investigating Helm deployment of Clickhouse etc in the near future.

https://github.com/openedx/openedx-k8s-harmony/pull/41 - still in progress. Testing is hitting some issues with Karpenter scaling. @Jhony Avella is hoping to work on this in the coming week.

Open discussion/questions, if any.

@Moisés González We are moving our old ansible installations to k8s, and we’re finding the baseline costs are a bit higher than before. Previously you could deploy Open edX onto an xlarge instance, and handle 30,40,60 users. Now with the autoscaler we usually need a minimum of three nodes. We’re wondering if other providers have noticed an increase in their cloud costs for the same level of Open edX service? (Keep in mind we’re mostly using AWS + a little bit of Azure).

@Gábor Boros We’re currently working to compare the actual costs. We had done estimates before, but now we are preparing a report on actual costs.

@Maksim Sokolskiy From my side, it feels like we pay more for a k8s installation, for the same amount of users. For, say, 1,000 users, we are seeing the cost is higher. We want to figure out how to scale replication down when there is almost no usage, such as at night. We also want to figure out how the costs scale with the number of users. It’s important to choose the right minimum node size that we can shrink to. [some audio cut out] But overall I can confirm we are paying more. We want all our deployments to be consistent though, not some on k8s and others not.

@Braden MacDonald Has anyone tried Fargate? @Moisés González : I think @Lawrence McDaniel has and wrote a post about it - check it out.

Also, we planned to put a lot of instances on few clusters but as it has turned out so far, we needed to deploy quite a few instances on dedicated clusters, which has definitely increased the cost.

@Jeremy Bowman (Deactivated) not sure about costs at 2U. But we’re excited about moving from redundant servers for an individual service to a redundant cluster that provides redundancy for a whole set of services.


@Maksim Sokolskiy Curious to know how 2U handles their database scalability for so many millions of users? @Jeremy Bowman (Deactivated) we use separate databases for separate services, but the main answer is that we use Aurora which handles a lot of the scaling concerns for us.

@Maksim Sokolskiy We actually use Aurora too, but we still experienced deadlock during exams, when the persistent grades for each user’s question for re-computed.

@Jeremy Bowman (Deactivated) We have some minor issues with row table locks, but not that bad. We’re working on identifying views where the whole request is in a transaction, since that’s the django default that we had for a long time, and scoping the transaction down to a smaller part of the view code.

@Maksim Sokolskiy It seems like the platform code is not well designed to take advantage of read replicas, so we’d like to see improvements made to take better advantage of reading from replicas within the web views.

@Jeremy Bowman (Deactivated) We also add a lot of caching to cut down on the number of requests that actually hit the database. Also, one disadvantage of Aurora is it made our MySQL 8 upgrade take much longer to plan, and it’s still in process.


@Jhony Avella “It would be interesting to condense all those ideas and recommendations in a centralized source” @Maksim Sokolskiy to create a confluence page.

Operational tips & tricks for Large installations