Large Instances Meeting Notes 2023-05-30

Video recording: https://drive.google.com/file/d/1l65nNgMrnkrW1AML9au-wbUr085Xt1Qt/view?usp=sharing

New agenda for this meeting:

Assign meeting lead and note taker.
Greetings & introductions as needed.
Updates from each org on the call - 2U, eduNEXT, OpenCraft, Raccoon Gang. What's new with your deployment(s)?
Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.
Open discussion/questions, if any.

Provider Updates

2U - no major updates.

eduNEXT - had an issue two weeks ago. We use Ubuntu images for the nodes, because that’s a requirement for AppArmor for codejail. We started having issues where the EBS CSI driver was making kubelet crash. We investigated many potential causes and found that it was a specific version of Ubuntu that had this problem. When we deployed new nodes with the new image and new kubelet version, it seemed to solve the problem. This was very hard to debug because the issue was so random, and the images were installed by EKS, not us, and the whole node would crash, causing chaos on the cluster. See this link for the issue that we believe was the problem. The impact would depend on the node - if the node contained the nginx ingress controller, it caused a big outage. If the node didn’t have the ingress, it wasn’t as big of problem, as the other pods had some redundancy. Lesson learned was to make sure nginx ingress pods are distributed across different nodes, and to look into the pod disruption budget setting.

eduNEXT also changed the uwsgi config and made changes to Tutor to make the uwsgi changes more configurable. It would be good to share best practices for what those settings should be for large instances.

OpenCraft - we are working on deploying sandboxes on k8s, using Grove. We have some blockers about being able to set Tutor config from PR description. e.g. if the PR is opened against master, it should use tutor-nightly, but if it’s against a maple branch we need to use a specific version of Tutor. We need to be able to run cron jobs in the cluster to deploy sandboxes on a schedule; we are using OpenFunctions for this. We are working on the plan to remove functionality from Grove and replace those pieces with Harmony.

Racoon Gang - We’re preparing for a really huge production deployment, with e.g. many simultaneous students taking an exam at once. What eduNEXT just shared could be related to what we’re experiencing actually, so thanks a lot for sharing. We also encountered an issue with the scaling of the coredns service when there are more than 16 nodes. We’re using horizontal autoscaling for DNS to solve that: Autoscale the DNS Service in a Cluster - we may want to think about adding that to Harmony. Also, we’ve found that it’s hard for k8s installations to handle courses with really huge content - e.g. 200 units and 9MB per html page. We’re hoping that MFEs may help, but it’s problematic having such huge content in a course. Q: what node scaling mechanism are you using? A: not sure, some custom autoscaler. Not Karpenter, though we’re interested in it. (comment from eduNEXT: check out overprovisioning, which can help)

Harmony project updates

We reviewed the open issues. @Felipe Montoya opened two new issues to track monitoring with Grafana, splitting up the previous issue as discussed on prior meetings.

@Maksim Sokolskiy asked for help with how to set the OpenSearch admin user’s password in a Helm chart. @Gábor Boros provided an example: provider-modules/k8s-monitoring/main.tf · main · opencraft / dev / Grove · GitLab and explained the steps:

We create a password with terraform
We hash it with htpasswd
We create a kubernetes secret
We set reference the secret from the open search helm values yaml

securityConfig:
enabled: true
internalUsersSecret: internal-users-config-secret