Large Instances Meeting Notes 2023-06-13

Meeting recording

kubernetes_collaboration_20230614.mp4

Agenda

  1. Assign meeting lead (@Felipe Montoya ) and note taker (@Braden MacDonald ).

  2. Greetings & introductions as needed.

  3. Updates from each org on the call - 2U, eduNEXT, OpenCraft, Raccoon Gang, Lawrence. What's new with your deployment(s)?

  4. Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.

  5. Open discussion/questions, if any.

Provider Updates

2U - @Adam Blackwell (Deactivated)

  • We’ve been experimenting with Karpenter and comparing Harmony to our internal helm chart. Our helm charts pull secrets from Vault, so we’re trying to decouple it before we open source it so that others can use it even if they don’t have Vault set up in the same way we do. We’re moving to using Argo CD’s Vault plugin instead.

    • Q from @Moisés González - could this use K8s secrets / External Secrets ? A: we’ve historically avoided k8s secrets but today that may be a reasonable approach.

  • We’re containerizing Studio and not happy with how we’re dealing with tracking logs. Interested in what others are doing if there are nice solutions.

    • @Felipe Montoya : we’ve tried using Vector to put the logs into a bucket. We also tried sending the logs to CloudWatch.

  • Q from @Lawrence McDaniel : Back with the ansible charts maintained by 2U, I could just deploy things and it worked really well. Now with k8s, that’s turned upside down and everything is more complex. With all the containerization going on by 2U, is there any chance of a world were there’s a simple, well-maintained thing to use, like in the old ansible days?

    • A: I don’t expect it will ever be as simple as it was in the “honeymoon phase” that we had before. And today 2U has a lot of complex infra e.g. EMR jobs that you don’t want, trust me… I think Axim is going to go all in on Harmony and it will be the blessed way of doing things and trying to make it easier. Also Tutor.

  • @Jeremy Bowman (Deactivated) : check out GitHub - openobserve/openobserve: 🚀 10x easier, 🚀 140x lower storage cost, 🚀 high performance, 🚀 petabyte scale - Elasticsearch/Splunk/Datadog alternative for 🚀 (logs, metrics, traces, RUM, Error tracking, Session replay). - it looks pretty cool for handling logs and is Rust and open source.

    • @Braden MacDonald : Ideally Harmony will include something like this by default for people who want a “batteries included” experience, but of course it can be disabled and replaced by something more custom for organizations that need to.

      • @Adam Blackwell (Deactivated) I’d be happy if the Helm chart included some mechanism for sending logs to an S3 bucket.

eduNEXT - @Jhony Avella

  • In the last meeting, we mentioned a problem with the connection of nodes to the cluster. Luckily it seems fixed with the new Ubuntu image, and we haven’t seen the problem occur since .

  • We’ve faces another issue related to limits for what can be scheduled on an EKS cluster. It scaled up to a limit and then couldn’t create more machines, because no more IPs could be assigned to the cluster. There is a limit to how many IPs the secondary network interface can allocate. So we’re planning to make that more configurable and plan it in advance better.

OpenCraft - @Gábor Boros

  • Q: has anyone tried to upgrade a k8s cluster with running Open edX instances with no more than 2 replicas? We tried it last week and it turned out that the Tutor config for LMS, CMS, etc. and it turned out that all the Tutor k8s resources are missing the liveness and readiness probes. So we tried to do a rolling upgrade but we still got a few minutes of outage while the new pods were booting up. (During an upgrade, nodes get drained and k8s will spawn new pods, but without liveness/readiness probes, k8s may start routing traffic to the new pods before the gunicorn workers have spun up)

    • @Lawrence McDaniel : yeah, I did some k8s cluster version upgrades and don’t remember encountering the issue that you’re describing. I’ll have to remember if/how I avoided that issue.

    • @Jhony Avella : are you using init hooks or termination hooks? We ran into this problem in the past and found that the pods weren’t terminating properly. We had to stop the uWSGI process with the right signal. We also configured it to not receive traffic until it gets three successful calls to the heartbeat.

    • A couple people suggest not using hearbeat as a probe, because then if the database is under heavy load, your LMS pods may be marked as unhealthy and stop getting traffic, even though it may still be able to serve some requests.

    • @Felipe Montoya : there should be a hook/filter in Tutor so that each plugin can configure or override the liveness/readiness probes.

    • @Adam Blackwell (Deactivated) : in some cases we just used a 30s or 60s delay at startup, which is a dumb but effective solution. Also posted in the chat: “For us, notes uses /heartbeat, most other services use /health, we have a `liveness_probe_initial_delay_seconds and liveness_probe_initial_delay_seconds var exposed in the currrrent django helm chart.”

    • Group consensus is that some fix for this should be upstreamed into Tutor core.

Racoon Gang - @Maksim Sokolskiy

  • We have a [client?] noticing a lot of the issues that have been discussed in this group. [Sorry missed the context here for the notes] GitHub - aulasneo/tutor-contrib-hpa

  • We have a huge installation and huge traffic load but the request response times vary dramatically - some are super tiny but others are up to 15s. We decided to separate the deployment into 2 deployments based on request headers - AJAX requests go to the small/fast deployment and huge page load requests go to the big/slow deployment. It’s helped with perceived user experience, and also observability.

  • Our current exam is going well. Our current record is 15k simultaneous students plus something like 3k instructors all in one session.

  • Performance testing is very important. We forgot to test one new small feature and it resulted in huge DB performance degradation. In our retro we identified that our definition of done for any project that can result in big load needs to include a testing strategy.

    • @Adam Blackwell (Deactivated) “I would love a better way to do performance testing on a Django service before we move things from EC2 to EKS, right now we have teams write smoke test runbooks, but they are incomplete.” A lot of manual testing is still used. “We at once point would scale up manually before major news announcements.”

      • @Moisés González We used k6, and found it helpful although it wasn’t fully automated into CI.

Lawrence

  • Re Harmony, I’ve been working on reference infra for Karpenter. Last week I addressed all of @Jhony Avella ‘s review comments, which were mostly stylistic and chores like bumping versions. I see there are some merge conflicts so I’ll get them resolved in the next few days.

Harmony Task Updates

  • Releasing the chart - @Jhony Avella created a new PR that’s from the same repo rather than a fork, and now it seems to be working. You can now use Helm to install it directly for testing.

  • OpenSearch cluster - @Maksim Sokolskiy has resolved all the comments. Mostly been testing on minikube so would appreciate if someone can test on EKS. Found a new issue with the “reindex_course” command that affects both OpenSearch and ElasticSearch. PR is ready for second round of review.

Meeting chat log

00:06:33 Adam Blackwell (he/him/his): GitHub
- argoproj-labs/argocd-vault-plugin: An Argo CD plugin to retrieve
secrets from Secret Management tools and inject them into Kubernetes
secrets
00:06:40 Adam Blackwell (he/him/his): Argo CD Vault Plugin
00:08:02 Felipe Montoya: @moises you are refering to Introduction - External Secrets Operator ?
00:09:10 Moisés González: Yes that one
00:11:55 Adam Blackwell (he/him/his): Kubernetes logs | Vector documentation ?
00:15:10 Moisés González: Sorry I got a spotty connection. Can you write the last thing you said Adam?
00:15:20 Adam Blackwell (he/him/his): I appreciate your appreciation
00:15:37 Adam Blackwell (he/him/his): Axim if hoping to put their eggs in the harmony chart(s)
00:17:49 Jeremy Bowman: Is this relevant to the topic at hand? GitHub
- openobserve/openobserve: 🚀 10x easier, 🚀 140x lower storage cost,
🚀 high performance, 🚀 petabyte scale - Elasticsearch/Splunk/Datadog
alternative for 🚀 (logs, metrics, traces).
00:27:12 Adam Blackwell (he/him/his): Minimum 3, but the 1.25 EKS upgrades did break a bunch of v1beta things.
00:27:27 Adam Blackwell (he/him/his): I believe only for scheduled cronjobs though, so it didn’t impact learners.
00:28:29 Maksim Sokolskiy: GitHub - aulasneo/tutor-contrib-hpa
Did you use this?
00:30:04 jhony: or GitHub - eduNEXT/tutor-contrib-pod-autoscaling: This repository aims to provide support for HPA and VPA in tutor
00:30:49 Adam Blackwell (he/him/his): We use ls :facepalm:
00:31:24 Adam Blackwell (he/him/his): We want to use helm hooks for mysql migrations
00:32:42 Jeremy Bowman: Need to drop for another meeting
00:34:47 Adam Blackwell (he/him/his): +1
00:36:12 Adam Blackwell (he/him/his): We default our Django services to: resources:
limits:
cpu: 100m
memory: 512Mi
requests:
cpu: 25m
memory: 512Mi
00:38:32 Adam Blackwell (he/him/his): Bad liveness probes can hurt people*
00:38:59 Moisés González: We can atleast have the discussion
00:41:31 Adam Blackwell (he/him/his): For us, notes uses /heartbeat,
most other services use /health, we have a
`liveness_probe_initial_delay_seconds and
liveness_probe_initial_delay_seconds var exposed in the currrrent django
helm chart.
00:42:19 Adam Blackwell (he/him/his): (For the curious, GitHub - edx/portal-designer: A place to create and design new learner-portal instances. is public but not part of Open edX)
00:46:46 Adam Blackwell (he/him/his): I would love a better way to do
performance testing on a Django service before we move things from EC2
to EKS, right now we have teams write smoke test runbooks, but they are
incomplete.
00:48:42 Adam Blackwell (he/him/his): latency under load
00:49:09 Adam Blackwell (he/him/his): We at once point would scale up manually before major news announcements.
00:49:27 Maksim Sokolskiy: ++
00:49:40 Adam Blackwell (he/him/his): That also relates to bad liveness probes which slow down scaling
00:49:57 Felipe Montoya: We ask customers to let us know if they will do
big announments to be able to judge if a manual scale up is necessary
00:51:52 Felipe Montoya: k6
00:54:35 Moisés González: Load testing for engineering teams | Grafana k6
00:54:42 Lawrence McDaniel: I have a hard stop in 5 minutes
00:54:56 Adam Blackwell (he/him/his): Sorry for derailing the conversation.
00:55:18 Gábor Boros: Same here about the 5mins
01:03:31 Adam Blackwell (he/him/his): Have to drop off for another
meeting, but wanted to thank all of you for all of your inspiring
collaboration!
01:03:53 Braden MacDonald: Reacted to “Have to drop off for…” with