Large Instances Meeting Notes 2024-05-28

 

  1. Assign meeting lead and note taker. (@Felipe Montoya and @Braden MacDonald )

  2. Greetings & introductions as needed. n/a, the usual participants.

  3. Updates from each org on the call - 2U, eduNEXT, OpenCraft, Raccoon Gang. What's new with your deployment(s)?

2U:

  • @Felipe Montoya : Update from @Jeremy Ristau at 2U - they’ve been seeking a volunteer who’d be interested in joining this meeting regularly but didn’t yet have anyone step forward.

eduNEXT:

  • @Moisés González spent a little bit of time researching the new course search feature and testing it out on a large instance. However, encountered a problem: with ~43,000 courses on the instance, the reindex_studio command crashes before it even starts indexing.

    • @Maksim Sokolskiy is also interested to know what happens with one very large course, because he experienced a failure when trying to do this with ElasticSearch.

    • @Braden MacDonald to follow up with @Moisés González to modify the command to better support instances with large numbers of courses.

  • @Jhony Avella worked on upgrading to Kubernetes 1.29. So far haven’t had any issues.

    • @Maksim Sokolskiy mentioned that the issue they had with the similar upgrade was related to Bottlerocket images.

  • @Jhony Avella is hoping people interested in Horizontal Pod Autoscaling can comment on this discussion thread.

  • @Felipe Montoya In anticipation of Aspects, we have started saving logs on our instances that will want to use it, so that when Aspects is officially installed/enabled, there will be existing data to populate it.

OpenCraft:

Raccoon Gang:

  • @Maksim Sokolskiy We are experimenting with testing Aspects with single installations. Today I encountered a huge performance issue caused by docker logs. If anyone else is using Aspects on non-Kubernetes installations, they may encounter this. We were testing an instance with a huge number of users (1 million) and this caused the instance to utilize all available resources (e.g. cpus) just to work with this amount of logs. The problem was that all stdout data was stored in docker’s store, and the dockerd simply cannot work with that huge volume of stored data. The solution was to clear the docker store of logs. (Aspects was still fine because the log data was sent to clickhouse separately.)

    • @Cristhian Garcia : Once Aspects is enabled it can be very verbose with xAPI statements (which go to stdout by default, making this issue much worse). You can reduce it by disabling Caliper and xAPI logging. However, this prevents processing xAPI logs using Vector, so you would need to enable Ralph. The setting are: XAPI_EVENT_LOGGING_ENABLED and CALIPER_EVENT_LOGGING_ENABLED

    • @Felipe Montoya We should update the Aspects documentation to mention this potential issue.

    • Does anyone know if there is some built in setting for docker to do log rotation automatically?

      • @Moisés González : yes, you can configure dockerd to do log rotation depending on size or time. In the case of Tutor, by default it actually puts the logs in a directory inside the container. So if the container lives for a long period of time, this directory keeps growing until k8s deletes the pod. (Though in tutor local, it’s a mounted folder that can grow indefinitely)

 

  1. Harmony project updates: Review list of PRs and issues, and assign anything un-assigned.

  1. Open discussion/questions, if any.

  • @Moisés González started a discussion around the complexity of Tutor configuration (Jina2 template files, tutor yaml files, open edx yaml files, open edx python files). Previously when you used Ansible, you could rely on a single .yaml file that had your entire config.