Large Instance Meeting Notes 18.04.2023

@Felipe Montoya will be leading the meeting, @Braden MacDonald taking notes

 

Notes:

  • Reviewing the issues on the DevOps Working group board

    • The issue to update the README for Harmony is now done.

    • Autoscaling PR. We had two options for how to resolve a command (helm dependency update) and Jhony has addressed this so the PR should be ready to go. @Braden MacDonald will take another look today and this should merge shortly.

    • Jhony is investigating how to allow people to use the helm chart without needing to clone the repo. He will make an issue.

    • Monitoring with Prometheus, should be unblocked with the merge of the autoscaling PR.

    • Karpenter issue is also blocked on the autoscaling which will soon merge.

    • OpenSearch support - we have a draft PR from @Maksim Sokolskiy. He anticipates it will be ready for review in a week or two.

    • Next steps until we can use this in production. Thanks for the different points of view and lists from the different people who commented. We’re missing monitoring, publishing the helm chart and a release process.

      • eduNEXT will be testing this on a prod-like sandbox environment soon.

      • Jhony: there is a helm chart we can use that provides all the monitoring tools. As mentioned, Jhony will create an issue about the release process.

      • Question about the release process: do we need to open an Axim engineering ticket so we can publish using an official Open edx account? A from Jhony: It depends - if we publish on GitHub Pages it won’t be necessary. It would be needed if we wanted to publish on Artefact Hub etc. Jhony will post details on the issue that he’ll open.

    • Discussion: do we need log collection? Gabor: It’s nice to have. Felipe: maybe it’s better to have log collection at a higher level, something out of scope.

      • @Felipe Montoya will open an issue about this so we can have a technical discussion about our options.

    • Tutor plugin: we’ll add it to the index once we’re using it in prod. Jhony: the pod autoscaling plugin is already in the tutor plugin index.

    • SSL cert for ElasticSearch:

      • @Felipe Montoya : I’ve seen so many issues with self-signed certs; they’re basically a planned outage on the date of their expiration. Can we use cert-manager/letsencrypt to get a valid public cert for this instead? We pinged @Moisés González to get his input.

      • @Maksim Sokolskiy notes that the solution for OpenSearch may be different than for ElasticSearch but we aren’t sure at this point. We’ll track it with the same issue for now, and create separate issues if needed.

  • Anything else that we aren’t tracking with an issue on the board?