Braindump on Configuration: Today and Future

Meeting Info

Date: Dec 4, 2020
Recording: https://discuss.openedx.org/t/lets-talk-about-the-native-installation/3269/14
Presenter: @Cory Lee (Deactivated)
Facilitator: @Former user (Deleted)

Notes

Rationale

  • Open edX Architecture moving from a centralized model to a decentralized model.

    • Similarly for the edX SRE team

  • Many of our teams own backend and frontend components, within and outside of the monolith

    • Each may want to do things differently - their pace, their testing, their definition of done, etc.

  • Hence, our centralized Configuration repo is problematic and bottlenecked

  • Changes to repo

    • Need to work with last 2 old releases, multiple OSes, multiple environments (sandboxes, etc).

      • Only the latest community release is supported, but some internal edX uses use older releases.

    • In the past, have taken OSPR changes but haven’t tested ourselves, but has added to our maintenance burden

    • Also realized that any change in configuration needed a long deprecation process

  • Configuration state in production

    • Many teams just couldn’t figure out what the state of a configuration was in production or in other environments

    • It was difficult to parse this from the code since the settings were tangled in Ansible files, YAML files, etc, etc.

    • So… moved to detangling the settings (De-DRY)

Current State

Future with Containers

  • Background

    • Remote config today

      • Remote config is Hermes with a few lines of python code

      • Hermes is currently configured with Ansible code

    • edX Secrets today

      • Stored in Vault

      • Using OS tech today from Hashicorp

  • Moving to decentralized model

    • Defaults are in the app

    • Dockerfile in each repo

      • Open edX version of the application, without any edX-specific

      • Example YAML config file as well

    • Dockerfile buildable by running docker build . → without any additional params, etc.

    • Just: Build the image and mount the YAML

    • Clearer delineation between Open edX and edX.org

      • So edX.org can run at a different pace and a different scale

  • edX.org: Docker images → using Helm → k8s

    • Would continue to support Blue/Green Deployments

    • Helmcharts are currently edX-specific in order to provide clearer directions to edX engineers

      • Could provide Open edX versions of HelmCharts and k8s Manifest as examples, if they will be useful

    • Private requirements for edX.org → would like to move this out of the public repos

      • So edX.org can dogfood this design pattern

      • You may notice new files with -newrelic tags - these are the start of these edX-specific files

Discussion

  • OEP-45 follow-up

    • NGinx configuration - what are the plans for this?

      • Would this be with sample Helm charts?

      • edX would be using nginx ingress

      • Possible Options

        • A: Provide an example chart

        • B: Provide an open chart that edX also consumes

  • Myth: Entire configuration for an IDA can be stored in a given repo

    • Example: Authentication is coupled across repos

      • operation was in the Ansible playbooks

    • Note: decentralized devstack has a similar issue

  • In addition to Helmcharts, you are also using Consul Templates to pull in secrets from Vault?

    • Yes, those go into as values in the Helmcharts.

  • Timeline for deprecation?

    • Let’s discuss this together and work together on accelerating and converging on this effort

  • Currently evaluating Tutor as a community-supported deployment mechanism

    • Usual questions: Testing, Devstack, Docker files, etc.

    • It seems similar problems underneath, no?

    • For example, should Tutor be the next Devstack?

      • Tutor - make it easy to install

        • Devstack is trying to address the same issue

    • Currently, not a hard blocker since we are pushing images in our current Devstack.

    • edX Architecture team is the newly designated “owner” for Devstack and plans to look into addressing short-term issues as well as long-term plans. Nim will ask the team to watch the recording from today and follow-up with this group in order to converge paths.

 

Cory’s speaking notes (rough transcript):

 

 

******************* 10m edx.org context ******************* As most of you already know, edx is over the past few years been transitioning from a monolithic architecture to a microservices architecture with separate micro frontends. Something you may or may not know. We have also been transitioning internally from a centralized devops team where most of the infrastructure work was done by a core team as a service to other teams, to a decentralized model where teams own their own applications and infrastructure as much as possible and a core team of SREs facilitates this ownership. While the centralized model worked well for us early on As the number of teams and projects grew it became overwhelming, as the devops team was not frequently the closest people to the problem with the most context. We found that cross team dependencies were a frequent source of friction with the centralized model as team A would need to inject work into the devops backlog, wait for it to be completed before they could proceed, and it was difficult to plan for this type of work in advance because it wasn't always apparent at the outset if you would need devops assistance or not. In order to minimize these dependencies the (now) SRE team is tasked with enabling each team to own their applications from top to bottom, this means, frontends, backends, databases, alerting, passwords. everything. This is somewhat aspirational at the moment, but it is the guiding star we are driving towards. Internally at edx, many of our teams own a frontend or two, and django backend application of two, and some components of the edx-platform monolith. They work in these repos most days and are very familiar with them, but edx has something like 450 repos in our github so it is impossible for anyone to REALLY know how everything works. As you might imagine these teams now mostly decoupled sometimes want to do very different things with their deployment processes, testing, automations, use different technologies, etc. This is where our centralized configuration repo becomes problematic. Since all the applications are coupled together sharing code, changes to this repo need to work for all applications in all contexts, but the people making these changes, the developers on the application teams, don't necessarily have all this context as they mostly work in silos on their own applications. We also accept open source pull requests to this repo, but Any changes to this repo need to work for the past two open edx releases, the current master branch, they need to work in our sandboxes, devstacks, The past few versions of ubuntu. Ideally they would not make any breaking changes for the openedx community. Some of the components are developed entirely by the openedx community and we don't have the capacity to test these changes (i.e we might receive pull requests for google cloud services, but dont use google cloud as an example) Reviewing and testing changes to this repo and having confidence that they wont break something is therefore really hard. ******************* 10m simplifying configuration asym-crypto and remote config (Hermes) ******************* Some of our past attempts at solving this include: simplifying configuration and adding asym-crypto-yaml and remote config ******************* simplifying configuration and adding asym-crypto-yaml ******************* A frequent complaint we received from our developers was that it was difficult to determine what their configuration settings were as we used a complex multi tiered merging of YAML configs in the configuration repo to build a single .yaml or multiple json files (in the case of edx-platform) for all our apps. Additionally applications had their own defaults ansible had its own per-app and global defaults, and then we layered multiple files on top of it to generate a single file(or multiples for the LMS) for each application. This was to facilitate config sharing across applications. We decided to simplify this to just every app having a single yaml file, period. No Json, and all of the 'default' config values would be moved into the application instead being in the ansible repository. To do this we needed to be able to store our secrets in our config files, and to do this we developed asym-crypto-yaml which let us do inline encryption in our yaml files We were successful in reducing all the apps to using a single yaml file, but were not able to remove all of the old configurations from the configuation repo, nor were we able to remove the JSON configurations because they were needed for old edx-platform named releases. This was where I began to really began to appreciate that any significant refactors of the configuration repo require us to go through a lenghty deprecation process. ******************* Remote config (aka Hermes) ******************* Once we had all our applications pulling from a single yaml file in an effort to bypass the configration repo for most settings changes a developer might want to make we developed remote config AKA Hermes. This was intended to speed up config deployments as running the entire ansible play is slow for simple config changes. hermes is a simple program really, it monitors a remote file, and whenever it's ETAG changes it runs a shell command. In our case, we modified the configuration repo to when hermes is enabled modify sudoers and allow it to restart the application. We then configure the bash command that hermes runs to download the remote file decrypt the inline secrets using asym-crypto-yaml and restart gunicorn. This enables our config changes to our app.yaml file to bypass our deployment pipelines (and therefore our ansible runs and complex config merging) for most of the settings our developers care about. However again we weren't able to factor out all of the old config from configuration, and the nested configuration is still used for configuring things like nginx and the machine images that we are deploying, so this still leads to a lot of confusion, though the answers are sometimes much simpler. ******************* Retirement of the configuration repo ******************* This inability to make changes to the way we ship edx.org without undergoing a significantly time consuming deprecation process to support openedx is essentially what has caused us to begin undergoing an internal deprecation of the configuration repo, with the intent of *eventually* deprecating and ceasing to support it externally. With that said, we are aware that we need to follow the formal deprecation process and cannot just pull the rug out from under the openedx community. ******************* Not using asym-crypto-yaml & remote config ******************* It is important to note that remote config or "hermes" is configured by the ansible code in configuration, which is being deprecated. Our new kubernetes deployments dont actually leverage hermes at all, and so in all liklihood, we will not be using it either in the future. In our kubernetes deployments we are storing secrets in vault and using consul-template to render secrets into our application yaml files, and thus no longer have a need to use either of these pieces of software as we can get the same functionality from these free hashicorp products. ******************* So what is next? ******************* We are moving to a decentralized model where each application mostly contains everything it needs in order to run, this means default configuration values, any libraries it needs to install etc. The plan will probably be pretty familiar and boring to you if you have spent much time working with Docker We are containerizing our applications by putting a Dockerfile in each repo. That should be openedx centric and should include nothing about edx.org The repo will contain everything needed to run that microservice except the YAML config file. I would personally like all of the defaults to be production centric and for all dev and sandbox environments to configure everything by manipulating the YAML file. The Dockerfile in the repo will be buildable by simply doing 'docker build .' at the root of the directory, and will produce the open edx image. I like to think of this as "Just add YAML" You will do `docker build .` mount your YAML file into the container and you are good to go. This allows both our developers and open source developers to compartmentalize when working on a micro service so they dont have to ingest the entire edx ecosystem to make a few changes to one microservice. Everything you need should be in that repo. One thing that we really want to do is create a clear delineation between what is openedx and edx.org so that edx.org can move fast and break stuff, and so that openedx can be a stable product for your consumption The reality is that edx.org has very different needs from most operators of openedx. We operate at a scale that isn't reasonable for most installs. And most organizations that operate openedx at a similar scale have their own infrastructure. As such we are deploying our docker images to kubernetes using helm. With our config files being rendered by using consul template to load secrets from vault at deploy time. But kubernetes is too large for many small installs of openedx so we dont want to force this technology choice on everyone. Our helm charts are currently very edx.org specific and make many assumptions about what our underlying infrastructure is in order to provide a clean experience to our developers working at edx.org There is currently something of an outstanding question of if we should attempt to provide an open source helm chart, or provide example manifests for running openedx on kubernetes, but our current approach is to keep our edx.org manifests private and to publish public docker images. So for the parts of the configuration repo that matter to most installs (the installed libraries required to run the application, the default config values etc) those will be in the repo in question. Then for installing private requirements and edx.org specific artifact customizations we plan to ourselves become a consumer of the openedx docker images by building private images that are derivative from the open source images. Currently many of our dockerfiles build both a public and a '-newrelic' image which is essentially the beginnings of our private images, but some of our applications, like the platform need to install many plugins and additional plugins for the edx.org install. The current images in the repos are also derivative of ubuntu upstream images, we would eventually like these to be based on the python base images, they are only ubuntu now for parity with the configuration repo package names