Meeting Info

Date: 04 Dec 2020
Recording: TK
Presenter: Cory Lee (Deactivated)
Facilitator: Former user (Deleted)

Notes

Rationale

Open edX Architecture moving from a centralized model to a decentralized model.
- Similarly for the edX SRE team
Many of our teams own backend and frontend components, within and outside of the monolith
- Each may want to do things differently - their pace, their testing, their definition of done, etc.
Hence, our centralized Configuration repo is problematic and bottlenecked
Changes to repo
- Need to work with last 2 old releases, multiple OSes, multiple environments (sandboxes, etc).
  - Only the latest community release is supported, but some internal edX uses use older releases.
- In the past, have taken OSPR changes but haven’t tested ourselves, but has added to our maintenance burden
- Also realized that any change in configuration needed a long deprecation process
Configuration state in production
- Many teams just couldn’t figure out what the state of a configuration was in production or in other environments
- It was difficult to parse this from the code since the settings were tangled in Ansible files, YAML files, etc, etc.
- So… moved to detangling the settings (De-DRY)

Current State

Moved to using a single YAML file for each App
Asym-Crypto-YAML: https://github.com/edx/asym-crypto-yaml
- Inline encryption of our YAML - so secrets can be encrypted inline
- Allows devs to see the secrets inlined, but protected encrypted
Remote Config
- Hermes: https://github.com/edx/hermes
  - Intended to speed up config deployments
  - Whenever its ETAG changes, it runs a python file
- Config files are completely separate from our IDA deployments
  - downside today: can’t tell today exactly when the config is updated; could be fixed with further tooling at some point
- Have been doing this for about a year perhaps
- Made refactoring configuration a lot easier
Config Flow changes today
- Configuration repo runs once in the “beginning”
  - We run Ansible to create the initial AMI
  - After that, we no longer use the configuration repo
- Dev
  - declares the Python configuration setting in the App
  - set the values in remote config
    - only set in configuration if the setting is needed in NGinx
- Remote config changes are merged
  - CI verifies decryption of secrets
  - CI verifies formatting
- Hermes automatically deploys the config change

Future with Containers

Background
- Remote config today
  - Remote config is Hermes with a few lines of python code
  - Hermes is currently configured with Ansible code
- edX Secrets today
  - Stored in Vault
  - Using OS tech today from Hashicorp
Moving to decentralized model
- Defaults are in the app
- Dockerfile in each repo
  - Open edX version of the application, without any edX-specific
  - Example YAML config file as well
- Dockerfile buildable by running docker build . → without any additional params, etc.
- Just: Build the image and mount the YAML
- Clearer delineation between Open edX and edX.org
  - So edX.org can run at a different pace and a different scale
edX.org: Docker images → using Helm → k8s
- Would continue to support Blue/Green Deployments
- Helmcharts are currently edX-specific in order to provide clearer directions to edX engineers
  - Could provide Open edX versions of HelmCharts and k8s Manifest as examples, if they will be useful
- Private requirements for edX.org → would like to move this out of the public repos
  - So edX.org can dogfood this design pattern
  - You may notice new files with -newrelic tags - these are the start of these edX-specific files

Discussion

OEP-45 follow-up
- NGinx configuration - what are the plans for this?
  - Would this be with sample Helm charts?
  - edX would be using nginx ingress
  - Possible Options
    - A: Provide an example chart
    - B: Provide an open chart that edX also consumes
Myth: Entire configuration for an IDA can be stored in a given repo
- Example: Authentication is coupled across repos
  - operation was in the Ansible playbooks
- Note: decentralized devstack has a similar issue
In addition to Helmcharts, you are also using Consul Templates to pull in secrets from Vault?
- Yes, those go into as values in the Helmcharts.
Timeline for deprecation?
- Let’s discuss this together and work together on accelerating and converging on this effort
Currently evaluating Tutor as a community-supported deployment mechanism
- Usual questions: Testing, Devstack, Docker files, etc.
- It seems similar problems underneath, no?
- For example, should Tutor be the next Devstack?
  - Tutor - make it easy to install
    - Devstack is trying to address the same issue
- Currently, not a hard blocker since we are pushing images in our current Devstack.
- edX Architecture team is the newly designated “owner” for Devstack and plans to look into addressing short-term issues as well as long-term plans. Nim will ask the team to watch the recording from today and follow-up with this group in order to converge paths.

Cory’s speaking notes (rough transcript):




*******************
10m edx.org context
*******************

    As most of you already know, edx is over the past 
    few years been transitioning 
    from a monolithic architecture to a 
    microservices architecture
    with separate micro frontends.  

    Something you may or may not know.  We have also been transitioning 
    internally from a 
    centralized devops team where most of the infrastructure work was 
    done by a core team as a service to
    other teams, to a decentralized model where teams own their own 
    applications and infrastructure
    as much as possible and a core team of SREs facilitates this 
    ownership.

    While the centralized model worked well for us early on
    As the number of teams and projects grew it became overwhelming,
     as the devops team was not
    frequently the closest people to the problem with the most context.

    We found that cross team dependencies were a frequent source of 
    friction with the centralized model
    as team A would need to inject work into the devops backlog, 
    wait for it to be completed before they could
    proceed, and it was difficult to plan for this type of work 
    in advance because it wasn't always
    apparent at the outset if you would need devops assistance
    or not.
    
    In order to minimize these dependencies the (now) SRE 
    team is tasked with enabling
    each team to own their
    applications from top to bottom, 
    this means, frontends, backends, databases, 
    alerting, passwords. everything.

    This is somewhat aspirational at the moment, 
    but it is the guiding star we are driving towards.

    Internally at edx, many of our teams own a frontend or two,
    and django backend application of two, and some components
    of the edx-platform monolith. They work in these repos
    most days and are very familiar with them, but edx
    has something like 450 repos in our github so it is
    impossible for anyone to REALLY know how everything works.

    As you might imagine these teams now mostly decoupled
    sometimes want to do very different things with their 
    deployment processes, testing, automations,
    use different technologies, etc.

    This is where our centralized configuration repo
    becomes problematic.  Since all the
    applications are coupled together sharing code,
    changes to this repo
    need to work for all applications in all
    contexts, but the people making these changes, 
    the developers on the application teams,
    don't necessarily have all this context as they mostly
    work in silos on their own applications.
    We also accept open source pull requests to this repo, but
    
    Any changes to this repo need to work
    for the past two
    open edx releases, the current
    master branch, they need to work in our sandboxes, devstacks,
    The past few versions of ubuntu.
    Ideally they 
    would not make any breaking
    changes for the openedx community. 

    Some of the components are 
    developed entirely by the openedx community
    and we don't have the capacity to 
    test these changes 
    (i.e we might receive 
    pull requests for
    google cloud services, 
    but dont use google cloud as an example)

    Reviewing and testing changes 
    to this repo and having confidence that they wont break something 
    is therefore really hard.




*******************
10m simplifying configuration asym-crypto and remote config (Hermes)
*******************

    Some of our past attempts at solving this include:
    simplifying configuration and adding asym-crypto-yaml
    and remote config

    *******************
    simplifying configuration and adding asym-crypto-yaml
    *******************

    A frequent complaint we received from our developers was that it was
    difficult to determine what their
    configuration settings were as we used a complex multi tiered merging
    of YAML configs in the configuration repo
    to build a single .yaml or multiple json files (in the case of edx-platform)
    for all our apps. Additionally applications had their own defaults
    ansible had its own per-app and global defaults, and then we layered
    multiple files on top of it to generate a single file(or multiples for the LMS)
    for each application. This was to facilitate config sharing across applications.

    We decided to simplify this to just every app having a single yaml file, period. 
    No Json, and all of the 'default'
    config values would be moved into the application instead being in the 
    ansible repository.

    To do this we needed to be able to store our secrets in our config files, 
    and to do this we developed asym-crypto-yaml
    which let us do inline encryption in our yaml files

    We were successful in reducing all the apps to using a single yaml file, 
    but were not able to remove all of the old
    configurations from the configuation repo, nor were we able to remove 
    the JSON configurations because they were needed for old
    edx-platform named releases. 
    
    This was where I began to really began to appreciate 
    that any significant refactors of the configuration repo require us to
    go through a lenghty deprecation process.


    *******************
    Remote config (aka Hermes)
    *******************

    Once we had all our applications pulling from 
    a single yaml file in an effort
    to bypass the configration repo for
    most settings changes a developer
    might want to make we developed remote config AKA Hermes.

    This was intended to speed up config deployments 
    as running the entire ansible play
    is slow for simple config changes.

    hermes is a simple program really, 
    it monitors a remote file, and whenever it's ETAG changes it 
    runs a shell command.

    In our case, we modified the configuration 
    repo to when hermes is enabled modify sudoers and allow it to 
    restart the application.

    We then configure the bash command that hermes 
    runs to download the remote file decrypt the inline secrets
    using asym-crypto-yaml and restart gunicorn.

    This enables our config changes to our app.yaml 
    file to bypass our deployment pipelines
    (and therefore our ansible runs and complex config merging) 
    for most of the settings our
    developers care about.

    However again we weren't able to factor out all of the 
    old config from configuration, and the nested configuration
    is still used for configuring things like nginx 
    and the machine images that we are deploying, so this still leads
    to a lot of confusion, though the answers are sometimes much simpler.


*******************
Retirement of the configuration repo
*******************

This inability to make changes to the way we ship edx.org without
undergoing a significantly time consuming deprecation process
to support openedx is essentially what has caused us to begin
undergoing an internal deprecation of the configuration repo,
with the intent of *eventually* 
deprecating and ceasing to support it externally.

With that said, we are aware that we need to follow 
the formal deprecation process and cannot just pull the 
rug out from under the openedx community.


*******************
Not using asym-crypto-yaml & remote config
*******************

It is important to note that remote config or 
"hermes" is configured by the ansible code
in configuration, which is being 
deprecated.
Our new kubernetes deployments dont 
actually leverage hermes at all,
and so in all liklihood, we will not 
be using
it either in the future. 

In our kubernetes deployments we are storing 
secrets in vault and using consul-template
to render secrets into our application 
yaml files,
and thus no longer have a need to use 
either of these pieces of software as
we can get the same functionality from 
these free hashicorp products.

*******************
So what is next?
*******************

We are moving to a decentralized model where each 
application mostly contains everything it needs
in order to run, this means default configuration 
values, any libraries it needs to install etc.

The plan will probably be pretty familiar and 
boring to you if you have spent much time working with Docker

We are containerizing our applications by 
putting a Dockerfile in each repo.

That should be openedx centric and should 
include nothing about edx.org

The repo will contain everything needed to 
run that microservice except the YAML config file.

I would personally like all of the defaults 
to be production centric and for all dev and sandbox environments to
configure everything by manipulating the YAML file.

The Dockerfile in the repo will be buildable by 
simply doing 'docker build .'
at the root of the directory, and will produce the open edx image.

I like to think of this as "Just add YAML" 
You will do `docker build .`
mount your YAML file into the container and you are good to go.

This allows both our developers and open 
source developers to compartmentalize
when working on a micro
service so they dont have to ingest the entire edx
ecosystem to make a few changes to one microservice.
Everything you need should be in that repo.



One thing that we really want to do is create 
a clear delineation between what is openedx and edx.org
so that edx.org can move fast and break stuff,
and so that openedx can be a stable product for your consumption

The reality is that edx.org has very different 
needs from most operators of openedx.
We operate at a scale that isn't reasonable for most installs. 
And most organizations that operate openedx at a similar scale have
their own infrastructure.

As such we are deploying our docker images to kubernetes using helm.  
With our config files being rendered by using consul template to load
secrets from vault at deploy time.

But kubernetes is too large for many small installs 
of openedx so we dont want to force this
technology choice on everyone.

Our helm charts are currently very edx.org specific
and make many assumptions about what our underlying 
infrastructure is in order
to provide a clean experience to our developers working at edx.org

There is currently something of an outstanding 
question of if we should attempt 
to provide an open source helm chart, or provide 
example manifests for running openedx on kubernetes, but
our current approach is to keep our 
edx.org manifests private and to publish public docker images.

So for the parts of the configuration 
repo that matter to
most installs (the installed libraries 
required to run the
application, the default config values etc) 
those will be in the repo in question. 

Then for installing private requirements and 
edx.org specific artifact customizations
we plan to ourselves become a consumer of the 
openedx docker images by building private images that are
derivative from the open source images.

Currently many of our dockerfiles build both a 
public and a '-newrelic' image which is essentially the beginnings of our
private images, but some of our applications, 
like the platform need to install many plugins 
and additional plugins for the edx.org
install.

The current images in the repos are also derivative 
of ubuntu upstream images, we would eventually 
like these to be based on the python base images, 
they are only ubuntu 
now for parity with the configuration repo 
package names

Braindump on Configuration: Today and Future

Meeting Info

Notes

Rationale

Current State

Future with Containers

Discussion

Cory’s speaking notes (rough transcript):