Meeting Info

Date:
Recording: https://discuss.openedx.org/t/lets-talk-about-the-native-installation/3269/14
Presenter: Cory Lee (Deactivated)
Facilitator: Former user (Deleted)

Notes

Rationale

Current State

Future with Containers

Discussion

Cory’s speaking notes (rough transcript):




*******************
10m edx.org context
*******************

    As most of you already know, edx is over the past 
    few years been transitioning 
    from a monolithic architecture to a 
    microservices architecture
    with separate micro frontends.  

    Something you may or may not know.  We have also been transitioning 
    internally from a 
    centralized devops team where most of the infrastructure work was 
    done by a core team as a service to
    other teams, to a decentralized model where teams own their own 
    applications and infrastructure
    as much as possible and a core team of SREs facilitates this 
    ownership.

    While the centralized model worked well for us early on
    As the number of teams and projects grew it became overwhelming,
     as the devops team was not
    frequently the closest people to the problem with the most context.

    We found that cross team dependencies were a frequent source of 
    friction with the centralized model
    as team A would need to inject work into the devops backlog, 
    wait for it to be completed before they could
    proceed, and it was difficult to plan for this type of work 
    in advance because it wasn't always
    apparent at the outset if you would need devops assistance
    or not.
    
    In order to minimize these dependencies the (now) SRE 
    team is tasked with enabling
    each team to own their
    applications from top to bottom, 
    this means, frontends, backends, databases, 
    alerting, passwords. everything.

    This is somewhat aspirational at the moment, 
    but it is the guiding star we are driving towards.

    Internally at edx, many of our teams own a frontend or two,
    and django backend application of two, and some components
    of the edx-platform monolith. They work in these repos
    most days and are very familiar with them, but edx
    has something like 450 repos in our github so it is
    impossible for anyone to REALLY know how everything works.

    As you might imagine these teams now mostly decoupled
    sometimes want to do very different things with their 
    deployment processes, testing, automations,
    use different technologies, etc.

    This is where our centralized configuration repo
    becomes problematic.  Since all the
    applications are coupled together sharing code,
    changes to this repo
    need to work for all applications in all
    contexts, but the people making these changes, 
    the developers on the application teams,
    don't necessarily have all this context as they mostly
    work in silos on their own applications.
    We also accept open source pull requests to this repo, but
    
    Any changes to this repo need to work
    for the past two
    open edx releases, the current
    master branch, they need to work in our sandboxes, devstacks,
    The past few versions of ubuntu.
    Ideally they 
    would not make any breaking
    changes for the openedx community. 

    Some of the components are 
    developed entirely by the openedx community
    and we don't have the capacity to 
    test these changes 
    (i.e we might receive 
    pull requests for
    google cloud services, 
    but dont use google cloud as an example)

    Reviewing and testing changes 
    to this repo and having confidence that they wont break something 
    is therefore really hard.




*******************
10m simplifying configuration asym-crypto and remote config (Hermes)
*******************

    Some of our past attempts at solving this include:
    simplifying configuration and adding asym-crypto-yaml
    and remote config

    *******************
    simplifying configuration and adding asym-crypto-yaml
    *******************

    A frequent complaint we received from our developers was that it was
    difficult to determine what their
    configuration settings were as we used a complex multi tiered merging
    of YAML configs in the configuration repo
    to build a single .yaml or multiple json files (in the case of edx-platform)
    for all our apps. Additionally applications had their own defaults
    ansible had its own per-app and global defaults, and then we layered
    multiple files on top of it to generate a single file(or multiples for the LMS)
    for each application. This was to facilitate config sharing across applications.

    We decided to simplify this to just every app having a single yaml file, period. 
    No Json, and all of the 'default'
    config values would be moved into the application instead being in the 
    ansible repository.

    To do this we needed to be able to store our secrets in our config files, 
    and to do this we developed asym-crypto-yaml
    which let us do inline encryption in our yaml files

    We were successful in reducing all the apps to using a single yaml file, 
    but were not able to remove all of the old
    configurations from the configuation repo, nor were we able to remove 
    the JSON configurations because they were needed for old
    edx-platform named releases. 
    
    This was where I began to really began to appreciate 
    that any significant refactors of the configuration repo require us to
    go through a lenghty deprecation process.


    *******************
    Remote config (aka Hermes)
    *******************

    Once we had all our applications pulling from 
    a single yaml file in an effort
    to bypass the configration repo for
    most settings changes a developer
    might want to make we developed remote config AKA Hermes.

    This was intended to speed up config deployments 
    as running the entire ansible play
    is slow for simple config changes.

    hermes is a simple program really, 
    it monitors a remote file, and whenever it's ETAG changes it 
    runs a shell command.

    In our case, we modified the configuration 
    repo to when hermes is enabled modify sudoers and allow it to 
    restart the application.

    We then configure the bash command that hermes 
    runs to download the remote file decrypt the inline secrets
    using asym-crypto-yaml and restart gunicorn.

    This enables our config changes to our app.yaml 
    file to bypass our deployment pipelines
    (and therefore our ansible runs and complex config merging) 
    for most of the settings our
    developers care about.

    However again we weren't able to factor out all of the 
    old config from configuration, and the nested configuration
    is still used for configuring things like nginx 
    and the machine images that we are deploying, so this still leads
    to a lot of confusion, though the answers are sometimes much simpler.


*******************
Retirement of the configuration repo
*******************

This inability to make changes to the way we ship edx.org without
undergoing a significantly time consuming deprecation process
to support openedx is essentially what has caused us to begin
undergoing an internal deprecation of the configuration repo,
with the intent of *eventually* 
deprecating and ceasing to support it externally.

With that said, we are aware that we need to follow 
the formal deprecation process and cannot just pull the 
rug out from under the openedx community.


*******************
Not using asym-crypto-yaml & remote config
*******************

It is important to note that remote config or 
"hermes" is configured by the ansible code
in configuration, which is being 
deprecated.
Our new kubernetes deployments dont 
actually leverage hermes at all,
and so in all liklihood, we will not 
be using
it either in the future. 

In our kubernetes deployments we are storing 
secrets in vault and using consul-template
to render secrets into our application 
yaml files,
and thus no longer have a need to use 
either of these pieces of software as
we can get the same functionality from 
these free hashicorp products.

*******************
So what is next?
*******************

We are moving to a decentralized model where each 
application mostly contains everything it needs
in order to run, this means default configuration 
values, any libraries it needs to install etc.

The plan will probably be pretty familiar and 
boring to you if you have spent much time working with Docker

We are containerizing our applications by 
putting a Dockerfile in each repo.

That should be openedx centric and should 
include nothing about edx.org

The repo will contain everything needed to 
run that microservice except the YAML config file.

I would personally like all of the defaults 
to be production centric and for all dev and sandbox environments to
configure everything by manipulating the YAML file.

The Dockerfile in the repo will be buildable by 
simply doing 'docker build .'
at the root of the directory, and will produce the open edx image.

I like to think of this as "Just add YAML" 
You will do `docker build .`
mount your YAML file into the container and you are good to go.

This allows both our developers and open 
source developers to compartmentalize
when working on a micro
service so they dont have to ingest the entire edx
ecosystem to make a few changes to one microservice.
Everything you need should be in that repo.



One thing that we really want to do is create 
a clear delineation between what is openedx and edx.org
so that edx.org can move fast and break stuff,
and so that openedx can be a stable product for your consumption

The reality is that edx.org has very different 
needs from most operators of openedx.
We operate at a scale that isn't reasonable for most installs. 
And most organizations that operate openedx at a similar scale have
their own infrastructure.

As such we are deploying our docker images to kubernetes using helm.  
With our config files being rendered by using consul template to load
secrets from vault at deploy time.

But kubernetes is too large for many small installs 
of openedx so we dont want to force this
technology choice on everyone.

Our helm charts are currently very edx.org specific
and make many assumptions about what our underlying 
infrastructure is in order
to provide a clean experience to our developers working at edx.org

There is currently something of an outstanding 
question of if we should attempt 
to provide an open source helm chart, or provide 
example manifests for running openedx on kubernetes, but
our current approach is to keep our 
edx.org manifests private and to publish public docker images.

So for the parts of the configuration 
repo that matter to
most installs (the installed libraries 
required to run the
application, the default config values etc) 
those will be in the repo in question. 

Then for installing private requirements and 
edx.org specific artifact customizations
we plan to ourselves become a consumer of the 
openedx docker images by building private images that are
derivative from the open source images.

Currently many of our dockerfiles build both a 
public and a '-newrelic' image which is essentially the beginnings of our
private images, but some of our applications, 
like the platform need to install many plugins 
and additional plugins for the edx.org
install.

The current images in the repos are also derivative 
of ubuntu upstream images, we would eventually 
like these to be based on the python base images, 
they are only ubuntu 
now for parity with the configuration repo 
package names