Decentralized DevStack Team Report - July 2020

We have decided not to go forward with decentralized devstack after doing further evaluation. This documentation is now out of date. See OEP-5 Decisions for more details.

 

Note: This is a copy of Decentralized Devstack Spike Team Report. This is meant as a historical doc and is no longer completely up to date of current design.

 

Decentralized Devstack Spike Team Report
Team: Adam Blackwell, Bill DeRusha, Diana Huang, Kyle McCormick
Summary
Outcomes
85% Development Environment Setup Speed Improvement
Improved Troubleshooting Time
Things We Tried
Things We Didn't Try
Action Items h1.Summary
Over the week of July 20th 2020, the Decentralized Devstack Spike Team tested the viability of the technical approaches proposed in the Devstack 2.0 One-pager Memo. Decentralizing devstack aligns with our ownership strategy by allowing code and repo owners to have more control over their own development environments and development workflows. Allowing developers to develop in their own apps without having to maintain an edx-platform development environment means that the pain of dealing with edx-platform development is largely mitigated for much of the engineering organization. The steps required to make that possible also benefit those who are working directly in edx-platform.
As a proof of concept the team created a development environment for the enterprise catalog IDA that has no dependencies on our existing devstack and can be managed entirely within the enterprise-catalog repository. By the end of the week the team had an environment that could be spun up from scratch in less than 5 minutes (an 85 - 90 percent improvement)! The methods described in later sections prove out the approaches and provide performance wins today, but are just first passes that will benefit from further iterations to improve performance and maintainability.
The next steps for this work are to validate the setup with someone from the enterprise-catalog dev team and provide more detailed documentation/training for how to convert a repo to this new development method. Beyond that there are additional investments that can be made depending on the willingness of the business to commit staff to this endeavor.

Outcomes

85% Development Environment Setup Speed Improvement

The enterprise-catalog repo is able to be fully provisioned and started from scratch in less than 5 minutes. [Video]] This was accomplished by improving data provisioning, removing our dependency on Ansible and standardizing Docker containers for dependencies.

Data Provisioning

Common LMS Test Data and DB Schema Pattern

By using MySQL and Mongo dumps for LMSThere will need to be a job that will update these dumps on an interval correct? What will the story look like for the times when there are migrations but the dump is not up to date? How frequently will users of this system need to download these dumps?Master + the dumps will be kept in lock step by automation so folks should be able to rely on just updating their dumps.
However, there is nothing preventing people from running the migrations manually or adding the migration commands to their provision scripts to catch the case where they are briefly out of sync we are able to both speed up provisioning and ensure a more consistent data state across environments

Custom Enterprise Catalog Data Creation Pattern

For enterprise catalog specific data, there is a reusable pattern for provision consistent data across all development environments.

Container Image Best Practices

Discovery

We were able to leverage and extend existing work to create a dockerfile that enabled us to:

  1. Run course-discovery as a service for enterprise-catalog to consume.

  2. Provision course-discovery with users quickly.

Platform

We were able to remove the dependency on ansible code from our development docker containers (though production instances still rely on it). This allowed us to optimize some of the layers to improve build times.
There are a few considerations that were made in this decision.

  1. The current Ansible Docker containers err on the side of minimizing configuration drift between environments at the expense of complexity and performance.

  2. The new dockerfiles opt for improved maintainability and image performance by shedding the Ansible at the risk of needing to port changes between production ansible and development containers.

  3. We shouldn't need to pay this tax for very long for IDAs since we are already capable of running Django services using containers on our existing infrastructure (eg not kubernetes). However, we may need to pay this tax for edx-platform for longer due to its complex dependencies (e.g. code-jail).

Improved Troubleshooting Time

Much like with the current devstack, this team faced challenges getting the setup to work. Because of the data speed improvements, smaller dependency chain, and the isolated nature of this environment, they were able to iterate quickly to resolve issues without fear of impacting others.
Due to the use of ansible to generate the docker images for the current devstack troubleshooting or improving it requires knowledge of two large mono-repos (devstack and configuration). The fact that this development approach has everything living within the target development repo or in the dockerfile in a dependent repo made understanding what was going on and how to fix it much easier.

Things We Tried

  • Investigated various ways to produce automated database dumps, including the Bok Choy database caching Jenkins job. In the end we opted to commit a dump directly rather than automate.

  • Looked at the possibility of stripping auth out of the LMS so that it could be used as a slim service that could be used in a distributed way for IDAs.

    • We discovered that there were too many dependencies to handle this within LMS.

    • We also discovered that many/most IDAs needed more LMS features than just auth, like enrollments and edx-enterprise.

    • We did look at creating a stub auth service, but we discovered that the LMS requirements were too broad.

  • Did some work to bring the size of the LMS container image down.

    • Added a .dockerignore file to keep test data, Git data, etc. out of the image.

    • Made the image more cache-friendly.

      • Much of the size comes from the Python and Javascript packages that edx-platform requires.

      • By installing those packages before copying in code, we make it so the requirements layer needs to be pulled less frequently.

  • Clean up centralized devstack provisioning to strip out unnecessary steps.

    • Removed usages of Paver wherever possible.

      • When not possible to remove, optimized calls to Paver to be less time-consuming.

    • These optimizations were included in the LMS decentralized devstack provisioning.

Things We Didn't Try

  • We didn't do any work to get Elasticsearch/searching working for Discovery.

  • Having static assets in a separate container.

  • Rewriting demo data Ansible playbook as a bash script (demo data was brought in via the database dump).

  • Refresh Course Metadata job

    • This required ecommerce service.

  • Kubernetes in any form.



Action ItemsA plan to move forward on these action items is being assembled here: https://openedx.atlassian.net/wiki/spaces/ENG/pages/1685619678/Post-Spike+Action+Plan

We have gathered all of the recommendations from our notes and placed them into three categories based around relative effort.
Immediate Next Steps:
Clean and Test

  1. Clean up & merge edx-platform PR to improve Dockerfile.

  2. Clean up & merge course-discovery PR to improve Dockerfile.

  3. Collect developer feedback from Enterprise

    1. Clean up & merge enterprise catalog spike PR as an alternate dev environment.

    2. Test proof of concept decentralized devstack pattern with an enterprise developer


Adoption Rollout
Action steps for teams looking to adopt decentralized devstack. Teams should be able to use the enterprise catalog code as a template for converting their repo to a decentralized devstack:

  1. Create and build a local dockerfile & github action

  2. Create and use a local docker-compose & .env

  3. Create sql dumps of current master branch schema and commit to repo

  4. Create provisioning scripts to decentralize all steps required to setup the environments


Short/Medium Term Wins: Small/medium improvements for big gains

  1. Improve image performance

    1. Smarter caching for requirements in edx-platform Dockerfile [PR ready]

    2. Ensure Github Actions are building and publishing production and devstack docker images for LMS, Discovery & Enterprise Catalog

  2. Create automatic MySQL dumps for all apps

    1. Discovery and enterprise catalog run migrations and would benefit from consistent data loading from dumps

    2. LMS data dumps would benefit from automation to prevent staleness

    3. Dumps can be better managed in s3 rather than committed to repo

  3. Improve UX of decentralized devstack

    1. Make getting a devstack running in a simple, no-nonsense manner

  4. Deploy Enterprise Catalog as container to production to avoid drift between container and ansible deployment patterns

  5. Modify/enhance this decentralized pattern to support MFE / plugin development


Long-term Takeaways: Bigger pieces of work whose prioritization would benefit decentralized devstack

  1. OEP-45 adoption.

    1. OEP recommends having a Dockerfile in each repository, having a simple YAML-based configuration scheme, et al.

    2. A lot of time was spent on the spike working on Dockerfiles and fighting with the Python settings file configuration scheme.

    3. Having Production and Devstack both follow OEP-45 will minimize risk of configuration drift over time.

  2. Decoupling of authentication from the monolith.

    1. Dependency on LMS is mandatory for almost all IDAs.

    2. Mandatory dependency on LMS is a burden for developers.

  3. Removal of old developer tooling and from Monolith.

    1. Execution of PaverPaver? Where could one learn more about what that is and why it may be old & no longer useful, as implied in this write-up?Paver is a python library to make it easier to write script like things in python, since I'm more familiar with Bash so I avoid using it, but many people like being able to think in Python. Fabric is another tool which I've used similarly in the past. You could read https://github.com/paver/paver or search "paver" in edx-platform or devstack to see how we use it currently.Thank you. And is Paver's presence and usage still useful? Why is it listed under the category of "removal of old"?I can't speak for all of edX engineering, however I think that it would be easier for engineers to ramp up on and maintain our local development workflows more easily if we keep things as simple as possible. To that end we have started to remove our dependency on Ansible and assuming Paver doesn't give us something we can't live without I would expect to engineers spinning their wheels less if we chose to invest in removing it and refactoring the scripts that use it. commands consumes a large percentage of LMS provision and start-up time.

    2. Reliance on Paver makes it more difficult to maintain a cache-performant LMS image.