Findings on Monolith Decomposition in the Modulestore Extraction

As we wrap up the attempt to break the modulestore out of edx-platform this document is a place to store what we've learned about decomposition in general, as well as some thoughts on a methodology of how to do this kind of work better in the future. A description of the modulestore project is here and my notes about that project are at the beginning of my blog. Many people who are much better versed in the platform than I am are engaged in the discussion of how to tame the monolith right now. Hopefully what we've learned in the modulestore efforts can add to that discussion and help inform processes and decisions going forward.

To that end this doc has a few items I suggest we think about for this kind of project. Each move is different and some of these are fairly specific to the type of move I attempted with modulestore - a new-to-edX coder attempting to move a large, old, foundational, platform component with unclear ownership or boundaries. I think we can assume that to be a worst-case scenario. (smile)

Identify stakeholders

If you are not the sole consumers of the code, identify teams that also have a stake in the code early and make sure there is a representative involved in the planning and ideally in daily communication (attending standups, through daily project documentation updates, blog posts, etc.) as work progresses to keep everyone aligned as the project evolves and decisions are made.

Have clear goals

This is pretty obvious, and on one level something we did right on the modulestore extraction effort. We had a list of things we wanted to get out of the project, and when it became obvious that we weren't going to achieve many of those goals we were able to pivot away. On another level we missed answering some fundamental questions early, the answers to which could have led to different outcomes on the project, or to an earlier termination of the work. Using the SMART metrics for goal setting could help at the outset of a project like this.

Know *exactly* what you're decoupling and why

Take the time early on to go through each file with the stakeholders and come up with a plan for their contents. This is the single biggest thing that could have changed the course of the modulestore work.

Think about whether or not you need to bring the git history

Carrying over git history can be a drain on time and resources. If all of the files are small, recent, rarely touched, or touched by a small group of people it is probably not worth the to use a solution like gitgraft. It may be enough to add the commit hashes at which you have pulled the original code from the origin.

Plan to refactor, not just lift

What will look from the surface like a simple, clean move can quickly grow as you try to separate code that has grown organically into more formalized pieces. Leave time in your budget to refactor code to be more general in nature or to find solutions to break bad dependency chains. Make sure you understand how the code is being tested! Significant portions of the low-level platform test code rely on higher level code to load courses, etc. for testing and may need to be substantially refactored or rewritten to break dependency chains.

Take things in the smallest bites you can

It's tempting to try to simply lift a whole directory of files in one go since it means fewer overall changes and seemingly less opportunity for error, however my experience has been that in a large, foundational area moving even one file can have huge follow-on consequences with wide-reaching impacts. At one point moving about 5 files in modulestore incurred a 700+ file commit. It is far easier to find CI or production issues with many smaller changes than a few big ones.

Check in often

If at any point you don't know the answer to one of these questions, stop everything and figure it out:

  • Why is this chunk moving / staying / changing?
  • Who are the stakeholders?
    • If the stakeholders for a particular piece of the code are different than the overall project stakeholders, seek their input.
    • Have the overall stakeholders changed due to organizational or personnel changes?
  • What are the current technical specifications of the chunk I'm currently moving?
    • If it's not documented, document it! Even a handful of bullet points and a list of inward and outward dependencies is better than nothing and the process of writing and reviewing that documentation might uncover issues or inform other parts of your plan.
  • Are my changes impacting that specification?
    • If so, are the stakeholders aware of, and in agreement with, that change?
  • What out-of-band code might rely on this chunk?
    • Signals?
    • Cached data (especially pickle!)?
    • Celery tasks?
    • Serialized OLX?
  • What out-of-band code does this rely on that might not be available in the new location?
    • How will you test that code?
  • Is this project still desirable?
    • Have circumstances / architecture changes invalidated your assumptions?
    • Has the scope grown to outweigh the potential benefits?

Communication is key

A place where the modulestore move fell down was in my communication out to the wider edX audience. Having stakeholders in daily contact would resolve some of that, but when moving foundational components there are likely to be a great many people with a vested interest in the move. Blogs or other mass-distribution means of communication are nice, but something like updates at the engineering all-hands would also be better for keeping interested parties up-to-date and allowing questions / helping them plan around any potentially disruptive deploys or incoming changes that may require mass rebasing. A part of why I failed with communication in the modulestore move was that the frequent changes of direction gave me little confidence that anything I shared would be correct or relevant for very long. I assumed that no information was better than bad / conflicting information, which may have created a feedback loop that prevented better solutions from being found sooner. Better planning should result in higher confidence and the possibility for more useful communication.