How Not To Move Modulestore

Julia Eskew (Deactivated) and I are in the process of wrapping up an attempt to separate out the modulestore and "closely related" code from edx-platform and glean information about teasing out code from the platform monolith. This document is one of two that are the primary artifacts from this attempt and should shed some light on the process in the event that we wish to try this again in the future. The shortest version of this tale is that we did not succeed in moving the modulestore code at this time, but did succeed in learning a lot about the process of extracting code from edx-platform as well as how edX can better handle projects like this in the future.

Why did we choose to try to move the modulestore?

At the beginning of the extraction project I was still fairly new to edX and the platform, which put me in a good position to take on a pretty good sized project without having a lot of other responsibilities or distractions. The team was ramping down on the continuous integration and asynchronous processing initiatives so spending the time to bring me up to speed on those seemed like a waste. At the time there was a push for making headway against some of the platform technical debt by forcing a separation of concerns between the large-scale platform components. A rough plan was to try to separate out CMS, LMS, modulestore, and possibly more into separate Git repositories. This seemed a logical way to break ties between those components where they existed, and to make smaller, more manageable code bases with faster testing times, fewer circular dependencies, and a greater ability to do things like Django upgrades in smaller chunks.

I took on the project with an exploration into whether we could achieve the following goals:

Reduce platform test times
Provide a foundation for future-split CMS / LMS to use
Move any common dependencies of modulestore and edx-platform to a new platform-core repo
Learn how to detangle code from the monolith
Build on the work Nimisha Asthagiri (Deactivated) and Julia Eskew (Deactivated) did in previous Hackathons to identify and clean up circular and upstream dependencies
To explicitly NOT refactor anything more than necessary to make the code extraction clean
(optional) To carry Git history for existing files to a new repository

What did we try?

In my initial investigations I attempted to find "optimal seams" for extracting modulestore that would offer a move with the fewest potentially breaking changes. Part of this was figuring out what dependencies modulestore had on external and internal code. I set up some skeleton repositories for platform-core and modulestore in my own GitHub space and experimented with moving code from platform while keeping only the relevant history (to prevent starting off with a 2gb repo and carrying history for things like numpy and scipi that may cause licensing issues). This experiment resulted in the gitgraft tool that now lives in repo-rools and can be a resource for future work of this type. Then I began trying to lift code into platform-core and modulestore.

Attempt 1 - Everything the light touches is our modulestore.

"What about that shadowy place?"
"That's xmodule. It's beyond our borders. You must never code there, Simba."

In my first attempt at the move I made a horrible error - I assumed that the large number of ties between common/lib/xmodule/xmodule/modulestore and its parent directory (especially in testing) made the entire /common/lib/xmodule/xmodule/ tree my target for moving. I was quickly put to rights that the xmodules and xblocks in that directory should not be a part of the modulestore. I spent some time investigating how we might move those files along with modulestore at first and then move them to a separate repo of their own later on. In the end, having a circular dependency of modulestore ↔ xmodule / block repo, even just for testing, didn't seem right. It also increased the scope of work significantly, and we had been warned (rightly so!) of this project becoming a quagmire by several people.

Attempt 2 - Move everything except the xblocks / xmodules to a new repo

I started with my previous branch and began removing anything that I could identify as an xblock or xmodule, greatly reducing the scope of my "closely related" code. Alas this also quickly reduced my ability to run tests since almost all of them rely explicitly on exported courses / course factories that use various block and module content. For a while I worked around this by adding back a few modules that seemed essential to testing, then rewriting all of the test data to use only those, as well as a new "just for testing" xblock that I cobbled together. Details of this phase are here, but at the end of it I had a version of things that would successfully run almost all of the existing unit tests inside the modulestore repo to a fairly high degree of coverage, something like 75-80%. There was a lot of great feedback on this path, largely that the scope of what I had pulled in to make the tests pass was still too much and some concerns (which I raised and shared) that there were static assets and front-end code living in this repo to support those modules.

Attempt 3 - Move only the modulestore directory, strip out everything that's relied on elsewhere

This extremely short-lived attempt died before any real coding started. It focused on whittling down the definition of "modulestore" to be pretty much a thin layer of abstraction around Mongo, with almost no higher level concepts like "courses" involved. Of course a cursory glance through the code's dependencies made it clear that even at the lowest levels modulestore relies on things like course keys and it simply isn't abstract enough in the current state to support this kind of removal without a substantial, and probably unwarranted, rewrite. This did lead to lots of discussion about the definition of "modulestore" and some thoughts written down about that by Eskew and myself.

Attempt 4 - Move inside the platform repo, create a new "courselike store" to house only modulestore and things which are related to courses

Around this time Eskew came on full-time. In our soul-searching for an answer to the great question of "what actually is the modulestore?" we came up with a plan:

Leave the modulestore in edx-platform.
Clarify its role by cutting ties to as much upstream code as possible.
Re-brand it something like "coursestore" or "courselikestore" since in the end what modulestore does is store serialized course and library data.

In this plan we would keep or move some higher level course-related code (ex. CourseSummary), and move modulestore and that code to /openedx/core/lib/. Some progress was made in this direction, with the merging of PRs that decoupled some bits of xmodule from modulestore. However, subsequent discussions of this plan brought into sharp focus:

The fact that the Platform team simply did not know enough about how modulestore is being used to take on this work in the first place.
Our ideas of what made up sensible seams did not actually make sense to the folks who actually use modulestore.

At this point we decided to drop the project in favor of other high priority work and focus on writing up our learnings. This decision was informed by the fact that two of the original big goals of the project had grown stale:

Dave Ormsbee (Deactivated)'s work on Django signals had already recouped much of the test time we expected to save in the extraction.
The design for platform decomposition work had moved away from separate repositories in general, and specifically for CMS & LMS.

The remaining work of value - paying down tech debt by reducing circular dependencies and achieving better separation of concerns can be done in small pieces, in place, by the stakeholders if desired.

Retrospective

We have not had a retrospective on the project yet; once we do I'll update this doc with a link to the results.