Modulestore Introduction
What is it?
People tend to use the term "modulestore" imprecisely. In various contexts, it can mean:
- The code that reads and writes course content to and from MongoDB. This definition includes the less well known Contentstore.
- The classes that implement the ModuleStore interfaces (Split, Draft, etc. – more on that later).
- The entire XModule/XBlock courseware system that renders course content to users.
Do I Really Need It?
If you're considering modulestore, you probably want to query, display, or edit course team authored content – problems, HTML, and videos that course teams have authored or imported into Studio. What is and is not "course content" isn't always clear, especially since other systems exist that duplicate (and sometimes conflict with) course data stored in the modulestore.
Here are some other systems that might get you that data in a more convenient or performant way.
Alternative System | Interface | Storage | Use Cases |
---|---|---|---|
Course Overviews |
Use the class methods:
| MySQL | Originally created as a read-through cache for commonly used metadata, the These fields are mostly related to displaying information about the course, its schedule, certificates, and marketing data. Access to this data for a single course or small set of courses is very cheap relative to the other options, so it should be the first thing you check. If you're creating a new table that needs a foreign key to a Do not do full table scans of Course Overviews to look for or collect certain values. This table has an entry for every course ever published, including CCX courses, and can be in the 10K row range. |
Course Blocks API | REST API
| S3 | Originally intended for the mobile use case, this is the API you want to query for content fields within one course. It applies user access controls, and will properly show the content for a given cohort or release dates that apply to that user. Some example use cases:
If you have an XBlock or XModule, it's also possible to get custom pre-computed data into the The Course Outline view in the LMS uses the Python interface of this API. |
Course Block Transformers |
see | S3 | This is the lower level infrastructure that the Course Blocks API is built on. Use this when you want to collect and manipulate authored content data across an entire course. The idea is that you create a Transformer class that is invoked for two phases: an asynchronous "collect" phase triggered during course publish, and a synchronous "transform" phase. You do expensive data access and calculations during "collect", and then do fast, per-user manipulations of the course DAG and fields during "transform". See this page documenting current and future Transformers. A long term goal is for XBlock/XModule field data for the LMS to be backed by this system, so that all inheritance computations can be done during the collect phase, and it would be possible to load the data for a given problem or video without having to load the entire set of ancestor nodes. |
Course Catalog API (a.k.a. Course Discovery) | https://prod-edx-discovery.edx.org | MySQL, Elasticsearch | This is the authoritative place for data relating to finding and enrolling in a course. It's what's queried when you go to the marketing site and search for a course or look at its "Course About" page details – who is teaching it, what the schedule is, what language is it given in, etc. Because it's backed by ElasticSearch, it is much more efficient and flexible when searching across the system for courses with particular attributes. Making synchronous calls to this service can be expensive, so if the data exists in Course Overviews and you can live with the delay, it's better to use that. |
CourseGraph | https://coursegraph.edx.org/browser/ (Neo4j DB, not part of edx-platform, requires VPN access) | Neo4j | This is used by support/sustaining teams and occasionally services staff to answer questions about course team authored content across edx.org. Think of this as the Modulestore content field data shoved into a Neo4j database on a periodic batch process. There are many example queries available, including:
CourseGraph only sees what Modulestore sees, and some of that data is not canonical. The following are data for which the canonical answer is in the Course Discovery/Catalog API:
|
Local App Models | Django ORM, django-storages | MySQL, S3 | A common pattern is to listen for Another scenario you might do the "listen, extract, and store locally" pattern is for something like search indexing. We used to have this with course content using edx-search, though I'm not sure if it's actually working these days. |
When Nothing Else Will Do
Situations that require you to use the Modulestore include:
- Getting a reference to an XBlock or XModule with the intention of rendering its views and invoking its handlers.
- Reading or writing course static assets, such as images or other file uploads. This is done via the contentstore interface.
- Changes to the import/export process.
- Adding a new type of Course (e.g. Content Libraries, CCX).
Hello ModuleStore!
from xmodule.modulestore.django import modulestore from opaque_keys.edx.keys import CourseKey, UsageKey from opaque_keys import InvalidKeyError # Old style IDs -- note the lack of Course Run info in usage_id # course_id = "edX/DemoX.1/2014" # usage_id = "i4x://edX/DemoX.1/problem/466f474fa4d045a8b7bde1b911e095ca" # New style IDs -- usage_id has full Course Run info course_id = "course-v1:edX+DemoX+Demo_Course" usage_id = "block-v1:edX+DemoX+Demo_Course+type@problem+block@d2e35c1d294b4ba0b3b1048615605d2a" # Parse the Course ID. try: # This will return a SlashSeparatedCourseKey (old) or CourseLocator (new) # Always use CourseKey.from_string() when parsing Course IDs. course_key = CourseKey.from_string(course_id) except InvalidKeyError: # Do some error handling here -- this is just completely made up raise ValueError("Could not parse course_id {}".format(course_id)) # Parse the Usage ID try: # This will return a Location (old) or BlockUsageLocator (new) # Always use UsageKey.from_string() when parsing Usage IDs. unmapped_usage_key = UsageKey.from_string(usage_id) # The map_into_course() call is not necessary for BlockUsageLocators, but # we do it to maintain compatibility with old style usage keys. usage_key = unmapped_usage_key.map_into_course(course_key) except InvalidKeyError: # Do some error handling here -- this is just completely made up raise ValueError("Could not parse usage_id {}".format(usage_id)) # This initializes a process global -- future calls to modulestore() will # just return references to the same global. ms = modulestore() # Get a single CapaDescriptor (a Capa problem, like multiple choice). # This object has all its XModule content fields, but not the user ones. problem = ms.get_item(usage_key) # Query the Modulestore for all sequentials in the Course. sequences = ms.get_items(course_key, qualifiers={'category': 'sequential'}) # Get the root CourseDescriptor course = ms.get_course(course_key) # List of child usage keys (Locations/BlockUsageLocators) for the chapters. course.children # Iterate through the descriptors for those children instead: for chapter in course.get_children(): print chapter.location, chapter.display_name
Why Do People Fear It?
The XBlock runtime is conceptually pretty straightforward. An XBlock is instantiated. It has handlers, views, fields, and access to certain runtime services (field data is actually one of those services). However, a number of things have contributed to complexity in this system.
- There are multiple Modulestores, with an intimidating class diagram of relationships. Most prominently, we have both "Old Mongo" and "Split Mongo", which work on different opaque key types and store data very differently. The performance characteristics are very different, most code works with both (and should be tested for both), but some features only work for Split (e.g. CCX). In some cases Split actually maintains some undesired Old Mongo behavior for Studio UI compatibility, like auto-publishing changes. Just getting a single, consistent "published" signal out of both ModuleStores involved a lot of work.
- It supports both XBlocks and XModules, which work very differently. A certain amount of proxy glue magic was created to make the systems gel together, and certain things like asset management are just completely different between the two systems. This proxying logic not only makes it more complicated to understand, but can be a source of difficult to debug memory leaks.
- The XModule rendering system developed piecemeal over time, leading to cases of "let's just add one more arg". WhenÂ
module_render.py
invokes the constructor forÂLmsModuleSystem
, it calls it with 28 arguments. - The XBlock runtimes dynamically mix in certain classes to enable things like field data inheritance. You may make
MySpecialBlock
, but the LMS is executingÂMySpecialBlockWithMixins
, and it's not clear to most developers what that means. - The data store handles both read and write use cases, and doesn't do a great job at either of them. There are optimizations such as Bulk Operations and the metadata inheritance cache – but it's not always clear when you need to use them, and it's easy to do something that accidentally tanks performance.
- Tests that use the ModuleStore tend to be slow, mostly because course creation is expensive – on the order of 0.25s - 0.5s, even for an empty course. That may not seem like much, but multiplied across thousands of tests, it starts to add up. We've createdÂ
ModuleStoreTestCase
andÂSharedModuleStoreTestCase
to address some of the isolation issues, and we suppress publish signals to help with performance, but this all adds up to testing complexity. Am I testing for both Old and Split Mongo modulestores? Can I reuse the same course, or do I have to recreate it each time? Do publish signals matter for my tests? Etc.
The XBlock/XModule runtime is in many ways the core of our offering, since that's how students interact with instructional material. But it's accumulated a lot of compatibility hacks and incrementally tacked on complexity over the years, and teams have generally found it easier to move things out of the Modulestore rather than to fix systemic issues in it.