Modulestore Introduction

What is it?

People tend to use the term "modulestore" imprecisely. In various contexts, it can mean:

  • The code that reads and writes course content to and from MongoDB. This definition includes the less well known Contentstore.
  • The classes that implement the ModuleStore interfaces (Split, Draft, etc. – more on that later).
  • The entire XModule/XBlock courseware system that renders course content to users.

Do I Really Need It?

If you're considering modulestore, you probably want to query, display, or edit course team authored content – problems, HTML, and videos that course teams have authored or imported into Studio. What is and is not "course content" isn't always clear, especially since other systems exist that duplicate (and sometimes conflict with) course data stored in the modulestore.

Here are some other systems that might get you that data in a more convenient or performant way.

Alternative SystemInterfaceStorageUse Cases
Course Overviews

CourseOverview Django Model

Use the class methods:

  • get_from_id
  • get_from_ids_if_exists
MySQL

Originally created as a read-through cache for commonly used metadata, the CourseOverview model is now the canonical source for that information from the point of view of the LMS. Much of this data is updated by a celery task that queries the Modulestore after every publish. Some data is synced over from the Course Discovery/Catalog API on a nightly basis.

These fields are mostly related to displaying information about the course, its schedule, certificates, and marketing data. Access to this data for a single course or small set of courses is very cheap relative to the other options, so it should be the first thing you check.

If you're creating a new table that needs a foreign key to a course_id, this is the table to you connect to.

Do not do full table scans of Course Overviews to look for or collect certain values. This table has an entry for every course ever published, including CCX courses, and can be in the 10K row range.

Course Blocks API

REST API

lms.djangoapps.course_api.blocks.api

S3

Originally intended for the mobile use case, this is the API you want to query for content fields within one course. It applies user access controls, and will properly show the content for a given cohort or release dates that apply to that user. Some example use cases:

  1. I want the "display_name" field for all content of types "video" and "problem".
  2. I want to find all sequences in the course.
  3. I want to to display links to every gradable problem in the course.

If you have an XBlock or XModule, it's also possible to get custom pre-computed data into the student_view_data attribute, so long as that data can be computed at publish time (i.e. it can't have student state in it).

The Course Outline view in the LMS uses the Python interface of this API.

Course Block Transformers

BlockStructureTransformer,

see VisibilityTransformer  for a short example

S3

This is the lower level infrastructure that the Course Blocks API is built on. Use this when you want to collect and manipulate authored content data across an entire course. The idea is that you create a Transformer class that is invoked for two phases: an asynchronous "collect" phase triggered during course publish, and a synchronous "transform" phase. You do expensive data access and calculations during "collect", and then do fast, per-user manipulations of the course DAG and fields during "transform".

See this page documenting current and future Transformers.

A long term goal is for XBlock/XModule field data for the LMS to be backed by this system, so that all inheritance computations can be done during the collect phase, and it would be possible to load the data for a given problem or video without having to load the entire set of ancestor nodes.

Course Catalog API (a.k.a. Course Discovery)https://prod-edx-discovery.edx.orgMySQL, Elasticsearch

This is the authoritative place for data relating to finding and enrolling in a course. It's what's queried when you go to the marketing site and search for a course or look at its "Course About" page details – who is teaching it, what the schedule is, what language is it given in, etc. Because it's backed by ElasticSearch, it is much more efficient and flexible when searching across the system for courses with particular attributes.

Making synchronous calls to this service can be expensive, so if the data exists in Course Overviews and you can live with the delay, it's better to use that.

CourseGraph

https://coursegraph.edx.org/browser/

(Neo4j DB, not part of edx-platform, requires VPN access)

Neo4j

This is used by support/sustaining teams and occasionally services staff to answer questions about course team authored content across edx.org. Think of this as the Modulestore content field data shoved into a Neo4j database on a periodic batch process.  There are many example queries available, including:

  1. How many courses are still using a particular XModule? Can we deprecate it?
  2. Are there any proctored exams coming up soon?
  3. What courses have a particular setting enabled?

CourseGraph only sees what Modulestore sees, and some of that data is not canonical. The following are data for which the canonical answer is in the Course Discovery/Catalog API:

  • Course start/end/enrollment dates.
  • Course language.
Local App Models

Django ORM, django-storages

MySQL, S3

A common pattern is to listen for SignalHandler.course_published, and then to access Modulestore data in an asynchronous process. The Blocks API and Transformers are extremely useful for grabbing a big chunk of the Course and manipulating it, but they still incur a relatively high overhead to query 100s of milliseconds. If you just want to derive one or two small bits of field data that are read all the time, you can copy those into your own model where the access time will be ~1ms.

Another scenario you might do the "listen, extract, and store locally" pattern is for something like search indexing. We used to have this with course content using edx-search, though I'm not sure if it's actually working these days.

When Nothing Else Will Do

Situations that require you to use the Modulestore include:

  • Getting a reference to an XBlock or XModule with the intention of rendering its views and invoking its handlers.
  • Reading or writing course static assets, such as images or other file uploads. This is done via the contentstore interface.
  • Changes to the import/export process.
  • Adding a new type of Course (e.g. Content Libraries, CCX).

Hello ModuleStore!

from xmodule.modulestore.django import modulestore
from opaque_keys.edx.keys import CourseKey, UsageKey
from opaque_keys import InvalidKeyError

# Old style IDs -- note the lack of Course Run info in usage_id
# course_id = "edX/DemoX.1/2014"
# usage_id = "i4x://edX/DemoX.1/problem/466f474fa4d045a8b7bde1b911e095ca"

# New style IDs -- usage_id has full Course Run info
course_id = "course-v1:edX+DemoX+Demo_Course"
usage_id = "block-v1:edX+DemoX+Demo_Course+type@problem+block@d2e35c1d294b4ba0b3b1048615605d2a"

# Parse the Course ID.
try:
    # This will return a SlashSeparatedCourseKey (old) or CourseLocator (new)
    # Always use CourseKey.from_string() when parsing Course IDs.
    course_key = CourseKey.from_string(course_id)
except InvalidKeyError:
    # Do some error handling here -- this is just completely made up
    raise ValueError("Could not parse course_id {}".format(course_id))

# Parse the Usage ID
try:
    # This will return a Location (old) or BlockUsageLocator (new)
    # Always use UsageKey.from_string() when parsing Usage IDs.
    unmapped_usage_key = UsageKey.from_string(usage_id)

    # The map_into_course() call is not necessary for BlockUsageLocators, but
    # we do it to maintain compatibility with old style usage keys.
    usage_key = unmapped_usage_key.map_into_course(course_key)
except InvalidKeyError:
    # Do some error handling here -- this is just completely made up
    raise ValueError("Could not parse usage_id {}".format(usage_id))

# This initializes a process global -- future calls to modulestore() will
# just return references to the same global.
ms = modulestore()

# Get a single CapaDescriptor (a Capa problem, like multiple choice).
# This object has all its XModule content fields, but not the user ones.
problem = ms.get_item(usage_key)

# Query the Modulestore for all sequentials in the Course.
sequences = ms.get_items(course_key, qualifiers={'category': 'sequential'})

# Get the root CourseDescriptor
course = ms.get_course(course_key)

# List of child usage keys (Locations/BlockUsageLocators) for the chapters.
course.children

# Iterate through the descriptors for those children instead:
for chapter in course.get_children():
    print chapter.location, chapter.display_name


Why Do People Fear It?

The XBlock runtime is conceptually pretty straightforward. An XBlock is instantiated. It has handlers, views, fields, and access to certain runtime services (field data is actually one of those services). However, a number of things have contributed to complexity in this system.

ModuleStore Class Diagram

  1. There are multiple Modulestores, with an intimidating class diagram of relationships. Most prominently, we have both "Old Mongo" and "Split Mongo", which work on different opaque key types and store data very differently. The performance characteristics are very different, most code works with both (and should be tested for both), but some features only work for Split (e.g. CCX). In some cases Split actually maintains some undesired Old Mongo behavior for Studio UI compatibility, like auto-publishing changes. Just getting a single, consistent "published" signal out of both ModuleStores involved a lot of work.
  2. It supports both XBlocks and XModules, which work very differently. A certain amount of proxy glue magic was created to make the systems gel together, and certain things like asset management are just completely different between the two systems. This proxying logic not only makes it more complicated to understand, but can be a source of difficult to debug memory leaks.
  3. The XModule rendering system developed piecemeal over time, leading to cases of "let's just add one more arg". When module_render.py invokes the constructor for LmsModuleSystem, it calls it with 28 arguments.
  4. The XBlock runtimes dynamically mix in certain classes to enable things like field data inheritance. You may make MySpecialBlock, but the LMS is executing MySpecialBlockWithMixins, and it's not clear to most developers what that means.
  5. The data store handles both read and write use cases, and doesn't do a great job at either of them. There are optimizations such as Bulk Operations and the metadata inheritance cache – but it's not always clear when you need to use them, and it's easy to do something that accidentally tanks performance.
  6. Tests that use the ModuleStore tend to be slow, mostly because course creation is expensive – on the order of 0.25s - 0.5s, even for an empty course. That may not seem like much, but multiplied across thousands of tests, it starts to add up. We've created ModuleStoreTestCase and SharedModuleStoreTestCase to address some of the isolation issues, and we suppress publish signals to help with performance, but this all adds up to testing complexity. Am I testing for both Old and Split Mongo modulestores? Can I reuse the same course, or do I have to recreate it each time? Do publish signals matter for my tests? Etc.

The XBlock/XModule runtime is in many ways the core of our offering, since that's how students interact with instructional material. But it's accumulated a lot of compatibility hacks and incrementally tacked on complexity over the years, and teams have generally found it easier to move things out of the Modulestore rather than to fix systemic issues in it.