Background

User XBlock state and most raw score information in edx-platform is currently stored in the StudentModule and StudentModuleHistory models, which correspond to the courseware_studentmodule and courseware_studentmodulehistory tables in the database. These are by far our largest tables, with hundreds of millions of rows. This has been a long term scaling concern – we want to be able to easily add capacity and avoid the locking-related outages which have caused intermittent outages on edx.org. For a discussion on the eventual store, please see

While the state and scores are stored in the same rows, they are independent concepts with differing query patterns (scores are written much less frequently, scores are often desired for all items a user has in a course, etc.).

Requirements

Store XBlock student state in a scalable, redundant, low-latency data store.
Allow installations running at smaller scales to keep using MySQL.
Maintain backwards compatibility for existing features.
Better separate grade calculation from XBlock state.

The general approach will be to create new abstractions for XBlock user state and Scores, with pluggable backends. By default, both of these interfaces will point back to StudentModule. Sites with higher scaling requirements (like edx.org) will be able to configure a different backend. All direct calls to StudentModule in edx-platform will be replaced.

Data Dimensions

Up to 10K unique XBlock state entries for a given user in a given course (our largest courses tend to be smaller, in the low thousands).
Largest cumulative state size for one (course, user) combination is ~5MB of uncompressed JSON on edx.org (this happens in RiceX/ELEC301x/T1_2014)
Some state will be extremely similar across students (e.g. multiple choice, sequences), while some will have a much higher variance (e.g. ORA2).
Largest known student state for a single module is ~1MB
- Occurred for a student against i4x://RiceX/ELEC301x/problem/02e84c4d4f5f4f0f86930e5d6e830a5a in course RiceX/ELEC301x/T1_2014
- Matlab problems return base64 encoded images.
- This module also had 9 state history entries.

Extra Considerations

Make the schema generic enough to handle other scopes?
Do we need good data locality across courses for the same user?
Append-only data structure?

Access Patterns

	Type	Access Pattern		Use Case	edx.org?	Port?	Code Example
001	State	(student, course, module)	RW	Read/write user state for a single XBlock (simple XB view).	Yes	Yes	`lms/djangoapps/courseware/model_data.py`
002	State	(student, course, (modules))	R(W)	Render part of a course tree (e.g. sequence)	Yes	Yes	`lms/djangoapps/courseware/model_data.py`
003	State History	(student, course, module)	RW	State history for a user + problem.	Yes	Yes	`lms/djangoapps/courseware/views.py`
004	State Bulk	(course, module)	RW	Reset/delete/rescore a problem for all students in a course.	Yes, async	Yes	`lms/djangoapps/instructor_task/tasks_helper.py`
005	State Report	score not null	R	Psychometrics, pull back every graded thing ever.	No	?	`lms/djangoapps/psychometrics/psychoanalyze.py`
006	State Report	(course, module)	R R	Course dump of problem state across all students. Answer distribution.	No Yes, async	? Analytics	`lms/djangoapps/instructor/views/legacy.py` `lms/djangoapps/courseware/grades.py`
007	State Report	(course, module, (students))	R	ORA1 related reporting management command	No	No	`instructor/management/commands/openended_stats.py`
008	Score	(student, course, module)	RW RW	Get or set score for a single XBlock. Reset attempts/score in a problem for one student.	Yes Yes	Yes Yes	`lms/djangoapps/courseware/grades.py` `lms/djangoapps/instructor/enrollment.py`
009	Score	(student, course, (modules))	R R	Calculate entrance exam scores. Determine whether to grade a section.	Yes Yes	Yes Yes	`common/djangoapps/util/milestones_helpers.py` `lms/djangoapps/courseware/grades.py`
010	Score Report	(course, score not null, module)	R	List of users and their scores for a given problem.	No	Yes	`lms/djangoapps/class_dashboard/dashboard_data.py`
011	Score Stats	(course, type=problem, score not null)	R	Score distribution for all problems in a course.	No	Analytics?	`lms/djangoapps/class_dashboard/dashboard_data.py`
012	Score Stats	(course, score not null, (modules))	R	Score distribution for a particular problem set.	No	Analytics	`lms/djangoapps/class_dashboard/dashboard_data.py`
013	State Stats	(course, type=sequential, module)	R	How many students have opened this subsection.	No	Analytics	`lms/djangoapps/class_dashboard/dashboard_data.py`
014	State Stats	(course, type=sequential)	R	Open counts for all subsections in the course.	No	Analytics	`lms/djangoapps/class_dashboard/dashboard_data.py`
015	State	(course, type=chapter, sequential, problem)	R	Open/Completed State for Problems and Sections. Problem grades lookup planned.	No	?	https://github.com/Stanford-Online/edx-platform/blob/master/lms/djangoapps/instructor/tasks_helper.py context: emailDistributionTool.png
016	State	(student, course, type)	R	Not currently supported, but functionality has been suggested for a number of use cases (being able to get the user's last location more quickly, see all A/B test state at once, etc.)	-	-

Code Plan

Code Block

language	py

class XBlockUserStateClient(object):
    """
    First stab at an interface for accessing XBlock User State. This will have
    use StudentModule as a backing store in the default case.
    
    Scope/Goals:
    1. Mediate access to all student-specific state stored by XBlocks.
        a. This includes "preferences" and "user_info" (i.e. UserScope.ONE)
        b. This includes XBlock Asides.
        c. This may later include user_state_summary (i.e. UserScope.ALL).
           - Though I'm fuzzy on where we're going in general with this and how
             it relates to content.
        d. This may include group state in the future.
        e. This may include other key types + UserScope.ONE (e.g. Definition)
           - I think this implies a per-user partition scheme and not a 
             user+course partition scheme.
    2. Assume network service semantics.
        At some point, this will probably be calling out to an external service.
        Even if it doesn't, we want to be able to implement circuit breakers, so
        that a failure in StudentModule doesn't bring down the whole site.
        This also implies that the client is running as a user, and whatever is
        backing it is smart enough to do authorization checks.
    3. This does not yet cover export-related functionality.

    Open Questions:
    1. Is it sufficient to just send the block_key in and extract course + 
       version info from it?
    2. Do we want to use the username as the identifier? Privacy implications?
       Ease of debugging?
    3. Would a get_many_by_type() be useful?
    """

    class ServiceUnavailableError(Exception):
        pass

    class PermissionDeniedError(Exception):
        pass

    # 001
    def get(user_id, block_key, scope=Scope.user_state):
        pass

    # 001
    def set(user_id, block_key, state, scope=Scope.user_state):
        pass

    # 002
    def get_many(user_id, block_keys, scope=Scope.user_state):
        """Returns dict of block_id -> state."""
        pass

    # 002
    def set_many(user_id, block_keys_to_state, scope=Scope.user_state):
        pass

    # 003
    def get_history(user_id, block_key, scope=Scope.user_state):
        """We don't guarantee that history for many blocks will be fast."""
        pass

    # 004, 006
    def iter_all_for_block(block_key, scope=Scope.user_state, batch_size=None):
        """
        You get no ordering guarantees. Fetching will happen in batch_size
        increments. If you're using this method, you should be running in an
        async task.
        """
        pass

    # 005 if you want to push it...?
    def iter_all_for_course(course_key, block_type=None, scope=Scope.user_state, batch_size=None):
        """
        You get no ordering guarantees. Fetching will happen in batch_size
        increments. If you're using this method, you should be running in an
        async task.
        """
        pass

Strawman Schema #1: Scope.user_state-specific, Partition by (user, course)

Code Block

language	sql

CREATE TABLE IF NOT EXISTS xblock_user_state (
    user varchar,     -- varchar so that we can support anonymous IDs if desired
    course_key varchar,
    block_type varchar, 
    usage_key varchar,
    created timestamp,  -- append only, each update is a new record
    state binary,
    PRIMARY KEY ((user, course_key), block_type, usage_key, created)
)

...

Note that some of these scenarios are highly speculative.

1. Student state

Partition: (user="u.{user_id}", grouping={course_id})

This is written like normal. The schema should give us fast writes, fast reads on individual items, and fast range scans for all of a particular user's student state by block_type. We would also get state history.

2. Published course content and settings.

Partition: (user=none, grouping={course_id@version})

This is written once at publish time. Old publishes are retained. Definitions are stored with the usages in a given partition, so there is duplication here. We may need to delete old partitions (e.g. previously published versions) if the publishing model is too frequent.

3. CCX derived from base course.

Partition: (user=none, grouping={ccx_course_id})

This would store all the overrdies in the settings scope that are necessary to go from the versioned course that this CCX was based on.

4. Individual Due Date Extensions

Partition: (user="u.{user_id}", grouping={course_id})

It's worth noting that this would be stored in the same partition as #1 (user_state scope storage). The app and scope would just be different. This would allow a runtime to grab all student-specific state in a single query.

5. Small Group work.

Partition: (user="u.{user_id}", grouping={course_id})

...

If we had to model content differences associated with very large groups, we could model that in a separate partition under a user="g.{group_id}".

6. Notifications (highly speculative)

Partitions:

 (user="u.{user_id}", grouping={course_id}) # Single user XBlock
 (user=none, grouping={course_id}) # Course as a whole

Notifications would have an app name (e.g. "ntf"). They get stored in the same partitions as user state and course state, depending on the notification type. This assumes a new scope.

Drawbacks

Parent/child relationship storage might get clunky. It helps that this would be stored against the version course though (since that's only ever written once).
Using a map column like this means that we would be limited to 64K of state per (user, b_key, scope).

Migration Plan - Transactional DB

Migration Plan - Analytics

The big concerns here are:

...

Versions Compared

Old Version 30

New Version Current

Key

Table of Contents

Background

Requirements

Data Dimensions

Extra Considerations

Access Patterns

Code Plan

Strawman Schema #1: Scope.user_state-specific, Partition by (user, course)

1. Student state

2. Published course content and settings.

3. CCX derived from base course.

4. Individual Due Date Extensions

5. Small Group work.

6. Notifications (highly speculative)

Drawbacks

Migration Plan - Transactional DB

Migration Plan - Analytics

Page Comparison

Versions Compared

Old Version 30

New Version Current

Key

Table of Contents

Background

Requirements

Data Dimensions

Extra Considerations

Access Patterns

Code Plan

Strawman Schema #1: Scope.user_state-specific, Partition by (user, course)

1. Student state

2. Published course content and settings.

3. CCX derived from base course.

4. Individual Due Date Extensions

5. Small Group work.

6. Notifications (*highly* speculative)

Drawbacks

Migration Plan - Transactional DB

Migration Plan - Analytics

6. Notifications (highly speculative)