Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Background

User XBlock state and most raw score information in edx-platform is currently stored in the StudentModule and StudentModuleHistory models, which correspond to the courseware_studentmodule and courseware_studentmodulehistory tables in the database. These are by far our largest tables, with hundreds of millions of rows. This has been a long term scaling concern – we want to be able to easily add capacity and avoid the locking-related outages which have caused intermittent outages on edx.org. For a discussion on the eventual store, please see 

While the state and scores are stored in the same rows, they are independent concepts with differing query patterns (scores are written much less frequently, scores are often desired for all items a user has in a course, etc.).

Requirements

  1. Store XBlock student state in a scalable, redundant, low-latency data store.
  2. Allow installations running at smaller scales to keep using MySQL.
  3. Maintain backwards compatibility for existing features.
  4. Better separate grade calculation from XBlock state.

The general approach will be to create new abstractions for XBlock user state and Scores, with pluggable backends. By default, both of these interfaces will point back to StudentModule. Sites with higher scaling requirements (like edx.org) will be able to configure a different backend. All direct calls to StudentModule in edx-platform will be replaced.

Data Dimensions

  • Up to 10K unique XBlock state entries for a given user in a given course (our largest courses tend to be smaller, in the low thousands).
  • Largest cumulative state size for one (course, user) combination is ~5MB of uncompressed JSON on edx.org (this happens in RiceX/ELEC301x/T1_2014)
  • Some state will be extremely similar across students (e.g. multiple choice, sequences), while some will have a much higher variance (e.g. ORA2).
  • Largest known student state for a single module is ~1MB
    • Occurred for a student against i4x://RiceX/ELEC301x/problem/02e84c4d4f5f4f0f86930e5d6e830a5a in course RiceX/ELEC301x/T1_2014
    • Matlab problems return base64 encoded images.
    • This module also had 9 state history entries.

Extra Considerations

  • Make the schema generic enough to handle other scopes?
  • Do we need good data locality across courses for the same user?
  • Append-only data structure?

Access Patterns

 TypeAccess Pattern Use Caseedx.org?Port?Code Example
001State(student, course, module)

RW

Read/write user state for a single XBlock (simple XB view).

YesYes

lms/djangoapps/courseware/model_data.py

002State(student, course, (modules))

R(W)

Render part of a course tree (e.g. sequence)

YesYes

lms/djangoapps/courseware/model_data.py

003

State History

(student, course, module)RWState history for a user + problem.YesYeslms/djangoapps/courseware/views.py
004State Bulk(course, module)RWReset/delete/rescore a problem for all students in a course.Yes, asyncYeslms/djangoapps/instructor_task/tasks_helper.py
005State Reportscore not nullRPsychometrics, pull back every graded thing ever.No?lms/djangoapps/psychometrics/psychoanalyze.py
006State Report(course, module)

R

R

Course dump of problem state across all students.

Answer distribution.

No

Yes, async

?

Analytics

lms/djangoapps/instructor/views/legacy.py

lms/djangoapps/courseware/grades.py

007State Report(course, module, (students))RORA1 related reporting management commandNoNoinstructor/management/commands/openended_stats.py
008Score(student, course, module)

RW

RW

Get or set score for a single XBlock.

Reset attempts/score in a problem for one student.

Yes

Yes

Yes

Yes

lms/djangoapps/courseware/grades.py

lms/djangoapps/instructor/enrollment.py

009Score(student, course, (modules))

R

R

Calculate entrance exam scores.

Determine whether to grade a section.

Yes

Yes

Yes

Yes

common/djangoapps/util/milestones_helpers.py

lms/djangoapps/courseware/grades.py

010Score Report(course, score not null, module)RList of users and their scores for a given problem.NoYeslms/djangoapps/class_dashboard/dashboard_data.py
011Score Stats(course, type=problem, score not null)RScore distribution for all problems in a course.NoAnalytics?lms/djangoapps/class_dashboard/dashboard_data.py
012Score Stats(course, score not null, (modules))RScore distribution for a particular problem set.NoAnalyticslms/djangoapps/class_dashboard/dashboard_data.py
013State Stats(course, type=sequential, module)RHow many students have opened this subsection.NoAnalyticslms/djangoapps/class_dashboard/dashboard_data.py
014State Stats(course, type=sequential)ROpen counts for all subsections in the course.NoAnalyticslms/djangoapps/class_dashboard/dashboard_data.py
015State(course, type=chapter, sequential, problem)R

Open/Completed State for Problems and Sections.

Problem grades lookup planned.

No?

https://github.com/Stanford-Online/edx-platform/blob/master/lms/djangoapps/instructor/tasks_helper.py

context: emailDistributionTool.png

016State(student, course, type)RNot currently supported, but functionality has been suggested for a number of use cases (being able to get the user's last location more quickly, see all A/B test state at once, etc.)-- 

Code Plan

Code Block
languagepy
class XBlockUserStateClient(object):
    """
    First stab at an interface for accessing XBlock User State. This will have
    use StudentModule as a backing store in the default case.
    
    Scope/Goals:
    1. Mediate access to all student-specific state stored by XBlocks.
        a. This includes "preferences" and "user_info" (i.e. UserScope.ONE)
        b. This includes XBlock Asides.
        c. This may later include user_state_summary (i.e. UserScope.ALL).
           - Though I'm fuzzy on where we're going in general with this and how
             it relates to content.
        d. This may include group state in the future.
        e. This may include other key types + UserScope.ONE (e.g. Definition)
           - I think this implies a per-user partition scheme and not a 
             user+course partition scheme.
    2. Assume network service semantics.
        At some point, this will probably be calling out to an external service.
        Even if it doesn't, we want to be able to implement circuit breakers, so
        that a failure in StudentModule doesn't bring down the whole site.
        This also implies that the client is running as a user, and whatever is
        backing it is smart enough to do authorization checks.
    3. This does not yet cover export-related functionality.

    Open Questions:
    1. Is it sufficient to just send the block_key in and extract course + 
       version info from it?
    2. Do we want to use the username as the identifier? Privacy implications?
       Ease of debugging?
    3. Would a get_many_by_type() be useful?
    """

    class ServiceUnavailableError(Exception):
        pass

    class PermissionDeniedError(Exception):
        pass

    # 001
    def get(user_id, block_key, scope=Scope.user_state):
        pass

    # 001
    def set(user_id, block_key, state, scope=Scope.user_state):
        pass

    # 002
    def get_many(user_id, block_keys, scope=Scope.user_state):
        """Returns dict of block_id -> state."""
        pass

    # 002
    def set_many(user_id, block_keys_to_state, scope=Scope.user_state):
        pass

    # 003
    def get_history(user_id, block_key, scope=Scope.user_state):
        """We don't guarantee that history for many blocks will be fast."""
        pass

    # 004, 006
    def iter_all_for_block(block_key, scope=Scope.user_state, batch_size=None):
        """
        You get no ordering guarantees. Fetching will happen in batch_size
        increments. If you're using this method, you should be running in an
        async task.
        """
        pass

    # 005 if you want to push it...?
    def iter_all_for_course(course_key, block_type=None, scope=Scope.user_state, batch_size=None):
        """
        You get no ordering guarantees. Fetching will happen in batch_size
        increments. If you're using this method, you should be running in an
        async task.
        """
        pass

 

Strawman Schema #1: Scope.user_state-specific, Partition by (user, course)


Code Block
languagesql
CREATE TABLE IF NOT EXISTS xblock_user_state (
    user varchar,     -- varchar so that we can support anonymous IDs if desired
    course_key varchar,
    block_type varchar, 
    usage_key varchar,
    created timestamp,  -- append only, each update is a new record
    state binary,
    PRIMARY KEY ((user, course_key), block_type, usage_key, created)
)

...

Note that some of these scenarios are highly speculative.

1. Student state

Partition: (user="u.{user_id}", grouping={course_id})

This is written like normal. The schema should give us fast writes, fast reads on individual items, and fast range scans for all of a particular user's student state by block_type. We would also get state history.

2. Published course content and settings.

Partition: (user=none, grouping={course_id@version})

This is written once at publish time. Old publishes are retained. Definitions are stored with the usages in a given partition, so there is duplication here. We may need to delete old partitions (e.g. previously published versions) if the publishing model is too frequent.

3. CCX derived from base course.

Partition: (user=none, grouping={ccx_course_id})

This would store all the overrdies in the settings scope that are necessary to go from the versioned course that this CCX was based on.

4. Individual Due Date Extensions

Partition: (user="u.{user_id}", grouping={course_id})

It's worth noting that this would be stored in the same partition as #1 (user_state scope storage). The app and scope would just be different. This would allow a runtime to grab all student-specific state in a single query.

5. Small Group work.

Partition: (user="u.{user_id}", grouping={course_id})

...

If we had to model content differences associated with very large groups, we could model that in a separate partition under a user="g.{group_id}".

6. Notifications (*highly* speculative)

Partitions:
 (user="u.{user_id}", grouping={course_id}) # Single user XBlock
(user=none, grouping={course_id}) # Course as a whole

Notifications would have an app name (e.g. "ntf"). They get stored in the same partitions as user state and course state, depending on the notification type. This assumes a new scope.

Drawbacks

  1. Parent/child relationship storage might get clunky. It helps that this would be stored against the version course though (since that's only ever written once).
  2. Using a map column like this means that we would be limited to 64K of state per (user, b_key, scope).

 

Migration Plan - Transactional DB

Migration Plan - Analytics

The big concerns here are:

...