Contentstore

The "contentstore" is where all course assets are stored. It's really a wrapper of code around a GridFS (MongoDB) backend and it stores binary files which can PDFs, WAVs, JPGs, or other. The contentstore code is mainly here:

https://github.com/edx/edx-platform/tree/master/cms/djangoapps/contentstore

The contentstore has some known technical problems, explained somewhat in this presentation:

http://doctoryes.github.io/mug_talk_modulestore/#1

An effort to move all course assets out of GridFS and into external storage was begun in 2014 and abandoned. The docs from that effort: GridFS Replacement

Since that effort, the performance team has implemented an optimization for non-locked course assets. See Dave Ormsbee or Toby Lawrence (Deactivated) for further details.

Ways to Query the Content Store

  1. using PyMongo:
    • If it was a PyMongo script, you'd run it from any machine that could connect to the prod mongo replica.
  2. using mongo itself:
    • If it was a JS script, you'd need to be on a machine that could connect via the "mongo" command prompt.
    • "mongo < ./myscript.js"
  3. using the read-replica mongo shell
    1. ssh to the tools-gp.edx.org machine
    2. navigate to /edx/bin and run the appropriate script to enter a mongo shell for the databse.
      1. e.g. ./prod-edx-edxapp-mongo.sh
    3. run your query
    4. NOTE: The prod-edx/prod-edge read-replica secondary databases are independent of the primary databases, so you can execute intensive queries via these shells. However, the stage and loadtest read-replicas are in the same cluster as the primary database, so intensive queries may affect the stage or loadtest environments. 

Counting Assets in Each Course

If you need to find out how many assets are contained in each course, the JS below will assist you.

/* The original "group" */
db.fs.files.group( {
    key: {"_id.course": 1, "_id.org": 1, "_id.run": 1},
    reduce: function(cur, result) { result.count += 1 },
    initial: {count: 0}
} )

/* ..and with category included. */
db.fs.files.group( {
    key: {"_id.course": 1, "_id.org": 1, "_id.run": 1, "_id.category": 1},
    reduce: function(cur, result) { result.count += 1 },
    initial: {count: 0}
} )

var mapFunction = function() {
    var slicer = function(x) { return x.slice(0, x.lastIndexOf("+")) };
    var split_id = null;
    if (typeof this._id === "string")
        split_id = slicer(this._id);
    var key = [ this._id.course, this._id.org, this._id.run, this._id.category, split_id ];
    emit( key, 1 );
};
var reduceFunction = function(key, values) {
    return Array.sum(values);
}
db.fs.files.mapReduce(
    mapFunction,
    reduceFunction,
    {
        out: {inline: 1}
    }
)

/* For debugging the mapper... */
var emit = function(key, value) {
    print("emit");
    print("key: " + key + " value: " + tojson(value));
}

/* To find all courses which aren't the three specified. */
db.fs.files.find({"_id.course": {$nin : ["DemoX", "import_test", "LargeCourse101"]}})

Finding Distinct Values for Fields and Frequency of the Values

db.fs.files.aggregate(
   {$group : { _id : '$<field_name>', count : {$sum : 1}}}
).result