Open edX Tagging Service Proposal

This is a proposal for an Open edX tagging service which allows Open edX administrators and content authors to define taxonomies, and then allows users to tag most entities with those tags.

This proposed tagging service has the following features:

  • Taxonomies can consist of unstructured keywords or structured data like learning outcomes
    • Example of a learning outcome: CCSS.Math.Content.HSS.ID.A.1 (Common Core State Standards > Math > High School Statistics > Interpreting Data > A1)
  • Various types of entities in Open edX can be tagged: XBlocks, sequentials, courses, CatalogCourses (all runs of a given course), programs, users, discussions, maybe other things in the future.
  • Taxonomies can be imported and exported
  • Taxonomies can be public or private
  • Tags can be public (all users will see that X is tagged with Y) or private (only I see that X is tagged with Z)

This proposal is designed with the following behavior in mind:

  • Need to support ~1B tags in the system
  • Reading tags is extremely common, but writes are rare (except in cases of high-volume automated content tagging)
  • Any user can create unstructured taxonomies (define a set of tags simply by tagging something with keywords), but only a limited set of users (admins and content authors) will create structured taxonomies

Implementation

The gist of the implementation is that we will create a “Tagging Service” as an independently Deployable Application which will leverage the Neo4j graph database to store taxonomies and the relationships between taxonomies and Open edX entities. The LMS, Studio, and other apps can make API calls with the Tagging Service which in turn reads/writes to Neo4j.

Taxonomies and the tags they contain are represented in Neo4j as Nodes, with relationships among them as appropriate. A Taxonomy node CONTAINS many Tag nodes, and Tag nodes may optionally have relationships among each other, such as (LearningOutcomeMultiplication) NARROWS (LearningOutcomeBasicMath) which specifies that the “Multiplication” learning outcome(s) are a subset of the “Basic Math” learning outcomes.

Open edX entities (courses, users, etc.) are also represented in Neo4j as Nodes, with a TaggableEntity label and a type-specific label such as User, and an “externalId” property. Then any TaggableEntity can be TAGGEDWITH a Tag.

At large scales, the Tagging Service would likely require Neo4j Enterprise which supports clustering/HA; that is fine since it is licensed under the AGPL, like Open edX itself.

Since the Tagging Service itself will do relatively little computation and is mostly focused on translating API requests to Neo4j queries, it should probably be written using an asychronous framework like AIOHTTP or Node.js to allow a single instance of the Tagging Service to serve even large Open edX instances with many LMS nodes.

Separation of Concerns

LMS/Studio

  • Enforces permissions
  • All UI

Tagging Service

  • Enforces schema and constraints (A tag from a private taxonomy cannot be publicly applied, etc.)
  • Wraps Neo4j API in a (RESTful?) Open edX-specific API

Neo4j

  • Data storage layer, provides very fast and efficient querying of tag data

Example

Here is a full Neo4j Cypher (Cypher is the Neo4j query language) statement that will create examples of all of these nodes and relationships as would be used by the tagging service.

In the following example, there are:

  • Three taggable entities: the user Bob, the course “Math Course”, and a unit within that course.
  • A “Common Core” taxonomy, including a subset of the Common Core high school math learning outcomes
  • A private “Bob Private Tags” taxonomy (contains two tags, “favorite” and “WIP”)

And the following tags have been applied:

  • The unit is publicly tagged with the “CCSS.MATH.CONTENT.HSS.ID.A” learning outcome and privately tagged by Bob with the tag “favorite”
  • The course is tagged with the learning outcome group “CCSS.MATH.CONTENT.HSS”

Example Cypher statement:

CREATE
(bob:User:TaggableEntity {type: 'user', externalId: 327645, displayAs: 'bob'}),
(mathCourse:Course:TaggableEntity {type: 'course', externalId: 'course-v1:OpenCraft+math+course'}),
(mathUnit:Content:TaggableEntity {type: 'content', externalId: 'block-v1:OpenCraft+math+course+type@vertical+block@interpreting_data'}),

(cc:Taxonomy {name: 'Common Core State Standards', type: 'public'}),
(cc)-[:OWNEDBY {}]->(bob),
(a:Tag {tag: 'CCSS.MATH', description: 'Mathematics'}),
(aa:Tag {tag: 'CCSS.MATH.CONTENT.HSS', description: 'High School: Statistics & Probability'}),
(aa)-[:NARROWS {}]->(a),
(aaa:Tag {tag: 'CCSS.Math.Content.HSS.ID', description: 'Interpreting Categorical & Quantitative Data'}),
(aaa)-[:NARROWS {}]->(aa),
(aaaa:Tag {tag: 'CCSS.MATH.CONTENT.HSS.ID.A', description: 'Summarize, represent, and interpret data on a single count or measurement variable'}),
(aaaa)-[:NARROWS {}]->(aaa),
(aaaaa1:Tag {tag: 'CCSS.MATH.CONTENT.HSS.ID.A.1', description: 'Represent data with plots on the real number line (dot plots, histograms, and box plots).'}),
(aaaaa1)-[:NARROWS {}]->(aaaa),
(aaaaa2:Tag {tag: 'CCSS.MATH.CONTENT.HSS.ID.A.2', description: 'Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.'}),
(aaaaa2)-[:NARROWS {}]->(aaaa),
(cc)-[:CONTAINS {}]->(a),
(cc)-[:CONTAINS {}]->(aa),
(cc)-[:CONTAINS {}]->(aaa),
(cc)-[:CONTAINS {}]->(aaaa),
(cc)-[:CONTAINS {}]->(aaaaa1),
(cc)-[:CONTAINS {}]->(aaaaa2),

(mathUnit)-[:TAGGEDWITH {}]->(aaaaa1),
(mathCourse)-[:TAGGEDWITH {}]->(aa),

(bpt:Taxonomy {name: 'Bob Private Tags', type: 'private'}),
(bpt)-[:OWNEDBY {}]->(bob),
(bobFavorite:Tag {tag: 'Favorite', description: 'Favorite'}),
(bobWIP:Tag {tag: 'WIP', description: 'Work in Progress'}),
(bpt)-[:CONTAINS {}]->(bobFavorite),
(bpt)-[:CONTAINS {}]->(bobWIP),

(mathUnit)-[:TAGGEDWITH {}]->(x1:PrivateTag)-[:OWNEDBY {}]->(bob),
(x1)-[:TAGGEDWITH {}]->(bobFavorite)

Visualizing this in Neo4j gives:

API Examples

Here are some examples of API calls that could be made of the Tagging Service, and the corresponding Neo4j queries it would run internally to serve those API calls. As you can see, most API calls can be directly translated into a single query run on Neo4j, which is key to keeping the tagging service lightweight and performant.


Apply public tag “CCSS.MATH.CONTENT.HSS.ID.A.2” from Taxonomy “Common Core State Standards” to block “block-v1:a+b+c”

MATCH (taxonomy:Taxonomy {name: 'Common Core State Standards', type: 'public'})-[:CONTAINS]->(tag:Tag {tag: 'CCSS.MATH.CONTENT.HSS.ID.A.2'})
MERGE (block:Content:TaggableEntity {type: 'content', externalId: 'block-v1:a+b+c'})
MERGE (block)-[:TAGGEDWITH]->(tag)


User 123 applies private tag “needswork” to block “block-v1:a+b+c”

This API call will automatically create a private taxonomy called “My Tags” for the user, to contain their unstructured keyword tags, if one doesn’t exist already.

MERGE (user:User:TaggableEntity {type: 'user', externalId: 123})
MERGE (block:Content:TaggableEntity {type: 'content', externalId: 'block-v1:a+b+c'})
MERGE (pt:Taxonomy {name: 'My Tags', type: 'private'})-[:OWNEDBY]->(user)
MERGE (pt)-[:CONTAINS]->(tag:Tag {tag: 'needswork'})
CREATE
   (block)-[:TAGGEDWITH]->(x1:PrivateTag)-[:OWNEDBY]->(user),
   (x1)-[:TAGGEDWITH]->(pt)

TBD: how to change that last CREATE to a MERGE so it’s idempotent?


Get all tags (public and private) on block “block-v1:a+b+c” visible to user 123

MATCH (te:TaggableEntity {type: 'content', externalId: 'block-v1:a+b+c'})
MATCH (te)-[:TAGGEDWITH]->(t:Tag)
MATCH (te)-[:TAGGEDWITH]->(pt:PrivateTag)-[:OWNEDBY]->(:User:TaggableEntity {externalId: 123}), (pt)-[:TAGGEDWITH]->(t2:Tag)
RETURN t,t2


Get all tags (public and private) on block “block-v1:a+b+c”

Only admins/servers could make this API call.

MATCH (te:TaggableEntity {type: 'content', externalId: 'block-v1:a+b+c'})
MATCH (te)-[:TAGGEDWITH*1..2]->(t:Tag)
RETURN t


Integration with Studio

Initially, a Taxonomy section could be added to Studio, to allow authoring of Taxonomies as first-class entities, for tagging content or users. The Studio UI would use Neo4j’s existing taxonomy visualization code (D3-based) to display taxonomies, but editing could only be done by uploading a CSV that defines the taxonomy. At first, taxonomies would only support simple hierarchies or unstructured keyword sets.

In addition, Studio (and maybe the LMS, for discussion posts?) would allow users to publicly or privately tag content with free-form keywords. Taxonomies can be linked to a course. When typing out a keyword in the “tags” field of any block/unit/course, a dropdown would appear showing an autocomplete menu of matching tags from all taxonomies currently linked to the course, as well as options like “Create new tag ‘respiration’ (My Personal Tags) (Private)” or “Create new tag ‘respiration’ in (AP Biology 12 Taxonomy)”

Try it Yourself

You can see examples of modelling tags and taxonomies in Neo4j easily by running:

docker run --publish=7474:7474 --publish=7687:7687 neo4j:3.4

Then, browse to http://localhost:7474/browser/ (the initial password is "neo4j"; you must change it upon login). Enter in the example commands shown in this proposal, then run the command “MATCH (n) RETURN n LIMIT 100” to see the graph.

To reset this test database at any time, run this command:

MATCH (n)
DETACH DELETE n


Open Questions

  1. Should Taxonomies be typed (User hierarchy taxonomies, learning outcome taxonomies, etc.)? Alternately, should Taxonomies specify what types of TaggableEntity they apply to?
  2. Should Taxonomies be a Node that CONTAINS all their tags, or a Label in Neo4j applied to all of that Taxonomy’s tags? (Probably a Node)
  3. Should/can we support RDF import/export?
  4. Should we support an additional type of tag, which is a parametrized tag? So a taxonomy could contain things like “Author {applies_to: ‘User’}”, “Publisher {applies_to: ‘Group’}”, and then content could be tagged like
    Block X has tag “Author: “ with user “Braden MacDonald”

    This is useful for Open edX instances that want to have custom structured data about their content. For example, a particular Open edX instance may need to tag all of their content with “Author”, “Publisher”, “Copyright Owner”, and “Department”.

    In Neo4j syntax, a parametrized Author tag could look like:

    MERGE (user:User:TaggableEntity {type: 'user', externalId: 345})
    MERGE (block:Content:TaggableEntity {type: 'content', externalId: 'block-v1:a+b+c'})
    MATCH (taxonomy:Taxonomy {name: 'MyOrg Attribution Taxonomy', type: 'public'})-[:CONTAINS]->(tag:ParametrizedTag {tag: 'Author'})
    CREATE
       (block)-[:TAGGEDWITH]->(x1:AppliedParametrizedTag)-[:TAGGEDWITH]->(tag),
       (x1)-[:PARAMETER]->(user)