Content Tagging Strategy for the Open edX Platform

Project Leads: @Jenna Makowski @Brad Brown @Braden MacDonald @Ali Hugo @Dave Ormsbee (Axim) @Bryan Kersten (Deactivated)

Project Status: In Community Review

1 Background: Community Needs Driving The Project
2 Value and Impact
3 What We Propose To Build
4 How We Propose To Build It
5 Key Capabilities
6 Proposed Definitions
7 Projects for future product discovery

Background: Community Needs Driving The Project

We’ve heard many use cases expressed that illustrate the need to align course content to taxonomies. These use cases are diverse and span a range of needs and outcomes. However, they are all underpinned by a common need of connecting course content to taxonomies, or controlled vocabularies. This is often expressed by the need to “tag content”, or to “add tags to content”.

The outcomes of these use cases range from improving content authoring and content reuse workflows; to enabling content recommendation for learners; to supporting instructional design goals. Some examples include:

“As a content author, I want to attach topic tags to questions I create in Content Libraries so it's easy for me to find questions by topic and reuse them in different assessments.”

“As a content author, I want to search for all the videos we’ve created that are about a certain topic, like “soil health”, so that it’s easy to find the content I need in my Library and use it in my courses.”

“As an Instructor, I want to be able to search for all content tagged for a certain skill. For example, I may have a student who needs more practice with factoring binomial equations. I want to search for and find all the content tagged for factoring binomial equations. Even better, I want to refine the search for all formative assessments that cover factoring binomial equations. This will make it easier for me to connect my learners with the content they need to fill knowledge gaps.”

“As a learning designer, I want to tag units for competencies and learning objectives so that I can check for alignment between unit learning objectives and the overall course objectives.”

“As a marketing team lead, I want to be able to make content recommendations to learners based on prior content they have interacted with.”

“As an administrator, I want to integrate an adaptive engine to our platform that can make content recommendations based on learner profiles, interests and goals.”

Value and Impact

By building platform capacity to align content with taxonomies, or to “add tags to content”, we can deliver many benefits to all Open edX user personas. Some examples include:

Value for authors and instructional designers

Align any part of your course to a competency or skill
Organize content in your libraries around competencies or skills, such as problem banks for factoring binomial equations or videos aligned to factoring binomial equations
Organize content in your libraries by subject, such as videos about ethnology or assessments about constitutional law
Conduct targeted searches in your library, such as videos that cover the skill of factoring binomial equations
Analyze student understanding of/engagement with particular competencies or skills.
- For example, identify trends, such as high failure rates on assessments aligned to a certain skill, and assess the quality of the content used to teach that skill
- For example, identify knowledge gaps for individual learners or learner cohorts and customize content for them (or allow a third-party adaptive engine to do so)
Define prerequisite relationships between sets of content

Value for administrators

Standardize and control the taxonomies your authors are using to align content to competencies, skills, subjects
Choose to author your own taxonomy, or ingest third-party taxonomies, such as the Open Skills Network taxonomy or Lightcast Skills taxonomy

Value for learners

Discover specific content in courses to fill competency or skill gaps
Access more robust data to inform decisions on which courses to pursue

Value for organizations:

Teaching and Learning: Stepping stone toward unlocking the potential to integrate with adaptive learning services. Aligning course content to tags is a step toward enabling external services to build complex, adaptive experiences that are customized to individual learner profiles.
Teaching and Learning: Stepping stone toward unlocking other modular learning capabilities, such enabling learners to build self-directed learning pathways, or enabling course teams to build more flexible and diverse types of learning presentations.
Marketing and Discovery: Stepping stone toward unlocking personalized recommendations.

What We Propose To Build

From the above use cases, we can distill a core set of generalized platform capabilities. By building these core capabilities, we will enable the Open edX community to unlock the value described in many of the above use cases.

The platform must support the capability to associate tags with content at all levels of content. This includes adding tags to individual components (text blocks, video blocks, question blocks), to all parts of the course (units, subsection and sections), and to full courses (and any new learning presentation types that might evolve in the future).

Tagging must be integrated into authoring workflows in Content Libraries and in the course outline.

Tags must be designed to align with content reuse workflows, ie integrated in Content Libraries and the course outline.

The platform cannot be prescriptive about all taxonomies organizations use. Rather, it must enable instances and organizations to customize taxonomies to meet individual needs.

Conversely, the platform must support a core set of standardized, platform-wide taxonomies to enable platform-wide use cases such as content search, content recommendations and more.

The platform must support use/ingestion of third-party taxonomies and/or the ability to create custom taxonomies from scratch.

The platform must support the capacity to create hierarchical and horizontal relationships between taxonomies.

Tagging infrastructure must be designed neutrally, in order to support multiple purposes, from instructional design needs to content management needs.

Tagging infrastructure must be designed with a unified UX/UI experience in mind, particularly between content authoring and reuse workflows between Content Libraries and the course outline.

Note: Adding content tagging capabilities will follow on the release of Content Libraries, V2 (H1 2023). Content Libraries, V2 enables authors to create videos, text and problems in Libraries and reuse them in any course. More information here.

How We Propose To Build It

Until now tagging capabilities have been built to support the delivery of specific, narrowly scoped, features. For example, the Discovery IDA has both a simple feature to provide tagging via the Django taggit library, and a custom set of models to allow the mapping of a skills taxonomy to courses. Additionally, a number of course-run-specific metadata fields are stored in the course_metadata_courserun in the edx-platform database. These fields require database migrations to add and remove and are specific to particular deployments of the platform in many cases. While many of the details of the technical implementation will not be understood until the technical design has been approved, there are important considerations that we can list regarding our expected approach.

First, the approach should be appropriate for a shared software platform. The addition of meta-data should be under the control of each deployment – with a set of sensible, platform-wide defaults – and should not require changes to the software core to add or remove meta-data.

Second, the approach should consider that the inability to flexibly apply meta-data to content has been an impediment to a number of efforts. However, the disaggregated user pain was not sufficient to inspire any team to build a general service to date. From a platform point of view, when we sum up user pain, it is obvious that a general capability for mapping meta-data to course content would unlock value across the platform.

Third, we believe that meta-data added by the owners of models should be authoritative. Yet we acknowledge that that data is valuable for other critical needs. For example, authors of content should own adding – or approving – meta-data about the content they create. Authors should define the content's level of difficulty, its expected time to complete, and the skills it teaches. However, this meta-data is also critical for marketing courses, computing learner analytics, and recommending content.

A general capability for classifying course content and appropriate APIs will have a number of benefits. First, it will allow us to deprecate limited implementations reducing confusion and maintenance costs. Second, it will allow us to share metadata across domain boundaries while ensuring appropriate ownership of that data. Third, it will make the software a better platform allowing specific instances of the platform to apply the appropriate meta-data to course content without altering the platform core. Fourth, it will allow us to converge on a single vocabulary for platform metadata and to disambiguate terms like tag, taxonomy, metadata, etc.

Key Capabilities

Tagging Infrastructure

The platform will support tagging based primarily on name-value pair style tags. It will support the following four types:

Field Type	Description	Field (example names)	Tag (value)	Author Experience
Free-form	Open-ended option to allow any author to add any desired tags; these are all collected in a single (possibly invisible) field for good data management	Tags	Anything author wants to enter	Author sees a field to add as many tags as desired and can free-form add. Could include predictive suggestions of existing tags as they type.
System-defined	Core tags controlled at the platform level in order to keep consistency across search and discovery and for general messiness control. Admins would not be able to change these.	Language, format/content type, organization	Specific set against each of these.	Many of these would not be visible but some might be integrated into facet search (e.g., content format, language). Most would not be editable.
Admin-defined fields	Admins can set up specific fields that authors can free-form enter tags on. These would typically be used for instances in which there could be many possible tags but admin wants them organized separately from free-form tags	Outcomes, Learning Objectives	Anything author wants to enter	Author is presented with the field and can enter a single free-form value.
Admin-defined closed taxonomies	Admins can set up specific fields that require closed taxonomies (including selecting from existing, uploading, manually creating). Includes ability to create child and grandchild hierarchies	Lightcast skills, state standards	Biology, Microbiology	Author is presented with the field and a drop-down (or other UI selection element) to select the value.

Example of Name-Value Pairs might include:

Field	Tag
subject	eg Anthropology
competency	eg Conflict Resolution
skill	eg Define Stakeholder Roles
curriculum alignment	eg Operations and Algebraic Thinking
learning outcome	eg Factor Binomial equations
level of difficulty	eg Medium
hours of effort	eg 4 - 6
prerequisites	eg Linear Algebra

2. Taxonomies will support localization. This will hold true for platform-supported taxonomies and imported taxonomies.

2. This will require there to be at least one additional piece of data for each tag: field, value (the same across all languages), translated value times the number of supported languages. Clarify that the "tag" holds only the key-value pair for a particular piece of content, and the translated values would live in the taxonomy and not be considered part of the "tag", i.e. not stored in a place where it's linked to a specific content object.

Tag Behavior

All of the fields and associated tags will display on content whether it’s in a Library or in a course outline.
The fields and tags will display with a piece of content when it moves from a Library to one or more courses. The same holds true in reverse, when a tag is added to a course section in Studio and that content is exported to a Library.
The tags and taxonomies that an author sees are determined by the taxonomies chosen and configured for their Instance or their organization. So if Instance A uses the Lightcast Skills taxonomy and Instance B uses the Open Skills Management Taxonomy, each will only see the tags in the taxonomy configured for them.

Platform Taxonomies

The platform will identify a core set of fields and taxonomies that are established and standardized, such as “language”, “content format” and “organization”. The platform will control these taxonomies to mitigate messy data and so that these fields can be used for faceted search infrastructure in libraries.
1. These core fields may be auto-generated from core data models in content blocks.
2. The fields and taxonomies will be unchangeable by the user.
3. Many of these would not be visible but some might be integrated into facet search (e.g., content format, language). Most would not be editable.

2. The platform will offer a few recommended fields and a “menu” of optional closed taxonomies for those fields. For example, we may suggest a field for “skill” and offer three skills taxonomies for optional use. Administrators would have the option to choose one of the taxonomies, upload their own taxonomy, or not use that field at all.

Admin User Stories

Admins can set up as many specific fields as they’d like that authors can create free-form tags on.
Admins can set up as many specific fields as they’d like that are associated with closed taxonomies. Admins would have three pathways to associate closed taxonomies with fields:
1. Choosing from a menu of taxonomies that already exist in the platform
2. Ingesting a new external taxonomy
3. Creating a taxonomy from scratch
Admins can apply taxonomies:
1. Across courses and libraries in an Instance
2. Across courses and libraries in an organization
3. To specified organizations in an Instance
Admins can share taxonomies and tags across instances
Administrators can create hierarchical relationships (child and grandchild) between tags.
Administrators can create horizontal relationships between tags.
Ability to require that certain tags are filled in, for example all competency tags must be filled in.
Admins can choose to add AI-generated tags from particular taxonomies to particular content sets in bulk.

3. Approach for multi-tenant sites vs instances with multiple orgs: From a technical perspective it's all possible, the question is just at what level taxonomies are scoped. We could say at the course, org, site, or instance level (which is going in order from smallest to largest - e.g. one instance can have five sites each with three orgs and each with a hundred courses). Or we could say that the taxonomies can be defined at any of those levels. I think everything is pretty similar from a technical perspective, so it should be the product requirements that drive this, to reflect what real needs are. Just be aware that configuring things at the intermediate levels (org and site) is unfortunately a little half-baked in the platform today, so there aren't as many tools for permissions, groups, and admin functionality at those levels. i.e. it needs to use the django admin, not a nice admin portal, and options for permissions are super basic.

4. Sharing tags across instances becomes very complex because of the need to share taxonomies and the problematic case where each instance is using a different version of the same taxonomy. But if we want to enable that (which would be nice), a nice simple approach to support it is to say that any tags which already exist in the taxonomy on the destination instance will be imported, as well as any free-form tags, and the rest (from taxonomies that aren't compatible or don't exist on the destination) will be dropped at import time. I think that covers a lot of use cases without adding any complexity.

7. Approach: surface the "required tag is missing!" errors in the UI at different levels, but not actually enforcing it, to avoid the complexities mentioned.

Admin Experience: Taxonomy Management System

Environment for administrators to create a new taxonomy and edit it.
Mechanism to ingest third-party taxonomies and edit them.
Permission controls for editing, managing and finalizing taxonomies.
Option to enable bulk-add using AI generation.
Version control
1. Only one version of a taxonomy will be supported at a time

Approach for handling marketing tags unrelated to teaching and learning: This might be something that's more manageable as a specific Field type, so you could have a "Marketing" field with multiple values like "summer-2023-promotion,on-sale,labor-day-bundle" and keep them separate from the learning data, however since that is a different boundary of permissions I think it would probably be best to keep them separate if possible. Tagging is a fairly generic and simple thing, there's no reason why we can't have several different systems that do similar things.

5. Support uploading a new version of a taxonomy, potentially with breaking changes, without breaking existing applications of tags to courses from a prior version.

Author User Stories - Creating Tags

Authors can see tags displayed in content blocks in Libraries and in each level of the course hierarchy in Studio.
1. When content is reused from Libraries in a course, authors will be able to add additional tags to the content, but will not be able to change any tags that were associated with the content in the Library.
Authors can add free-form tags to any piece of content or any part of the course outline.
1. UI where author just sees a field to add as many tags as desired, with predictive suggestions.
Authors can add free-form tags on admin-defined fields to any piece of content or any part of the course outline.
1. UI where author is presented with the field and can enter a single free-form tag.
Authors can choose from a tag from an admin-defined closed taxonomy to any piece of content or any part of the course outline.
1. UI where author is presented with the field and a drop-down (or other UI selection element) to select the value.
Permissions for adding or editing tags follow the same logic and permissions structure as editing content.

Author User Stories - Content Management & Search

Authors can conduct basic keyword search functionality on all content in libraries
Authors can utilize facet-style filtering on search results
Authors can conduct advanced searches such as “all videos covering topic X and competency Y”.
Authors can group content by tags, such as all videos tagged with Competency X.

3. Approach for Boolean searches: When we built https://www.labxchange.org/library which took a lot of iteration, what we found was that for faceted search, people expect that different options from the same field should be joined with OR, but different fields entirely should be joined with AND. Try out the link ^ and you'll see what I mean - you can search for "[Subject: Genetics OR Evolution] AND [video length: < 5 min]" very naturally and intuitively using the facets on the left panel. I strongly recommend building a UI like that (there are libraries like instantsearch that make it easy), rather than having learners type tag fields/values into a search box or worry about explicit boolean operators.

Proposed Definitions

Tag: Any application of metadata to an object. In the Open edX context, a user story may be, “I want to add subject tags, or skills tags, to videos, to units, to sections, or to a course.”

Taxonomy: A controlled vocabulary in which all the values belong to a single hierarchical structure and have parent/child relationships to other terms, or horizontal relationships to other terms. For example, the Core Subject Taxonomy for Mathematical Sciences Education or the Open Skills Network Taxonomies.

Name-Value Pair: The mechanism to relate tags to data sets, where name functions as the constant that defines the data set, and value functions as the variable tags that belong to the set. For example:

Field (name)	Tag (value)
Subject	Biology
	Anthropology
	History

Some tags may require hierarchies up 3+ layers deep. For example,

Field	Parent Tag	Child Tag	Grandchild tag
Subject	Biology	Genetics	Molecular Genetics
	Anthropology	Cultural Anthropology	Ethnology
	History	Political History	Constitutional History

Tags require multi-select functionality (each field supports multiple values). For example,

Field	Tag
Subject	Biology, Chemistry
	Anthropology, Music
	History, Geography

Projects for future product discovery

Does tagging extend to other entities besides content, such as people?

Do tags display for learners in the LMS?

Do we extend content searching capabilities to the course outline in Studio?