State of edx-search 2023

 

edx-search Codebase

edx-search is an app that is installed into edx-platform, specifically the LMS. It is mostly a wrapper around a search engine providing an abstracted interface, but breaks that abstraction in several places. All users also break the abstraction by requiring the engine be ElasticSearch compatible.

edx-search is not installed into any other codebases in the openedx or edx github orgs (at least that I have permission to search).

The engine abstraction provides a call for indexing, a call for searching, and a few other helpers. These are not course specific, and details are left to the implementing engine. The only implementation is ElasticSearchEngine. Because the indexing operation is completely abstract, taking only unspecified “sources”, anyone using this code must know what engine they are using to provide indexable sources and to search them later.

Search API and URLs

edx-search also provides a search API as a set of URLs, which is installed into the LMS. That API is has an all course search function and a course limited search function which channel to the same function under the hood. This API is only for course search, the course index is baked in. Users who want to use edx-search for other information build their own APIs, see the index table below. The API also has a course_discovery_search which is more obviously not generic.

Both the course search and more specific discovery search are exposed as python methods. The course search perform_search is only used in tests.

The course_api uses course_discovery_search directly via that API, but that use is gated on ENABLE_COURSEWARE_SEARCH which is on in devstack settings but otherwise off for the codebase and edx.org. It could of course be enabled by other openedx users.

There are also direct uses of the engine.search function. See the index table below, some of the classes which work this way are probably the best models for future work because they colocate search and indexing.

Extension Points

edx-search provides two extension points that apply to the API via URL or direct function call and one that applies to the URL endpoint only.

The search filter generator lets you specify what fields to search, filters to use, and fields to exclude. In theory it also lets you specify filters, but that is not hooked up for the special course discovery endpoint.

There is one SearchFilterGenerator in the entire openedx and edx github repositories:

The Search Result Processor is well named and lets you mangle search results all you want afterwards. Importantly the Search Result Processor expects to be passed a user.

There is one SearchResultProcessor in the entire openedx and edx github repositories:

  • LMSSearchResultsProcessor: since it has the user, it restricts access to blocks the user can see, and provides proper links. Not currently in use, so this may not work. For course staff users, it short circuits access checks which should help performance. For other users, it is based on a very standard get_course_blocks call, which could well be the only way to reliably guarantee that the restrictions on search match the restrictions on content in general. No matter how many matches are returned from a course only one such call will be made, but no matter how few matches are returned the call will always get all available blocks, so there might be room for optimization in the low match case.

The Search Initializer works at a different level. It is not attached to the API functions but is attached to the do_search course search view. It “sets up the environment” which doesn’t help us understand it much.

As expected there is one SearchInitializer in edx and openedx:

Extension points are attached in the LMS’s common.py settings file.

In practice because the extension points are only set up for course search, they have separated course search into multiple code bases without serving for actual search customization. The extension points are also far away from the indexer which actually defines what is available to search and what the results look like, see the next section.

Indexing and the Other Search APIs

As mentioned above the indexing call is extremely abstract, so callers have to understand all of the internals:

def index(self, sources, **kwargs): """ Add documents to the search index. """

edx practice is to create an indexer object which obtains a concrete search engine from edx-search and properly builds documents for that call. Most are children of SearchIndexerBase that deal with course content and are contained in that same file:

  • CoursewareSearchIndexer

  • LibrarySearchIndexer

  • CourseAboutSearchIndexer

But there is another SearchIndexerBase in the library code, with its own children:

  • ContentLibraryIndexer

  • LibraryBlockIndexer

There are also one independent indexers (at least that is named Indexer, hopefully there are not others with secret names):

Each indexer defines an index name:

INDEX_NAME = "content_library_index"

The index name is a magic key shared across codebases used to create the search engine for indexing and then also to search on it. As you can see in this example one side allows that to change via setting, the other is hardcoded. Because both sides also need to completely understand the document structure stored in that index, one extra magic string is not much of a lift. For some uses there is only one side and thus no magic, see the table below.

All also define some sort of key that tells the indexer whether it is on:

ENABLE_SEARCH_KEY = "ENABLE_TEAMS" ENABLE_INDEXING_KEY = 'ENABLE_COURSEWARE_INDEX'

For the courseware index example it matters whether this is on in the CMS settings, not the LMS settings, since indexing is tied into signals that happen during course publish.

edx-search Uses by Index

index name

what is it?

indexer

searcher

index name

what is it?

indexer

searcher

courseware_content

course content

CoursewareSearchIndexer in platform/cms

/search and perform_search in edx-search

library_index

library content, similar to courseware_content

LibrarySearchIndexer in platform/cms

LibraryToolsService in platform/xmodule

course_info

course about info, maybe this object

CourseAboutSearchIndexer in platform/cms

course_discovery_search and its endpoint in edx-search

content_library_index

library metadata

ContentLibraryIndexer in platform/core/content libraries app

the same indexer, via its parent class get function which is exposed via an API

content_library_block_index

library content

LibraryBlockIndexer in platform/core/content libraries app

the same indexer, via its parent class get function which is exposed via an API

course_team_index

team info but not members

CourseTeamIndexer in platform/lms teams app

Team list view in the same app

Course Content for Indices

The indexers which run over course content, CoursewareSearchIndexer and LibrarySearchIndexer, get the bulk of that content by calling index_dictionary provided by the xblock IndexInfoMixin. This allows xblocks to be completely flexible in their indexed content, at the price of scattering information about what that content is across all xblocks.

Best Indexer/Searcher Practices

The content_library_index and content_library_block_index are much newer than the original content indices. They keep their indexing and search close together to prevent drift. Aside from still calling the result an “indexer” these are much cleaner search implementations. Despite that they are still tied to ElasticSearch, a different engine would break them completely.

These two indexers also provide explicit and versioned schemas, which is a very nice practice.

Doing their own searching means that these indexes can’t use the extension points, but it also means they don’t have to because the filters and results are customized to their particular use case.

Two problems wth calling this format a best practice:

  1. This search is not enabled by default or on edx.org, so unless it has other users it could be completely broken and we wouldn’t know.

  2. The distributed nature of xblock content makes a unified schema extremely difficult, so this might not be applicable for course content search.

So Many Settings

There are minor settings for specific search tweaks I’m going to skip to keep these tables compact.

Most of these settings have different values for test which are not listed here.

edx.org specific settings changes maintained by 2U are tracked in the companion “State of edx-search 2023” page in the 2U wiki.

edx-search settings

Setting

What is it?

Set to

Setting

What is it?

Set to

SEARCH_ENGINE

which engine class to use

search.elastic.ElasticSearchEngine via cms & lms production.py

COURSEWARE_CONTENT_INDEX_NAME, COURSEWARE_INFO_INDEX_NAME

index names

left at code default, could not be changed since the indexer does not know the setting

ELASTIC_SEARCH_*, ELASTIC_FIELD_MAPPINGS

various elasticsearch specific settings

generally set to use version 7 values

SEARCH_RESULT_PROCESSOR, SEARCH_INITIALIZER, SEARCH_FILTER_GENERATOR

expansion points, see above

set to courseware search, LMS specific values in common.py

Special note for SEARCH_ENGINE

This setting is in lms and cms production.py, but it is only actually set if a selection of the other search settings are True. THAT SELECTION IS DIFFERENT FOR LMS AND CMS.

edx-platform settings

Setting

What is it?

Set to

Setting

What is it?

Set to

ENABLE_COURSEWARE_INDEX

index course content at all, governs courseware_content and courseware_info indices

false in common.py

true in devstack.py

ENABLE_LIBRARY_INDEX

same for libraries

false in common.py

ENABLE_CONTENT_LIBRARY_INDEX

governs both content_library_index and content_library_block_index

false in common.py

ENABLE_TEAMS

index teams content, this is the general teams feature setting

true in common.py

ENABLE_COURSEWARE_SEARCH

expose courseware search in old UI, enable course discovery search in course_api

false in common.py

ENABLE_COURSEWARE_SEARCH_FOR_COURSE_STAFF

enable search for course staff in old UI

false in common.py