State of edx-search 2023
edx-search Codebase
edx-search is an app that is installed into edx-platform, specifically the LMS. It is mostly a wrapper around a search engine providing an abstracted interface, but breaks that abstraction in several places. All users also break the abstraction by requiring the engine be ElasticSearch compatible.
edx-search is not installed into any other codebases in the openedx or edx github orgs (at least that I have permission to search).
The engine abstraction provides a call for indexing, a call for searching, and a few other helpers. These are not course specific, and details are left to the implementing engine. The only implementation is ElasticSearchEngine. Because the indexing operation is completely abstract, taking only unspecified “sources”, anyone using this code must know what engine they are using to provide indexable sources and to search them later.
Search API and URLs
edx-search also provides a search API as a set of URLs, which is installed into the LMS. That API is has an all course search function and a course limited search function which channel to the same function under the hood. This API is only for course search, the course index is baked in. Users who want to use edx-search for other information build their own APIs, see the index table below. The API also has a course_discovery_search which is more obviously not generic.
Both the course search and more specific discovery search are exposed as python methods. The course search perform_search is only used in tests.
The course_api uses course_discovery_search directly via that API, but that use is gated on ENABLE_COURSEWARE_SEARCH which is on in devstack settings but otherwise off for the codebase and edx.org. It could of course be enabled by other openedx users.
There are also direct uses of the engine.search function. See the index table below, some of the classes which work this way are probably the best models for future work because they colocate search and indexing.
Extension Points
edx-search provides two extension points that apply to the API via URL or direct function call and one that applies to the URL endpoint only.
The search filter generator lets you specify what fields to search, filters to use, and fields to exclude. In theory it also lets you specify filters, but that is not hooked up for the special course discovery endpoint.
There is one SearchFilterGenerator in the entire openedx and edx github repositories:
LMSSearchFilterGenerator is built for cross course search and tries to limit course access by checking enrollments.
The Search Result Processor is well named and lets you mangle search results all you want afterwards. Importantly the Search Result Processor expects to be passed a user.
There is one SearchResultProcessor in the entire openedx and edx github repositories:
LMSSearchResultsProcessor: since it has the user, it restricts access to blocks the user can see, and provides proper links. Not currently in use, so this may not work. For course staff users, it short circuits access checks which should help performance. For other users, it is based on a very standard
get_course_blocks
call, which could well be the only way to reliably guarantee that the restrictions on search match the restrictions on content in general. No matter how many matches are returned from a course only one such call will be made, but no matter how few matches are returned the call will always get all available blocks, so there might be room for optimization in the low match case.
The Search Initializer works at a different level. It is not attached to the API functions but is attached to the do_search course search view. It “sets up the environment” which doesn’t help us understand it much.
As expected there is one SearchInitializer in edx and openedx:
LMSSearchInitializer seems entirely concerned with properly setting masquerading.
Extension points are attached in the LMS’s common.py settings file.
In practice because the extension points are only set up for course search, they have separated course search into multiple code bases without serving for actual search customization. The extension points are also far away from the indexer which actually defines what is available to search and what the results look like, see the next section.
Indexing and the Other Search APIs
As mentioned above the indexing call is extremely abstract, so callers have to understand all of the internals:
def index(self, sources, **kwargs):
"""
Add documents to the search index.
"""
edx practice is to create an indexer object which obtains a concrete search engine from edx-search and properly builds documents for that call. Most are children of SearchIndexerBase that deal with course content and are contained in that same file:
CoursewareSearchIndexer
LibrarySearchIndexer
CourseAboutSearchIndexer
But there is another SearchIndexerBase in the library code, with its own children:
ContentLibraryIndexer
LibraryBlockIndexer
There are also one independent indexers (at least that is named Indexer, hopefully there are not others with secret names):
Each indexer defines an index name:
INDEX_NAME = "content_library_index"
The index name is a magic key shared across codebases used to create the search engine for indexing and then also to search on it. As you can see in this example one side allows that to change via setting, the other is hardcoded. Because both sides also need to completely understand the document structure stored in that index, one extra magic string is not much of a lift. For some uses there is only one side and thus no magic, see the table below.
All also define some sort of key that tells the indexer whether it is on:
ENABLE_SEARCH_KEY = "ENABLE_TEAMS"
ENABLE_INDEXING_KEY = 'ENABLE_COURSEWARE_INDEX'
For the courseware index example it matters whether this is on in the CMS settings, not the LMS settings, since indexing is tied into signals that happen during course publish.
edx-search Uses by Index
index name | what is it? | indexer | searcher |
---|---|---|---|
courseware_content | course content | CoursewareSearchIndexer in platform/cms | /search and perform_search in edx-search |
library_index | library content, similar to courseware_content | LibrarySearchIndexer in platform/cms | LibraryToolsService in platform/xmodule |
course_info | course about info, maybe this object | CourseAboutSearchIndexer in platform/cms | course_discovery_search and its endpoint in edx-search |
content_library_index | library metadata | ContentLibraryIndexer in platform/core/content libraries app | the same indexer, via its parent class get function which is exposed via an API |
content_library_block_index | library content | LibraryBlockIndexer in platform/core/content libraries app | the same indexer, via its parent class get function which is exposed via an API |
course_team_index | team info but not members | CourseTeamIndexer in platform/lms teams app | Team list view in the same app |
Course Content for Indices
The indexers which run over course content, CoursewareSearchIndexer and LibrarySearchIndexer, get the bulk of that content by calling index_dictionary provided by the xblock IndexInfoMixin. This allows xblocks to be completely flexible in their indexed content, at the price of scattering information about what that content is across all xblocks.
Best Indexer/Searcher Practices
The content_library_index and content_library_block_index are much newer than the original content indices. They keep their indexing and search close together to prevent drift. Aside from still calling the result an “indexer” these are much cleaner search implementations. Despite that they are still tied to ElasticSearch, a different engine would break them completely.
These two indexers also provide explicit and versioned schemas, which is a very nice practice.
Doing their own searching means that these indexes can’t use the extension points, but it also means they don’t have to because the filters and results are customized to their particular use case.
Two problems wth calling this format a best practice:
This search is not enabled by default or on edx.org, so unless it has other users it could be completely broken and we wouldn’t know.
The distributed nature of xblock content makes a unified schema extremely difficult, so this might not be applicable for course content search.
So Many Settings
There are minor settings for specific search tweaks I’m going to skip to keep these tables compact.
Most of these settings have different values for test which are not listed here.
edx.org specific settings changes maintained by 2U are tracked in the companion “State of edx-search 2023” page in the 2U wiki.
edx-search settings
Setting | What is it? | Set to |
---|---|---|
SEARCH_ENGINE | which engine class to use | search.elastic.ElasticSearchEngine via cms & lms production.py |
COURSEWARE_CONTENT_INDEX_NAME, COURSEWARE_INFO_INDEX_NAME | index names | left at code default, could not be changed since the indexer does not know the setting |
ELASTIC_SEARCH_*, ELASTIC_FIELD_MAPPINGS | various elasticsearch specific settings | generally set to use version 7 values |
SEARCH_RESULT_PROCESSOR, SEARCH_INITIALIZER, SEARCH_FILTER_GENERATOR | expansion points, see above | set to courseware search, LMS specific values in common.py |
Special note for SEARCH_ENGINE
This setting is in lms and cms production.py, but it is only actually set if a selection of the other search settings are True. THAT SELECTION IS DIFFERENT FOR LMS AND CMS.
edx-platform settings
Setting | What is it? | Set to |
---|---|---|
ENABLE_COURSEWARE_INDEX | index course content at all, governs courseware_content and courseware_info indices | false in common.py true in devstack.py |
ENABLE_LIBRARY_INDEX | same for libraries | false in common.py |
ENABLE_CONTENT_LIBRARY_INDEX | governs both content_library_index and content_library_block_index | false in common.py |
ENABLE_TEAMS | index teams content, this is the general teams feature setting | true in common.py |
ENABLE_COURSEWARE_SEARCH | expose courseware search in old UI, enable course discovery search in course_api | false in common.py |
ENABLE_COURSEWARE_SEARCH_FOR_COURSE_STAFF | enable search for course staff in old UI | false in common.py |