Do We Need Elasticsearch in Analytics Data API and Insights?

JIRA: MST-1047

Table of Contents

TL;DR:

  • We can get the same things we are using from Elasticsearch from MySQL. We do not make full use of of the Elasticsearch product. We likely get a performance enhancement from being able to perform quick searches, but I do not see a strong argument for continuing the use of Elasticsearch given its limited use by Insights and the analytics data API.

  1. Parameter validation can be performed by Python Django code.
  2. Elasticsearch queries can be replaced by MySQL queries.

How Do We Use Elasticsearch in Insights?

Key Concepts

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure.

(source)

High Level

Insights does not make heavy use of Elasticsearch. Insights relies on Django views in the analytics-data-api that are backed by Elasticsearch. These views are LearnerView and LearnerListView. This views back the Learner view in Insights. See an example in Insights here for the demo course.

Low Level

The edx-analytics-data-api/analytics_data_api/v0/documents.py file defines two classes, RosterUpdate and RosterEntry, that inherit from the Document class provided by the elasticsearch-dsl library. The Document class is a model-like wrapper around the Elasticsearch document. It allows us to define Elasticsearch mappings, which are associations between a field in a document and the field's "type".

RosterUpdate

RosterUpdate is a Document that stores when the index was last updated.

RosterEntry

RosterEntry is a Document that stores information about a learner with respect to a course, including fields like course_id, user_id, problems_attempted, problems_completed, etc. RosterEntry has two class methods that implement Elasticsearch search queries, get_course_user and get_users_in_course . These will be discussed below.

The LearnerView and the LearnerListView Django views use these Document classes, particularly the above class methods, when fetching data from Elasticsearch.

Nitty Gritty Details

get_course_user does a query for a given course_id and username.

get_users_in_course does a query for a given course_id and returns information about learners in the course. You can provide a few parameters to customize the search. These parameters include ones to filter, ones to select a sorting policy, and one to perform a text search against a set of fields.

There is validation done on the arguments to the method. This can be replicated outside of the elasticsearch DSL.

The following options to tailor the elasticsearch query are supported.

  • segments and ignore_segments

    • What is a “segment”?

      • A “segment” is an attribute of a learner describing the learner’s position in a course, namely around their enrollment and performance. Valid segments are "highly_engaged", "disengaging", "struggling", "inactive", "unenrolled". They are defined here.

    • segments is a comma separated list of segments is used to filter which documents to include.

    • ignore_segments is a comma separate list of segments to filter which documents to exclude.

      • For example, ignore_segments=inactive will remove inactive learners from the response.

    • You cannot include both parameters.

    • Arguments for either parameter are validated against the segments constant.

  • cohort

    • cohort is a parameter that is used to create an exact match query on the cohort field. It can be any string.

  • enrollment_mode

    • cohort is a parameter that is used to create an exact match query on the cohort field. It can be any string.

  • text_search:

    • text_search is a parameter that is used to get a multi_match query on the name, username, and email fields. It essentially does a match query on any of the three fields. It does not support substring match.

  • sort_policies

    • sort_policies is a dictionary containing the keys sort_policy and sort_order, where the sort_policy is a string representing the field to sort by and sort_order is asc or desc.

    • In elasticsearch, there is a concept of missing in sorting. This tells elasticsearch what to use for the sort value for documents that do not contain the field that is being sorted by. elasticsearch can include them at the beginning (_first) or the end (_last), or you can have a custom value that is used as the missing value of the field. I do not expect us to have this issue, as name, username, and email are fields we will reliably have.

    • You can only sort by a select few fields, defined here.

How can we write equivalent MySQL?

Let us assume that we have a MySQL table learner_activity with the following fields, which are fields on the existing RosterEntry document.

1 2 3 4 5 6 7 course_id, user_id, username, name, email, language, location, year_of_birth, level_of_education, gender, mailing_address, city, country, goals, enrollment_mode, cohort, segments, problems_attempted, problems_completed, problem_attempts_per_completed, attempt_ratio_order, discussion_contributions, enrollment_date, videos_viewed, last_updated

Here is sample elasticsearch DSL for the types of queries we make currently.

get_course_user

 

1 http://localhost:19001/api/v0/learners/edx?course_id=course-v1%3AedX%2BDemoX%2BDemo_Course

 

1 2 3 4 5 6 7 8 9 10 { 'query': { 'bool': { 'must': [ {'term': {'course_id':'course-v1:edX+DemoX+Demo_Course'}}, {'term': {'username': 'edx'}} ] } } }

This can be implemented with the following SQL.

1 2 3 4 SELECT * FROM user_activity WHERE course_id='course-v1:edX+DemoX+Demo_Course' AND username='abigail123';

get_users_in_course

 

1 http://localhost:19001/api/v0/learners/?course_id=course-v1%3AedX%2BDemoX%2BDemo_Course&segments=disengaging,struggling&cohort=test&enrollment_mode=verified&text_search=abigail
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 { 'query': { 'bool': { 'must': [ {'term': {'course_id': 'course-v1:edX+DemoX+Demo_Course'}}, {'bool': { 'should': [ {'term': {'segments': 'disengaging'}}, {'term': {'segments': 'struggling'}} ] }} {'term': {'cohort': 'test'}}, {'term': {'enrollment_mode': 'verified'}}, {'multi_match': { 'query': 'abigal', 'fields': ['name', 'username', 'email'] }} ] } }, 'sort': [{'username': {'order': 'asc', 'missing': '_last'}}] }

The only difference for ignore_segments is must is replaced with must_not .

This can be implemented with the following SQL.

1 2 3 4 5 6 7 8 SELECT * FROM user_activity WHERE course_id='course-v1:edX+DemoX+Demo_Course' AND segments IN ('disengaging', 'struggling') AND cohort='test' AND enrollment_mode='verified' AND 'abigail' IN (name, username, email) SORT BY username;

 

  1. I have not looked into how to parameterize the SQL or make it dynamically generated given a set of parameters. I’m assuming Django has this functionality even without Django models.

Odds and Ends

The elasticsearch RosterEntry document contains a field attempt_ratio_order. It’s used to make ordering by the problem_attempts_per_completed more correct. problem_attempts_per_completed can be infinite if no attempts were completed. My understanding is this is stored in the database as null. The comments say the following.

Useful for ordering problem_attempts_per_completed (because results can include null, which is different from zero). attempt_ratio_order is equal to the number of problem attempts if problem_attempts_per_completed is > 1 and set to -problem_attempts if problem_attempts_per_completed = 1.

That seems reasonable. It seems like this is just a necessity for sorting because of the division by zero issue.

But in the data pipeline, it says this.

Used to sort learners by problem_attempts_per_completed in a meaningful way. When using problem_attempts_per_completed as your primary sort key, you can secondary sort by attempt_ratio_order to see struggling and high performing users. At one extreme this identifies users who have gotten many problems correct with the fewest number of attempts, at the other extreme it highlights users who have gotten very few (if any) problems correct with a very high number of attempts. The two extremes identify the highest performing and lowest performing learners according to this metric. To see high performing learners sort by (problem_attempts_per_completed ASC, attempt_ratio_order DESC). To see struggling learners sort by (problem_attempts_per_completed DESC, attempt_ratio_order ASC).

This sounds like it has a benefit beyond what’s described above. I just wanted to call out that this wasn’t immediately clear to me.