Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

JIRA: MST-1047

Table of Contents

Table of Contents

TL;DR:

  • We can get the same things we are using from Elasticsearch from MySQL. We do not make full use of of the Elasticsearch product. We likely get a performance enhancement from being able to perform quick searches, but I do not see a strong argument for continuing the use of Elasticsearch given its limited use by Insights and the analytics data API.

  • Parameter validation can be performed by Python Django code.
  • Elasticsearch queries can be replaced by MySQL queries.

How Do We Use Elasticsearch in Insights?

Key Concepts

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure.

(source)

High Level

Insights does not make heavy use of Elasticsearch. Insights relies on Django views in the analytics-data-api that are backed by Elasticsearch. These views are LearnerView and LearnerListView. This views back the Learner view in Insights. See an example in Insights here for the demo course.

Low Level

The edx-analytics-data-api/analytics_data_api/v0/documents.py file defines two classes, RosterUpdate and RosterEntry, that inherit from the Document class provided by the elasticsearch-dsl library. The Document class is a model-like wrapper around the Elasticsearch document. It allows us to define Elasticsearch mappings, which are associations between a field in a document and the field's "type".

RosterUpdate

RosterUpdate is a Document that stores when the index was last updated.

RosterEntry

RosterEntry is a Document that stores information about a learner with respect to a course, including fields like course_id, user_id, problems_attempted, problems_completed, etc. RosterEntry has two class methods that implement Elasticsearch search queries, get_course_user and get_users_in_course . These will be discussed below.

The LearnerView and the LearnerListView Django views use these Document classes, particularly the above class methods, when fetching data from Elasticsearch.

Nitty Gritty Details

get_course_user does a query for a given course_id and username.

...

Expand
titleoptions
  • segments and ignore_segments

    • What is a “segment”?

      • A “segment” is an attribute of a learner describing the learner’s position in a course, namely around their enrollment and performance. Valid segments are "highly_engaged", "disengaging", "struggling", "inactive", "unenrolled". They are defined here.

    • segments is a comma separated list of segments is used to filter which documents to include.

    • ignore_segments is a comma separate list of segments to filter which documents to exclude.

      • For example, ignore_segments=inactive will remove inactive learners from the response.

    • You cannot include both parameters.

    • Arguments for either parameter are validated against the segments constant.

  • cohort

    • cohort is a parameter that is used to create an exact match query on the cohort field. It can be any string.

  • enrollment_mode

    • cohort is a parameter that is used to create an exact match query on the cohort field. It can be any string.

  • text_search:

    • text_search is a parameter that is used to get a multi_match query on the name, username, and email fields. It essentially does a match query on any of the three fields. It does not support substring match.

  • sort_policies

    • sort_policies is a dictionary containing the keys sort_policy and sort_order, where the sort_policy is a string representing the field to sort by and sort_order is asc or desc.

    • In elasticsearch, there is a concept of missing in sorting. This tells elasticsearch what to use for the sort value for documents that do not contain the field that is being sorted by. elasticsearch can include them at the beginning (_first) or the end (_last), or you can have a custom value that is used as the missing value of the field. I do not expect us to have this issue, as name, username, and email are fields we will reliably have.

    • You can only sort by a select few fields, defined here.

How can we write equivalent MySQL?

Let us assume that we have a MySQL table learner_activity with the following fields, which are fields on the existing RosterEntry document.

...

  • I have not looked into how to parameterize the SQL or make it dynamically generated given a set of parameters. I’m assuming Django has this functionality even without Django models.

Odds and Ends

The elasticsearch RosterEntry document contains a field attempt_ratio_order. It’s used to make ordering by the problem_attempts_per_completed more correct. problem_attempts_per_completed can be infinite if no attempts were completed. My understanding is this is stored in the database as null. The comments say the following.

...