JIRA: MST-1047
We can get the same things we are using from Elasticsearch from MySQL. We do not make full use of of the Elasticsearch product. We likely get a performance enhancement from being able to perform quick searches, but I do not see a strong argument for continuing the use of Elasticsearch given its limited use by Insights and the analytics data API.
Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.
An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure.
(source)
Insights does not make heavy use of Elasticsearch. Insights relies on Django views in the analytics-data-api that are backed by Elasticsearch. These views are LearnerView
and LearnerListView
. This views back the Learner view in Insights. See an example in Insights here for the demo course.
The edx-analytics-data-api/analytics_data_api/v0/documents.py
file defines two classes, RosterUpdate and RosterEntry, that inherit from the Document class provided by the elasticsearch-dsl library. The Document class is a model-like wrapper around the Elasticsearch document. It allows us to define Elasticsearch mappings, which are associations between a field in a document and the field's "type".
RosterUpdate is a Document that stores when the index was last updated.
RosterEntry is a Document that stores information about a learner with respect to a course, including fields like course_id
, user_id
, problems_attempted
, problems_completed
, etc. RosterEntry
has two class methods that implement Elasticsearch search queries, get_course_user
and get_users_in_course
. These will be discussed below.
The LearnerView and the LearnerListView Django views use these Document classes, particularly the above class methods, when fetching data from Elasticsearch.
get_course_user
does a query for a given course_id
and username
.
get_users_in_course
does a query for a given course_id
and returns information about learners in the course. You can provide a few parameters to customize the search. These parameters include ones to filter, ones to select a sorting policy, and one to perform a text search against a set of fields.
There is validation done on the arguments to the method. This can be replicated outside of the elasticsearch DSL.
The following options to tailor the elasticsearch query are supported.
|
Let us assume that we have a MySQL table learner_activity
with the following fields, which are fields on the existing RosterEntry
document.
course_id, user_id, username, name, email, language, location, year_of_birth, level_of_education, gender, mailing_address, city, country, goals, enrollment_mode, cohort, segments, problems_attempted, problems_completed, problem_attempts_per_completed, attempt_ratio_order, discussion_contributions, enrollment_date, videos_viewed, last_updated |
Here is sample elasticsearch DSL for the types of queries we make currently.
get_course_user
This can be implemented with the following SQL.
|
get_users_in_course
The only difference for This can be implemented with the following SQL.
|
The elasticsearch RosterEntry document contains a field attempt_ratio_order
. It’s used to make ordering by the problem_attempts_per_completed
more correct. problem_attempts_per_completed
can be infinite if no attempts were completed. My understanding is this is stored in the database as null
. The comments say the following.
Useful for ordering problem_attempts_per_completed (because results can include null, which is different from zero). attempt_ratio_order is equal to the number of problem attempts if problem_attempts_per_completed is > 1 and set to -problem_attempts if problem_attempts_per_completed = 1.
That seems reasonable. It seems like this is just a necessity for sorting because of the division by zero issue.
But in the data pipeline, it says this.
Used to sort learners by problem_attempts_per_completed in a meaningful way. When using problem_attempts_per_completed as your primary sort key, you can secondary sort by attempt_ratio_order to see struggling and high performing users. At one extreme this identifies users who have gotten many problems correct with the fewest number of attempts, at the other extreme it highlights users who have gotten very few (if any) problems correct with a very high number of attempts. The two extremes identify the highest performing and lowest performing learners according to this metric. To see high performing learners sort by (problem_attempts_per_completed ASC, attempt_ratio_order DESC). To see struggling learners sort by (problem_attempts_per_completed DESC, attempt_ratio_order ASC).
This sounds like it has a benefit beyond what’s described above. I just wanted to call out that this wasn’t immediately clear to me.