Student Notes Architecture

Components:

ElasticSearch (ES) for storing notes, the same instance that powers forums
edx-notes additional django app in edx-platform LMS

implements new views: student notes and search results
responsible for loading of annotator JS and CSS assets for pages that need them
use decorator to make components annotatable

edx-notes-api a standalone service
- supports CRUD operations on notes, serving as an interface to ES
- accepts authorized LMS users, passing username to ES
- Django app that reuses parts of annotator-store and authenticates users with OpenID connect ID token
- in the future, having standalone Notes API service will let students build their own mashups, pull archives etc.

protocol details

unit id is used as URI
django username is userd as user identifier

interaction

edx-notes-api accepts HTTP requests accompanied valid Oauth2 token
LMS backend creates OpenID connect ID token and passes it to the frontend
user's browser sends requests to edx-notes-api with token in headers
edx-notes-api validates the token signature

Data Store

NOTE this section is presently in-progress and subject to rapid change - owner Jim Abramson (Deactivated)

We plan to use ElasticSearch 1.4 as both a primary data store and for search capabilities. Motivating this decision:

Interface Fit: CRUD, simple filtering (by owner), and fast text search are the three primary access cases for notes data, all of which ES handles well.
Operational Fit: ElasticSearch is already deployed in production at edx.org and installed as a dependency of existing OpenedX installations. It is both highly scalable and also very accessible at small / development scale. Masterless clustering model.
Simplicity: using ES as both document store and search index means a single external service interface/dependency for persistence, and does not require a replication process or admin tooling to maintain a correct synchronization between multiple stores.
Expediency: we are leveraging a third-party reference implementation for the notes backend, which is itself implemented against ElasticSearch as sole data store, thus very little customization is needed.

This selection carries some risk, on account of the following:

ES's primary use case is for search indexing with an external source-of-truth database; durability is not an original design priority.
In a somewhat well-publicized analysis on aphyr.com, ES was shown to be prone to losing (acknowledged) writes under certain network partitioning conditions.
The list of companies presently using ES as a primary data store, at scale, is not long.

Toward addressing these concerns, below are spelled out risks and planned mitigations in more detail.

Writes and Durability

ElasticSearch defaults to synchronous replication of writes across index replicas in a cluster, and supports a per-operation consistency specification which may either be "one", "quorum" (majority), or "all". Under synchronous replication, a write request will block until the required number of replicas have confirmed a successful write. This blocking timeout is configurable from 100ms to a minute or more. In a defensive configuration (sync writes with 'quorum' or 'all'), ES will time out instead of writing data to an impaired cluster, which removes the risk of losing writes under a single node failure, and provides a response mechanism by which to alert the user, during an outage/incident, that their note could not be saved.

ES reports cluster health based on the number of shard replicas online, and mapping this reporting to some monitoring system would be straightforward. Alerts should trigger intervention when a shard's health has exceeded a certain amount of time during some window (minutes) in a "yellow" state, or immediately upon entering a "red" state. Shards and quorum should be configured such that quorum is never possible while the shard has a "red" state of health.

Query Patterns

CRUD: single document (note). ES supports a simple RESTful semantic for key-value access to individual documents. Reads and writes can be batched in single requests. Partial updates are supported, including in batched operations. Document changes are atomic using compare-and-swap; individual changes in a batched operation may succeed or fail, which is indicated to the client in the response to such a request (IOW batch operations are NOT atomic). Optimistic concurrency control is supported to ensure that changes are not applied to stale versions of documents.
Filtered retrieval: Fetch all notes by user, user and course component, etc. This is a core capability of ES' search APIs. Searching across multiple physical indexes is supported.
Search: obviously, this is ES' primary use case, and will support very flexible and responsive searching capabilities, in combination with any filters needed (user, course, course component).
Future filtering/searching cases: TBD (need a little more input from product). Obvious possibilities are: 1. search by user across courses (cross-course notes management), 2. search across all users in a course, where some tag has been set indicating that a note has been made "public" its owner (sharing).

Data management

Backups / snapshot + restore: TBD
Schema / index change management: TBD - how we will do this with zero downtime
Exporting notes - TBD: whether this is a use case, needed for analytics?
- If so, we would probably solve by writing a streaming query using scrolling APIs to dump documents as JSON, as there is no native export facility. It should be easy to do this by course, and with any necessary filter applied (e.g. public/private notes, if that is ever added)

Index analysis and internationalization

TBD - more work/testing to do here. Anecdotally, one-size-fits-all analysis generally works well for basic text matching (Omar from Edraak has confirmed Forum searches work fine with Arabic/RTL, using an english analyzer). Standard ES analysis should be sufficient for MVP / v1.0 but a little more investigation is needed to speak to envisioned future cases / details.

ES cluster resiliency

The case studied extensively on aphyr.com, often cited as an argument against using ES as a primary store, describes 1. the ability to confuse an ES cluster into split-brain mode under a combination configuration options and network partition scenarios; 2. ES' failure to converge out of such a cluster condition without losing previously-acknowledged writes.

The mitigation strategy is as follows:

use a newer version (1.4) of elasticsearch in which the issues leading to the potential split-brain condition are resolved, and where the cluster can be configured not to accept writes under an unhealthy cluster state
ensure that we immediately detect an unhealthy cluster state and quickly recover from it (automatically whenever possible)
ensure that user feedback is given if an outage/incident is occurring so that the inability to save notes is communicated, avoiding the perception (and reality!) of lost data

TBD: detailed strategy.

Testing / Proving Strategy

Plan B

Planned Production Configuration

ES as a primary database, in the wild

Testing:

edx-notes, being part of LMS is going to be tested with regular Jenkins
edx-notes-api, being standalone and fresh, can be tested with Travis

Todo

edx-notes-API needs ElasticSearch 1.0+, while we do not have it deployed yet
- need to upgrade ElasticSearch in production

Documents

Design proposal https://docs.google.com/document/d/14YNFgfEJAEkeAf-mouK4MYcna38oxMbu7yeJsgxMiDs