Components: 

protocol details

interaction




Data Store

NOTE this section is presently in-progress and subject to rapid change - owner Jim Abramson (Deactivated)

We plan to use ElasticSearch 1.4 as both a primary data store and for search capabilities.  Motivating this decision:

This selection carries some risk, on account of the following:


Toward addressing these concerns, below are spelled out risks and planned mitigations in more detail.

Writes and Durability

ElasticSearch defaults to synchronous replication of writes across index replicas in a cluster, and supports a per-operation consistency specification which may either be "one", "quorum" (majority), or "all".  Under synchronous replication, a write request will block until the required number of replicas have confirmed a successful write.  This blocking timeout is configurable from 100ms to a minute or more.  In a defensive configuration (sync writes with 'quorum' or 'all'), ES will time out instead of writing data to an impaired cluster, which removes the risk of losing writes under a single node failure, and provides a response mechanism by which to alert the user, during an outage/incident, that their note could not be saved.

ES reports cluster health based on the number of shard replicas online, and mapping this reporting to some monitoring system would be straightforward.   Alerts should trigger intervention when a shard's health has exceeded a certain amount of time during some window (minutes) in a "yellow" state, or immediately upon entering a "red" state.  Shards and quorum should be configured such that quorum is never possible while the shard has a "red" state of health.

Query Patterns

Data management

Index analysis and internationalization

ES cluster resiliency

The case studied extensively on aphyr.com, often cited as an argument against using ES as a primary store, describes 1. the ability to confuse an ES cluster into split-brain mode under a combination configuration options and network partition scenarios; 2. ES' failure to converge out of such a cluster condition without losing previously-acknowledged writes.

The mitigation strategy is as follows:

  1. use a newer version (1.4) of elasticsearch in which the issues leading to the potential split-brain condition are resolved, and where the cluster can be configured not to accept writes under an unhealthy cluster state
  2. ensure that we immediately detect an unhealthy cluster state and quickly recover from it (automatically whenever possible)
  3. ensure that user feedback is given if an outage/incident is occurring so that the inability to save notes is communicated, avoiding the perception (and reality!) of lost data

TBD: detailed strategy.

Testing / Proving Strategy

Plan B

Planned Production Configuration

 

ES as a primary database, in the wild


Testing:

Todo

 

Documents