Architecture Challenges (2017-2018)

Purpose

This is a live page to keep track of sources of drag faced by an edX engineer or team due to architectural or design issues in our system.

Goals

  1. Surface issues that we face when doing rapid experiments or iterative development and see whether they are challenges for other teams as well.
  2. See if any patterns emerge that call for integration points or other architectural features.
  3. Use anecdotal data on this page to prioritize initiatives on architectural runways.

Challenges

Top-level Categories (as of 10/15/18)


Pain Category# incidents# votes in meetingstatusnotes
1Monolith (mindset): Pluggability, Extensibility, Interdependency/coupling, Deprecation/removal142NOPE
2Modernizing FED92PROGRESS

Arch-FED efforts:

  • Paragon ownership and updates
  • Micro-frontend runways
  • Micro-frontend re-platforming
3Coupled services: Data synchronization and duplication52PARTIAL

Publisher efforts to fix data synch issues

4Environments / Testing - End-to-end tests, Stage environments, Sandboxes, etc.5-NOPE

DevOps efforts:

  • Kubernetes, Docker, and Config
5Configuration/Toggles: OEP-17 - Feature Toggles2-PROGRESS

Arch-BOM & Tools efforts:

  • Toggle OEP
  • Toggle reporting
  • Toggle annotations
6A/B Testing11

7Release pipeline evolvability11

8Authorization: product access, courseware, etc (aspect-oriented design)11

Enterprise efforts:

  • edx-rbac library
9

Clear Best Practices - Debugging, API versioning

1-

Pain Incidents

CategoryIssueTeam anecdotes (with dates, with team)Number of occurrencesStatus

Monolith / New Features

14 (sad)



Requires updating many Django files in the existing IDA.
  • Adaptive Learning features from OpenCraft needed to update settings, urls, and installed_apps. (Nov 2017)
  • Enterprise app
  • Completion app (Jan 2018)

(sad)(sad)(sad)


(tick) Runway created for Django App Plugins.
Inline Discussions: A lot of our codebase depends on forums, making performance issues potentially cascade into the entire system.(sad)
Needs to receive Django signals from apps in the existing IDA.
  • Completion app, in separate repo, wants to receive signals from Grades app. (Jan 2018)
(sad)(tick) Runway created for Django Plugin Settings.
Python API: Wants to make use of a python API defined in the existing IDA (to avoid unnecessary network roundtrips and code complexity involved with calling an HTTP API).


Testing. Wants feature package tests to be run when IDA tests are run to ensure that changes made to the IDA code don't break the feature.


Difficult to add a new LMS Dashboard
  • Journals app needs to add new dashboard to list Journal purchases
(sad)
LMS Course listing not extensible
  • Journals app needs to display Journal related cards on LMS index listing pages
(sad)
New Product: Introducing a new product offering is difficult.
  • "edX Portfolio" The Differentiated Paid team pursues creating new paid experiences, but ends up including it within the verified certificate experience - due, in part, to technical and business challenges. (Fall 2017)
  • Journals: The White Label team is looking into creating a new product line for Digital Books and requires larger development effort to make it available in all places.  (Jan 2018)
  • Certificate Product: Business rules baked into course mode string names prevent us from...  (Open edX ecosystem runs into this issue)
  • Enrollments vs Entitlements: Issues have arisen from the way these two products co-exist.
(sad)(sad)(sad)(sad)

Notes:

  • Milestones
  • Authorizations
  • Business Logic on top of Base Example
    • ex: Programs type: XSeries / MicroMasters
    • ex. Credential type: Credit / Business
  • MicroMasters vs Degrees 
  • Course Modes logic
Functionality stuck in the Monolith: Reusable building blocks are locked up in the openedx.core.lib.api in edx-platform, which makes it difficult to pull into other apps and/or projects.
  • Pagination should be common across edX. Currently we have a slightly modified DefaultPagination class in platform. analytics-data-api implements pagination slightly differently. If pagination was moved to edx-drf-extensions everyone could use! Likely other API building blocks living in openedx.core.lib.api could also be moved to be made reusable (Feb 2018)
(sad)

Enterprise actively working on moving pagination to drf-extension.

To avoid breaking changes analytics-data-api response not updated.

Evaluation of core.lib.api classes ongoing. 

XBlocks doesn't run on Block Transformers The Courseware doesn't use Block Transformers as its source for XBlock field data. The mobile API and Course Outline both do. This inconsistency means that we need to duplicate some features across both
  • Content Type Gating is requiring a new BlockTransformer as well as a new FieldDataOverride in order to force graded content to have a specific group_access field
(sad)
API Pluggability - Course Blocks API: new feature wants to make use of a new Transformer and/or add new fields to the API response.
  • Adaptive Learning feature wants to add a new transformer. (Nov 2017)
  • Completion app wants to add a new transformer and add new fields. (Oct 2017)
  • ContentTypeGating required a new BlockTransformer (which required code changes, not just install+configuration)
  • ContentTypeGating needed code changes to add a new xblock wrapper to the LMS
  • FBE had to add code to access.py in order to further customize authorization.
(sad)(sad)(sad)(sad)

Front End Development

(sad)

UI Pluggability: Adding UI to an existing IDA, especially into LMS/Studio.
  • Adaptive Learning feature wants to add a spaced-repetition review block on the Course outline. (Nov 2017)
  • Growth: Diff Paid wants to add courseware pages (Jan 2017)
    • Learner Analytics
    • Portfolio Builder
(sad)(sad)(sad)
Bootstrap / Pattern Library: CSS conflicts
  • Educator Dahlia team adding new components that use Bootstrap which conflicts with the Legacy CSS. See this presentation for more details. (Jan 2018)
(sad)A hack POC exists, but it is not ideal.
Paragon: components are inconsistent and ongoing maintenance is unclear
  • Without a clear lead and/or team supporting the effort, there isn't clear stakeholder engagement between UX and development or even between development teams. Educator Dahlia has run into upgrade woes while upgrading studio-frontend mostly in the modal not meeting our use cases.
  • The "asInput" component is the parent for a number of components and is in need of a refactor to support the numerous components are depend on it. It has evolved very piecemeal.
  • Props and types are inconsistent and unclear between components, e.g. classNames are passed in as a string as opposed to a list.
(sad)(sad)(sad)
Backbone: Integrating with Backbone
  • Educator Dahlia team and the FED Core team find that it is challenging to integrate React/Redux components with the existing Studio Backbone apps. (Jan 2018)
(sad)Dahlia's next feature to implement will deal heavily with this.
Mako: Inserting the component into the page with Mako
  • The edx-platform frontend asset management needs a lot of work. The custom home-grown nature of it makes it hard to insert new technologies that are meant to be installed in fresh environments.
(sad)studiofontend mako definition has been created to ease some of this pain.

Changing or testing learner features

(sad)

TTV is affected when changes are required in edx-platform.

For experiments, although optimizely allows plugin of FED code, back-end changes are sometimes still required. Possible solutions may include additional apis or server-side hooks / code injection.

  • Growth: Diff Paid is slowed down by edx-platform releases (Jan 2017) 
    • Learner Analytics
    • Portfolio Builder
(sad)(sad)
Difficult to take advantage of our endpoint versioning which should normally let us support multiple external clients (mobile, etc) without having to sync our releases.
  • (heart): Bumping platform's user API version to accommodate mobile's schedule seemed like enough work (large code surface, no prior art) that I found a different way to do it (Mar 2018)
(sad)

Experimentation Tools: Need to duplicate configuration code for experimentation

NOTE - Workaround exits for now on this.

  • Growth: Diff Paid wants to configure list of courses to use across optimizely experiments (Jan 2017)
    • Learner Analytics 
    • Portfolio Builder
    • Support

(tick) Generic experiment key-value store can be used for storing experimental config/data.

Releases / Pipeline

(sad)

Altering GoCD pipelines
  • Updating the GoCD pipelines for IDAs was very challenging due to the complex nature of the code.  There are multiple competing abstractions and design approaches to creating a pipeline (August 2018) 
(sad)

Environment Sync / Management (Stage, Sandboxes, Production)

(sad)

Setting up communications between our different IDAs on a sandbox: Setting up sandboxes so that the different IDAs are configured to communicated with each other is a tedious and frustrating process.
  • There are many environment variables that need to be set to set up communications between ecommerce and LMS.
    • (LMS) EDX_API_KEY -> (ecommerce) EDX_API_KEY
    • (LMS) JWT_SECRET_KEY → (ecommerce) ECOMMERCE_API_SIGNING_KEY 
    • some things in (LMS) JWT_AUTH -> some things in (ecommerce) JWT_AUTH
    • (LMS) OAUTH2 → (ecommerce) Site Configurations
  •  It's not clear what each configuration will do without digging into the code
  • We've had multiple team members working on trying to set up Publisher on a sandbox and currently have this large document on how to do so: Publisher Sandbox 

  • The docs for Enterprise also describe how to configure a sandbox, but the instructions seem out of date, and it's not clear where to go for authoritative information. Perhaps it's not reasonable to expect a wiki page full of screenshots and code snippets to be kept up to date. 
(sad)(sad)(sad)(sad)
True / Complete Staging Environments: Prod data is rarely synchronized to stage or loadtest environments.
  • We need to scrub it of PII, and the people that could sync might fear that we've haven't got an up to date list of where to scrub.
  • There is a fear around deleting some stage data that people have manually entered for various tests.
  • Loadtest servers that don't have representative amounts of data are not very useful for testing load.
  • Likewise, stage grows less and less representative over time, and with fewer instances of newer features (like bundles/entitlements) because that's all manually entered.
  • We have no standard way of scrubbing the data and no process in place to update tooling so that it stays up to date for what should be scrubbed.
(sad)

Debugging & Investigation

(sad)

There are no clear best practices or points-of-view on how we add logging to our code.  This results in noisy unhelpful logs in some places and a lack of valuable information in others.
  • Due to a fear that we might over log, no logs are written into a grading signal.  But in practice the ability to see what was happening when we backpopulated grades would have saved days of investigation into grades that were being skipped.  Clarity into how we use logging and what options we have for verbosity might have saved a lot of pain here. (Adjusting log levels, adding verbose flags to methods, etc) (August 2018)
(sad)

Feature Toggle reporting is not automated (see OEP-17)

(sad)

Testing: Feature toggles (often waffle switch and flags) have not matched in stage/prod causing unexpected e2e failures or Production Outages/RCAs.
  • Several RCAs/e2e failures related to getting the bundling experiment on the dashboard out to Production. (Need to find links.)
(sad)
Documentation: It is difficult to know how to remediate related to feature toggles that are undocumented regarding intended use and lifespan.(sad)
Debugging / Logging: Determining/debugging production outages because of toggle changes.


Deprecation: Not removing toggles after rollout.


end-to-end test failures discovered on a centralized Staging environment, blocks release.
  • Growth team's attempts to launch Bundles on Dashboard using feature flags (Jan 2017)
(sad)OEP 17: Feature toggles  provides strategy for e2e tests related to toggles

Data Duplication

(sad)

We have an architecture philosophy of self-contained systems, which often involves systems having local copies or caches of data owned by other systems.
  • (Early 2018) The Learner team has had some ongoing pain with the data copying between LMS and Credentials (sending of certificate and grade information). A fair amount of effort to set up syncing and the devops resources to keep it going. A long tail of backpopulating.
  • GDPR / Platform team ran into an issue relating to this as well re: User / Segment IDs
  • Revenue team ran into this in cases where Ecommerce cached User data isn't properly synced. 
(sad)(sad)(sad)
Course language is not in synch across our services
  • Dynamic pacing wants to translate emails in the course language (Spanish) and can't rely on the language in LMS' CourseOverview (English). (Oct 2017)
(sad)(tick) Language value in LMS' CourseOverview is now synched with the value in Catalog.
Course start date is not in synch across our services.
  • Dynamic pacing creates schedules within a course based on course' start date.  Had to work around this issue by changing the business logic on how we updated schedule dates. (Oct 2017)
(sad)