edX/2U Architecture Manifesto

edX/2U Architecture Manifesto

Manifesto Overview

Manifesto Description and Examples

 

Description

Example(s)

 

Description

Example(s)

Decentralized beats Centralized

We create a scalable system with decentralized points of autonomy. We prevent centralized points of failure and bottlenecks. We balance this with intentional architecture.

Autonomy

beats
Coordination

We minimize coordination across team boundaries and value team-level autonomy in order to scale our organization. Collaboration between teams, however, is still vital to align on high-level goals and ensure the convergence of architectural decisions.

Developer teams are responsible for the services and code they own; therefore, they have autonomy over local architectural decisions, data modeling, code quality, library usage, monitoring, etc. Although edX promotes standards for many such concerns and encourages teams to collaboratively seek advice, the final say on decisions about a technology lies with its owners. These decisions, however, need to be defendable and should be documented in ADRs.

Extensions

beat
Core Modifications

We follow SOLID design principles. We invest appropriate time in determining the Single Responsibility of functions, modules, and higher-order components. We build interfaces for components that have many dependents, following Dependency Inversion. We depend on interfaces and not on concrete implementations.

We create an extensible platform that allows developers to build extensions that integrate with the core of the platform. This allows the core to remain small, while volatile extensions remain in the periphery.

Slides: SOLID/Extensions (Apr 2020)

  • Django App Plugin framework is used by the Open edX community to make enhancements to the platform.

  • The XBlock framework is used by the Open edX community and Course teams to make custom authoring/learning course content.

Non-examples:

  • Discussions forums coupled into the core of the platform.

  • Hard-coded vendor-specific implementations are in the core codebase (e.g., SoftwareSecure, ZenDesk), while they could have been implemented as extensions.

Clear Bounded Contexts

beat
Shared Business Logic
(across Contexts)

We decompose large-scale systems into smaller, cognitively understandable, encapsulated and cohesive components called Bounded Contexts.

Defining Contexts
We define Contexts (and their boundaries) by determining their “Jobs to be done” as described in Lifecycle-based Services. We do NOT follow the anti-pattern of focusing on Data and defining Contexts as Entity-based Services. This allows us to minimize interdependencies between Contexts.

Within Contexts
We ensure that each individual Context has a singular purpose and a Job to do. The responsibility of each Context is well defined and communicated with a clear boundary (per SOLID’s Single Responsibility principle). Within a Context, we can share code, share a common Ubiquitous language, and actively refactor as needed.

Across Contexts
We ensure communications across Contexts are with asynchronous messages and with clearly defined interfaces and relationships. Since each Context minds its own business and focuses on its own Job, there is minimal sharing of business logic across Contexts.

  • Pulling data out of course discovery into a new enterprise catalog IDA.

Non-examples:

  • edx-platform - tightly coupled across its Django apps.

  • Creating a single library for storing/accessing/manipulating User data. Similarly, a single library for managing Course data. Rather, each Context has its own requirements for this and can evolve accordingly.

  • A past mindset of considering the discovery service as a central entity-based service to all other services.

  • GraphQL Gotchas

  • Enterprise Data aggregation layers

Data Redundancy

beats
Coupling across Contexts

We intentionally choose to duplicate data across Contexts when data from a source Context is needed for a 'core Job' of a dependent Context. The dependent Context then has the flexibility to store/transform/optimize its local copy of the data according to its own special needs. This produces highly independent self-contained systems.

We look at client-side integrations when a product feature requires time-precise data from the source Context and cannot tolerate delays in asynchronous data synchronization by the dependent Context.

Slides: Data Redundancy (Mar 2020)

  • Student Records (in the Credentials service) maintains its own data storage of learners' Grades.

  • LMS’s local storage of Course Content in MySQL (Course Overviews, edx-when, sequences) and Block Transformers.

Loosely Coupled Codebase

beats
Don't Repeat Yourself
(across Contexts)

We prefer building a loosely coupled architecture that is not bottlenecked by centralized implementations (libraries, code, etc). While our engineering impulse may be to de-duplicate code, we do not (usually) apply DRY across Contexts. We understand that when multiple microservices depend on shared libraries, it can lead to unintended team bottlenecks.

We allow exceptions to this trade-off when architecturally significant concerns need to be consistent across services (e.g., API authentication and authorization frameworks in edx-drf-extensions).

  • The architecture of the new Remote Config system encourages IDA configuration to be simple and loosely coupled between services. (TODO: Let DevOps flesh this out a bit)

Non-examples:

Independent Deployments

beat
Lockstep Releases

We keep the deployment pipelines of each of our microservices to be independent of each other. We never design a feature that requires multiple services to be deployed at the same time.

  • edX IDAs are released to production through pipelines that are owned by feature teams. When releasing IDA-specific features, teams do not have to be concerned about the edxapp release cadence. When issues arise, they are empowered to rollback or fix-forward at their own discretion.

Non-Examples:

  • Release of LMS/Studio features are often contingent upon the edxapp release going out that day, which may be blocked by an unrelated end-to-end test failure or production bug.

Asynchronous beats Synchronous

Asynchronous Messaging

beats
Synchronous Requests 

Inspired by the Reactive Manifesto, we create a loosely coupled architecture that is responsive, resilient, and elastic by having asynchronous (non-blocking) communication channels between services wherever possible.

We keep our 99% web response times to less than 2 seconds unless an explicit exception is made (with Product). This means we do not make blocking synchronous remote requests (within a web request) to other microservices, especially those that are outside of our Context’s Subdomain.

  • Better to asynchronously trigger a report to be generated and have the client get back to it later, than to make 100K requests to paginate through. (Your system can better control load.)

    • (Event queues might be an option later down the line.)

Non-Examples:

  • Multiple endpoints making blocking connections to the discovery service.

  • Connecting to other services within a web-worker response to the end-user.

Eventual Consistency

beats
Distributed Transactions (across Contexts)

We prefer accessing slightly stale data rather than being dependent on the real-time availability of other services. We recognize that the CAP theorem holds in our distributed services. That is, for Partition-tolerance (the system continues to operate in spite of partial network failures) we choose Availability (find a local/nearby replica of data) over Consistency (all read the same data at the same time).

When we write data, we don’t expect it will be immediately available throughout the system and be read at the same time. We prefer each client to be resilient and handle any ambiguities.

  • Updates to course metadata in publisher aren’t visible in the marketing site immediately, because publisher doesn’t try to push that data to the marketing site. Instead, the marketing site reads the data on the schedule that it needs in order to maintain its own recency SLA.

  • Changes to course content may take tens of seconds to make it out to students because of all the asynchronously triggered tasks that have to complete (e.g. Block Transformer collect()), but this is better for scaling than having it synchronously block on everything during publish.

Client-side Integrations

beat
Monolithic Integrations

When the time-precision of reading up-to-date data is paramount such that delays from background data synchronization cannot be tolerated, we have client (e.g., frontend) code (asynchronously) access the data from its source of truth. We prefer clients to fetch data across contexts (when really necessary) rather than have synchronous remote calls within backend business logic.

We recognize that this design introduces dependencies between Contexts (even if through the client) and so we do not use this pattern when the data is necessary for a core Job of the client.

  • Micro-frontend coordinates responses across multiple backends (i.e., async integration at the UI)

Non-examples:

  • Old basket page calls discovery service (for course/programs metadata information) within the same process of a learner’s request. Rather, the frontend can progressively render this non-core information upon receiving an asynchronous response from the secondary discovery service.

Asynchronous Tasks

beat
Paginated Requests

We minimize the load on our webservers by exposing only APIs that are performant. We implement slow transactions as asynchronous tasks that run in the background and do not interfere with web responses.

We add pagination from the ground up through our backend APIs, not just (a posteriori) in frontend views.

For APIs that provide large data sets, we prefer asynchronously generating data reports and exposing them via separate APIs that do not load our servers.

Non examples:

  • Many edX services paginate their endpoints, but this does not stop the service from being overloaded when a client (more often than not, another edX job/service) requests one page after another in order to fetch the entire dataset.

Intentional Architecture beats Emergent Design

"Without guidance, evolutionary architecture becomes simply a reactionary architecture."

Work to Simplify

beats
Slippery Complexity

We understand that the first solution that comes to mind is not necessarily the best solution. We take some time to better understand the problem space before ideating on solutions. Solving a problem one bit at a time can lead to complicated solutions since each increment addresses symptoms instead of the underlying diagnosis.

Instead, invest time to tease apart the problem, make relevant connections, and create meaningful abstractions in order to create simpler long-term solutions.

Additionally, long-lived code may age and accrue complexity over time as business needs evolve and features are added. So, we continuously refactor and improve the code, leaving it better than it was (“scouts rule”).

  • LTI 1.0, 1.1 > LTI 1.2, 2.0

  • Certificates code, spread across multiple apps in the monolith, was refactored and simplified - finding and fixing user-impacting issues along the way.

  • Monolith’s GoCD pipeline was simplified by eliminating unneeded pipes.

Timeboxed Design

beats
You Ain't Gonna Need It

In following Agile practices and iterative processes, we recognize that (appropriately) timeboxed upfront design efforts are investments that can lead to better long-term technical outcomes. The design may contain abstractions or APIs that can save development time in the future.

We understand that the phrase ‘You Ain’t Gonna Need It (YAGNI)’ applies more to local code within a Context since that code can be continuously refactored as the feature and code evolve. For a platform developed across multiple distributed services, features and code cannot be as easily refactored and so upfront design that results in additional generalities pay off.

  • User Partitions

  • Course Blocks API

Non-examples:

  • Course Modes

  • Comprehensive theming

  • Initial White Label implementation

Decision Records

beat
Design Docs

As part of our regular development process, we document technical decisions in GitHub as described in OEP-19. We add the creation of decision records in acceptance criteria during our grooming processes. We remind our teammates to do so when we review PRs. We recognize decision records serve to onboard new members, to understand historical traces, to reference past decisions, and to ensure the team is aligned.

In addition to decision records, we actively maintain READMEs and HowTos as described in the OEP.

  • Open edX Proposals (OEPs).

  • Architectural Decisions Records, like the ones in the edx-platform Grades app.

Non-examples:

  • Information on context and decisions being buried deep in forgotten Confluence spaces and Jira boards.

Fitness Functions

beat
Manual Verifications

We believe automated verifications will more efficiently and precisely direct technical changes than manual verifications. Technical direction includes both detection of errors (e.g., variable not defined) as well as guidance towards a future technical evolution (e.g., adoption of PII annotation).

Fitness function is the term used in Building Evolutionary Architecture to “communicate, validate and preserve architectural characteristics in an automated, continual manner, which is the key to building evolutionary architectures.”

  • pylint can find undefined names better than we can

  • XSS linter finds XSS problems

  • automated pull request builds

  • toggle annotation linter finds feature toggles without proper annotations (missing fields, etc)

Non-example

  • openedx.yaml files

 

Looking for labels? They can now be found in the details panel on the floating action bar.

Related content

Nimisha Asthagiri (Deactivated)
February 7, 2020

We can look at each of these examples and understand what the actual issues are. I believe we can still achieve these with thoughtful design.

Here are some observations from my limited POV.

For example, introducing Course Overviews in LMS (as a duplication of Course metadata from Studio's database) allowed us provide the availability and performance guarantees for the mobile apps. Prior to that, retrieving this data from the modulestore was not performant enough for the mobile apps. We synchronize the data via celery tasks.

The design of having a Program cache in edx-platform is in itself fine. However, the choice of storing that data in Memcached is prone to issues. Rather, that data would be better served stored in a SQL table in edx-platform.

Past issues with Refresh Course Metadata were related to having multiple writers and multiple directions of data flow. And the design mindset of the Discovery service as the real-time source of truth of all course-meta-data was the wrong direction.

There are some infrastructural changes one can make with an eventing architecture to make data updates in more real-time. I agree that will help teams quite a bit. But even without that, we need to improve our intentionality around data-flows and data-storage. In the meantime, without real-time evening, one can have robust designs using web-hooks or business-SLA considered refreshes via REST APIs.

Albert (AJ) St. Aubin (Deactivated)
February 7, 2020

I am aware and understand there were reasons behind all the decisions we have made.  I think that it is imperative that we own not all of these are sustainable or desired methods of data copying and manipulation. They open us up to data consistency issues as well as reliability and data manipulations.

I have scheduled time to talk with you about this so we can move the conversation there if you like.

My main desire here is to request that we in some way add to this manifesto a sense of ownership of our past choices that we still have to deal with today.

You already mention options in your comment that I believe people will simply not consider because or what they see in prior art.

Kyle McCormick (Deactivated)
February 7, 2020

I agree with essentially everything being said here on both sides.

That being said:

  • CourseOverviews is pretty good, but it has had its share of pretty nasty bugs that I would hope not to duplicate in future SQL caching systems.

  • The Credentials service caches Catalog data in a SQL via a system seems to have been developed in a thought-out manner by devs cognizant of data source-of-truth concerns. Despite that, it has caused and continues to cause regular escalations for the Master's squad.

  • As a feature dev, I wouldn't mind a top-down recommendation from Platform/Arch on how to do caching and what tools to use. In fact, I'd appreciate it, because it's not what I want to be spending most of my time thinking about when making a new service.

Kyle McCormick (Deactivated)
February 7, 2020

So, in a nutshell, I agree with the premise in the manifesto, but I also agree with AJ that we need concrete recommendations on how to achieve data redundancy without re-introducing issues that the Pub squad has spent a long time untangling.

Nimisha Asthagiri (Deactivated)
February 7, 2020

Has the Master's squad had any RCAs regarding the Credentials service escalations with its Catalog data cache? If so, can you share links?  If not, can we hold one so we can go through details and understand root causes?

Nimisha Asthagiri (Deactivated)
February 7, 2020

AJ, yes, let's follow-up in our meeting so we can discuss in real-time.

Nimisha Asthagiri (Deactivated)
February 7, 2020

As a feature dev, I wouldn't mind a top-down recommendation from Platform/Arch on how to do caching and what tools to use.

Kyle, let's figure out how/where to do this.  I will also add that I believe there may be a terminology mismatch here with the words "caching" and "data redundancy" since I view them differently. So I'd like to understand the problems that the teams are facing more deeply from designs that abide by this principle.

Dave Ormsbee (Deactivated)
February 8, 2020

I'm also very interested in the Credentials issues. I'm also +1 for putting in more details about recommended data practices (e.g. eventing if/when that's available, unidirectional data flow, etc.) – though I don't know where exactly that goes. I assume a lot of it is going to be hashed out in the long-form discussions we're going to have in Arch lunches on the topics, and that we'll capture those notes. Any one of these manifesto points is easy to misinterpret in ways that will cause more harm than good.

But I especially want to echo and expand on Nimisha's concern about terminology with this item, especially when we say "caching". I think that we've actually been bitten a number of times (like CourseOverviews) because we've been thinking of these things as delayed/optimized copies of the same entity, when they're often subtly different concepts in different systems, and the copy of an entity from System A is actually an input to a similarly named concept in System B.

Take due dates for instance. You can currently set a due date in Studio. That lives in the modulestore and is the only source of truth for due dates as far as Studio is concerned. But when you hit the publish button these days, edx-when then copies that due date information into database tables for the LMS. Data redundancy, right? But a problem's "due date" in the LMS is not a simple copy of Studio data. Studio is just the baseline. The LMS applies individual due date extensions and will shortly start setting due dates on self-paced courses by combining enrollment date information with the spacing it sees in Studio data. Both of these are major transformations of the data it gets from Studio. Studio and LMS both work with something called "due dates", but they now represent substantially different concepts.

In the same way, CourseOverview was originally created as a runtime optimization, but conceptually it now fills a niche that's more about "this is the LMS's local source of truth for catalog related information". I haven't worked with this part of the code in some time, so my understanding is probably poor here. But I think if we're living in a world where we have competing sources of truth for this kind of stuff, then CourseOverviews should probably have explicit models for the data that comes from Studio, models for the data that comes from course-discovery, and some logic about which one wins for the purposes of providing data to the LMS. That's data redundancy, but it's also synthesis and independence of a concept in one domain/system from how it evolves in another. I think this item in the manifesto was created with that in mind ("store/transform/optimize... to its own special needs"), but some of that gets lost when we think of things in terms of just copying/caching data.

Nimisha Asthagiri (Deactivated)
December 8, 2020

Hey folks on this thread,

Are there any issues with marking this thread resolved at this point? Are there any changes you propose we make to the current text, given this convo?

Kyle McCormick (Deactivated)
December 9, 2020

It doesn’t look like we hit a resolution on Credentials, but admittedly that’s on me for never following up.

I remember hearing that @Albert (AJ) St. Aubin (Deactivated) and his team found some concerning data consistency issues with Credentials after taking ownership of it. If there are any tickets or notes related to that, it might be good to link to them for posterity.

Albert (AJ) St. Aubin (Deactivated)
December 10, 2020

Feel free to close this, but I believe this is still an issue. There is still not a clear pathway to having reliable data redundancy in our systems.

My recent experience with Credentials has emphasized the clear issues with data consistency lurking under the covers of all of our data copy jobs and celery tasks.

Kyle McCormick (Deactivated)
December 10, 2020

@Albert (AJ) St. Aubin (Deactivated) do you have any specific learnings from the Credentials issues? are the any notes/readouts/RCAs we can take a look at?

I’m not disagreeing with your point, but knowing how we’ve burned ourselves in the past would help us make better recommendations about data replication.

Albert (AJ) St. Aubin (Deactivated)
December 10, 2020

I can dig them up but it is simple.

Case 1
We use celery tasks to send data with 0 verification the data was received.

As networks go, data is sometimes not received.

An inconsistency occurs and it causes issues in the learner record.

Case 2:
The same celery task go into a failing or unregistered state. We do not send X updates to Credentials. We keep no record of updates that should have been sent to Credentials.

Our only resolution is a replay of ALL events for a user or within a timeframe.

We have queries that we use in Snowflake to track the inconsistencies in Course and Program credentials.

There are more examples in

  • Refresh Course Metadata

  • Programs Cache

  • etc.

tl;dr Do we really need more evidence?

 

Albert (AJ) St. Aubin (Deactivated)
December 10, 2020

Anyway, as I mentioned I think this thread can close. If anyone would like to discuss it further outside this thread let me know.