We create a scalable system with decentralized points of autonomy. We prevent centralized points of failure and bottlenecks. We balance this with intentional architecture.
We minimize coordination across team boundaries and value team-level autonomy in order to scale our organization. Collaboration between teams, however, is still vital to align on high-level goals and ensure the convergence of architectural decisions.
Developer teams are responsible for the services and code they own; therefore, they have autonomy over local architectural decisions, data modeling, code quality, library usage, monitoring, etc. Although edX promotes standards for many such concerns and encourages teams to collaboratively seek advice, the final say on decisions about a technology lies with its owners. These decisions, however, need to be defendable and should be documented in ADRs.
beat Core Modifications
We follow SOLID design principles. We invest appropriate time in determining the Single Responsibility of functions, modules, and higher-order components. We build interfaces for components that have many dependents, following Dependency Inversion. We depend on interfaces and not on concrete implementations.
We create an extensible platform that allows developers to build extensions that integrate with the core of the platform. This allows the core to remain small, while volatile extensions remain in the periphery.
Django App Plugin framework is used by the Open edX community to make enhancements to the platform.
The XBlock framework is used by the Open edX community and Course teams to make custom authoring/learning course content.
Discussions forums coupled into the core of the platform.
Hard-coded vendor-specific implementations are in the core codebase (e.g., SoftwareSecure, ZenDesk), while they could have been implemented as extensions.
Clear Bounded Contexts
beat Shared Business Logic (across Contexts)
We decompose large-scale systems into smaller, cognitively understandable, encapsulated and cohesive components called Bounded Contexts.
Defining Contexts We define Contexts (and their boundaries) by determining their “Jobs to be done” as described in Lifecycle-based Services. We do NOT follow the anti-pattern of focusing on Data and defining Contexts as Entity-based Services. This allows us to minimize interdependencies between Contexts.
Within Contexts We ensure that each individual Context has a singular purpose and a Job to do. The responsibility of each Context is well defined and communicated with a clear boundary (per SOLID’s Single Responsibility principle). Within a Context, we can share code, share a common Ubiquitous language, and actively refactor as needed.
Across Contexts We ensure communications across Contexts are with asynchronous messages and with clearly defined interfaces and relationships. Since each Context minds its own business and focuses on its own Job, there is minimal sharing of business logic across Contexts.
Pulling data out of course discovery into a new enterprise catalog IDA.
edx-platform - tightly coupled across its Django apps.
Creating a single library for storing/accessing/manipulating User data. Similarly, a single library for managing Course data. Rather, each Context has its own requirements for this and can evolve accordingly.
A past mindset of considering the discovery service as a central entity-based service to all other services.
Enterprise Data aggregation layers
beats Coupling across Contexts
We intentionally choose to duplicate data across Contexts when data from a source Context is needed for a 'core Job' of a dependent Context. The dependent Context then has the flexibility to store/transform/optimize its local copy of the data according to its own special needs. This produces highly independent self-contained systems.
We look at client-side integrations when a product feature requires time-precise data from the source Context and cannot tolerate delays in asynchronous data synchronization by the dependent Context.
Student Records (in the Credentials service) maintains its own data storage of learners' Grades.
LMS’s local storage of Course Content in MySQL (Course Overviews, edx-when, sequences) and Block Transformers.
Loosely Coupled Codebase
beats Don't Repeat Yourself (across Contexts)
We prefer building a loosely coupled architecture that is not bottlenecked by centralized implementations (libraries, code, etc). While our engineering impulse may be to de-duplicate code, we do not (usually) apply DRY across Contexts. We understand that when multiple microservices depend on shared libraries, it can lead to unintended team bottlenecks.
We allow exceptions to this trade-off when architecturally significant concerns need to be consistent across services (e.g., API authentication and authorization frameworks in edx-drf-extensions).
The architecture of the new Remote Config system encourages IDA configuration to be simple and loosely coupled between services. (TODO: Let DevOps flesh this out a bit)
The https://github.com/edx/configuration repository, which is in the process of being supplanted by Remote Config, centralized logic to generate configuration settings between services, creating complex inter-dependencies between the configurations of what would ideally be completely independent applications. (TODO: Let DevOps flesh this out a bit).
A utility library of trivial functionality that is shared across micro-services:
beat Lockstep Releases
We keep the deployment pipelines of each of our microservices to be independent of each other. We never design a feature that requires multiple services to be deployed at the same time.
edX IDAs are released to production through pipelines that are owned by feature teams. When releasing IDA-specific features, teams do not have to be concerned about the edxapp release cadence. When issues arise, they are empowered to rollback or fix-forward at their own discretion.
Release of LMS/Studio features are often contingent upon the edxapp release going out that day, which may be blocked by an unrelated end-to-end test failure or production bug.
Asynchronous beats Synchronous
beats Synchronous Requests
Inspired by the Reactive Manifesto, we create a loosely coupled architecture that is responsive, resilient, and elastic by having asynchronous (non-blocking) communication channels between services wherever possible.
We keep our 99% web response times to less than 2 seconds unless an explicit exception is made (with Product). This means we do not make blocking synchronous remote requests (within a web request) to other microservices, especially those that are outside of our Context’s Subdomain.
Better to asynchronously trigger a report to be generated and have the client get back to it later, than to make 100K requests to paginate through. (Your system can better control load.)
(Event queues might be an option later down the line.)
Multiple endpoints making blocking connections to the discovery service.
Connecting to other services within a web-worker response to the end-user.
beats Distributed Transactions (across Contexts)
We prefer accessing slightly stale data rather than being dependent on the real-time availability of other services. We recognize that the CAP theorem holds in our distributed services. That is, for Partition-tolerance (the system continues to operate in spite of partial network failures) we choose Availability (find a local/nearby replica of data) over Consistency (all read the same data at the same time).
When we write data, we don’t expect it will be immediately available throughout the system and be read at the same time. We prefer each client to be resilient and handle any ambiguities.
Updates to course metadata in publisher aren’t visible in the marketing site immediately, because publisher doesn’t try to push that data to the marketing site. Instead, the marketing site reads the data on the schedule that it needs in order to maintain its own recency SLA.
Changes to course content may take tens of seconds to make it out to students because of all the asynchronously triggered tasks that have to complete (e.g. Block Transformer collect()), but this is better for scaling than having it synchronously block on everything during publish.
beat Monolithic Integrations
When the time-precision of reading up-to-date data is paramount such that delays from background data synchronization cannot be tolerated, we have client (e.g., frontend) code (asynchronously) access the data from its source of truth. We prefer clients to fetch data across contexts (when really necessary) rather than have synchronous remote calls within backend business logic.
We recognize that this design introduces dependencies between Contexts (even if through the client) and so we do not use this pattern when the data is necessary for a core Job of the client.
Micro-frontend coordinates responses across multiple backends (i.e., async integration at the UI)
Old basket page calls discovery service (for course/programs metadata information) within the same process of a learner’s request. Rather, the frontend can progressively render this non-core information upon receiving an asynchronous response from the secondary discovery service.
beat Paginated Requests
We minimize the load on our webservers by exposing only APIs that are performant. We implement slow transactions as asynchronous tasks that run in the background and do not interfere with web responses.
We add pagination from the ground up through our backend APIs, not just (a posteriori) in frontend views.
For APIs that provide large data sets, we prefer asynchronously generating data reports and exposing them via separate APIs that do not load our servers.
Registrar implements read/write-heavy API endpoints with https://github.com/edx/django-user-tasks, which allows asynchronous processing of API requests using Celery. Clients are provided with a URL to check the status of their task and download the result when it’s ready.
Many edX services paginate their endpoints, but this does not stop the service from being overloaded when a client (more often than not, another edX job/service) requests one page after another in order to fetch the entire dataset.
Intentional Architecture beats Emergent Design
"Without guidance, evolutionary architecture becomes simply a reactionary architecture."
Work to Simplify
beats Slippery Complexity
We understand that the first solution that comes to mind is not necessarily the best solution. We take some time to better understand the problem space before ideating on solutions. Solving a problem one bit at a time can lead to complicated solutions since each increment addresses symptoms instead of the underlying diagnosis.
Instead, invest time to tease apart the problem, make relevant connections, and create meaningful abstractions in order to create simpler long-term solutions.
Additionally, long-lived code may age and accrue complexity over time as business needs evolve and features are added. So, we continuously refactor and improve the code, leaving it better than it was (“scouts rule”).
LTI 1.0, 1.1 > LTI 1.2, 2.0
Certificates code, spread across multiple apps in the monolith, was refactored and simplified - finding and fixing user-impacting issues along the way.
Monolith’s GoCD pipeline was simplified by eliminating unneeded pipes.
beats You Ain't Gonna Need It
In following Agile practices and iterative processes, we recognize that (appropriately) timeboxed upfront design efforts are investments that can lead to better long-term technical outcomes. The design may contain abstractions or APIs that can save development time in the future.
We understand that the phrase ‘You Ain’t Gonna Need It (YAGNI)’ applies more to local code within a Context since that code can be continuously refactored as the feature and code evolve. For a platform developed across multiple distributed services, features and code cannot be as easily refactored and so upfront design that results in additional generalities pay off.
Course Blocks API
Initial White Label implementation
beat Design Docs
As part of our regular development process, we document technical decisions in GitHub as described in OEP-19. We add the creation of decision records in acceptance criteria during our grooming processes. We remind our teammates to do so when we review PRs. We recognize decision records serve to onboard new members, to understand historical traces, to reference past decisions, and to ensure the team is aligned.
In addition to decision records, we actively maintain READMEs and HowTos as described in the OEP.
Open edX Proposals (OEPs).
Architectural Decisions Records, like the ones in the edx-platform Grades app.
Information on context and decisions being buried deep in forgotten Confluence spaces and Jira boards.
beat Manual Verifications
We believe automated verifications will more efficiently and precisely direct technical changes than manual verifications. Technical direction includes both detection of errors (e.g., variable not defined) as well as guidance towards a future technical evolution (e.g., adoption of PII annotation).