Arch and Engineering Challenges (2020+)
Purpose
This is a live page to keep track of sources of drag faced by an edX engineer or team due to architectural, infrastructural, technical issues in our system.
Goals
Surface issues that we face when doing rapid experiments or iterative development and see whether they are challenges for other teams as well.
See if any patterns emerge that call for technical or architectural investments.
Use anecdotal data on this page to prioritize platform and non-platform theme initiatives.
Challenges
Top-level Categories from Architecture Challenges (2017-2018)
Pain Incidents
Also see Observability Challenges for Observability specific pain incidents.
Category | Issue | Team anecdotes (with dates, with team) | Number of occurrences | Status |
|---|---|---|---|---|
Jenkins and Test Infrastructure Issues | Networking problems (affecting PR Builds, e2e, GoCD) | April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests. | 2 | |
edx-platform master flaky failures | March 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness. | 1 | ||
Authentication Clean-up | Messy JWT related and JWT_ISSUER config causing pain. | ~Aug, 2021: pyjwt 2 has breaking change on Note: This work may be simplified when and if ecommerce is moved externally. | 2 | https://openedx.atlassian.net/browse/ARCHBOM-1202 Note: This ticket was not actually completed. |
LMS missing JwtAuthentication as default. | Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class. It is unclear if the mail activation issue is a separate challenge that should be listed. May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication. Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint. | 2 | https://openedx.atlassian.net/browse/ARCHBOM-107 | |
JWT Cookie seems broken locally. | May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue. | 2 | https://openedx.atlassian.net/browse/ARCHBOM-1218 | |
JWT Expired Signature | Nov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved. | 1 | https://openedx.atlassian.net/browse/ARCHBOM-1152 Note: This ticket was not actually completed. | |
Authentication in pact Provider verification | The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4) | |||
JWT Cookie Refresh sometimes not working | This appears as error: "[frontend-auth] Access token is still null after successful refresh." It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only). The cause continues to be elusive. See ADR for more details. | https://openedx.atlassian.net/browse/ARCHBOM-1150 Note: This ticket just checked a particular hypothesis which didn't pan out. | ||
Credentials for integrations | Getting Google, Github, etc. API keys or tokens can be slow | June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker) | 2 | |
Authorization | Managing admin access to Django services. | June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc). | 3 | |
Pipeline | Security Merge Conflicts are hard to detect/resolve | Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline. | 1 | |
Masquerading | Can be error-prone and can cause security issues | Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request). Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading. | 1 |