Arch and Engineering Challenges (2020+)
Purpose
This is a live page to keep track of sources of drag faced by an edX engineer or team due to architectural, infrastructural, technical issues in our system.
Goals
- Surface issues that we face when doing rapid experiments or iterative development and see whether they are challenges for other teams as well.
- See if any patterns emerge that call for technical or architectural investments.
- Use anecdotal data on this page to prioritize platform and non-platform theme initiatives.
Challenges
Top-level Categories from Architecture Challenges (2017-2018)
Pain Incidents
Also see Observability Challenges for Observability specific pain incidents.
Category | Issue | Team anecdotes (with dates, with team) | Number of occurrences | Status |
---|---|---|---|---|
Jenkins and Test Infrastructure Issues | Networking problems (affecting PR Builds, e2e, GoCD) | April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests. | 2 | |
edx-platform master flaky failures | March 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness. | 1 | ||
Authentication Clean-up | Messy JWT related and JWT_ISSUER config causing pain. | ~Aug, 2021: pyjwt 2 has breaking change on Note: This work may be simplified when and if ecommerce is moved externally. | 2 | - ARCHBOM-1202Getting issue details... STATUS Note: This ticket was not actually completed. |
LMS missing JwtAuthentication as default. | Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class. It is unclear if the mail activation issue is a separate challenge that should be listed. May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication. Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint. | 2 | ||
JWT Cookie seems broken locally. | May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue. | 2 | ||
JWT Expired Signature | Nov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved. | 1 | - ARCHBOM-1152Getting issue details... STATUS Note: This ticket was not actually completed. | |
Authentication in pact Provider verification | The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4) | |||
JWT Cookie Refresh sometimes not working | This appears as error: "[frontend-auth] Access token is still null after successful refresh." It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only). The cause continues to be elusive. See ADR for more details. | - ARCHBOM-1150Getting issue details... STATUS Note: This ticket just checked a particular hypothesis which didn't pan out. | ||
Credentials for integrations | Getting Google, Github, etc. API keys or tokens can be slow | June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker) | 2 | |
Authorization | Managing admin access to Django services. | June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc). | 3 | |
Pipeline | Security Merge Conflicts are hard to detect/resolve | Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline. | 1 | |
Masquerading | Can be error-prone and can cause security issues | Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request). Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading. | 1 |