Arch and Engineering Challenges (2020+)

Purpose

This is a live page to keep track of sources of drag faced by an edX engineer or team due to architectural, infrastructural, technical issues in our system.

Goals

  1. Surface issues that we face when doing rapid experiments or iterative development and see whether they are challenges for other teams as well.
  2. See if any patterns emerge that call for technical or architectural investments.
  3. Use anecdotal data on this page to prioritize platform and non-platform theme initiatives.

Challenges

Top-level Categories from Architecture Challenges (2017-2018)

 Click here to expand...

Top-level Categories from Architecture Challenges (2017-2018)


Pain Category# incidentsstatusnotes
1Monolith: Pluggability, Extensibility, Interdependency/coupling, Deprecation/removal14PROGRESSDEPR efforts continue
2Modernizing FED9INVESTED

Arch-FED efforts in FY19-FY20:

  • Paragon ownership and updates
  • Micro-frontend runways
  • Micro-frontend re-platforming
3Environments / Testing - End-to-end tests, Stage environments, Sandboxes, etc.5PROGRESS

DevOps efforts:

  • Kubernetes, Docker, and Config

Arch-BOM efforts:

  • Test low-hanging fruits in FY20-Q3
  • Test strategy in FY20-Q4
4Coupled services: Data synchronization and duplication5PARTIAL

Publisher efforts to fix data synch issues

5Configuration/Toggles: OEP-17 - Feature Toggles2PROGRESS

Arch-BOM efforts:

  • Toggle OEP in FY19-Q3
  • Toggle reporting/annotations in FY20-Q4
  • With Open edX, Toggle documentation


Pain Incidents

Also see Observability Challenges for Observability specific pain incidents.


CategoryIssueTeam anecdotes (with dates, with team)Number of occurrencesStatus
Jenkins and Test Infrastructure IssuesNetworking problems (affecting PR Builds, e2e, GoCD)

April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests.
March 12, 2020: Created DOS-701 which initiated a manual restart of Jenkins.  Issues in the days proceeding were seen on PRs, e2e and GoCD by multiple teams. (Build Jenkins)

2
edx-platform master flaky failuresMarch 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness.1
Authentication Clean-upMessy JWT related and JWT_ISSUER config causing pain.

~Aug, 2021: pyjwt 2 has breaking change on jwt.decode requiring algorithm, and it is challenging determining how to best resolve because it exists in and out of the issuer settings (see spreadsheet).
~April 16, 2020: Start of about 2 weeks of devstack issues related to JWT_ISSUER related config change and no clarity on how to properly configure.
~May 1, 2020: Ned spend time researching configs and documenting in a spreadsheet as part of Juniper release and reached out for support.

Note: This work may be simplified when and if ecommerce is moved externally.

2

ARCHBOM-1202 - Getting issue details... STATUS

Note: This ticket was not actually completed.

LMS missing JwtAuthentication as default.

Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class.  It is unclear if the mail activation issue is a separate challenge that should be listed.

May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication.

Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint.

2

ARCHBOM-107 - Getting issue details... STATUS


JWT Cookie seems broken locally.

May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue.
May 7, 2020: Alex reached out wondering how to get JWT roles from JWT cookies when calling endpoints from the browser.  The problem/solution was non-obvious.

2

ARCHBOM-1218 - Getting issue details... STATUS


JWT Expired SignatureNov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved.1

ARCHBOM-1152 - Getting issue details... STATUS

Note: This ticket was not actually completed.

Authentication in pact Provider verification The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4)

JWT Cookie Refresh sometimes not working

This appears as error: "[frontend-auth] Access token is still null after successful refresh."  It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only).

The cause continues to be elusive. See ADR for more details.


ARCHBOM-1150 - Getting issue details... STATUS

Note: This ticket just checked a particular hypothesis which didn't pan out.

Credentials for integrationsGetting Google, Github, etc. API keys or tokens can be slowJune 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker)2
AuthorizationManaging admin access to Django services.June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc).3
PipelineSecurity Merge Conflicts are hard to detect/resolveSep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline.1
MasqueradingCan be error-prone and can cause security issues

Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request).


Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading.

1