Arch and Engineering Challenges (2020+)

Arch and Engineering Challenges (2020+)

Purpose

This is a live page to keep track of sources of drag faced by an edX engineer or team due to architectural, infrastructural, technical issues in our system.

Goals

  1. Surface issues that we face when doing rapid experiments or iterative development and see whether they are challenges for other teams as well.

  2. See if any patterns emerge that call for technical or architectural investments.

  3. Use anecdotal data on this page to prioritize platform and non-platform theme initiatives.

Challenges

Top-level Categories from Architecture Challenges (2017-2018)

Top-level Categories from Architecture Challenges (2017-2018)



Pain Category

# incidents

status

notes

1

Monolith: Pluggability, Extensibility, Interdependency/coupling, Deprecation/removal

14

Progress

DEPR efforts continue

2

Modernizing FED

9

Invested

Arch-FED efforts in FY19-FY20:

  • Paragon ownership and updates

  • Micro-frontend runways

  • Micro-frontend re-platforming

3

Environments / Testing - End-to-end tests, Stage environments, Sandboxes, etc.

5

PROGRESS

DevOps efforts:

  • Kubernetes, Docker, and Config

Arch-BOM efforts:

  • Test low-hanging fruits in FY20-Q3

  • Test strategy in FY20-Q4

4

Coupled services: Data synchronization and duplication

5

PARTIAL

Publisher efforts to fix data synch issues

5

Configuration/Toggles: OEP-17 - Feature Toggles

2

PROGRESS

Arch-BOM efforts:

  • Toggle OEP in FY19-Q3

  • Toggle reporting/annotations in FY20-Q4

  • With Open edX, Toggle documentation


Pain Incidents

Also see Observability Challenges for Observability specific pain incidents.



Category

Issue

Team anecdotes (with dates, with team)

Number of occurrences

Status

Category

Issue

Team anecdotes (with dates, with team)

Number of occurrences

Status

Jenkins and Test Infrastructure Issues

Networking problems (affecting PR Builds, e2e, GoCD)

April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests.
March 12, 2020: Created DOS-701 which initiated a manual restart of Jenkins.  Issues in the days proceeding were seen on PRs, e2e and GoCD by multiple teams. (Build Jenkins)

2



edx-platform master flaky failures

March 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness.

1



Authentication Clean-up

Messy JWT related and JWT_ISSUER config causing pain.

~Aug, 2021: pyjwt 2 has breaking change on jwt.decode requiring algorithm, and it is challenging determining how to best resolve because it exists in and out of the issuer settings (see spreadsheet).
~April 16, 2020: Start of about 2 weeks of devstack issues related to JWT_ISSUER related config change and no clarity on how to properly configure.
~May 1, 2020: Ned spend time researching configs and documenting in a spreadsheet as part of Juniper release and reached out for support.

Note: This work may be simplified when and if ecommerce is moved externally.

2

https://openedx.atlassian.net/browse/ARCHBOM-1202

Note: This ticket was not actually completed.

LMS missing JwtAuthentication as default.

Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class.  It is unclear if the mail activation issue is a separate challenge that should be listed.

May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication.

Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint.

2

https://openedx.atlassian.net/browse/ARCHBOM-107



JWT Cookie seems broken locally.

May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue.
May 7, 2020: Alex reached out wondering how to get JWT roles from JWT cookies when calling endpoints from the browser.  The problem/solution was non-obvious.

2

https://openedx.atlassian.net/browse/ARCHBOM-1218



JWT Expired Signature

Nov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved.

1

https://openedx.atlassian.net/browse/ARCHBOM-1152

Note: This ticket was not actually completed.

Authentication in pact Provider verification 

The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4)





JWT Cookie Refresh sometimes not working

This appears as error: "[frontend-auth] Access token is still null after successful refresh."  It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only).

The cause continues to be elusive. See ADR for more details.



https://openedx.atlassian.net/browse/ARCHBOM-1150

Note: This ticket just checked a particular hypothesis which didn't pan out.

Credentials for integrations

Getting Google, Github, etc. API keys or tokens can be slow

June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker)

2



Authorization

Managing admin access to Django services.

June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc).

3



Pipeline

Security Merge Conflicts are hard to detect/resolve

Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline.

1



Masquerading

Can be error-prone and can cause security issues

Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request).



Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading.

1