|Category||Issue||Team anecdotes (with dates, with team)||Number of occurrences||Status|
|Jenkins and Test Infrastructure Issues||Networking problems (affecting PR Builds, e2e, GoCD)|
April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests.
March 12, 2020: Created DOS-701 which initiated a manual restart of Jenkins. Issues in the days proceeding were seen on PRs, e2e and GoCD by multiple teams. (Build Jenkins)
|edx-platform master flaky failures||March 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness.||1|
|Authentication Clean-up||Messy JWT related and JWT_ISSUER config causing pain.|
~Aug, 2021: pyjwt 2 has breaking change on
jwt.decode requiring algorithm, and it is challenging determining how to best resolve because it exists in and out of the issuer settings (see spreadsheet).
~April 16, 2020: Start of about 2 weeks of devstack issues related to JWT_ISSUER related config change and no clarity on how to properly configure.
~May 1, 2020: Ned spend time researching configs and documenting in a spreadsheet as part of Juniper release and reached out for support.
Note: This work may be simplified when and if ecommerce is moved externally.
ARCHBOM-1202Getting issue details...
Note: This ticket was not actually completed.
|LMS missing JwtAuthentication as default.|
Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class. It is unclear if the mail activation issue is a separate challenge that should be listed.
May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication.
Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint.
ARCHBOM-107Getting issue details...
|JWT Cookie seems broken locally.|
May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue.
May 7, 2020: Alex reached out wondering how to get JWT roles from JWT cookies when calling endpoints from the browser. The problem/solution was non-obvious.
ARCHBOM-1218Getting issue details...
|JWT Expired Signature||Nov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved.||1|
ARCHBOM-1152Getting issue details...
Note: This ticket was not actually completed.
|Authentication in pact Provider verification ||The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4)|
|JWT Cookie Refresh sometimes not working|
This appears as error: "[frontend-auth] Access token is still null after successful refresh." It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only).
The cause continues to be elusive. See ADR for more details.
ARCHBOM-1150Getting issue details...
Note: This ticket just checked a particular hypothesis which didn't pan out.
|Credentials for integrations||Getting Google, Github, etc. API keys or tokens can be slow||June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker)||2|
|Authorization||Managing admin access to Django services.||June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc).||3|
|Pipeline||Security Merge Conflicts are hard to detect/resolve||Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline.||1|
|Masquerading||Can be error-prone and can cause security issues|
Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request).
Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading.