...
Category | Issue | Team anecdotes (with dates, with team) | Number of occurrences | Status | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Jenkins and Test Infrastructure Issues | Networking problems (affecting PR Builds, e2e, GoCD) | April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests. | 2 | ||||||||||
edx-platform master flaky failures | March 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness. | 1 | |||||||||||
Authentication Clean-up | Messy JWT related and JWT_ISSUER config causing pain. | ~Aug, 2021: pyjwt 2 has breaking change on Note: This work may be simplified when and if ecommerce is moved externally. | 2 |
Note: This ticket was not actually completed. | |||||||||
LMS missing JwtAuthentication as default. | Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class. It is unclear if the mail activation issue is a separate challenge that should be listed. May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication. Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint. | 2 |
| ||||||||||
JWT Cookie seems broken locally. | May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue. | 2 |
| ||||||||||
JWT Expired Signature | Nov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved. | 1 |
Note: This ticket was not actually completed. | ||||||||||
Authentication in pact Provider verification | The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4) | ||||||||||||
JWT Cookie Refresh sometimes not working | This appears as error: "[frontend-auth] Access token is still null after successful refresh." It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only). The cause continues to be elusive. See ADR for more details. |
Note: This ticket just checked a particular hypothesis which didn't pan out. | |||||||||||
Credentials for integrations | Getting Google, Github, etc. API keys or tokens can be slow | June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker) | 2 | ||||||||||
Authorization | Managing admin access to Django services. | June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc). | 3 | ||||||||||
Pipeline | Security Merge Conflicts are hard to detect/resolve | Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline. | 1 | ||||||||||
Masquerading | Can be error-prone and can cause security issues | Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request). Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading. | 1 |