April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests. March 12, 2020: Created DOS-701 which initiated a manual restart of Jenkins. Issues in the days proceeding were seen on PRs, e2e and GoCD by multiple teams. (Build Jenkins)
Messy JWT related and JWT_ISSUER config causing pain.
~Aug, 2021: pyjwt 2 has breaking change on jwt.decode requiring algorithm, and it is challenging determining how to best resolve because it exists in and out of the issuer settings (see spreadsheet). ~April 16, 2020: Start of about 2 weeks of devstack issues related to JWT_ISSUER related config change and no clarity on how to properly configure. ~May 1, 2020: Ned spend time researching configs and documenting in a spreadsheet as part of Juniper release and reached out for support.
Note: This work may be simplified when and if ecommerce is moved externally.
May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication.
Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint.
May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue. May 7, 2020: Alex reached out wondering how to get JWT roles from JWT cookies when calling endpoints from the browser. The problem/solution was non-obvious.
The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4)
JWT Cookie Refresh sometimes not working
This appears as error: "[frontend-auth] Access token is still null after successful refresh." It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only).
Note: This ticket just checked a particular hypothesis which didn't pan out.
Credentials for integrations
Getting Google, Github, etc. API keys or tokens can be slow
June 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker)
Managing admin access to Django services.
June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc).
Security Merge Conflicts are hard to detect/resolve
Sep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline.
Can be error-prone and can cause security issues
Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request).
Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading.