Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CategoryIssueTeam anecdotes (with dates, with team)Number of occurrencesStatus
Jenkins and Test Infrastructure IssuesNetworking problems (affecting PR Builds, e2e, GoCD)

April 6, 2020: Created INFRA-39 to restart Tools Jenkins because ecommerce e2e was seeing flaky timeout failures on different tests.
March 12, 2020: Created DOS-701 which initiated a manual restart of Jenkins.  Issues in the days proceeding were seen on PRs, e2e and GoCD by multiple teams. (Build Jenkins)

2
edx-platform master flaky failuresMarch 3, 2020: RCA: BOM-1161 - DOPrecation fallout and remediation mentions that failure on master is noticed by others who rebase, rather than alerts, because master builds for edx-platform nearly always have a failed job due to some flakiness.1
Authentication Clean-upMessy JWT related and JWT_ISSUER config causing pain.

~Aug, 2021: pyjwt 2 has breaking change on jwt.decode requiring algorithm, and it is challenging determining how to best resolve because it exists in and out of the issuer settings (see spreadsheet).
~April 16, 2020: Start of about 2 weeks of devstack issues related to JWT_ISSUER related config change and no clarity on how to properly configure.
~May 1, 2020: Ned spend time researching configs and documenting in a spreadsheet as part of Juniper release and reached out for support.

Note: This work may be simplified when and if ecommerce is moved externally.

2

Jira
serverSystem JIRA
serverId13fd1930-5608-3aac-a5dd-21b934d3a4b4
keyARCHBOM-1202

Note: This ticket was not actually completed.

LMS missing JwtAuthentication as default.

Feb 16, 2021: AA-664 (see https://github.com/edx/edx-platform/pull/26582/files) would have at least been partially solved by have JwtAuthentication as a default class.  It is unclear if the mail activation issue is a separate challenge that should be listed.

May 5, 2020: TNL reached out with Production bug where verified learners could use LMS with initial session before verifying email, but same user/session could not make MFE calls to an LMS endpoint. The problem/solution was non-obvious, because endpoint was using SessionAuthentication (via default), but not JwtAuthentication.

Short term fix: Ensure JwtAuthentication is included on any endpoint called from an MFE. In LMS, this means explicitly adding it do endpoint where needed. For non-DRF endpoints, this may additional complexity depending on the endpoint.

2

Jira
serverSystem JIRA
serverId13fd1930-5608-3aac-a5dd-21b934d3a4b4
keyARCHBOM-107


JWT Cookie seems broken locally.

May 13, 2020: Brandon and enterprise-titans reached out because JWT cookies weren't being used when calling endpoints from browser outside of an MFE, and the problem was non-obvious. At first, it was also unclear that this was a local only issue.
May 7, 2020: Alex reached out wondering how to get JWT roles from JWT cookies when calling endpoints from the browser.  The problem/solution was non-obvious.

2

Jira
serverSystem JIRA
serverId13fd1930-5608-3aac-a5dd-21b934d3a4b4
keyARCHBOM-1218


JWT Expired SignatureNov 20, 2019: Payment app was having issue with Expired Signatures. We needed to add a hack in to fix, but all MFEs could run into this issue until it is resolved.1

Jira
serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId13fd1930-5608-3aac-a5dd-21b934d3a4b4
keyARCHBOM-1152

Note: This ticket was not actually completed.

Authentication in pact Provider verification The second part of contract testing involves replaying the interactions in the contract against the provider. If the API endpoints are authenticated, a proper mechanism is needed to add authentication information during the provider verification. In edX, the APIs are using a mix of Jwt, Bearer, and Session auth. How can we have a one-fit-all way to add auth in pact provider verification? See some more details on Contract Testing: Architecture/Integration Questions(Question 4)

JWT Cookie Refresh sometimes not working

This appears as error: "[frontend-auth] Access token is still null after successful refresh."  It looks like this is roughly affecting 0.3% of requests. See https://one.newrelic.com/-/0YBR6m5v2wO (edX only).

The cause continues to be elusive. See ADR for more details.


Jira
serverSystem JIRA
serverId13fd1930-5608-3aac-a5dd-21b934d3a4b4
keyARCHBOM-1150

Note: This ticket just checked a particular hypothesis which didn't pan out.

Credentials for integrationsGetting Google, Github, etc. API keys or tokens can be slowJune 9, 2020: Had to file a ticket to get a Github access token from SRE (could use personal token in meantime); previously had to get a Google API service account created (was a blocker)2
AuthorizationManaging admin access to Django services.June 10, 2020 SRE-84 - Several services have required SRE to manually manage admins and manage permissions (VEM, LM, Ecommerce, etc).3
PipelineSecurity Merge Conflicts are hard to detect/resolveSep 18, 2020 - There was a security PR that was causing a merge conflict in the pipeline, the error was hard to debug and fully stopped the pipeline.1
MasqueradingCan be error-prone and can cause security issues

Because the masquerading code changes the request user, it caused false positives in various security checks because it does not look different from a malicious account takeover we've seen in the past(session user changes during a request).


Applying masquerading is complicated and often people override the request user to get the desired effects from masquerading.

1