RCA: NPM secret exposed in GitHub PR

Summary

Date

​Feb 17, 2023

Team(s)

Enterprise Titans

Related Issues

https://github.com/openedx/tcril-engineering/issues/674

Summary

While fixing frontend-enterprise CI to allow Lerna to publish a PR to update versions, the NPM token used by Open edX to publish NPM package versions was exposed in a pull request.

Impact

NPM automatically revoked the token, but pull requests that needed to publish changes to NPM were unable to do so.

Incident Duration

Duration was about 2 hours. However given that this happened on a Friday afternoon before a holiday weekend no one was merging changes that needed to be published to NPM at that time. No actual work was effected.

Lesson Learned (internal)

Publicly-shareable lesson learned

What lessons would be relevant to the Open edX community? These are shared out in ​Public RCA blurbs from edx.org

Intro

  • An RCA is a way for us to reflect on what happened as a way of learning and improving.

  • Focus on processes not people: How can we design systems so it is harder to make mistakes? And when mistakes happen anyway, how can we make them less costly?

    • Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, the resources available, and the situation at hand.

Technical Cause and Resolution

While attempting to make frontend-enterprise NPM releases work with the required openedx/cla status check, the token used for publishing to NPM was leaked in an automatically generated PR. NPM automatically detected the exposed token and revoked it. Working across 2U and tCRIL, we rotated the NPM token and updated the GitHub organization secrets to reflect the new token.

Timeline of Events

When adding events, please include a timestamp and a link out to relevant content if appropriate. Please also try to avoid making any individual feel blamed or shamed.

  • 2023/01/24

    • Bilal merges PR into frontend-enterprise and the release pipeline fails due to failing openedx/cla status check.

    • Adam helps triage / ensure Bilal’s changes did ultimately publish to NPM at the expected versions.

      • Git tags were created.

      • No CHANGELOG entry or package.json files were committed to master

  • 2023/02/13

    • Brian becomes the oncall engineer for enterprise

    • Brian is informed the CI is broken in frontend-enterprise and it is blocking the release of new code.

    • Investigation begins.

  • 2023/02/13 - 2023/02/17

    • In his spare moments Brian makes a few attempts to fix CI in this repository.

    • Comes up with a mostly working solution.

      • Attempted approach is to have Lerna generate its changes on a temp branch, automatically open a PR with the changes and wait for CI to pass (including openedx/cla), before auto-merging into master (as briefly described here).

  • 2023/02/17

    • 2:56pm: Release automation opens PR exposing NPM token secret in .npmrc file.

    • 3:02pm: Observed an issue automatically created in the Paragon repo by the edx-semantic-release bot to inform of an issue related to an “Invalid NPM token.”

    • 3:15pm: Subsequent release attempts in frontend-enterprise failed due to NPM auth issues (status code 401) when trying to authenticate with NPM.

    • 3:17pm: Adam realizes and escalates the issue as a security risk with the leaked NPM secret.

      • Adam explains how the NPM secret got leaked to Brian.

    • 3:22pm: Brian raises the issue in a Slack thread in #incident-warroom (2U Slack workspace).

    • 3:27pm: Adam suggests we may need to escalate to tCRIL since the token is defined as a Github organization secret for all repos within Open edX GH org.

    • 3:27pm: Joe Mulloy searches through Keeper to find a matching NPM token.

      • Initially, it looks like the NPM token is a key per repo.

    • 3:35pm: Adam raises the issue in #openedx-ask-tcril (link) at first.

    • 3:55pm: Sarina responds to #openedx-ask-tcril Slack thread to suggest filing a “Systems Request” with tCRIL engineering.

    • 4:01pm: Adam files the “Systems Request” with tCRIL engineering: ​

      • Adam expands on the issue description to indicate urgency/severity and list (most) impacted JS repos.

    • 4:06pm: Sarina verifies she can edit the Github organization secrets, but can’t view them. Next steps would be to properly generate a new NPM token.

    • 4:32pm: Sarina drops an @-here in their tCRIL engineering Slack to draw attention of others at tCRIL who might be more familiar with NPM; Feanil jumps on shortly thereafter and starts a DM with Adam.

    • 4:40pm: Feanil asks Adam if he has access to the edx-semantic-release@edx.org  google group (he does not) and to page 2U’s SRE instead. Adam shares a related message/screenshot from the #incident-warroom from Joe.

    • 4:44pm: Feanil starts new Slack DM with both Adam and Joe. Asks again about access to the edx-semantic-release@edx.org  google group.

    • 4:48pm: David Joy informs the #incident-warroom Slack thread that he has access to the edx-semantic-release@edx.org  google group (one of 3 people: Adam Blackwell, David Joy, and Muhammad Nadeem Shahrad). David sees a few emails from NPM:

      • OTP for logging in

      • Helpful security alert about NPM token found in public repo (sent at 2:56pm).

    • 4:52pm: David informs he can’t add anyone else to the edx-semantic-release@edx.org  google group; seems that only Adam Blackwell can manage the group members.

    • 4:59pm: In Slack DM between Adam, Joe, and Da vid, Feanil asks if we can send him the OTP email received from NPM.

      • Feanil had username/password for the NPM account, but would need the OTP to access it.

    • 5:35pm: David sends Feanil a valid OTP to use to authenticate into NPM account.

    • 5:39pm: Feanil informs he set up 2FA on the NPM account “so it won't need the e-mail for login in the future.”

    • 5:39pm (cont.): Feanil updates the NPM token in the Github organization secret; communicates the issue & resolution for the Open edX community.

  • 2023/03/03

    • 11:46am: Leangseu realizes frontend-app-learning release is blocked due to a mismatch in exports between @edx/frontend-component-footer and @edx/frontend-component-footer-edx, requiring a fix to be published to NPM for ​edx/frontend-component-header-edx.

    • 12:25pm: Once the PR for the fix was approved/merged, it failed to release to NPM due to an error indicating the token needs a OTP (one-time password) to authenticate.

    • 12:43pm: Adam realizes we would likely need to rotate our NPM token in the edx GitHub organization as well as the openedx GitHub that we already did through tCRIL. He escalates in the #incident-warroom thread from earlier, tagging Joe he helped previously. Discussion ensues.

    • 3:21pm: Joe tries rotating the edx NPM token in the GitHub settings and Adam retries the release (requires deleting a Git tag from the previous release attempt), but we see the same OTP error.

    • 3:34pm: Adam starts a group DM with Feanil at tCRIL and Joe at 2U to ask if Feanil could send us the new NPM token to add to the edx GitHub settings.

    • 3:43pm: Feanil realizes that the old edx token we’re using may be a publish token, not an automation token, so it would still require 2FA. Feanil generates a new automation token and shares it with Joe.

    • 3:48pm: Adam retries the NPM release once Joe updates the GitHub secret for the rotated NPM token and it succeeds ​ Adam informs stakeholders in Slack.

What factors contributed to {event}?

  • Why was the openedx/cla status check made required for all Open edX repos without advance warning?

  • Why was the release workflow in frontend-enterprise doing NPM authentication by writing the NPM token secret to a local .npmrc file?

  • Why/how did the leaked NPM token automatically get revoked within minutes?

  • Why was there uncertainty around which NPM account is associated with the leaked token?

    • Context: There are 4 users associated with the @edx NPM organization.

      • edx-old-org

      • edx-semantic-release

      • 2 personal users (Robert & Simon)

  • Why isn’t it clear on best practices for escalating a security incident with tCRIL directly?

    • The documented security escalation path recommends raising an issue with the 2U/edX security team:

      • 2U/edX security wouldn’t really have been able to help here.

  • Why didn’t we proactively consider rotating the NPM token in the edX GitHub organization as well before it became a blocker for a release?

What did we learn?

  • .

  • .

  • .

Action Items (optional)

  • .

  • .

  • .

Feedback

Really liked/disliked the RCA process? Share your feedback here!