Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

For those not familiar with the RCA format: The point is to find systematic flaws that led to the incident being discussed. Blame should be assigned to processes, not people.

...

  • 2024-11-11

    • Dave tags Kelly to give a heads up that the v2 forums code PR is getting close to merging in edx-platform.

    • Dave starts a Slack thread in #ask-2u to ask if 2U has switched over to utf8mb4 for database connections. This was motivated by the fact that the forums PR would create database tables for the new MySQL storage backend. If the config was not changed before deployment, the tables would not support storing emojis.

  • 2024-11-13

    • After some investigation, Robert verifies that the encoding of the database connection for the LMS and Studio is still utf8 (which means it's utf8mb3).

  • 2024-11-18

    • Forums PR gets final approvals.

  • 2024-11-19

    • 2U's staging environment Elasticsearch is not working, making it impossible to properly test forums in that environment. Asad informs the channel that SRE has reached out to AWS support

  • 2024-11-21

    • Robert makes a new ticket to 2U SRE. Merging the PR to master is delayed.

      • The rationale for this was that the configuration change would be straightforward (requires no data migration, Tutor has connected this way by default since Redwood, and most sites that we know of also connect in this manner). And that by doing that configuration change, we would save 2U having to spend more time modifying those tables in the future.

  • 2024-11-22

    • Alex notifies the Slack thread that SRE is investigating the connection encoding issue.

    • The 2U staging instance Elasticsearch is fixed.

  • 2024-11-25

    • More discussion of the scope of required changes, and whether changing this configuration falls under SRE or app owners.

  • 2024-11-26

    • Ahtisham and Diana pick up the configuration work, but we agree that deployment shouldn't happen until after the long U.S. Thanksgiving holiday.

  • 2024-12-03

    • 2U infrastructure was updated to use utf8mb4, removing the last blocker to merge.

    • Diana pauses the release pipeline. The Infinity will test.

    • 11:01 AM EST: Dave merges the initial PR into master.

  • 2024-12-04

    • 08:43 EST: Ahtisham reports that the staging env is broken. We decide to fix forward.. 2U decides that they can keep their release pipeline paused and fix forward.

    • 09:42 EST: Regis’s fix for the above issue is merged.

  • 2024-12-05

    • 10:34 EST: Diana reports that they needed to add the Django settings FORUM_MONGODB_DATABASE and FORUM_MONGODB_CLIENT_PARAMETERS to get things running on stage. This was not originally intended. There is some issue with getting MongoDB authentication working properly. Diana launches a conversation with SRE to figure this out.

  • 2024-12-06

    • 09:09 EST: Diana reports that “some of the experienced SRE engineers over in Lahore seem to have resolved our issue overnight, so i think we might be able to get this out to edge and prod today”.

    • 10:40 EST: Diana reports new error on production after this rollout: pymongo.errors.OperationFailure: An existing index has the same name as the requested index. When index names are not specified, they are auto generated and can cause conflicts. Please refer to our documentation. Requested index: { v: 2, key: { comment_thread_id: 1, sk: 1 }, name: "comment_thread_id_1_sk_1", sparse: true }, existing index: { v: 1, key: { comment_thread_id: 1.0, sk: 1.0 }, name: "comment_thread_id_1_sk_1", background: true }, full error: {'ok': 0.0, 'errmsg': 'An existing index has the same name as the requested index. When index names are not specified, they are auto generated and can cause conflicts. Please refer to our documentation. Requested index: { v: 2, key: { comment_thread_id: 1, sk: 1 }, name: "comment_thread_id_1_sk_1", sparse: true }, existing index: { v: 1, key: { comment_thread_id: 1.0, sk: 1.0 }, name: "comment_thread_id_1_sk_1", background: true }', 'code': 86, 'codeName': 'IndexKeySpecsConflict', '$clusterTime': {'clusterTime': Timestamp(1733499539, 5), 'signature': {'hash': b'g6\x9d(\x9a6\xe6R\xde\x12\xcc\xd9)\xc4\xef!\x03\xc5SH', 'keyId': 7405177900736970758}}, 'operationTime': Timestamp(1733499539, 5)}

      • (the suspicion is that this didn’t happen in stage because the indexes were never properly created there, see more in “How did this happen?”)

    • 10:52 EST: Diana, Ahtisham, and Dave all agree that we should roll back the prod environment, but this was already done.

    • 11:12 EST: Diana creates a revert PR to remove v2 forums integration from edx-platform. Dave reviews and Diana merges at 11:19 EST.

  • 2024-12-09

    • 10:19 EST: Regis creates a PR to allow the v2 forum functionality to be completely turned off with an additional Django setting. Dave reviews and merges at 10:48 EST.

  • 2024-12-11

    • 05:27 EST: Asad signs off on the approach of the above PR. Ahtisham agrees to move forward with it the following day.

  • 2024-12-12

    • 02:18 EST: Ahtisham merges the PR to reapply v2 forums (i.e. undo the 2024-12-06 revert, but with the new PR from 2025-12-09 that has a global DISABLE_FORUM_V2 Django setting.

    • 03:55 EST: Ahtisham reports an error around endorsing a response. Regis creates a PR to fix.

    • 08:59 EST: Ahtisham reports that v2 forums code is live on edX prod, with DISABLE_FORUM_V2 = True

How did this happen?

 

Note
  • We are filling this out async.

    • If there are any invalid assumptions or statements in any of the bullets below, simply add your own bullet to help clarify.

    • Please add “[NAME]” as a prefix to your bullet.

...

  • [Robert] Why was Forums v2 included in Sumac so long after the branches were cut?

    • [Robert] See https://discuss.openedx.org/t/btr-forum-sumac-backport-discussion/14332 (needs more notes). This decision certainly impacted this situation.

    • [Dave] The thinking was that it would be relatively safe to do, and that it was a worthwhile tradeoff of stability vs. the increased maintenance burden of keeping the old code around for another release cycle. The decision happened on 11/13, and I believe it was being cherry-picked onto the Sumac sandbox even before that for testing purposes. At the time, we didn’t imagine it would take another month to land on master.

    • [Dave] Another aspect was that there were opposing views on how this should be rolled out. Both Edly and Axim agreed that the Build Test Working Group should make the final call with regards to Sumac and how it would be configured for rollout to the community.

    • [Dave] At the same time, the code would have merged to master relatively soon either way (that is not under BTR control). It is true that wanting to have it in Sumac did push us to want to merge to master sooner, but if that code had merged three weeks later, it would likely have still run into the same issues. As it was, the PR to master was approved on 11/18 and merged on 12/3.

    • [Régis] Because it is the Build/Test/Release working group policy not to include features in a release which are not merged first in master.

  • [Robert] Why wasn't Forums v2 merged into master earlier?

    • [Robert] It would impact 2U as soon as it was merged.

      • [Dave] Why would it impact 2U as soon as it was merged?

        • [Dave] 2U deploys directly off of master, meaning that there is no way for 2U to decouple their deployment from the latest CI state of master, except to pause the pipeline altogether. I think this point is important because 2U would probably have opted to skip this change altogether, or at least rescheduled it to a more opportune time.

          • [Dave] Why does 2U deploy off of master?

            • [Dave] It’s easy to deploy the latest changes.

            • [Dave] Changing to a different process that decouples 2U deployment from master would take time to implement, and 2U is short on staffing.

            • [Robert] These points are true. However, this topic is also complicated by the fact that there are benefits to the current process that are not obvious how to replicate. This can be discussed outside the context of this RCA at some point.

    • [Robert] It was waiting on 2U to fully test the disabled state.

      • [Robert] Why wasn't Forums v2 better tested in a disabled state before reaching 2U? (Or was it, and were these 2U-unique issues?)

        • [Dave] We had both types of issues:

          • [Dave] The forums v2 code connected to MongoDB directly to determine which course certain API endpoints affected, because there is not enough information in the request itself to make that determination (e.g. “what course does this thread belong to, so I can decide whether to use the old API or the new one?”). This was an issue that we ran into partway through and didn’t think through the implications of. The intent had always been to make it so that it would be off by default for site operators running on the latest master, and not require new configuration, but this forced edx-platform to at least be aware of the MongoDB credentials, even if it was mostly routing to the Ruby service. This was later addressed with a blanket DISABLE_FORUM_V2 Django setting that would short circuit anything but the legacy interface.

            • [Régis] The disabled state was tested before, in a Tutor context. In such a Tutor installation, the mongodb connection defaults to the “mongodb” host, which is correct for a locally-running mongodb instance inside a Docker container. Hence we did not detect any connection error.

            • [Robert] Thanks for this additional context.

          • [Dave] The v2 forum MongoDB code tried to auto-generate indexes when indexes already existed in the prod environment. (This was before the DISABLE_FORUM_V2setting). This wasn’t an issue that came up in testing either on other MongoDB instances during Edly’s testing, or even in 2U’s staging environment. There’s speculation that it went through the staging environment without problems because those indexes were never properly created in the first place. Index creation was since moved to a management command.

            • It should be noted that the creation of mongodb indices at runtime (instead of, say, in a Django migration) was ported from cs_comments_service: the goal was to preserve the existing behaviour. We still do not know which difference between the Ruby mongo library and pymongo is causing this difference in behaviour.

          • (TODO: there was something else, need to look up what)

        • [Régis] The issues that were detected were specific to the 2U environment (see my comments above).

        • [Régis] Given the importance of the change, why wasn’t it tested locally in an environment close to 2U’s prod before going to staging or production?

          • [Robert] Possibly the sense of urgency, and the hope that disabling would just work. Not sure.

      • [Robert] Why wasn't Forums v2 (under development) merged behind a release toggle months ago in a disabled state?

        • [Robert] Was this to save the default-state of enabled? Other?

          • [Dave] The default state of edx-platform was/is to put it into legacy mode (i.e. still calling the Ruby service). Tutor overrides this in its configuration, so operators running Tutor will get a default Sumac experience that does not call the Ruby service.

          • [Robert] This is helpful context, thanks.

        • [Dave] My understanding was that 2U did not have time to be operationally involved at all. My initial inclination was to do a smaller, iterative rollout that went endpoint by endpoint (similar to other refactoring work we’ve done in the past), but the fear was that this would require too much back and forth support from 2U with respect to monitoring and debugging. The hope was that the validation process would be simpler for 2U with this approach–i.e. turning on something that had already gone through the debugging process and hopefully doing one validation.

        • [Régis] Would it have helped? My understanding is that if we had pushed an unfinished (i.e: broken) forum v2 to master, then it wouldn’t have been tested.

          • [Robert] The main point was around timing. If we had caught issues around the toggle long before there was this Sumac deadline, there would have been more breathing room for resolving it. This is all theory of course, since maybe there would have been other issues.

            • [Régis] Was timing an issue? The migration was announced in April and the DEPR was made in October. What’s the process to follow up on such announcements?

        • [Robert] Thank you Dave and Régis. These are helpful comments and questions for developing further Why-and-How questions for learning and improving from this, especially given that this was the easy part of an imminent migration. I believe 2U needs to and will discuss this internally (post-holidays). In the meantime, I personally need to withdraw myself. I appreciate your efforts and concern. Thank you. FYI: Asad Azam (Deactivated) Kelly Buchanan.

How could we have prevented it?

...