For those not familiar with the RCA format: The point is to find systematic flaws that led to the incident being discussed. Blame should be assigned to processes, not people. It’s us against the mistakes that got us here.
Summary of Issue
There were a number of issues we encountered while trying to merge the new forums backend into edx-platform.
The process was disruptive to 2U because of the operational effort needed to deploy, test, and debug the new forums code (running in legacy-mode) on their infrastructure (it had been tested and deployed using Tutor prior to this point).
The process was disruptive to the Sumac release process, because of the coupling of the master branch and 2U's release pipeline. This put us in the position where we were cherry-picking the forums code onto the Sumac test sandbox for weeks, without landing it on either master or the sumac release branch.
...
2024-11-11
Dave tags Kelly to give a heads up that the v2 forums code PR is getting close to merging in edx-platform.
Dave starts a Slack thread in
#ask-2u
to ask if 2U has switched over toutf8mb4
for database connections. This was motivated by the fact that the forums PR would create database tables for the new MySQL storage backend. If the config was not changed before deployment, the tables would not support storing emojis.
2024-11-13
After some investigation, Robert verifies that the encoding of the database connection for the LMS and Studio is still
utf8
(which means it'sutf8mb3
).
2024-11-18
Forums PR gets final approvals.
2024-11-19
2U's staging environment Elasticsearch is not working, making it impossible to properly test forums in that environment. Asad informs the channel that SRE has reached out to AWS support
2024-11-21
Robert makes a new ticket to 2U SRE. Merging the PR to master is delayed.
The rationale for this was that the configuration change would be straightforward (requires no data migration, Tutor has connected this way by default since Redwood, and most sites that we know of also connect in this manner). And that by doing that configuration change, we would save 2U having to spend more time modifying those tables in the future.
2024-11-22
Alex notifies the Slack thread that SRE is investigating the connection encoding issue.
The 2U staging instance Elasticsearch is fixed.
2024-11-25
More discussion of the scope of required changes, and whether changing this configuration falls under SRE or app owners.
2024-11-26
Ahtisham and Diana pick up the configuration work, but we agree that deployment shouldn't happen until after the long U.S. Thanksgiving holiday.
2024-12-03
2U infrastructure was updated to use
utf8mb4
, removing the last blocker to merge.Diana pauses the release pipeline. The Infinity will test.
11:01 AM EST: Dave merges the initial PR into master.
2024-12-04
08:43 EST: Ahtisham reports that the staging env is broken. We decide to fix forward.
How did this happen?
Note |
---|
|
...
[Robert] Why was Forums v2 included in Sumac so long after the branches were cut?
[Robert] See https://discuss.openedx.org/t/btr-forum-sumac-backport-discussion/14332 (needs more notes). This decision certainly impacted this situation.
[Dave] The thinking was that it would be relatively safe to do, and that it was a worthwhile tradeoff of stability vs. the increased maintenance burden of keeping the old code around for another release cycle. The decision happened on 11/13, and I believe it was being cherry-picked onto the Sumac sandbox even before that for testing purposes. At the time, we didn’t imagine it would take another month to land on master.
[Dave] Another aspect was that there were opposing views on how this should be rolled out. Both Edly and Axim agreed that the BTR should make the final call with regards to Sumac and how it would be configured for rollout.
[Dave] At the same time, the code would have merged to master relatively soon either way (that is not under BTR control). It is true that wanting to have it in Sumac did push us to want to merge to master sooner, but if that code had merged three weeks later, it would likely have still run into the same issues. As it was, the PR to master was approved on 11/18 and merged on 12/3.
[Robert] Why wasn't Forums v2 merged into master earlier?
[Robert] It would impact 2U as soon as it was merged.
[Dave] Why would it impact 2U as soon as it was merged?
[Dave] 2U deploys directly off of master, meaning that there is no way for 2U to decouple their deployment from the latest CI state of master, except to pause the pipeline altogether. I think this point is important because 2U would probably have opted to skip this change altogether, or at least rescheduled it to a more opportune time.
[Dave] Why does 2U deploy off of master?
[Dave] It’s easy to deploy the latest changes.
[Dave] Changing to a different process that decouples 2U deployment from master would take time to implement, and 2U is short on staffing.
[Robert] It was waiting on 2U to fully test the disabled state.
[Robert] Why wasn't Forums v2 better tested in a disabled state before reaching 2U? (Or was it, and were these 2U-unique issues?)
[Robert] Why wasn't Forums v2 (under development) merged behind a release toggle months ago in a disabled state?
[Robert] Was this to save the default-state of enabled? Other?
[Dave] My understanding was that 2U did not have time to be operationally involved at all. My initial inclination was to do a smaller, iterative rollout that went endpoint by endpoint (similar to other refactoring work we’ve done in the past), but the fear was that this would require too much back and forth support from 2U with respect to monitoring and debugging. The hope was that the validation process would be simpler for 2U with this approach–i.e. turning on something that had already gone through the debugging process and hopefully doing one validation.
How could we have prevented it?
...