For those not familiar with the RCA format: The point is to find systematic flaws that led to the incident being discussed. Blame should be assigned to processes, not people.
...
2024-11-11
Dave tags Kelly to give a heads up that the v2 forums code PR is getting close to merging in edx-platform.
Dave starts a Slack thread in
#ask-2u
to ask if 2U has switched over toutf8mb4
for database connections. This was motivated by the fact that the forums PR would create database tables for the new MySQL storage backend. If the config was not changed before deployment, the tables would not support storing emojis.
2024-11-13
After some investigation, Robert verifies that the encoding of the database connection for the LMS and Studio is still
utf8
(which means it'sutf8mb3
).
2024-11-18
Forums PR gets final approvals.
2024-11-19
2U's staging environment Elasticsearch is not working, making it impossible to properly test forums in that environment. Asad informs the channel that SRE has reached out to AWS support
2024-11-21
Robert makes a new ticket to 2U SRE. Merging the PR to master is delayed.
The rationale for this was that the configuration change would be straightforward (requires no data migration, Tutor has connected this way by default since Redwood, and most sites that we know of also connect in this manner). And that by doing that configuration change, we would save 2U having to spend more time modifying those tables in the future.
2024-11-22
Alex notifies the Slack thread that SRE is investigating the connection encoding issue.
The 2U staging instance Elasticsearch is fixed.
2024-11-25
More discussion of the scope of required changes, and whether changing this configuration falls under SRE or app owners.
2024-11-26
Ahtisham and Diana pick up the configuration work, but we agree that deployment shouldn't happen until after the long U.S. Thanksgiving holiday.
2024-12-03
2U infrastructure was updated to use
utf8mb4
, removing the last blocker to merge.Diana pauses the release pipeline. The Infinity will test.
11:01 AM EST: Dave merges the initial PR into master.
2024-12-04
08:43 EST: Ahtisham reports that the staging env is broken. We decide to fix forward.. 2U decides that they can keep their release pipeline paused and fix forward.
09:42 EST: Regis’s fix for the above issue is merged.
2024-12-05
10:34 EST: Diana reports that they needed to add the Django settings
FORUM_MONGODB_DATABASE
andFORUM_MONGODB_CLIENT_PARAMETERS
to get things running on stage. This was not originally intended. There is some issue with getting MongoDB authentication working properly. Diana launches a conversation with SRE to figure this out.
2024-12-06
09:09 EST: Diana reports that “some of the experienced SRE engineers over in Lahore seem to have resolved our issue overnight, so i think we might be able to get this out to edge and prod today”.
10:40 EST: Diana reports new error on production after this rollout:
pymongo.errors.OperationFailure: An existing index has the same name as the requested index. When index names are not specified, they are auto generated and can cause conflicts. Please refer to our documentation. Requested index: { v: 2, key: { comment_thread_id: 1, sk: 1 }, name: "comment_thread_id_1_sk_1", sparse: true }, existing index: { v: 1, key: { comment_thread_id: 1.0, sk: 1.0 }, name: "comment_thread_id_1_sk_1", background: true }, full error: {'ok': 0.0, 'errmsg': 'An existing index has the same name as the requested index. When index names are not specified, they are auto generated and can cause conflicts. Please refer to our documentation. Requested index: { v: 2, key: { comment_thread_id: 1, sk: 1 }, name: "comment_thread_id_1_sk_1", sparse: true }, existing index: { v: 1, key: { comment_thread_id: 1.0, sk: 1.0 }, name: "comment_thread_id_1_sk_1", background: true }', 'code': 86, 'codeName': 'IndexKeySpecsConflict', '$clusterTime': {'clusterTime': Timestamp(1733499539, 5), 'signature': {'hash': b'g6\x9d(\x9a6\xe6R\xde\x12\xcc\xd9)\xc4\xef!\x03\xc5SH', 'keyId': 7405177900736970758}}, 'operationTime': Timestamp(1733499539, 5)}
(the suspicion is that this didn’t happen in stage because the indexes were never properly created there, see more in “How did this happen?”)
10:52 EST: Diana, Ahtisham, and Dave all agree that we should roll back the prod environment, but this was already done.
11:12 EST: Diana creates a revert PR to remove v2 forums integration from edx-platform. Dave reviews and Diana merges at 11:19 EST.
2024-12-09
10:19 EST: Regis creates a PR to allow the v2 forum functionality to be completely turned off with an additional Django setting. Dave reviews and merges at 10:48 EST.
2024-12-11
05:27 EST: Asad signs off on the approach of the above PR. Ahtisham agrees to move forward with it the following day.
2024-12-12
02:18 EST: Ahtisham merges the PR to reapply v2 forums (i.e. undo the 2024-12-06 revert, but with the new PR from 2025-12-09 that has a global
DISABLE_FORUM_V2
Django setting.03:55 EST: Ahtisham reports an error around endorsing a response. Regis creates a PR to fix.
08:59 EST: Ahtisham reports that v2 forums code is live on edX prod, with
DISABLE_FORUM_V2 = True
How did this happen?
Note |
---|
|
...