Feanil Patel ticket enabling cron CI of master every week so we know when external changes might have broken some repos that are usually not getting updates.
Discuss access for teams that maintain many repos across the org
Introduction of a new CC Role for maintainer-at-large role
Will be posted to the forums shortly
We can follow up on discussion in that post for feedback.
Continued discussion on whether we should change the Depr 6-month window approach. Should we have one big ticket for something like Python 3.8 or Node 18 and just start the 6-month clock once all the maintained repos have been updated?
Proposal to shorten the DEPR simultaneous support window to 4-months, for future upgrade DEPRs that need to have a support window/operator impact.
We chose 6-months to guarantee it would be in one release.
Alternate Proposal:
Provide a predictable time when the fix will be gauranteed to be available within the next six months.
Announce the DEPR as early as possible (6-months is ideal) and at the end of the DEPR, there has to be a 1-month period of simultaneous support.
The Plan is announced early and the time when the work is completed is as predictable as possible.
If the work is done early, we should keep the original date but this could be negotiated. Get agreement from people running master.
If the work is completed late, we provide a 1-month simultaneous support window from the time of completion.
We give at least six months announcement window. But the work does not need to have started or completed when we make the announcement.
I think we’ve proven empirically that the issue is as follows (this is not captured well by our docs yet, so that could cause some confusion w.r.t. state of actual resolution):
We were running with celery mingle enabled (b/c its enabled by default). Mingle means that, on worker startup (including restarts), each worker asks about the state of every other worker bound to the broker (redis).
Every edx python IDA that uses celery used a single broker (the legacy redis cluster).
edxapp was running with 30 worker instances each, and each one of those runs around 14 parent celery worker processes.
The confluence of these three things kicked off a “connection storm” in redis, causing massive amounts of (duplicated) task data to be sent out over the network to every worker, which caused us to pin the redis engine CPU at 100%, and blocked all workers from processing tasks from any queue.
The way we proved this empirically - during deploys (i.e. when we bring up a larger number of new worker processes), look at the following:
The number of “sync with” celery logs eminating from the celery workers.
The total network out from redis to the workers.
Redis engine CPU utilization
Redis new and current network connection counts.
In the bad state, all three of these metrics spiked and stayed elevated for quite some time. When mingle was disabled (on stage), none of them spiked.
Config overrides and YAML
Old Conversation: You should have your own settings files.
New Conversation about Devstack config being dropped:
The new development.py settings file should not include YAML support but will allow downstream settings files to add YAML support if they want it.
Toggle annotations and DEPR
Can we use the removal dates in toggle annotations as the deadlines when it’s safe to remove?
The goal of the annotation was always documentation to make it easier to understand the age of Toggles. This was before the 6-month window was created.
Proposal: Drop the removal date and just use the DEPR process because the dates used to be aspirational will mislead folks.
✅ Action items
Kyle McCormick will update the DEPR Pilot ticket with the new suggestion for planning major maintenance DEPRs