Discovery: Recent Deployers (WIP)
To start, this is just information copied directly from Slack. It is not yet ready for consumption. Confluence is simply a better place to collect this data than Slack. The following needs to be converted to an actual document.
Slack Thread 1
Julia Eskew [she/her] Apr 27th at 5:55 PM
Arch Team,During last week's edxapp deployment freeze, teams still merged edx-platform PRs which were then deployed to stage and queued up for production release. This queueing resulted in a larger-than-usual number of commits being released at once on Monday 4/26. Fortunately, our e2e-tests had failures that were easily mapped to a PR that was reverted - and the remaining release was deployed without issue.But conversations occurred last week and the general opinion was that we should adopt a merge freeze instead of a deployment freeze on future holiday/gratitude weeks. A merge freeze would prevent a large number of commits being released at once - and avoid the increased risk of a rollback needed because of release issues. The merge freeze would not be enforced via Github - instead it would be communicated to all development teams as a request. High-priority fixes would still be expedited/merged as needed. The edxapp deployment pipelines would still be paused in case someone mistakenly forgets the merge freeze.How does this plan sound to you? Please add your feedback in the thread. Thanks!
18 replies
Steven Burch 2 months ago
one thing to consider is the Core Contributors program. some non-edx employees already have merge access and this is expected to grow.
Feanil Patel [he/him] 2 months ago
I don't see how this actually changes the outcome, my expectation would be that if we made a merge freeze, it would result in lots of PRs merging on the first day after the freeze and we would be in the same boat but with less forewarning?
Feanil Patel [he/him] 2 months ago
The other options is to not adopt a merge freeze at all and expect that if you're merging you're around to monitor and revert as well. This puts a slightly higher burden on the on-call people when the impact is cross-team though so I understand why people are reluctant to do it.
Kyle McCormick 2 months ago
@feanil the benefits I see of merge freezing over deployment freezing are:
During the freeze, the pipeline is left clear for high-priority fixes. We had to tell AJ "sorry, you can't release your bugfix because there is so much unreleased in the pipeline" last week.
Yes, there will always be a deluge of merges when the freeze lifts, but at least that big Monday@10am release will be comprised of folks who knowingly merged Monday@9am, instead of a collection of people who merged at any point the previous week, potentially in a different timezone.
(edited)
Nimisha (PTO 6/16 - 6/18)
2 months ago
From my POV, a merge freeze seems to be a regressive step in our pipeline flow.What would need to happen for us to “not adopt a freeze at all and expect that if you’re merging you’re around to monitor and revert”?For example
Would having a single-click “revert” button enable this?
What additional ownership routing capabilities are needed?
(edited)
1
Feanil Patel [he/him] 2 months ago
or making it easier to take on-call for all of edx-platform temporarily
Kyle McCormick 2 months ago
@nasthagiri do you feel that deployment freezes are regressive as well?
Nimisha (PTO 6/16 - 6/18)
2 months ago
(yes)
Nimisha (PTO 6/16 - 6/18)
2 months ago
But I understood the need for deployment freeze earlier in the winter - when we had just rolled out distributed ownership of edx-platform and we were still establishing ourselves with it.
Julia Eskew [she/her] 2 months ago
Question about taking on-call: I assume the non-edX employees who have merge access aren't expected to take edx-platform on-call when they merge? That we'd instead rely on our internal on-call schedule to support those merges/deployments?
Nimisha (PTO 6/16 - 6/18)
2 months ago
Right now, it feels kind of a bummer that our Arbisoft colleagues have a long lead time on their daytime efforts since they need to wait until their own night-time to see their changes go to production.
Nimisha (PTO 6/16 - 6/18)
2 months ago
non-edX employees who have merge access
Are you speaking about the Open edX Core Committers?
1
Robert Raposa
2 months ago
I think the issue is on-call vs the one who merges (like @jeskew and others said). If it is late at night, or a weekend, or a holiday - I don’t want to be disturbed because others feel like taking a risk.
Arbisoft does not need to wait according to our current practices. Are you just referring to when we decide to freeze?
Nimisha (PTO 6/16 - 6/18)
2 months ago
Right now, the state is what you said: the Core Committers (CCs) rely on their designated internal “edX champions” to merge and deploy their changes. So those CCs don’t always merge even though they have access. But this is an agreement that the champions have established with their CCs.
Nimisha (PTO 6/16 - 6/18)
2 months ago
As of now, I wouldn’t over-index on the needs of CCs as we may evolve how we work with them in the next phase of the program.
Julia Eskew [she/her] 2 months ago
Ah - that sounds ok then. I didn't know that we were the champions (my friend...)
1
1
Nimisha (PTO 6/16 - 6/18)
2 months ago
lol. Check out the soundtrack: https://www.youtube.com/watch?v=jqKXxlFO5F0
YouTube | Open edXOpen edX Core Committers 2020 2
1
Kyle McCormick 2 months ago
Personally I have no opposition to establishing a system of "merge whenever you want, but if it's outside standard hours, you're on call now".I think the blockers to that would be
deciding what taking on call means (is it for all of edxapp? a certain slice? this could evolve with our capabilities)
configuring opsgenie to make it maximally obvious how to take on call and to check who is on call
communicating the new process
Slack Thread 2
Robert Raposa
Jun 2nd at 3:17 PM
@ablackwell @kmccormick: When you get the chance, in this thread (unless it is simpler to discuss in-person), can you provide some background on the types of problems you think you want to solve with “recent deployer” notifications? I’d like to capture the problems first, before we decide on a solution. Thanks.
8 replies
Adam Blackwell
15 days ago
My ultimate goal would be to stop pausing pipelines on weekends. My shorter term goal would be to notify authors of PRs that need to be reverted so they can reverts or fix forward without requiring the service owners team oncall rotation to spend as much time diagnosing the cause of the alert. I think this would be helpful for when enterprise PRs land and cause issues in edxapp or ecommerce.
Adam Blackwell
15 days ago
(Kyle may have a slightly different objective in mind though)
Adam Blackwell
15 days ago
This is a similar goal to when we first experimented with recent deployers, but a key difference is that we now continuously deploy edxapp (for the most part) and have better on call docs which would enable other engineers to better troubleshoot errors.
Robert Raposa
15 days ago
Thanks. I think there were a lot of issues with the recent deployed alerting, and it is helpful to hear your issues, so we can brainstorm the best solution.1
Kyle McCormick 14 days ago
My goals are similar to Adam's. Mainly, I'd like a resolution to
, in which there was disagreement of how exactly we should protect rotating on-call engineers from weekend/holiday/hackathon merges, whether from not-on-call edX employees, or from community core committers.If the answer is "automating this is too flaky, we should just have a manual process that asks weekend mergers to take the pager" then that is OK with me too. (edited)
Julia Eskew [she/her]Arch Team,During last week's edxapp deployment freeze, teams still merged edx-platform PRs which were then deployed to stage and queued up for production release. This queueing resulted in a larger-than-usual number of commits being released at once on Monday 4/26. Fortunately, our e2e-tests had failures that were easily mapped to a PR that was reverted - and the remaining release was deployed without issue.But conversations occurred last week and the general opinion was that we should adopt a merge freeze instead of a deployment freeze on future holiday/gratitude weeks. A merge freeze would prevent a large number of commits being released at once - and avoid the increased risk of a rollback… Show moreThread in #arch-n-tnl | Apr 27th | View message
Robert Raposa
14 days ago
Thanks for all this. Some additional notes to be discussed in the future:
Are there differences between weekends, holidays, and events like Cyber Monday where we were asked to provide greater stability by not deploying?
What can Core Commiters do, and does that affect solutions and how they are made available?
Is there a solution that can both minimize changes (via messaging, or other), with a clear path for making changes? Is this necessary?
How do we deal with the various
code_owner
alert policies and OpsGenie configs and wishes? I’ve heard some code owners who want to know about all issues, and some who do not.While the pipeline itself is flaky, are people free to call in SRE or others on holidays, etc., to unblock the pipeline?
What would our dream automation look like? Does it change based on time of day/week, or always the same? If I merge, when am I actually on-call? How am I notified that my on-call is starting/ending? Etc.
1
Robert Raposa
13 days ago
Some additional thoughts:
Ensuring the pipeline is not flaky has many benefits, but it is a blocker to this work. I wouldn’t want to merge a change, be on-call for it, but have no idea if it might be hours or days later before the pipeline gets it out.
These notes should all be moved to a page (I may do that) that documents this as a potential project.
I think @nasthagiri and @jbowman will need to review this and help determine the resourcing and priority of this work, even if it is just prioritizing discovery work.
Kyle McCormick 13 days ago
I want to be clear that I personally am not trying to push for this work to happen now or even soon. I fully agree that pipeline reliability is more urgent, and is a prereq to this.The incident that we were mini-retro'ing in eSRE guild was actually a success story of the current process, in that the dev who merged the broken PR was paying attention, and joined the incident response immediately.I remarked that it'd also be cool to automatically page recent mergers, but as evidenced by the incident, a manual "if you merge, you're on call" system is also totally viable.1
Slack Thread 3
Adam Blackwell
Yesterday at 9:41 AM
Hey Rob, I am catching up on things since I was out and am rereading the thread behind
, I'm curious if you have an idea of how hard it would be to add recent deployers logic to the notes deployment pipeline or if you'd be open to taking 30 minutes to either move the notes from that thread to a page together or scope some requirements/a POC.SRE was also talking about a MVP canary build process where we auto rollback a service if appdex/error rate metric drops/jumps which could be related if this PR that allows slack notifications on rollbacks lands: https://github.com/argoproj/argo-rollouts/pull/1175
2 replies
Robert Raposa
23 hours ago
I’m out today, but happy to discuss when I return.
First step would be to come up with a plan before implementing. Would be great to be iterative and experiment before over investing.
Same for automating rollbacks. Maybe start with detecting situations and notifying, so you can tune and determine pros/cons of automated decision on this.
Also, will be important to clarify goals and measurements. Is this to reduce MTTR, or reduce reliance on SRE, or other?
We’ll talk more. Look forward to it. (edited)
1
Adam Blackwell
21 hours ago
I love this list. I'd like to frame the work on reducing MTTR.