This plan is made by the Spartan Team Members for a 3-person on-call rotation cycle. Following are our expectations in order of their importance:
- Have 24 hours primary on-call coverage for weekdays.
- Leave weekend coverage for now and measure its impact.
- Most Spartan engineers in Arbisoft are from other cities (with intermittent internet access) and weekend coverage considerably limits their ability to visit home.
- The Cambridge team will cover primary on-call on weekends (documented here), and secondary on-call during the week.
- Keep it completely fair for all resources.
- Particular consideration for night-shifts.
- Accommodate response level between shifts.
- Preferably not have individual resources be continuously on-call for 24-hours.
Plan
Rotation Schedule
Given three resources Resource A, Resource B and Resource C, we will have following schedule for on-call rotation:
...
This schedule will be followed for the weekdays only. For weekends, the Spartan team will not be providing on-call coverage. The focus is to have greater energy levels for on-call during the weekdays.
Rotation Switch
Since we have two shifts rotation between resources daily, each on-call resource will send out a brief email summarizing the status of their shift and to provide context (if any) to subsequent on-call engineer. The emphasis here is providing only the most relevant information and not to create a cumbersome process.
In order to streamline it, each on-call resource will only have to checkout here: https://goo.gl/forms/lMWtCWdmW3ah0Yrv2. Filling this form will send out a notification email to all Spartans with shift checkout status.
...
Shift | Day (5:00 to 21:00) / Night (21:00 to 5:00) Pakistan Standard Time |
---|---|
Shift Date | The date when shift is ending. |
# of Alerts Handled | The number of new relic alerts that were investigated and closed. |
Ongoing Alerts Investigation | Status of any alerts that are being investigated at the shift end, whether the current engineer will keep on investigating or does s/he needs support from incoming on-call engineer. |
Ongoing CAT-1/CAT-2 Status | Similarly, status of any ongoing CAT-1/CAT-2 investigation and resolution during the shift. |
Response Level
Given we have two shifts during the day, we will have following response levels respectively:
Day Shift | Response to Splunk, OpsGenie, HipChat and JIRA filter. Active triaging. Immediate deep-dive into Cat-1 and Cat-2 bugs resolution. Look into edx-status emails. |
---|---|
Night Shift | Response to Splunk, OpsGenie, HipChat and JIRA filter. Immediate deep-dive into Cat-1 and Cat-2 bugs resolution. |
Categories
Spartan team is planning on introducing new definitions of categories for a better granularity in response levels but only after involving and integrating feedback from all stakeholders. Until then we will be using the category definitions we have inherited.
For reference
Source: https://openedx.atlassian.net/wiki/display/ENG/Bug+Fix+Policy
CAT-1 (Catastrophic or Critical)
...