Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This plan is made by the Spartan Team Members for a 3-person on-call rotation cycle. Following are our expectations in order of their importance:

  1. Have 24 hours primary on-call coverage for weekdays.
  2. Leave weekend coverage for now and measure its impact.
    • Most Spartan engineers in Arbisoft are from other cities (with intermittent internet access) and weekend coverage considerably limits their ability to visit home.
  3. The Cambridge team will cover primary on-call on weekends (documented here), and secondary on-call /wiki/spaces/LEARNER/pages/162121684 during the week. 
  4. Keep it completely fair for all resources.
  5. Particular consideration for night-shifts.
  6. Accommodate response level between shifts.
    • Preferably not have individual resources be continuously on-call for 24-hours.

...

Table of Contents
maxLevel3
minLevel2

Plan

Rotation Schedule

Given for example, resources A, B and C, we will have following schedule for on-call rotation:


Day Shift (0020:00- 1012:00 EST)

Night Shift (1012:00-0020:00 EST)

Week 1

Resource A

Resource B

Week 2

Resource C

Resource A

Week 3

Resource B

Resource C

Week 4*

Resource A

Resource B

...

This schedule will be followed for the weekdays only. For weekends, the Spartan team will not be providing on-call coverage. The focus is to have greater energy levels for on-call during the weekdays.

Rotation Switch

Since we have two shifts rotation between resources daily, each on-call resource will send out a brief email summarizing the status of their shift and to provide context (if any) to subsequent on-call engineer. The emphasis here is providing only the most relevant information and not to create a cumbersome process.

In order to streamline it, each on-call resource will only have to checkout here: https://goo.gl/forms/lMWtCWdmW3ah0Yrv2. Filling this form will send out a notification email to all Spartans with shift checkout status.

...

Shift

Day (5:00 to 21:00) / Night (21:00 to 5:00) Pakistan Standard Time

Shift Date

The date when shift is ending.

# of Alerts Handled

The number of new relic alerts that were investigated and closed.

Ongoing Alerts Investigation

Status of any alerts that are being investigated at the shift end, whether the current engineer will keep on investigating or does s/he needs support from incoming on-call engineer.

Ongoing CAT-1/CAT-2 Status

Similarly, status of any ongoing CAT-1/CAT-2 investigation and resolution during the shift.

Response Level

Given we have two shifts during the day, we will have following response levels respectively:

Day Shift

Response to Splunk, OpsGenie, HipChat and JIRA filter. Active triaging. Immediate deep-dive into Cat-1 and Cat-2 bugs resolution. Look into edx-status emails.

Night Shift

Respond to high frequency Splunk alerts, OpsGenie, HipChat and once alerted, immediately deep-dive into Cat-1 and Cat-2 bugs for resolution.  


Categories

Spartan team is planning on introducing new definitions of categories for a better granularity in response levels but only after involving and integrating feedback from all stakeholders. Until then we will be using the category definitions we have inherited.

 

For reference

Source: https://openedx.atlassian.net/wiki/display/ENG/Bug+Fix+Policy 

CAT-1 (Catastrophic or Critical)

...