Learner Triage: On-Call Rotation Plan

This plan is made by the Spartan Team Members for a 3-person on-call rotation cycle. Following are our expectations in order of their importance:

  1. Have 24 hours primary on-call coverage for weekdays.
  2. Leave weekend coverage for now and measure its impact.
    • Most Spartan engineers in Arbisoft are from other cities (with intermittent internet access) and weekend coverage considerably limits their ability to visit home.
  3. The Cambridge team will cover primary on-call on weekends (documented here), and /wiki/spaces/LEARNER/pages/162121684 during the week. 
  4. Keep it completely fair for all resources.
  5. Particular consideration for night-shifts.
  6. Accommodate response level between shifts.
    • Preferably not have individual resources be continuously on-call for 24-hours.

Table of contents:


Plan

Rotation Schedule

Given for example, resources A, B and C, we will have following schedule for on-call rotation:


Day Shift (20:00- 12:00 EST)

Night Shift (12:00-20:00 EST)

Week 1

Resource A

Resource B

Week 2

Resource C

Resource A

Week 3

Resource B

Resource C

Week 4*

Resource A

Resource B

* The rotation configuration repeats itself every fourth week.

This schedule will be followed for the weekdays only. For weekends, the Spartan team will not be providing on-call coverage. The focus is to have greater energy levels for on-call during the weekdays.

Rotation Switch

Since we have two shifts rotation between resources daily, each on-call resource will send out a brief email summarizing the status of their shift and to provide context (if any) to subsequent on-call engineer. The emphasis here is providing only the most relevant information and not to create a cumbersome process.

In order to streamline it, each on-call resource will only have to checkout here: https://goo.gl/forms/lMWtCWdmW3ah0Yrv2. Filling this form will send out a notification email to all Spartans with shift checkout status.

Following is a list of fields in this form with their descriptions: 

Shift

Day (5:00 to 21:00) / Night (21:00 to 5:00) Pakistan Standard Time

Shift Date

The date when shift is ending.

# of Alerts Handled

The number of new relic alerts that were investigated and closed.

Ongoing Alerts Investigation

Status of any alerts that are being investigated at the shift end, whether the current engineer will keep on investigating or does s/he needs support from incoming on-call engineer.

Ongoing CAT-1/CAT-2 Status

Similarly, status of any ongoing CAT-1/CAT-2 investigation and resolution during the shift.

Response Level

Given we have two shifts during the day, we will have following response levels respectively:

Day Shift

Response to Splunk, OpsGenie, HipChat and JIRA filter. Active triaging. Immediate deep-dive into Cat-1 and Cat-2 bugs resolution. Look into edx-status emails.

Night Shift

Respond to high frequency Splunk alerts, OpsGenie, HipChat and once alerted, immediately deep-dive into Cat-1 and Cat-2 bugs for resolution.  


Categories

Spartan team is planning on introducing new definitions of categories for a better granularity in response levels but only after involving and integrating feedback from all stakeholders. Until then we will be using the category definitions we have inherited.

 

For reference

Source: https://openedx.atlassian.net/wiki/display/ENG/Bug+Fix+Policy 

CAT-1 (Catastrophic or Critical)

Major loss of functionality, possibly major data loss as well. Fix should be made ASAP.

General Example(s): Loss of data; users are unable to complete workflow(s); system outage

Specific Example(s): Users are prevented from registering for courses

 

CAT-2 (Severe or Major)

These bugs should also be resolved in a tight time range, but are not absolutely critical. Functionality will be lost, and there may be perceived (but no actual) data loss. User experience may be significantly hampered for smaller segments of the platform. While not necessarily release-blocking, the responsible team should prioritize these bugs accordingly.

General Example(s): Without relying on support, users can complete workflows, but they must use alternative methods; user-facing 500 errors

Specific Example(s):  To advance to the next unit ("unit 7"), the "next" button doesn't work (e.g, "next"), but using the next sequence link does work (e.g., specifically using "unit 7").

 

CAT-3 (Minor)

These bugs include: serious but infrequent bugs, pervasive UX bugs, and bugs that hamper monitoring. They do not have the same time sensitivity as CAT-2 bugs. There is no timeline for a fix; a team should prioritize on a case-by-case basis.

 

CAT-4 (Cosmetic or Enhancement)

These are less severe bugs or might better be classified as enhancements or general improvements. Since many of them are simple, they should be great bugs for someone new to edX platform development. They are also great candidates for bug-bashing sessions. We've started labeling good introductory bugs as "byte-sized": check them out! These bugs may never be fixed, and that's OK. (In other words, if it should be fixed someday, then it should be CAT-3).

Accessibility Bug Prioritization

Accessibility Bugs MUST be remedied with the same level of priority (e.g., speed, resources used to remediate) as any other equivalent loss of function for individuals without disabilities. For example, an unlabeled button that prevents blind individuals from registering for a course is remedied with the same level of priority as would a bug that prevents an individual without a disability from registering for a course (see CAT-1 examples above).

Any accessibility bug that qualifies as a conformance failure of the Web Content Accessibility Guidelines 2.0 Level AA MUST NOT be categorized as a CAT-4.