Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Priority of TicketCommunicationStakeholders
CAT-1
  • Primary communication should be JIRA ticket
  • HipChat: Immediate message in War Room channel and Escalations channel.
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Immediate email with the following information
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Updates to the above two channels as details are unfolded
    • Email sent at least every 24 hours.
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-2
  • Primary communication should be JIRA ticket
  • HipChat: Message only if impact is a great number of learners or educators
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Follow up email once triaged and issue is isolated.  Can also wait until after the issue is resolved.
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-3+
  • All communication should be had in the JIRA tickets.  
  • Limit email for communicating potential engineering improvements to avoid repeat issues.

Alerts:

Setup for

...

On-Call Rotation and Triage:

Setup services locally:

If there is a CAT-2 bug, triage team should start looking into it, either to fix by your own or try to get the right person for it. So make sure you have setup following services locally. 

  • edX-platform
  • Ecommerce
  • Course discovery/catalog
  • edX-mktg

This will also help to reproduce/debug the bugs locally.

We should use docker for devstack setup for these local environments. See https://www.Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.

Operational Ops-Genie Steps:

  • Once you are alerted by OpsGenie, please confirm the alert that you received it
  • If it is an issue that impacting a lot of learners, please:
    • Create a learner JIRA ticket to track, if a ticket haven't already existed
    • Triage the JIRA ticket immediately and follow the triage process defined above
    • Contact the Escalations Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
  • Monitor and investigate Splunk alerts. The documentation on splunk alerts we monitor are at 

    https://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts


...

  • Where is the schedule?  
  • OpsGenie
  • On Call: Learner Team and Discovery Squad
  • What are the hours? 
  • What do I do if there are certain times I am not available during my rotation?
    • You should arrange before hand with your team, manager, and the Escalations Team Lead.
    • If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
  • What do I do if I don't know how to help directly
    • You should still acknowledge the alert, and try to figure out whether the resolution can wait
    • If it can wait, just wait until business hours
    • If it cannot wait, go online and HipChat and seek help. 
    • Keep in mind, we should have a member in Lahore trying to investigate.
    • Worst comes worst, learn as much as you canReach out up the chain of leadership within the Learner team.  Escalate through your manager and the Escalations Team Lead.
  • On-call Frequently Asked Questions