Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Setup for On-Call Rotation and Triage:

...

Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.

Operational Ops-Genie Steps
Anchor
OpsGenie Next Steps
OpsGenie Next Steps
:

  • Once you are alerted by OpsGenie, please confirm the alert that you received itIf it is an issue that impacting a lot of learners, please: Acknowledge the alert.
  • Review the alert and try to determine the impacthttps://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts
    • If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
      • Review the the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for next steps
      • If there is no Run-book
        • Create a learner JIRA ticket to track, if a ticket haven't already existed
        • Triage the JIRA ticket immediately and follow the triage process defined above
        • Contact the Escalations Engineering Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
    Monitor and investigate Splunk alerts. The documentation on splunk alerts we monitor are at 
    • If the impact it low Close the alert and begin looking at possible resolutions steps in the Run-books here: /wiki/spaces/LEARNER/pages/789970979
    • If the impact is unknown notify the Learner-All channel of the Alert and review the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for possible resolution steps. Follow the steps above if there is no Run-book available.
  • After determining impact and whether or not there is a Run-book available please make sure to Close the alert.  The alert will message you again if it is not closed.

On-call FAQ

  • Where is the schedule?  
  • OpsGenieOn Call: Learner Team and Discovery Squad
  • /wiki/spaces/LEARNER/pages/789970979
  • What are the hours? 
  • What do I do if there are certain times I am not available during my rotation?
    • You should arrange before hand with your team, manager, and the Escalations Team Lead.
    • If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
  • What do I do if I don't know how to help directly
    • You should still acknowledge the alert, and try to figure out whether the resolution can wait
    • If it can wait, just wait until business hours
    • If it cannot wait, go online and HipChat and seek help. 
    • Reach out up the chain of leadership within the Learner team.  Escalate through your manager and the Escalations Team Lead.
  • On-call Frequently Asked Questions