...
Priority of Ticket | Communication | Stakeholders |
---|---|---|
CAT-1 |
|
|
CAT-2 |
|
|
CAT-3+ |
|
Alerts:
Setup for
...
On-Call Rotation and Triage:
- Use the On Call: Learner Team and Discovery Squad page for contact information and details.
- Request an OpsGenie (https://www.opsgenie.com/) account.
- Have Albert (AJ) St. Aubin (Deactivated), Bill DeRusha (Deactivated) or Mike Dikan (Deactivated) send Contact the Escalations team lead or your managed and have then send you an invite.
- Alternatively, you can file a devops ticket to get it as well
- Log in to OpsGenie using using SSO on the login page, using: "Using Single Sign-On? Login via your Identity Provider".
- Download the OpsGenie mobil mobile app https://play.google.com/store/apps/details?id=com.ifountain.opsgenie&hl=en and https://itunes.apple.com/us/app/opsgenie/id528590328?mt=8
- Setup in the mobile app your alerts setting with your phone.Have Albert (AJ) St. Aubin (Deactivated) set you up in the rotation
- Contact the Escalations team lead to add you to the rotation schedule.
- Ensure you have access to Splunk (https://splunk.edx.org).
- Review How to use Splunk for our various services.
- Get GoCD pipeline access at https://gocd.tools.edx.org/go/auth/login
- Ensure after you log into GoCD, you have access to the learner pipelines like "E-Commerce" (marketing site), ecommerce, credentials, and so on.
Setup services locally:
If there is a CAT-2 bug, triage team should start looking into it, either to fix by your own or try to get the right person for it. So make sure you have setup following services locally.
- edX-platform
- Ecommerce
- Course discovery/catalog
- edX-mktg
This will also help to reproduce/debug the bugs locally.
We should use docker for devstack setup for these local environments. See https://www.Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.
Operational Ops-Genie Steps:
- Once you are alerted by OpsGenie, please confirm the alert that you received it
- If it is an issue that impacting a lot of learners, please:
- Create a learner JIRA ticket to track, if a ticket haven't already existed
- Triage the JIRA ticket immediately and follow the triage process defined above
- Contact the Escalations Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
- Monitor and investigate Splunk alerts. The documentation on splunk alerts we monitor are at
https://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts
...
- Where is the schedule?
- OpsGenie
- On Call: Learner Team and Discovery Squad
- What are the hours?
- For production alerts raised by Opsgenie, primary is 24/7, secondary is technically also 24/7
- On Call: Learner Team and Discovery Squad
- What do I do if there are certain times I am not available during my rotation?
- You should arrange before hand with your team, manager, and the Escalations Team Lead.
- If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
- What do I do if I don't know how to help directly
- You should still acknowledge the alert, and try to figure out whether the resolution can wait
- If it can wait, just wait until business hours
- If it cannot wait, go online and HipChat and seek help.
- Keep in mind, we should have a member in Lahore trying to investigate.
- Worst comes worst, learn as much as you canReach out up the chain of leadership within the Learner team. Escalate through your manager and the Escalations Team Lead.
- On-call Frequently Asked Questions