...
Setup for On-Call Rotation and Triage:
- Use the On Call: Learner Team and Discovery Squad /wiki/spaces/LEARNER/pages/789970979 page for contact information and details.
- Request an OpsGenie (https://www.opsgenie.com/) account.
- Contact the Escalations team lead or your managed and have then send you an invite.
- Alternatively, you can file a devops ticket to get it as well
- Log in to OpsGenie using using SSO on the login page, using: "Using Single Sign-On? Login via your Identity Provider".
- Download the OpsGenie mobile app https://play.google.com/store/apps/details?id=com.ifountain.opsgenie&hl=en and https://itunes.apple.com/us/app/opsgenie/id528590328?mt=8
- Setup in the mobile app your alerts setting with your phone.
- Contact the Escalations team lead to add you to the rotation schedule.
- Ensure you have access to Splunk (https://splunk.edx.org).
- Review How to use Splunk for our various services.
- Get GoCD pipeline access at https://gocd.tools.edx.org/go/auth/login
- Ensure after you log into GoCD, you have access to the learner pipelines like "E-Commerce" (marketing site), ecommerce, credentials, and so on.
...
Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.
Operational Ops-Genie Steps
Anchor | ||||
---|---|---|---|---|
|
- Once you are alerted by OpsGenie, please confirm the alert that you received itIf it is an issue that impacting a lot of learners, please: Acknowledge the alert.
- Review the alert and try to determine the impacthttps://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts
- If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
- Review the the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for next steps
- If there is no Run-book
- Create a learner JIRA ticket to track, if a ticket haven't already existed
- Triage the JIRA ticket immediately and follow the triage process defined above
- Contact the Escalations Engineering Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
- If the impact it low Close the alert and begin looking at possible resolutions steps in the Run-books here: /wiki/spaces/LEARNER/pages/789970979
- If the impact is unknown notify the Learner-All channel of the Alert and review the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for possible resolution steps. Follow the steps above if there is no Run-book available.
- If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
- After determining impact and whether or not there is a Run-book available please make sure to Close the alert. The alert will message you again if it is not closed.
On-call FAQ
- Where is the schedule?
- OpsGenieOn Call: Learner Team and Discovery Squad
- /wiki/spaces/LEARNER/pages/789970979
- What are the hours?
- For production alerts raised by Opsgenie, primary is 24/7, secondary is technically also 24/7
- On Call: Learner Team and Discovery Squad/wiki/spaces/LEARNER/pages/789970979
- What do I do if there are certain times I am not available during my rotation?
- You should arrange before hand with your team, manager, and the Escalations Team Lead.
- If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
- What do I do if I don't know how to help directly
- You should still acknowledge the alert, and try to figure out whether the resolution can wait
- If it can wait, just wait until business hours
- If it cannot wait, go online and HipChat and seek help.
- Reach out up the chain of leadership within the Learner team. Escalate through your manager and the Escalations Team Lead.
- On-call Frequently Asked Questions