Triage Process:
Spartans Triage Board https://openedx.atlassian.net/secure/RapidBoard.jspa?rapidView=419&projectKey=LEARNER
We are in the process of updating the following Flow with a Flow chart and greater detail.
For each ticket in the Needs Triage queue:
- Read the ticket and determine if there is enough information from the reporter to begin an investigation.
- If there is not enough information, contact the reporter and move the ticket to the "Blocked Waiting on Reporter" queue.
- It should be expected that the reporter can provide steps to reproduce an issue.
- Attempt to reproduce the issue
- If the issue can be reproduced, document the steps to reproduce if they were not provided by the reporter
- If the issue cannot be reproduced, message the reporter and move the ticket to the "Blocked Waiting on Reporter" queue until more information is available.
- Evaluate the issue impact to our learners and edX organization.
- Use splunk, newrelic, and other tools to evaluate the frequency, and user experience impact.
- Use this impact to assist with defining a Priority (e.g. CAT-1→CAT-5)
- Identify and communicate a potential work around for an issue until the issue is resolved.
- Identify if the issue is related to ongoing feature work and if the ticket should be assigned to a member of another engineering team
- Assign a prioritization to the issue. See edX Bug Fix Policy for information on what CAT number means
- Use the following Communication Plan for actions related to each CAT number.
Escalations Actions After Triage
Priority | Description |
---|---|
CAT-1 and CAT-2 |
|
CAT-3 |
|
CAT-4 and CAT-5 |
|
Communication Plan
Priority of Ticket | Communication | Stakeholders |
---|---|---|
CAT-1 |
|
|
CAT-2 |
|
|
CAT-3+ |
|
Alerts:
Setup for On-Call Rotation and Triage:
- Use the /wiki/spaces/LEARNER/pages/789970979 page for contact information and details.
- Request an OpsGenie (https://www.opsgenie.com/) account.
- Contact the Escalations team lead or your managed and have then send you an invite.
- Alternatively, you can file a devops ticket to get it as well
- Log in to OpsGenie using using SSO on the login page, using: "Using Single Sign-On? Login via your Identity Provider".
- Download the OpsGenie mobile app https://play.google.com/store/apps/details?id=com.ifountain.opsgenie&hl=en and https://itunes.apple.com/us/app/opsgenie/id528590328?mt=8
- Setup in the mobile app your alerts setting with your phone.
- Contact the Escalations team lead to add you to the rotation schedule.
- Ensure you have access to Splunk (https://splunk.edx.org).
- Review How to use Splunk for our various services.
- Get GoCD pipeline access at https://gocd.tools.edx.org/go/auth/login
- Ensure after you log into GoCD, you have access to the learner pipelines like "E-Commerce" (marketing site), ecommerce, credentials, and so on.
Setup services locally:
Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.
Operational Ops-Genie Steps :
- Once you are alerted by OpsGenie, please Acknowledge the alert.
- Review the alert and try to determine the impact
- If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
- Review the the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for next steps
- If there is no Run-book
- Create a learner JIRA ticket to track, if a ticket haven't already existed
- Triage the JIRA ticket immediately and follow the triage process defined above
- Contact the Escalations Engineering Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
- If the impact it low Close the alert and begin looking at possible resolutions steps in the Run-books here: /wiki/spaces/LEARNER/pages/789970979
- If the impact is unknown notify the Learner-All channel of the Alert and review the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for possible resolution steps. Follow the steps above if there is no Run-book available.
- If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
- After determining impact and whether or not there is a Run-book available please make sure to Close the alert. The alert will message you again if it is not closed.
On-call FAQ
- Where is the schedule?
- OpsGenie
- /wiki/spaces/LEARNER/pages/789970979
- What are the hours?
- For production alerts raised by Opsgenie, primary is 24/7, secondary is technically also 24/7
- /wiki/spaces/LEARNER/pages/789970979
- What do I do if there are certain times I am not available during my rotation?
- You should arrange before hand with your team, manager, and the Escalations Team Lead.
- If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
- What do I do if I don't know how to help directly
- You should still acknowledge the alert, and try to figure out whether the resolution can wait
- If it can wait, just wait until business hours
- If it cannot wait, go online and HipChat and seek help.
- Reach out up the chain of leadership within the Learner team. Escalate through your manager and the Escalations Team Lead.
- On-call Frequently Asked Questions