Triage Process:
Spartans Triage Board https://openedx.atlassian.net/secure/RapidBoard.jspa?rapidView=419&projectKey=LEARNER
We are in the process of updating the following Flow with a Flow chart and greater detail.
For each ticket in the Needs Triage queue:
- Read the ticket and determine if there is enough information from the reporter to begin an investigation.
- If there is not enough information, contact the reporter and move the ticket to the "Blocked Waiting on Reporter" queue.
- It should be expected that the reporter can provide steps to reproduce an issue.
- Attempt to reproduce the issue
- If the issue can be reproduced, document the steps to reproduce if they were not provided by the reporter
- If the issue cannot be reproduced, message the reporter and move the ticket to the "Blocked Waiting on Reporter" queue until more information is available.
- Evaluate the issue impact to our learners and edX organization.
- Use splunk, newrelic, and other tools to evaluate the frequency, and user experience impact.
- Use this impact to assist with defining a Priority (e.g. CAT-1→CAT-5)
- Identify and communicate a potential work around for an issue until the issue is resolved.
- Identify if the issue is related to ongoing feature work and if the ticket should be assigned to a member of another engineering team
- Assign a prioritization to the issue. See edX Bug Fix Policy for information on what CAT number means
- Use the following Communication Plan for actions related to each CAT number.
Escalations Actions After Triage
Priority | Description |
---|---|
CAT-1 and CAT-2 |
|
CAT-3 |
|
CAT-4 and CAT-5 |
|
Communication Plan
Priority of Ticket | Communication | Stakeholders |
---|---|---|
CAT-1 |
|
|
CAT-2 |
|
|
CAT-3+ |
|
Alerts:
Setup for triage:
- Get OpsGenie (https://www.opsgenie.com/) account.
- Have Albert (AJ) St. Aubin (Deactivated), Bill DeRusha (Deactivated) or Mike Dikan (Deactivated) send you an invite.
- Alternatively, you can file a devops ticket to get it as well
- Log in to OpsGenie using using SSO on the login page, using: "Using Single Sign-On? Login via your Identity Provider".
- Download the OpsGenie mobil app https://play.google.com/store/apps/details?id=com.ifountain.opsgenie&hl=en and https://itunes.apple.com/us/app/opsgenie/id528590328?mt=8
- Setup in the mobile app your alerts setting with your phone.
- Have Albert (AJ) St. Aubin (Deactivated) set you up in the rotation.
- Ensure you have access to Splunk (https://splunk.edx.org).
- Review How to use Splunk for our various services.
- Get GoCD pipeline access at https://gocd.tools.edx.org/go/auth/login
- Ensure after you log into GoCD, you have access to the learner pipelines like "E-Commerce" (marketing site), ecommerce, credentials, and so on.
Setup services locally:
If there is a CAT-2 bug, triage team should start looking into it, either to fix by your own or try to get the right person for it. So make sure you have setup following services locally.
- edX-platform
- Ecommerce
- Course discovery/catalog
- edX-mktg
This will also help to reproduce/debug the bugs locally.
We should use docker for devstack setup for these local environments. See https://www.github.com/edx/devstack
Operational:
- Once you are alerted by OpsGenie, please confirm the alert that you received it
- If it is an issue that impacting a lot of learners, please:
- Create a learner JIRA ticket to track, if a ticket haven't already existed
- Triage the JIRA ticket immediately and follow the triage process defined above
- Monitor and investigate Splunk alerts. The documentation on splunk alerts we monitor are at
https://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts
On-call FAQ
- Where is the schedule?
- OpsGenie
- What are the hours?
- For production alerts raised by Opsgenie, primary is 24/7, secondary is technically also 24/7
- What do I do if there are certain times I am not available during my rotation?
- You should arrange before hand
- If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
- What do I do if I don't know how to help directly
- You should still acknowledge the alert, and try to figure out whether the resolution can wait
- If it can wait, just wait until business hours
- If it cannot wait, go online and HipChat and seek help.
- Keep in mind, we should have a member in Lahore trying to investigate.
- Worst comes worst, learn as much as you can
- On-call Frequently Asked Questions