Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

Triage Process:

Spartans Triage Board https://openedx.atlassian.net/secure/RapidBoard.jspa?rapidView=419&projectKey=LEARNER

We are in the process of updating the following Flow with a Flow chart and greater detail.


For each ticket in the Needs Triage queue:

  • Read the ticket and determine if there is enough information from the reporter to begin an investigation.
    • If there is not enough information, contact the reporter and move the ticket to the "Blocked Waiting on Reporter" queue.
    • It should be expected that the reporter can provide steps to reproduce an issue.
  • Attempt to reproduce the issue
    • If the issue can be reproduced, document the steps to reproduce if they were not provided by the reporter
    • If the issue cannot be reproduced, message the reporter and move the ticket to the "Blocked Waiting on Reporter" queue until more information is available.
  • Evaluate the issue impact to our learners and edX organization.
    • Use splunk, newrelic, and other tools to evaluate the frequency, and user experience impact.
    • Use this impact to assist with defining a Priority (e.g. CAT-1→CAT-5)
  • Identify and communicate a potential work around for an issue until the issue is resolved.
  • Identify if the issue is related to ongoing feature work and if the ticket should be assigned to a member of another engineering team
  • Assign a prioritization to the issue. See edX Bug Fix Policy for information on what CAT number means
  • Use the following Communication Plan for actions related to each CAT number.

Escalations Actions After Triage

PriorityDescription
CAT-1 and CAT-2
  • Move the ticket immediately to the Prioritized queue.
CAT-3
  • Move to the Grooming queue to be Groomed and then prioritized.

CAT-4 and CAT-5

  • Move to the Grooming queue to be Groomed and then prioritized.

Communication Plan

Priority of TicketCommunicationStakeholders
CAT-1
  • Primary communication should be JIRA ticket
  • HipChat: Immediate message in War Room channel and Escalations channel.
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Immediate email with the following information
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Updates to the above two channels as details are unfolded
    • Email sent at least every 24 hours.
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-2
  • Primary communication should be JIRA ticket
  • HipChat: Message only if impact is a great number of learners or educators
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Follow up email once triaged and issue is isolated.  Can also wait until after the issue is resolved.
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-3+
  • All communication should be had in the JIRA tickets.  
  • Limit email for communicating potential engineering improvements to avoid repeat issues.

Alerts:

Setup for triage:

Setup services locally:

If there is a CAT-2 bug, triage team should start looking into it, either to fix by your own or try to get the right person for it. So make sure you have setup following services locally. 

  • edX-platform
  • Ecommerce
  • Course discovery/catalog
  • edX-mktg

This will also help to reproduce/debug the bugs locally.

We should use docker for devstack setup for these local environments. See https://www.github.com/edx/devstack

Operational:

  • Once you are alerted by OpsGenie, please confirm the alert that you received it
  • If it is an issue that impacting a lot of learners, please:
    • Create a learner JIRA ticket to track, if a ticket haven't already existed
    • Triage the JIRA ticket immediately and follow the triage process defined above
  • Monitor and investigate Splunk alerts. The documentation on splunk alerts we monitor are at 

    https://openedx.atlassian.net/wiki/display/ECOM/Splunk+Alerts


On-call FAQ

  • Where is the schedule?  
  • OpsGenie
  • What are the hours? 
    • For production alerts raised by Opsgenie, primary is 24/7, secondary is technically also 24/7
  • What do I do if there are certain times I am not available during my rotation?
    • You should arrange before hand
    • If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
  • What do I do if I don't know how to help directly
    • You should still acknowledge the alert, and try to figure out whether the resolution can wait
    • If it can wait, just wait until business hours
    • If it cannot wait, go online and HipChat and seek help. 
    • Keep in mind, we should have a member in Lahore trying to investigate.
    • Worst comes worst, learn as much as you can
  • On-call Frequently Asked Questions
  • No labels