Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

Triage Process:

Spartans Triage Board https://openedx.atlassian.net/secure/RapidBoard.jspa?rapidView=419&projectKey=LEARNER

We are in the process of updating the following Flow with a Flow chart and greater detail.


For each ticket in the Needs Triage queue:

  • Read the ticket and determine if there is enough information from the reporter to begin an investigation.
    • If there is not enough information, contact the reporter and move the ticket to the "Blocked Waiting on Reporter" queue.
    • It should be expected that the reporter can provide steps to reproduce an issue.
  • Attempt to reproduce the issue
    • If the issue can be reproduced, document the steps to reproduce if they were not provided by the reporter
    • If the issue cannot be reproduced, message the reporter and move the ticket to the "Blocked Waiting on Reporter" queue until more information is available.
  • Evaluate the issue impact to our learners and edX organization.
    • Use splunk, newrelic, and other tools to evaluate the frequency, and user experience impact.
    • Use this impact to assist with defining a Priority (e.g. CAT-1→CAT-5)
  • Identify and communicate a potential work around for an issue until the issue is resolved.
  • Identify if the issue is related to ongoing feature work and if the ticket should be assigned to a member of another engineering team
  • Assign a prioritization to the issue. See edX Bug Fix Policy for information on what CAT number means
  • Use the following Communication Plan for actions related to each CAT number.

Escalations Actions After Triage

PriorityDescription
CAT-1 and CAT-2
  • Move the ticket immediately to the Prioritized queue.
CAT-3
  • Move to the Grooming queue to be Groomed and then prioritized.

CAT-4 and CAT-5

  • Move to the Grooming queue to be Groomed and then prioritized.

Communication Plan

Priority of TicketCommunicationStakeholders
CAT-1
  • Primary communication should be JIRA ticket
  • HipChat: Immediate message in War Room channel and Escalations channel.
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Immediate email with the following information
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Updates to the above two channels as details are unfolded
    • Email sent at least every 24 hours.
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-2
  • Primary communication should be JIRA ticket
  • HipChat: Message only if impact is a great number of learners or educators
    • Description of the issue
    • Ticket that was created to track the issue
    • Current impact of the issue.
  • Follow up email once triaged and issue is isolated.  Can also wait until after the issue is resolved.
    • Summary 
    • Ticket
    • Current impact
    • Next expected Status
    • Recipients: status@edx.org, CC stakeholders
  • Final email and message once the issue is resolved.
  • Escalations Team Lead
  • Escalations Product
  • DevOps
  • Learner and Educator Leads
CAT-3+
  • All communication should be had in the JIRA tickets.  
  • Limit email for communicating potential engineering improvements to avoid repeat issues.

Alerts:

Setup for On-Call Rotation and Triage:

Setup services locally:

Setting up the local Devstack will help you prepare for triage work you may need to do. Follow the Devstack setup instructions (https://github.com/edx/devstack). Having a local environment that is up to date will make the triage process easier.

Operational Ops-Genie Steps :

  • Once you are alerted by OpsGenie, please Acknowledge the alert.
  • Review the alert and try to determine the impact
    • If the impact is high (e.g. an IDA is down and many users are impacted) notify the Learner-All channel, your Manager, and the Escalations Engineering Lead.
      • Review the the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for next steps
      • If there is no Run-book
        • Create a learner JIRA ticket to track, if a ticket haven't already existed
        • Triage the JIRA ticket immediately and follow the triage process defined above
        • Contact the Escalations Engineering Lead using the "Customer Requests / Support / Escalation" Channel in HipChat.
    • If the impact it low Close the alert and begin looking at possible resolutions steps in the Run-books here: /wiki/spaces/LEARNER/pages/789970979
    • If the impact is unknown notify the Learner-All channel of the Alert and review the Run-books here: /wiki/spaces/LEARNER/pages/789970979 for possible resolution steps. Follow the steps above if there is no Run-book available.
  • After determining impact and whether or not there is a Run-book available please make sure to Close the alert.  The alert will message you again if it is not closed.

On-call FAQ

  • Where is the schedule?  
  • OpsGenie
  • /wiki/spaces/LEARNER/pages/789970979
  • What are the hours? 
  • What do I do if there are certain times I am not available during my rotation?
    • You should arrange before hand with your team, manager, and the Escalations Team Lead.
    • If the alert was missed by you, and you are secondary, it will start ping the next person scheduled by the rotation, and then the next, and then the next ...
  • What do I do if I don't know how to help directly
    • You should still acknowledge the alert, and try to figure out whether the resolution can wait
    • If it can wait, just wait until business hours
    • If it cannot wait, go online and HipChat and seek help. 
    • Reach out up the chain of leadership within the Learner team.  Escalate through your manager and the Escalations Team Lead.
  • On-call Frequently Asked Questions
  • No labels