Number of occurrences
Undetected failures (originally “Anomaly detection of common failures”)
Feb 5, 2020: RCA: PROD-1195 - edx-rest-api-client updates breaking proctoring - Error anomaly detection would have found this much faster and reduced the burden on the team to build the exact right monitoring.
June 1, 2020: RCA: CR-2252 - Personalized Learner Schedules Outage - we would have found this problem a lot faster with anomaly detection on per view error rates.
Note: NewRelic AI anomaly detection is only at the app level, not the view level. We'd need to brainstorm our own custom anomaly detection if we wanted to catch this without using KeyTransactions and custom alerts.
(closed as “unfinished”)
New Relic errors need to be temporarily or permanently ignored
Dec 17, 2020: Dave O. manually disabled shared edx-platform alert and edited a TNL alert to ignore a PermissionDenied error. It was ultimately ignored for the entire app. See Ignored Errors for LMS in New Relic /wiki/spaces/AT/pages/2175434915
Oct 9, 2020: Jason wrote about alerting related to:
Oct/Nov 2020: Robert had to temporarily adjust a TNL error alert condition to ignore
Initial implementation is complete, but no confirmation yet that “expected error” functionality is solving this problem.
New Relic alerting with spotty metrics (sparse data)
Apr 28, 2021: Discussion with Alex D. See https://openedx.atlassian.net/wiki/spaces/SOL/pages/2693628290/Chatting+about+New+Relic+with+Robert?focusedTaskId=28 Also, in former Slack convo about the Metrics Aggregator (New Relic app), Emma G had written:
Oct 7, 2020: Revenue created this New Relic post to see how to handle when “aggregation window” setting could be larger than 15 minutes: https://discuss.newrelic.com/t/alerting-on-spotty-metrics-possible-to-increase-aggregation-window-times/117331
Dec 22, 2020: Matt Hughes was asking about handling spotty data for proctoring. In addition to above, also discussed “Sum of query results is” option for totaling across time.
Note: At this time, it is unclear if we still have unsolved issues, or if we simply need better docs to help people when they run into these issues.
Stacktrace code_owner would be better than view code_owner
Nov 2020: TNL was alerted when Stacktrace contained
https://openedx.atlassian.net/browse/ARCHBOM-1545 (closed as “unfinished”)
Dec 2020: Issues around unknown tasks were monitored via Splunk to Opsgenie.
There are a variety of Celery signals like
https://openedx.atlassian.net/browse/ARCHBOM-1581 (closed as “unfinished”)
Can’t see deployments on New Relic dashboards
Add deployment data to events
When seeing errors, sometimes it is unclear whether they are coming from old or new boxes, or what release. The events could have a custom attributes with git hash and deployment time.