Page Comparison

...

Category	Issue	Team anecdotes (with dates, with team)	Number of occurrences	Status
Alerting	Undetected failures (originally “Anomaly detection of common failures”)	Feb 5, 2020: RCA: PROD-1195 - edx-rest-api-client updates breaking proctoring - Error anomaly detection would have found this much faster and reduced the burden on the team to build the exact right monitoring. June 1, 2020: RCA: CR-2252 - Personalized Learner Schedules Outage - we would have found this problem a lot faster with anomaly detection on per view error rates. Note: NewRelic AI anomaly detection is only at the app level, not the view level. We'd need to brainstorm our own custom anomaly detection if we wanted to catch this without using KeyTransactions and custom alerts.	2	https://openedx.atlassian.net/browse/ARCHBOM-1282 (closed as “unfinished”)
Alerting	New Relic errors need to be temporarily or permanently ignored	Dec 17, 2020: Dave O. manually disabled shared edx-platform alert and edited a TNL alert to ignore a PermissionDenied error. It was ultimately ignored for the entire app. See Ignored Errors for LMS in New Relic /wiki/spaces/AT/pages/2175434915 Oct 9, 2020: Jason wrote about alerting related to:`rest_framework.exceptions:AuthenticationFailed`. He is unsure if he wants to permanently ignore. Oct/Nov 2020: Robert had to temporarily adjust a TNL error alert condition to ignore `openedx.core.lib.blockstore_api.exceptions:BundleStorageError` which was a known error that was supposed to be fixed, but the fix was taking days to weeks to get out.	3	https://openedx.atlassian.net/browse/ARCHBOM-1555 Initial implementation is complete, but no confirmation yet that “expected error” functionality is solving this problem.
Alerting	New Relic alerting with spotty metrics (sparse data)	Apr 28, 2021: Discussion with Alex D. See https://openedx.atlassian.net/wiki/spaces/SOL/pages/2693628290/Chatting+about+New+Relic+with+Robert?focusedTaskId=28 Also, in former Slack convo about the Metrics Aggregator (New Relic app), Emma G had written: The metric aggregator app was unhelpful for aggregating over time -- and when speaking to a New Relic representative, had bugs, it was suggested that we use the api instead. Oct 7, 2020: Revenue created this New Relic post to see how to handle when “aggregation window” setting could be larger than 15 minutes: https://discuss.newrelic.com/t/alerting-on-spotty-metrics-possible-to-increase-aggregation-window-times/117331 Dec 22, 2020: Matt Hughes (Deactivated) was asking about handling spotty data for proctoring. In addition to above, also discussed “Sum of query results is” option for totaling across time. Note: At this time, it is unclear if we still have unsolved issues, or if we simply need better docs to help people when they run into these issues.	3
Alerting	Stacktrace code_owner would be better than view code_owner	Nov 2020: TNL was alerted when Stacktrace contained `edx_proctoring.exceptions:ProctoredExamIllegalStatusTransition`, which is owned by another team.	1	https://openedx.atlassian.net/browse/ARCHBOM-1545 (closed as “unfinished”)
Alerting	Celery issues	Dec 2020: Issues around unknown tasks were monitored via Splunk to Opsgenie. There are a variety of Celery signals like `task_unknown`, or `task_internal_error` that could be helpful for monitoring Celery by `code_owner` in New Relic.		https://openedx.atlassian.net/browse/ARCHBOM-1581 (closed as “unfinished”)
Dashboards	Can’t see deployments on New Relic dashboards			https://openedx.atlassian.net/browse/ARCHBOM-1287
Custom Attributes	Add deployment data to events	When seeing errors, sometimes it is unclear whether they are coming from old or new boxes, or what release. The events could have a custom attributes with git hash and deployment time.

...

Versions Compared

Old Version 14

New Version 15

Key