Observability Challenges
Here we will capture Arch and Engineering Challenges (2020+) that are specifically around Observability.
Please add new issues or anecdotes as they arise. Thank you!
Category | Issue | Team anecdotes | Number of occurrences | Status |
---|---|---|---|---|
Alerting | Undetected failures (originally “Anomaly detection of common failures”) | Feb 5, 2020: RCA: PROD-1195 - edx-rest-api-client updates breaking proctoring - Error anomaly detection would have found this much faster and reduced the burden on the team to build the exact right monitoring. June 1, 2020: RCA: CR-2252 - Personalized Learner Schedules Outage - we would have found this problem a lot faster with anomaly detection on per view error rates. Note: NewRelic AI anomaly detection is only at the app level, not the view level. We'd need to brainstorm our own custom anomaly detection if we wanted to catch this without using KeyTransactions and custom alerts. | 2 | https://openedx.atlassian.net/browse/ARCHBOM-1282 (closed as “unfinished”) |
Alerting | New Relic errors need to be temporarily or permanently ignored | Dec 17, 2020: Dave O. manually disabled shared edx-platform alert and edited a TNL alert to ignore a PermissionDenied error. It was ultimately ignored for the entire app. See https://openedx.atlassian.net/wiki/spaces/AT/pages/2175434915 Oct 9, 2020: Jason wrote about alerting related to: Oct/Nov 2020: Robert had to temporarily adjust a TNL error alert condition to ignore | 3 | https://openedx.atlassian.net/browse/ARCHBOM-1555 Initial implementation is complete, but no confirmation yet that “expected error” functionality is solving this problem. |
Alerting | New Relic alerting with spotty metrics (sparse data) | Apr 28, 2021: Discussion with Alex D. See https://openedx.atlassian.net/wiki/spaces/SOL/pages/2693628290/Chatting+about+New+Relic+with+Robert?focusedTaskId=28 Also, in former Slack convo about the Metrics Aggregator (New Relic app), Emma G had written:
Oct 7, 2020: Revenue created this New Relic post to see how to handle when “aggregation window” setting could be larger than 15 minutes: https://discuss.newrelic.com/t/alerting-on-spotty-metrics-possible-to-increase-aggregation-window-times/117331 Dec 22, 2020: @Matt Hughes (Deactivated) was asking about handling spotty data for proctoring. In addition to above, also discussed “Sum of query results is” option for totaling across time. Note: At this time, it is unclear if we still have unsolved issues, or if we simply need better docs to help people when they run into these issues. | 3 |
|
Alerting | Stacktrace code_owner would be better than view code_owner | Nov 2020: TNL was alerted when Stacktrace contained | 1 | https://openedx.atlassian.net/browse/ARCHBOM-1545 (closed as “unfinished”) |
Alerting | Celery issues | Dec 2020: Issues around unknown tasks were monitored via Splunk to Opsgenie. There are a variety of Celery signals like |
| https://openedx.atlassian.net/browse/ARCHBOM-1581 (closed as “unfinished”) |
Dashboards | Can’t see deployments on New Relic dashboards |
|
| |
Custom Attributes | Add deployment data to events | When seeing errors, sometimes it is unclear whether they are coming from old or new boxes, or what release. The events could have a custom attributes with git hash and deployment time. |
|
|
Additional Items
This is a place to capture new items without having to add to the table or create tasks for work that may never be implemented.
Fix New Relic logs
Some log lines are not forwarded, including stack traces errors.
Recommendation is to try adding to https://github.com/openedx/configuration/blob/master/playbooks/roles/edxapp/templates/newrelic.ini.j2 as noted in https://docs.newrelic.com/docs/logs/logs-context/configure-logs-context-python/#1-agent :
application_logging.enabled=true application_logging.forwarding.enabled=true
Requires monitoring for log ingest size/cost impact.
Potentially switch to DISTRIBUTED_TRACING for all services through the config file.
Switch (or add) a userId-like custom attribute according to https://docs.newrelic.com/docs/errors-inbox/error-users-impacted/ to get additional functionality out of New Relic.