Observability Challenges

Here we will capture Arch and Engineering Challenges (2020+) that are specifically around Observability.

Please add new issues or anecdotes as they arise. Thank you!

Category

Issue

Team anecdotes
(with dates, with team)

Number of occurrences

Status

Category

Issue

Team anecdotes
(with dates, with team)

Number of occurrences

Status

Alerting

Undetected failures (originally โ€œAnomaly detection of common failuresโ€)

Feb 5, 2020: RCA: PROD-1195 - edx-rest-api-client updates breaking proctoring - Error anomaly detection would have found this much faster and reduced the burden on the team to build the exact right monitoring.

June 1, 2020: RCA: CR-2252 - Personalized Learner Schedules Outage - we would have found this problem a lot faster with anomaly detection on per view error rates.

Note: NewRelic AI anomaly detection is only at the app level, not the view level. We'd need to brainstorm our own custom anomaly detection if we wanted to catch this without using KeyTransactions and custom alerts.

2

https://openedx.atlassian.net/browse/ARCHBOM-1282

(closed as โ€œunfinishedโ€)

Alerting

New Relic errors need to be temporarily or permanently ignored

Dec 17, 2020: Dave O. manually disabled shared edx-platform alert and edited a TNL alert to ignore a PermissionDenied error. It was ultimately ignored for the entire app. See https://openedx.atlassian.net/wiki/spaces/AT/pages/2175434915

Oct 9, 2020: Jason wrote about alerting related to:rest_framework.exceptions:AuthenticationFailed. He is unsure if he wants to permanently ignore.

Oct/Nov 2020: Robert had to temporarily adjust a TNL error alert condition to ignore openedx.core.lib.blockstore_api.exceptions:BundleStorageError which was a known error that was supposed to be fixed, but the fix was taking days to weeks to get out.

3

https://openedx.atlassian.net/browse/ARCHBOM-1555

Initial implementation is complete, but no confirmation yet that โ€œexpected errorโ€ functionality is solving this problem.

Alerting

New Relic alerting with spotty metrics (sparse data)

Apr 28, 2021: Discussion with Alex D. See https://openedx.atlassian.net/wiki/spaces/SOL/pages/2693628290/Chatting+about+New+Relic+with+Robert?focusedTaskId=28 Also, in former Slack convo about the Metrics Aggregator (New Relic app), Emma G had written:

The metric aggregator app was unhelpful for aggregating over time -- and when speaking to a New Relic representative, had bugs, it was suggested that we use the api instead.

Oct 7, 2020: Revenue created this New Relic post to see how to handle when โ€œaggregation windowโ€ setting could be larger than 15 minutes: https://discuss.newrelic.com/t/alerting-on-spotty-metrics-possible-to-increase-aggregation-window-times/117331

Dec 22, 2020: @Matt Hughes (Deactivated) was asking about handling spotty data for proctoring. In addition to above, also discussed โ€œSum of query results isโ€ option for totaling across time.

Note: At this time, it is unclear if we still have unsolved issues, or if we simply need better docs to help people when they run into these issues.

3

ย 

Alerting

Stacktrace code_owner would be better than view code_owner

Nov 2020: TNL was alerted when Stacktrace contained edx_proctoring.exceptions:ProctoredExamIllegalStatusTransition, which is owned by another team.

1

https://openedx.atlassian.net/browse/ARCHBOM-1545 (closed as โ€œunfinishedโ€)

Alerting

Celery issues

Dec 2020: Issues around unknown tasks were monitored via Splunk to Opsgenie.

There are a variety of Celery signals like task_unknown, or task_internal_error
that could be helpful for monitoring Celery by code_owner in New Relic.

ย 

https://openedx.atlassian.net/browse/ARCHBOM-1581 (closed as โ€œunfinishedโ€)

Dashboards

Canโ€™t see deployments on New Relic dashboards

ย 

ย 

https://openedx.atlassian.net/browse/ARCHBOM-1287

Custom Attributes

Add deployment data to events

When seeing errors, sometimes it is unclear whether they are coming from old or new boxes, or what release. The events could have a custom attributes with git hash and deployment time.

ย 

ย 

Additional Items

This is a place to capture new items without having to add to the table or create tasks for work that may never be implemented.

ย