Observability Backlog and Notes
SRE generally owns Observability, but I, @Robert Raposa, have been championing this effort. This page provides a home for me to document a backlog until and if the time comes to roll this work into SRE, or eSREs, or other groups.
Backlog
These tickets should be reviewed to see what should be closed as won’t do, even if just for now.
Backlog Ideas
Tasks marked with “SRE Support” come from SRE Support (Data Analysis). This list needs prioritization!
Make better use of New Relic’s new expected error functionality.
Maybe replace/remove our custom functionality for expected errors?
Custom expected errors has missing errors.
Use
_nr_exc_info
in new middleware. See New Relic code.Test on DRF permission failures locally.
Also, on Oct 12 Agent v7.2.1.168 was released that presumably fixes the Ignored Error problem.
How much of this functionality do we want to keep in place?
New Relic policy that sends notifications to warroom for rare events that many might care about:
memcached lost all its data?
ignored errors warning
Add user_id to error message in frontend-platform for login_refresh:
User id was added to backend in https://github.com/edx/edx-platform/pull/28905
Adding to frontend would be similar to: https://github.com/edx/frontend-platform/pull/207/files
Are there other log messages worth adding this to?
Fix/Rethink RequestCustomAttributesMiddleware
Note: currently this observability data goes missing for certain exceptions.
Most of it applies to any request, and the middleware (or some of it) probably better belongs in edx-django-utils monitoring.
Splitting out user and authentication monitoring into separate middleware would enable the existing middleware to move higher in the list, so we don’t lose good info during exceptions in other middleware.
Note: would auth exception monitoring middleware need to be higher than auth middleware? Needs thought.
Deployment metadata in New Relic ideas:
Custom attribute(s) for Python version and Django version.
Allows for New Relic querying with historical information.
Admin API with Python version and all libraries and versions. Maybe call pip freeze from Python.
Only provides current information, but provides more details.
See https://courses.edx.org/api/toggles/v0/state/ as an example of an admin API with runtime metadata.
Note: if we had the Deployment ID (from GoCD), that should also be included.
Custom attribute with Deployment ID (from GoCD).
Pull edx-platform healthcheck from apdex
Proposal: OpsGenie team template with example configurations
Improve and track GoCD failures
See https://openedx.atlassian.net/wiki/spaces/AT/pages/2600896071
First get accelerate metrics to watch improvements
Simplify New Relic onboarding and OneLogin SSO (SRE Support)
Simplify alert creation for new services. (SRE Support)
Simplify how someone can determine what version of a service is deployed where. (SRE Support)
Docs answering common questions in New Relic. (SRE Support)
How do I find x in New Relic?
Note: github repo frontend-app-admin-portal, but it's just prod-edx-portal in NR. Can this be fixed?
Why is this red in New Relic?
How do I do x in New Relic?
Ensure applications all follow best practices and have New Relic configured from the get go, rather than waiting till there is a fire. (SRE Support)
Organize Hnycon (Honeycomb) Notes
Discuss with New Relic
Demo of trace of deployment
Demo of SLO burn and error budgets
Discussed overall on-call vs on-call for change (i.e. different notification channels).
HAVING clause (new feature)
Book recommendations
Multipliers
Team Topologies (Nimisha may have page)
Other notes
Observability vs APM
Honeycomb is clear on this. Are we?
Recommendation to fix missing spans. Traces should be complete.
How and when are we using spans?
End-to-end user SLOs
Additional Resources