[Spike] Identify options for handling Enterprise overnight search traffic spikes
Description
There is an Enterprise Catalog service that sends requests to search/all at a high rate every night from around 4-4:30am EST. This results in an extended period of low Apdex for the discovery service and usually triggers an OpsGenie alert. Consider options for how to deal with this.
Some ideas:
the upcoming upgrade of ElasticSearch to version 7 may give us an across-the-board improvement of response time for search queries
add rate limiting to this endpoint / requesting user (or adjust if rate limiting is already configured
add an alert policy in OpsGenie for a small time window when this alert keeps firing. This would be a temporary measure only
profile the endpoint to understand whether there’s an egregious slowness that can be fixed in code.
Distributed trace link:
Throughput images:
Steps to Reproduce
Activity
performance analysis write-up might be helpful:
Story Points
Assignee
Reporter
Labels
Reach
Impact
Platform Area
Customer
Partner Manager
URL
Contributor Name
Groups with Read-Only Access
Actual Points
Category of Work
Platform Map Area (Levels 1 & 2)
Platform Map Area (Levels 3 & 4)
Sprint
Priority
