[Spike] Identify options for handling Enterprise overnight search traffic spikes

Description

There is an Enterprise Catalog service that sends requests to search/all at a high rate every night from around 4-4:30am EST. This results in an extended period of low Apdex for the discovery service and usually triggers an OpsGenie alert. Consider options for how to deal with this.

Some ideas:

  • the upcoming upgrade of ElasticSearch to version 7 may give us an across-the-board improvement of response time for search queries

  • add rate limiting to this endpoint / requesting user (or adjust if rate limiting is already configured

  • add an alert policy in OpsGenie for a small time window when this alert keeps firing. This would be a temporary measure only

  • profile the endpoint to understand whether there’s an egregious slowness that can be fixed in code.

Distributed trace link:

https://one.nr/0PLREAr1rRa

Throughput images:

Steps to Reproduce

None

Activity

Show:
Jason Myatt
January 7, 2021, 2:24 PM

performance analysis write-up might be helpful:

Story Points

None

Assignee

Unassigned

Reporter

Jason Myatt

Labels

Reach

None

Impact

None

Platform Area

None

Customer

None

Partner Manager

None

URL

None

Contributor Name

None

Groups with Read-Only Access

None

Actual Points

None

Category of Work

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Sprint

Priority

Unset