Spike: issues with spinning up jenkins workers should not impact developers

Description

Current situation:

  • Developer pushes a change on a PR branch (or creates a new PR)

  • Jenkins doesn't have enough workers to service the request

  • The EC2 plugin tries to spin up more workers but cannot as it has determined that the instance cap has been reached

  • We see this because the splunk alert is triggered: The alert condition for 'Jenkins instance cap reached' was triggered.

  • Only the "a11y" status context job is run

  • "jenkins run all" comments will not retrigger the other jobs

  • The current workaround is to close and reopen the PR (lame)

When we dug into this earlier, it seemed like the root cause had to do with the ghprb plug-in's cache of PR branches and what to/was run on them.

  • might remember more details.

  • there are probably JIRA tickets from the past that also have more details.

Acceptance Criteria:

  • Timeboxed to 2 points

  • Discovery for root cause and possible fixes/workarounds

  • Recommendation for approach

  • If it fits in the timebox, implement

  • If it does NOT fit in the timebox, created stories for implementation of recommended solution

Steps to Reproduce

None

Current Behavior

None

Expected Behavior

None

Reason for Variance

None

Release Notes

None

User Impact Summary

None

Assignee

Unassigned

Reporter

JesseZ

Labels

Reach

None

Impact

None

Platform Area

None

Customer

None

Partner Manager

None

URL

None

Contributor Name

None

Groups with Read-Only Access

None

Actual Points

None

Category of Work

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Story Points

2

Epic Link

Priority

Unset
Configure