asg_lifcycle_watcher.py WAIT_TIMEOUT_SECONDS must be less than the boto http_socket_timeout

Description

Jenkins job failures for the olive minos `terminate-instances` job were occurring when the SQS queue is empty (e.g. [job#449675](https://admin.edx-flatu.org:8080/job/terminate-instances-that-have-been-verified-for-retirement-prod-olivex/449675/console)), but succeed when there are any messages in the queue (e.g. [job#449676](https://admin.edx-flatu.org:8080/job/terminate-instances-that-have-been-verified-for-retirement-prod-olivex/449676/console))

From the errors reported in the failed job, we determined that, when the queue is empty, the [boto http_socket_timeout of 3 sec](https://github.com/edx/configuration/blob/cf4d221d384ed396a3008c31487f385544ef08e7/playbooks/roles/aws/templates/boto.cfg.j2#L2) was being hit before the [configured queue wait timeout of 10 sec](https://github.com/edx-olive/configuration/blob/a16cd2c9c6fd20fc604259ca4c89e1dda0fa0f9d/util/vpc-tools/asg_lifcycle_watcher.py#L38).

Since the `boto.cfg` file is part of the `aws` role and so is shared by lots of services, the best solution here was to decrease the queue wait to match the upstream [1 second timeout](https://github.com/edx/configuration/blob/cf4d221d384ed396a3008c31487f385544ef08e7/util/vpc-tools/asg_lifcycle_watcher.py#L35) instead of increasing the boto timeout.

*Testing instructions*

To verify this fix, I:

*Author Notes & Concerns*

  • The upstream reduced queue wait timeout was made as part of upgrading minos to boto3, however issues with using this upgrade on olive caused that change to be reverted.
    We'll need to investigate these issues more thoroughly to maintain this configuration repo moving forward.

  • We are also working on changes to remove the need for OpenCraft to merge configuration changes to the `edx:configuration/olive` fork/branch, so we don't have to pester you about this stuff in future

*Reviewers*

  • [ ] @itsjeyd

  • [ ] @coryleeio

CC @natabene

Configuration Pull Request

Make sure that the following steps are done before merging:

  • [ ] A DevOps team member has approved the PR if it is code shared across multiple services and you don't own all of the services.

  • [ ] Are you adding any new default values that need to be overridden when this change goes live? If so:

  • [ ] Update the appropriate internal repo (be sure to update for all our environments)

  • [ ] If you are updating a secure value rather than an internal one, file a DEVOPS ticket with details.

  • [ ] Add an entry to the CHANGELOG.

  • [ ] If you are making a complicated change, have you performed the proper testing specified on the [Ops Ansible Testing Checklist](https://openedx.atlassian.net/wiki/display/EdxOps/Ops+Ansible+Testing+Checklist)? Adding a new variable does not require the full list (although testing on a sandbox is a great idea to ensure it links with your downstream code changes).

Status

Assignee

Unassigned

Reporter

Open Source Pull Request Bot

Contributor Name

Jillian Vogel

Repo

edx/configuration

Customer

Epic Link

None

OSCM Assignee

None

Priority

Unset
Configure