Cypress Retrospective
Rough timeline of events:
- June 9: meeting to discuss ops workload for Cypress, distribute upgrade PR (#8161) merged
- June 18 to 29: DB on vacation
- June 30: debugging Vagrant boxes from configuration master
OPS-805 - July 13 to 17: DB tied up with edx.org weekly release (problems with distribute upgrade), distribute upgrade reverted
- July 14: Cypress RC1 cut (never released)
- July 15: Joel's first day
- July 17: Cypress RC2 released, YouTube API issue discovered
- July 21 to 23: working on migration script
- July 24: YouTube API fix merged to master
- July 25: Cypress RC3 released
- July 27: Birch.1 released
- July 31: distribute upgrade merged back into master
- Aug 7: Cypress RC4 released
- Aug 12: marketing planning for Cypress final
- Aug 13: Cypress final released
Getting the Vagrant boxes to build off of master before Cypress was released took maybe two weeks, with Feanil Patel doing most of that effort. The YouTube API issue was an unplanned major effort for this release, and necessitated the creation of Birch.1 and Birch.2. Between Cypress RC3 (which we specifically delayed for the YouTube API fix) and the final Cypress release, there were about 2.5 weeks of fixing issues with the configuration repo were causing problems with installing Cypress: missing /edx/bin/update, improperly installing insights, etc.
Things That Went Well
- Spent less time getting boxes made in the first place (better than for Birch)
- Ed put together automation for building nightly boxes (need to revisit it)
- Marketing came out well
- Feanil is getting ready to give people more information about how to run in production
- Marketing calendars were synced; marketing was ready to go on the day of release
- Release went out in a good state
Things That Didn't
- People say the wrong name
- Incorrectly believed that automation/testing would catch errors due to distribute/setuptools upgrade (feanil, db, sarina)
- Unexpected behaviors due to setuptools upgrade, AND due to distribute downgrade
- DB figured something like this might happen – didn't realize the extent
- Process requires too much manual effort; needs more automation
- manually tag all of the repos
- manually test whether or not images are working
- Missed multiple dates
- Had to backport a lot of changes to Birch, unexpectedly (partially because we missed dates)
- Lack of clarity around support plans for named releases (db)
- Lack of capacity to support more than one named release at a time
- Planning to provide extended-support releases in the future
- Didn't pull in marketing resources early enough
- Lack of clarity of what would be included in Cypress box until very late in the process
- Last-minute discovery that devstack and sandbox.sh installed different things (sarina, feanil)
- people treat it as production, even though it isn't
- This was the first time we were doing security fixes: no time built in to the schedule to deal with it (ned)
- Lack of metrics around the Open edX release process (joel, ed)
- Should have made marketing tickets for assets & made deadlines
- Marketing (db)
- Learning curve with WordPress
- Marketing doesn't know response to their efforts – should keep them more in the loop
- Should target completion substantially before doing a marketing push (maybe a week before)
- Should have done testing of blog post sharing
- Other partners jumping the gun in terms of marketing Cypress, lack of collaboration with partners
- Developmental/holistic fuck-up with YouTube API
- Ambiguity of what goes into a release (joel, ed, feanil, sarina, ned)
- Lahore developer didn't communicate to Open Source team about critical bugfix that needed to go into Cypress (and production)
- We've been taking the approach of feature-based approaches – felt like there was a lot of arguments around what features should go in (possible mis-perception?)
Biggest problems: ambiguity and lack of automation
Action Items
- Call out specific marketing resources (images, emails, tweets, etc) in planning for Dogwood (David Baumgold (Deactivated))
- Create automation for tagging repos and PR into configuration repo for cutting RC (David Baumgold (Deactivated))
- Create automation for testing Vagrant images (Edward Zarecor (Do Not Use) (Deactivated))
- Explore the idea of doing Vagrant nightly builds, and shipping one of them as a release (Edward Zarecor (Do Not Use) (Deactivated))
- Figure out how to set expectations for communication around bugfixes (Joel Barciauskas (Deactivated))
- YouTube API RCA part 2 (Joel Barciauskas (Deactivated))
- Further timeline/planning investigation. Compare expected and real timeline. (David Baumgold (Deactivated))
- Determine feasibility of making single devstack for Dogwood: complex microservice architecture may not work for Vagrant (Edward Zarecor (Do Not Use) (Deactivated))