Cypress Retrospective
Rough timeline of events:
June 9: meeting to discuss ops workload for Cypress, distribute upgrade PR (#8161) merged
June 18 to 29: DB on vacation
June 30: debugging Vagrant boxes from configuration master OPS-805
July 13 to 17: DB tied up with edx.org weekly release (problems with distribute upgrade), distribute upgrade reverted
July 14: Cypress RC1 cut (never released)
July 15: Joel's first day
July 17: Cypress RC2 released, YouTube API issue discovered
July 21 to 23: working on migration script
July 24: YouTube API fix merged to master
July 25: Cypress RC3 released
July 27: Birch.1 released
July 31: distribute upgrade merged back into master
Aug 7: Cypress RC4 released
Aug 12: marketing planning for Cypress final
Aug 13: Cypress final released
Getting the Vagrant boxes to build off of master before Cypress was released took maybe two weeks, with Feanil Patel doing most of that effort. The YouTube API issue was an unplanned major effort for this release, and necessitated the creation of Birch.1 and Birch.2. Between Cypress RC3 (which we specifically delayed for the YouTube API fix) and the final Cypress release, there were about 2.5 weeks of fixing issues with the configuration repo were causing problems with installing Cypress: missing /edx/bin/update, improperly installing insights, etc.
Things That Went Well
Spent less time getting boxes made in the first place (better than for Birch)
Ed put together automation for building nightly boxes (need to revisit it)
Marketing came out well
Feanil is getting ready to give people more information about how to run in production
Marketing calendars were synced; marketing was ready to go on the day of release
Release went out in a good state
Things That Didn't
People say the wrong name
Incorrectly believed that automation/testing would catch errors due to distribute/setuptools upgrade (feanil, db, sarina)
Unexpected behaviors due to setuptools upgrade, AND due to distribute downgrade
DB figured something like this might happen – didn't realize the extent
Process requires too much manual effort; needs more automation
manually tag all of the repos
manually test whether or not images are working
Missed multiple dates
Had to backport a lot of changes to Birch, unexpectedly (partially because we missed dates)
Lack of clarity around support plans for named releases (db)
Lack of capacity to support more than one named release at a time
Planning to provide extended-support releases in the future
Didn't pull in marketing resources early enough
Lack of clarity of what would be included in Cypress box until very late in the process
Last-minute discovery that devstack and sandbox.sh installed different things (sarina, feanil)
people treat it as production, even though it isn't
This was the first time we were doing security fixes: no time built in to the schedule to deal with it (ned)
Lack of metrics around the Open edX release process (joel, ed)
Should have made marketing tickets for assets & made deadlines
Marketing (db)
Learning curve with WordPress
Marketing doesn't know response to their efforts – should keep them more in the loop
Should target completion substantially before doing a marketing push (maybe a week before)
Should have done testing of blog post sharing
Other partners jumping the gun in terms of marketing Cypress, lack of collaboration with partners
Developmental/holistic fuck-up with YouTube API
Ambiguity of what goes into a release (joel, ed, feanil, sarina, ned)
Lahore developer didn't communicate to Open Source team about critical bugfix that needed to go into Cypress (and production)
We've been taking the approach of feature-based approaches – felt like there was a lot of arguments around what features should go in (possible mis-perception?)
Biggest problems: ambiguity and lack of automation
Action Items