Cypress Retrospective

Rough timeline of events:

  • June 9: meeting to discuss ops workload for Cypress, distribute upgrade PR (#8161) merged
  • June 18 to 29: DB on vacation
  • June 30: debugging Vagrant boxes from configuration master OPS-805
  • July 13 to 17: DB tied up with edx.org weekly release (problems with distribute upgrade), distribute upgrade reverted
  • July 14: Cypress RC1 cut (never released)
  • July 15: Joel's first day
  • July 17: Cypress RC2 released, YouTube API issue discovered
  • July 21 to 23: working on migration script
  • July 24: YouTube API fix merged to master
  • July 25: Cypress RC3 released
  • July 27: Birch.1 released
  • July 31: distribute upgrade merged back into master
  • Aug 7: Cypress RC4 released
  • Aug 12: marketing planning for Cypress final
  • Aug 13: Cypress final released

Getting the Vagrant boxes to build off of master before Cypress was released took maybe two weeks, with Feanil Patel doing most of that effort. The YouTube API issue was an unplanned major effort for this release, and necessitated the creation of Birch.1 and Birch.2. Between Cypress RC3 (which we specifically delayed for the YouTube API fix) and the final Cypress release, there were about 2.5 weeks of fixing issues with the configuration repo were causing problems with installing Cypress: missing /edx/bin/update, improperly installing insights, etc.

Things That Went Well

  • Spent less time getting boxes made in the first place (better than for Birch)
  • Ed put together automation for building nightly boxes (need to revisit it)
  • Marketing came out well
  • Feanil is getting ready to give people more information about how to run in production
  • Marketing calendars were synced; marketing was ready to go on the day of release
  • Release went out in a good state

 

Things That Didn't

  1. People say the wrong name (sad)
  2. Incorrectly believed that automation/testing would catch errors due to distribute/setuptools upgrade (feanil, db, sarina)
    1. Unexpected behaviors due to setuptools upgrade, AND due to distribute downgrade
    2. DB figured something like this might happen – didn't realize the extent
    3. Process requires too much manual effort; needs more automation
      1. manually tag all of the repos
      2. manually test whether or not images are working
  3. Missed multiple dates
  4. Had to backport a lot of changes to Birch, unexpectedly (partially because we missed dates)
  5. Lack of clarity around support plans for named releases (db)
    1. Lack of capacity to support more than one named release at a time
    2. Planning to provide extended-support releases in the future
  6. Didn't pull in marketing resources early enough
  7. Lack of clarity of what would be included in Cypress box until very late in the process
  8. Last-minute discovery that devstack and sandbox.sh installed different things (sarina, feanil)
    1. people treat it as production, even though it isn't
  9. This was the first time we were doing security fixes: no time built in to the schedule to deal with it (ned)
  10. Lack of metrics around the Open edX release process (joel, ed)
  11. Should have made marketing tickets for assets & made deadlines
  12. Marketing (db)
    1. Learning curve with WordPress
    2. Marketing doesn't know response to their efforts – should keep them more in the loop
  13. Should target completion substantially before doing a marketing push (maybe a week before)
  14. Should have done testing of blog post sharing
  15. Other partners jumping the gun in terms of marketing Cypress, lack of collaboration with partners
  16. Developmental/holistic fuck-up with YouTube API
  17. Ambiguity of what goes into a release (joel, ed, feanil, sarina, ned)
    1. Lahore developer didn't communicate to Open Source team about critical bugfix that needed to go into Cypress (and production)
    2. We've been taking the approach of feature-based approaches – felt like there was a lot of arguments around what features should go in (possible mis-perception?)

 

Biggest problems: ambiguity and lack of automation

Action Items