Arch Tea Time: 2020-04-02 (Tiger Team Findings)

Main Topic: Tiger Teams Findings

  • Growing Pains Tiger Team Report - Mar 2020

    • It’s still in discussion what of these findings will be tackled and when they will be.

    • Document is focused on short-term wins. Separately, we need to look into longer-term wins.

      • We’re getting to the point where “weird stuff we used to be able to ignore” (spikes at 4am, etc.) aren’t viable any more

    • Three phases:

      • Prevent the building from burning down

      • Know better when the building is on fire

      • Put the fire out faster

Questions on Tiger Team report

  • Was that last PR that removed an edx-when index the biggest win on the edx-when front?

    • There was a PR that lowered the number of writes to edx-when. That was the other biggest win.

    • Alex: There was recently pain with edx-when contributing to downtime.

    • edx-when was intended as a canary - are there things about it that we should note so that future canary-esque projects don’t cause such issues?

    • edx-when blew up b/c of a query that performed well enough until broke through a certain level of SQL cache space. At that point, it because noticeably problematic.

      • Looking at SQL explains aren’t always going to make these things obvious.

    • How might we use learnings from here to teach others and predict other future failures?

      • For example:

        • How to use and interpret EXPLAIN output?

      • One recommendation is having more training around diagnosing performance issues and designing to prevent them.

  • Since we can’t predict for sure which features will exactly break down with scale, should we invest time in adding Ops Kill Switch toggles as a preventive measure?

    • Wasn’t focused on in group because it’s hard to know which features to wrap in kill switches, as it’s hard to know where things will go wrong.

    • Kill switches in edx-when?

      • This would be tough, because features are built on top of edx-when.

      • Cale: We ought to think of kill switches on a feature level, not a library level.

      • Feanil: Using more kill switches seems worth investing in, but would take time to arrive at a place where we could do this as an org.

      • Nimisha: We can try to write features that don’t depend on certain things (i.e., programs) from being available, allowing usage of kill switches without cascading feature failure.

      • Alex: We did talk about manual interventions / circuit breakers. E.g., being able to clearing Celery queue.

      • Feanil: A conversation worth having is around what features are worth wrapping in kill switches, which is a longer-term dialog that could happen between engineering and product.

  • How costly to edX are the hardware improvements? How soon will we need to remove those expansions (for $$ reasons) once CoVid has subsided?

    • Feanil:The monthly increase in cost in estimated to be around $1,700/month in permanent changes + about $30,000/month temporary changes to the lms database.

    • Gabe: Note that the increase in traffic has resulted in an incremental ~$1.8 million/month in edX revenue (after rev share)

    • Aside: It’d be great if more of us could do the “napkin math” on how much we make per month after rev share, etc.

  • This seems like a good opportunity to “teach people to fish” I’m thinking there could be value in the tiger team talking through the steps they took to identify the issues they brought up

    • +1 - How might we upskill other engineers in this type of discovery as they become self-owners of their services?

    • Feanil: One of the recommendations is to do exactly this. To setup repeatable performance related training similar to a11y or security.

  • For MTTR, how long do rollbacks take today? What do we anticipate this time to be after the investments suggested in the report?

    • Feanil: One of the recommendations in the list was around the fact that this information is not easy to collect right now and we should invest in gathering data on the full deploy timeline to make better investment decisions in the future.

    • Another thing noticed is that not everybody knows how to rollback, when to, whether they are empowered to. A recommendation is to address that.

    •  

  • Alex: Another thing we didn’t even glance at: disaster recovery drills - have we done these in the past, should we do them in the future?

  • Should we spend time improving observability and our use of monitoring tools?

    • Recommendation: Adding some specific alerts that would have been helpful.

    • Recommendation: Reviewing infrastructure alerts and making sure they are signalling on things that the team cares about.

      • Would be good if teams are have the knowledge and power to decide what should be monitored + alerted upon.

  • Are there recommendations around having a “pane of glass” (i.e., single dashboard for system monitoring)?

    • Recommendation: In NewRelic: Track when management commands are run, marketing pushes happen, course are opened, and other events that could impact performance

      • Would help us understand when “we did something that affected performance” or “someone else did something that affected performance”.

    • Recommendation: Improve existing observability dashboards.

    • Dave has some great NewRelic dashboards. We would be happy to spread knowledge on how to build dashboards.

  • Configuration complexity?

    • Arch team plans to deprecatete ConfigModel and move completely towards remote config

  • Monitoring business metric vs. hardware performance: is one less noise / more useful?

    • Context dependent, both have ups and downs.

    • Hardware performance

      • Identifying causes of things like memory spikes takes investigation work.

Misc.

@Dave Ormsbee wants the programs cache to die