Disable CSM Writes for known crawlers loading courseware

Description

This would have two parts:

1. A ConfigModel to specify a set of known crawlers.
2. Adding a flag to the constructor for DjangoKeyValueStore to disable writes, and passing that flag through courseware when we get crawler traffic.

This is to help address observed latency spikes where courseware index transactions block on lock contention because they're trying to update the same sequential. This also removes crawler traffic from having CSM side-effects.

Risks:

  • Analytics discrepancies of events vs. CSM state.

  • XBlocks that rely on write-then-read within the same request for correct functionality.

Steps to Reproduce

None

Current Behavior

None

Expected Behavior

None

Reason for Variance

None

Release Notes

None

User Impact Summary

None

Activity

Show:
David Ormsbee
February 9, 2017, 10:52 PM
Edited

, , , , , ,

Update and followup on this: We've turned on the flag to disable CSM writes for known crawlers in production and seen a significant decrease in CSM write latency spikes. This is the 99th percentile chart for CSM writes comparing a day with this feature turned on compared to the same period a week ago:

Blue is the last day, dashed yellow is last week, and the units are milliseconds. Time interval was 5 mins. Our peaks have been been cut down about an order of magnitude, from 2-6 second peaks to much rarer ~200-500ms peaks. I've also confirmed in Splunk logs that crawlers were running against our site today, so it's not just a case that we had a lucky day.

I'll keep an eye on this over the next couple of weeks, but our team isn't planning to do any more work on CSM health in the near term if this pattern holds. There is still the issue of pursuing limits on field data size (PLAT-529, ), but that's been a much rarer problem and I'm explicitly prioritizing other stuff in the short term. If you have any concerns, please let me know.

David Ormsbee
February 9, 2017, 10:56 PM

BTW, now that this noise has been cleared out of the courseware index request, we're seeing traces that may indicate other performance issues in courseware rendering. Some of these are absurdly large sequences (i.e. hundreds of problems), but some of them look like they may point to separate issues with old mongo and CCX.

EdwardF
February 9, 2017, 11:24 PM

That's great news, Dave! Seems like we're now getting to the more interesting challenges.

Nimisha Asthagiri
February 10, 2017, 12:18 AM

Excellent!

David Ormsbee
February 14, 2017, 3:47 PM

I posted this in the perf channel, but one last graph for completeness on this ticket: This is the effect on 99th%ile courseware index views.

Assignee

David Ormsbee

Reporter

David Ormsbee

Labels

None

Reach

None

Impact

None

Platform Area

None

Customer

None

Partner Manager

None

URL

None

Contributor Name

None

Groups with Read-Only Access

None

Actual Points

None

Category of Work

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Story Points

3

Sprint

None

Priority

Unset
Configure