Move Course Structure Document Caching to Local Filesystem

Description

Goal: Reduce inbound network traffic to the app servers enough to lower the number of instances.

Acceptance criteria:

  • Shift caching of structure documents to be local, not in the shared memcached.

  • Don't compress structure docs when storing locally.

  • Deploy and testing on a sandbox.

In scope:

  • Verification on sandbox

  • Verification that django does not intercede and do cleanup itself

Out of scope:

  • Deploying to Prod (This will require coordination with devops)

  • Temp / file cleanup mechanism

Background:
We're significantly over-provisioned on the LMS app server side for edx.org. We're using maybe 2/16 cores on those machines, and a third of the memory. Our capacity as measured by actively busy gunicorn workers typically hovers around 8%, with occasional spikes to 20+%. It seems like we should be able to drop half the cluster, move to cheaper machines, up the number of workers on each, and still have headroom. The only thing holding us back these days is that we have too much network traffic going to the machines to reduce the count.

I believe that the bulk of that inbound network traffic is from the app servers grabbing course structures out of memcached, a step we took to prevent flooding the NAT between us and Compose. We could switch the course structures to use a FileBasedCache and store it locally on the app server's local SSD. The nice thing is that there are very few active keys at any given time (basically the number of courses), and we'll almost always be hitting the page cache. We could also remove the compression piece while we're at it, and potentially see an improvement to server response times.

Steps to Reproduce

None

Current Behavior

None

Expected Behavior

None

Reason for Variance

None

Release Notes

None

User Impact Summary

None

Activity

Show:
Feanil Patel
May 11, 2016, 1:35 PM

My initial testing was quite a while ago when we initially turned off the XML courses which was 10 months ago. I didn't get a chance to do a very deep investigation so I would trust 's results more than what I saw when I was initially testing. Here is what I had originally done:

After having reduced the memory footprint of the gunicorn process significantly, I updated the machines to run more workers(CPU and load were both quite low already). The new workers made better use of the memory and did not really change the load average or CPU usage on the machine as we reduced the number of machines to maintain the same number of overall gunicorn workers. However the overall latency of requests from the machines did increase. When I added more machines to bring it back up to the number of machines we previously had, the server side latency returned back to normal. I had concluded that we probably need to be on different sized machines and that it was likely due to network I/O.

David Ormsbee
May 11, 2016, 2:27 PM

Am I right to think that we should just file an OPS ticket to retry that experiment now then?

Kevin Falcone
May 11, 2016, 3:05 PM

I think bumping the gunicorn workers is a reasonable next step.

We can try doing it as a DEVOPS ticket tomorrow during business hours (build amis tonight, deploy in the morning, monitor) or we can spread it out into an OPS ticket if we need more input from ops (I don't think it'll be a high burden as long as I'm not dealing with prod issues). Thoughts

David Ormsbee
May 11, 2016, 6:18 PM

Created

Toby Lawrence
May 11, 2016, 6:20 PM

Blocking this ticket on since that will determine if this work needs to happen or not.

Assignee

Unassigned

Reporter

David Ormsbee

Labels

Reach

None

Impact

None

Platform Area

None

Customer

None

Partner Manager

None

URL

None

Contributor Name

None

Groups with Read-Only Access

None

Actual Points

None

Category of Work

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Story Points

3

Epic Link

Priority

Unset
Configure