Goal: Reduce inbound network traffic to the app servers enough to lower the number of instances.
Shift caching of structure documents to be local, not in the shared memcached.
Don't compress structure docs when storing locally.
Deploy and testing on a sandbox.
Verification on sandbox
Verification that django does not intercede and do cleanup itself
Out of scope:
Deploying to Prod (This will require coordination with devops)
Temp / file cleanup mechanism
We're significantly over-provisioned on the LMS app server side for edx.org. We're using maybe 2/16 cores on those machines, and a third of the memory. Our capacity as measured by actively busy gunicorn workers typically hovers around 8%, with occasional spikes to 20+%. It seems like we should be able to drop half the cluster, move to cheaper machines, up the number of workers on each, and still have headroom. The only thing holding us back these days is that we have too much network traffic going to the machines to reduce the count.
I believe that the bulk of that inbound network traffic is from the app servers grabbing course structures out of memcached, a step we took to prevent flooding the NAT between us and Compose. We could switch the course structures to use a FileBasedCache and store it locally on the app server's local SSD. The nice thing is that there are very few active keys at any given time (basically the number of courses), and we'll almost always be hitting the page cache. We could also remove the compression piece while we're at it, and potentially see an improvement to server response times.
My initial testing was quite a while ago when we initially turned off the XML courses which was 10 months ago. I didn't get a chance to do a very deep investigation so I would trust 's results more than what I saw when I was initially testing. Here is what I had originally done:
After having reduced the memory footprint of the gunicorn process significantly, I updated the machines to run more workers(CPU and load were both quite low already). The new workers made better use of the memory and did not really change the load average or CPU usage on the machine as we reduced the number of machines to maintain the same number of overall gunicorn workers. However the overall latency of requests from the machines did increase. When I added more machines to bring it back up to the number of machines we previously had, the server side latency returned back to normal. I had concluded that we probably need to be on different sized machines and that it was likely due to network I/O.
Am I right to think that we should just file an OPS ticket to retry that experiment now then?
I think bumping the gunicorn workers is a reasonable next step.
We can try doing it as a DEVOPS ticket tomorrow during business hours (build amis tonight, deploy in the morning, monitor) or we can spread it out into an OPS ticket if we need more input from ops (I don't think it'll be a high burden as long as I'm not dealing with prod issues). Thoughts
Blocking this ticket on since that will determine if this work needs to happen or not.