Orders Endpoint
These are notes on the load testing process for the LMS commerce order creation endpoint. The LMS endpoint calls an ecommerce order creation and fulfillment endpoint, which calls back to the LMS enrollment API during fulfillment.
TL;DR
When running at 78 RPM (on two m3.medium instances), 95% of requests to the LMS commerce endpoint complete in 460 ms.
When running at 1000 RPM (on two m3.xlarge instances), 95% of requests to the LMS commerce endpoint complete in 460 ms.
Order fulfillment is bottlenecked by the enrollment API.
Performance Precedents
For reference, the statistics below represent throughput and response times, averaged over the seven day period between March 5, 2015 and March 12, 2015, for the change_enrollment (iframe) and EnrollmentListView (enrollment API) processes on the LMS. These serve as targets for our new endpoint.
RPM | Response time (ms) | |
---|---|---|
22.5 | 244 | |
5.75 | 273 |
Load Testing, Round 0
Throughput and Response Times
These are load test results for the LMS order creation endpoint, run on March 20, 2015 with autocommit enabled and atomic requests disabled. All response times are measured in milliseconds.
Users | RPS | RPM | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.20 | 12 | 580 | 580 | 600 | 600 | 600 | 610 | 610 | 610 | 609 |
2 | 0.60 | 36 | 580 | 590 | 610 | 610 | 630 | 690 | 710 | 710 | 713 |
3 | 0.70 | 42 | 580 | 600 | 610 | 620 | 630 | 670 | 2600 | 2600 | 2556 |
4 | 0.90 | 54 | 580 | 590 | 610 | 620 | 640 | 650 | 720 | 720 | 717 |
5 | 1.00 | 60 | 590 | 620 | 630 | 650 | 730 | 810 | 860 | 860 | 860 |
6 | 1.60 | 96 | 590 | 600 | 610 | 620 | 640 | 670 | 1100 | 1100 | 1112 |
7 | 1.70 | 102 | 620 | 660 | 670 | 670 | 690 | 730 | 860 | 860 | 861 |
8 | 2.10 | 126 | 600 | 620 | 630 | 640 | 680 | 700 | 750 | 750 | 746 |
9 | 2.60 | 156 | 600 | 650 | 660 | 690 | 700 | 730 | 780 | 780 | 776 |
10 | 2.60 | 156 | 580 | 610 | 630 | 650 | 670 | 720 | 770 | 770 | 774 |
The service started to fail at 900 RPM.
New Relic Readings
loadtest-edx-ecommerce
New Relic readings for the e-commerce service under stress. The e-commerce service was running on two m3.medium (single-core) instances with 3 Gunicorn workers each.
loadtest-edx-edxapp-lms
New Relic readings for the LMS.
"MySQL other" Investigation
New Relic indicated that the ecommerce service spent about 50% of its time (261 ms per request, on average) performing an operation referred to as "MySQL other" during requests to the OrdersView.post endpoint:
After a cursory inspection, we were under the impression that Oscar performed its own transaction management, behavior which would conflict with the autocommit behavior enabled by default in Django 1.7. As a result, we thought Django was performing two commits for most database queries, doubling the amount of work done by the endpoint. Disabling autocommit (part of the databases settings dictionary) prevents Django from committing SQL queries to the database, resulting in a "50% improvement" in the endpoint's performance:
The effect of the change is very visible in the screenshot above. Between 1:50 and 2:00 PM, autocommit was enabled, allowing Django to perform many database commits, as evidenced by the "MySQL other" layer in the area graph. At 2:00 PM, autocommit was disabled, preventing Django from performing any commits and cutting the endpoint's response time in half.
However, it quickly became clear that the ecommerce service was not persisting anything to the database. The Oscar management dashboard appeared as follows:
Note the last order at 5:58 PM UTC, or 1:58 PM EDT. New Relic appeared as follows:
Load Testing, Round 1
In addition to reducing the number of commits made to the database, we're now making use of connection pooling on the ecommerce side to reduce the number of new MySQL connections created by Django.
One m3.medium, three synchronous workers
These statistics serve as a baseline for a single m3.medium instance running three synchronous workers. The number of concurrent users matches the number of available workers, yielding optimal performance.
Users | RPS | RPM | 95% | Error rate |
---|---|---|---|---|
3 | 0.39 | 23.4 | 570 | 0.00% |
While running this test, we observed that we were only utilizing 1% of CPU on the server. Since we're bottlenecked by the number of workers on the server, Ed suggested increasing the number of workers on the server until we hit 60-80% CPU utilization so that we can use our allotted resources more effectively.
One m3.medium, three asynchronous workers
The `analytics_productrecord update` query locks up the database and causes workers to almost immediately pile up and time out.
Bottleneck: Oscar Analytics
Now that the issue with overly frequent committing has been resolved (the "MySQL other" problem), the `analytics_productrecord update` query is the slowest, bottlenecking query in our system, by a long shot:
The query itself increments a counter on a row in the database indicating the number of times a product was purchased, locking the row and preventing other worker processes from writing to it, even at moderate load.
UPDATE `analytics_productrecord` SET `num_purchases` = `analytics_productrecord`.`num_purchases` + %s WHERE `analytics_productrecord`.`product_id` = %s
Attempting to service approximately 120 RPM gets us approximately to the point where we're hitting this bottleneck.
Load Testing, Round 2 (Soak Test)
The Oscar Analytics bottleneck has been removed, allowing us to use async workers. We're now ready to try a third round of testing in an attempt to characterize our performance profile. This run will be a soak test, run over a 64-hour period from Friday, March 27, into Monday, March 30.
Note: Since we're using async workers in production, but not for local development, we need to be conscious when introducing new libraries which make assumptions about thread safety.
Throughput and Response Times
m3.xlarge instances are the cheapest machines Amazon provides with a gigabit connection. We'd hoped using them would reduce time spent in the network, but upon closer inspection, we saw that we weren't even close to saturating the connection. Statistics below are the result of running on two m3.xlarge instances.
RPS | RPM | 95% | Error rate |
---|---|---|---|
17.30 | 1038 | 460 | 0.01% |
The commerce endpoint on the LMS serviced 3,747,227 requests over a 64-hour period. The service was able to handle 2334 RPM before failure.
Performance appears constant over time as the size of the database grows. The New Relic screenshot below is for the ecommerce order creation and fulfillment endpoint (excludes the initial LMS call).
m3.medium instances are more affordable than m3.xlarge instances, and are capable of handling the ecommerce service's production performance requirements. Statistics below are the result of running on two m3.medium instances.
RPS | RPM | 95% | Error rate |
---|---|---|---|
1.3 | 78 | 460 | 0.00% |
The service was able to handle 660 RPM before failure.