Orders Endpoint

These are notes on the load testing process for the LMS commerce order creation endpoint. The LMS endpoint calls an ecommerce order creation and fulfillment endpoint, which calls back to the LMS enrollment API during fulfillment.

TL;DR

When running at 78 RPM (on two m3.medium instances), 95% of requests to the LMS commerce endpoint complete in 460 ms.

When running at 1000 RPM (on two m3.xlarge instances), 95% of requests to the LMS commerce endpoint complete in 460 ms.

Order fulfillment is bottlenecked by the enrollment API.

Performance Precedents

For reference, the statistics below represent throughput and response times, averaged over the seven day period between March 5, 2015 and March 12, 2015, for the change_enrollment (iframe) and EnrollmentListView (enrollment API) processes on the LMS. These serve as targets for our new endpoint.

	RPM	Response time (ms)
change_enrollment	22.5	244
EnrollmentListView	5.75	273

Load Testing, Round 0

Throughput and Response Times

These are load test results for the LMS order creation endpoint, run on March 20, 2015 with autocommit enabled and atomic requests disabled. All response times are measured in milliseconds.

Users	RPS	RPM	50%	66%	75%	80%	90%	95%	98%	99%	100%
1	0.20	12	580	580	600	600	600	610	610	610	609
2	0.60	36	580	590	610	610	630	690	710	710	713
3	0.70	42	580	600	610	620	630	670	2600	2600	2556
4	0.90	54	580	590	610	620	640	650	720	720	717
5	1.00	60	590	620	630	650	730	810	860	860	860
6	1.60	96	590	600	610	620	640	670	1100	1100	1112
7	1.70	102	620	660	670	670	690	730	860	860	861
8	2.10	126	600	620	630	640	680	700	750	750	746
9	2.60	156	600	650	660	690	700	730	780	780	776
10	2.60	156	580	610	630	650	670	720	770	770	774

The service started to fail at 900 RPM.

New Relic Readings

loadtest-edx-ecommerce

New Relic readings for the e-commerce service under stress. The e-commerce service was running on two m3.medium (single-core) instances with 3 Gunicorn workers each.

loadtest-edx-edxapp-lms

New Relic readings for the LMS.

"MySQL other" Investigation

New Relic indicated that the ecommerce service spent about 50% of its time (261 ms per request, on average) performing an operation referred to as "MySQL other" during requests to the OrdersView.post endpoint:

After a cursory inspection, we were under the impression that Oscar performed its own transaction management, behavior which would conflict with the autocommit behavior enabled by default in Django 1.7. As a result, we thought Django was performing two commits for most database queries, doubling the amount of work done by the endpoint. Disabling autocommit (part of the databases settings dictionary) prevents Django from committing SQL queries to the database, resulting in a "50% improvement" in the endpoint's performance:

The effect of the change is very visible in the screenshot above. Between 1:50 and 2:00 PM, autocommit was enabled, allowing Django to perform many database commits, as evidenced by the "MySQL other" layer in the area graph. At 2:00 PM, autocommit was disabled, preventing Django from performing any commits and cutting the endpoint's response time in half.

However, it quickly became clear that the ecommerce service was not persisting anything to the database. The Oscar management dashboard appeared as follows:

Note the last order at 5:58 PM UTC, or 1:58 PM EDT. New Relic appeared as follows:

We disabled autocommit at about 1:58 PM. However, Splunk logs showed orders being created and fulfilled successfully after 1:58, and enrollments for the demo course were steadily increasing at some point between 2 and 3 PM.

Despite New Relic showing order_line select and update operations, it seems like order model objects were being created and operated on in memory without being persisted to the database. I confirmed that each of the utility functions we're using from Oscar core (e.g., place_order, used to create an order) explicitly save model objects before returning them, yet disabling autocommit appears to prevent the SQL generated by those saves from being committed to the database, behavior which aligns with what's described in the docs.

One point of confusion was that order numbers visible in Splunk continued to be incremented. Order numbers are created using basket IDs, so it would seem that incrementing them requires the baskets being created for each new user to be written to the database. However, it's more likely that Django was instantiating models in memory and assigning them IDs, but not persisting these in the database.

The "MySQL other" issue was finally resolved by leaving autocommit enabled (it's set to TRUE by default) and enabling atomic requests. When enabled, autocommit wraps every SQL query in a transaction, committing it to the database on success. Django is smart enough to prevent SQL queries from being wrapped in transactions when those queries are executed in the context of an existing transaction. Enabling atomic requests wraps every HTTP request to a view in a transaction. As a result, instead of committing queries to the database many times per request, Django commits queries to the database once per request after the successful conclusion of the transaction wrapping the target view. The result is lower database response times (atomic requests and autocommit were enabled together at 12:46 PM):

Spending less time in MySQL results in lower response times overall for the endpoint:

With the LMS and the ecommerce service running on sandboxes, we were seeing response times of 560ms at 80 RPM for the 95th percentile. A significant amount of this time is spent in the network. We expect to see improved response times when both the LMS and ecommerce service have been moved into the same VPC.

Load Testing, Round 1

In addition to reducing the number of commits made to the database, we're now making use of connection pooling on the ecommerce side to reduce the number of new MySQL connections created by Django.

One m3.medium, three synchronous workers

These statistics serve as a baseline for a single m3.medium instance running three synchronous workers. The number of concurrent users matches the number of available workers, yielding optimal performance.

Users	RPS	RPM	95%	Error rate
3	0.39	23.4	570	0.00%

While running this test, we observed that we were only utilizing 1% of CPU on the server. Since we're bottlenecked by the number of workers on the server, Ed suggested increasing the number of workers on the server until we hit 60-80% CPU utilization so that we can use our allotted resources more effectively.

One m3.medium, three asynchronous workers

The `analytics_productrecord update` query locks up the database and causes workers to almost immediately pile up and time out.

Bottleneck: Oscar Analytics

Now that the issue with overly frequent committing has been resolved (the "MySQL other" problem), the `analytics_productrecord update` query is the slowest, bottlenecking query in our system, by a long shot:

The query itself increments a counter on a row in the database indicating the number of times a product was purchased, locking the row and preventing other worker processes from writing to it, even at moderate load.

UPDATE `analytics_productrecord` SET `num_purchases` = `analytics_productrecord`.`num_purchases` + %s WHERE `analytics_productrecord`.`product_id` = %s

Attempting to service approximately 120 RPM gets us approximately to the point where we're hitting this bottleneck.

Load Testing, Round 2 (Soak Test)

The Oscar Analytics bottleneck has been removed, allowing us to use async workers. We're now ready to try a third round of testing in an attempt to characterize our performance profile. This run will be a soak test, run over a 64-hour period from Friday, March 27, into Monday, March 30.

Note: Since we're using async workers in production, but not for local development, we need to be conscious when introducing new libraries which make assumptions about thread safety.

Throughput and Response Times

m3.xlarge instances are the cheapest machines Amazon provides with a gigabit connection. We'd hoped using them would reduce time spent in the network, but upon closer inspection, we saw that we weren't even close to saturating the connection. Statistics below are the result of running on two m3.xlarge instances.

RPS	RPM	95%	Error rate
17.30	1038	460	0.01%

The commerce endpoint on the LMS serviced 3,747,227 requests over a 64-hour period. The service was able to handle 2334 RPM before failure.

Performance appears constant over time as the size of the database grows. The New Relic screenshot below is for the ecommerce order creation and fulfillment endpoint (excludes the initial LMS call).

m3.medium instances are more affordable than m3.xlarge instances, and are capable of handling the ecommerce service's production performance requirements. Statistics below are the result of running on two m3.medium instances.

RPS	RPM	95%	Error rate
1.3	78	460	0.00%

The service was able to handle 660 RPM before failure.