Initial Tests:

Note: Locust user wait times were not correctly set here. Ignore user number and wait time and assume requests were being sent as soon as one of the users are ready.

In the initial testing phase of the discussion api, the wait time between each top-level task was 500ms. If a task has multiple calls, such as PATCH comment, the wait time between each of those calls were 1000ms. These waits are very short and represent an accelerated usage of the discussions api. The first test that was run was against BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) with 10 locust users. A high percentage of the responses were failures and the response times were in the 1000-3000ms range. Under the assumption that a course with more posts would have worse performance, it would be expected that this course which is the 5th biggest course in our system, would not perform. To get a better baseline, we tried a smaller course.

Initial Flowtest against very large course

This was tested against the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) against courses-loadtest.edx.org.

Name	reqs	#	fails	Avg	Min	Max	Median	95%
GET	GET_comment_list	213	190(47.15%)	1635	189	5574	960	4500
GET	GET_thread	100935	3035(2.92%)	1722	9	10003	1400	4400
GET	GET_thread_list	7788	51851(86.94%)	2186	163	9313	2000	4900
PATCH	PATCH_comment	207	7(3.27%)	1622	227	9537	1200	4000
PATCH	PATCH_thread	191	87(31.29%)	1210	147	6729	600	400
POST	POST_comment_comment	157	2(1.26%)	1973	363	8839	1500	5300
POST	POST_comment_response	444	216(32.73%)	2416	316	10080	1600	6800
POST	POST_thread	4449	26(0.58%)	1200	197	6663	900	3200

The errors that came up were all 500 server errors. When looking at NewRelic, these errors were all timeout errors.

#	occurences	Error
3035	GET_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
7	PATCH_comment:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
51851	GET_thread_list:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
83	PATCH_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
216	POST_comment_response:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
2	POST_comment_comment:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
4	PATCH_thread:	HTTPError('404 Client Error: NOT FOUND',)
26	POST_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
190	GET_comment_list:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)

The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) which is about 2000 more posts than the median. In this test, we set up 10 locust users (same as the previous) for 1 hour and yielded better results. It was noted that during the 1 hour, the response times gradually got worse and less consistent as more spikes showed up on the charts. To see if the issue was too much load, another test was run against the "same" course (~5000 threads, ~10,000 commentS) with 5 locust users. This test showed that the response times were actually much worse than the previous test. After analyzing both sets of data, it was realized that the creation of posts in the course had affected the performance of that course. It was found that the number of posts in the course had increase about 2-3 times in an hour. During a week, about 50,000 posts are made against ALL courses. In this 1 hour, a fifth of that was added to a SINGLE COURSE. This significant increase in posts lead to the slower response times in the second test.

Another test was to be run against a small course, without any thread creation.

Tests for smaller course which gradually grew

The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time). Unlike the loadtest with the very large course "BerkeleyX/ColWri2.2x/1T2014", the server was able to handle the requests without returning 500s. These numbers seem more reasonable. The test ran for 1 hour.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	1718	0(0.00%)	459	163	1783	410	0.1	950
GET	GET_thread	106253	0(0.00%)	195	153	2302	180	10.3	270
GET	GET_thread_list	61009	0(0.00%)	595	155	4064	460	6.8	1700
PATCH	PATCH_comment	1719	0(0.00%)	593	208	2122	520	0.1	1100
PATCH	PATCH_thread	1883	19(1.00%)	403	166	2283	360	0.3	810
POST	POST_comment_comment	1373	0(0.00%)	966	591	4261	900	0.1	1600
POST	POST_comment_response	4135	0(0.00%)	928	546	4194	860	0.5	1600
POST	POST_thread	4688	0(0.00%)	492	424	4102	460	0.4	610

As the loadtest continued, the 95th and 50th percentile gradually became slower and slower. The throughput was very high compared to that of production (~100rpm for a similar call /django_comment_client.forum.views:single_thread) so the test will be rerun with less load.

After analyzing this data, it was realized that the number of posts in this course had actually altered the response times of the http requests. The course had increased about 2-3 times. In a week, about 20,000 threads, 20,000 responses, and 10,000 comments are created in ALL COURSES. In this 1 hour test, we have created a fifth of that value against a single course. Take note that the following tests were against the same course, but with the new increased post count.

The test below was against the same course but at half the throughput. As shown on the graph, the response times were slower. This test ran for 30 minutes. Again, we see that the response times increases slowly as the RPMs decreases appropriately.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	306	0(0.00%)	517	179	1227	490	0.1	910
GET	GET_thread	19602	0(0.00%)	198	150	2403	180	4.2	320
GET	GET_thread_list	11150	1(0.01%)	915	156	4026	770	3.8	1900
PATCH	PATCH_comment	306	0(0.00%)	768	423	2007	690	0	1200
PATCH	PATCH_thread	370	2(0.54%)	508	253	2043	480	0	890
POST	POST_comment_comment	268	0(0.00%)	1139	775	2556	1100	0.1	1500
POST	POST_comment_response	735	0(0.00%)	1108	715	2573	1000	0.1	1500
POST	POST_thread	833	0(0.00%)	478	416	1925	450	0	600

In this test, the course HarvardX/1368.2x/2T2015 (92 threads, 1697 comments) was used with 10 locust users for 30 minutes without an post creation. As expected, we observed very consistent and lower response times.

Initial test for a course smaller than the last while removing post creation

Name	reqs	#	fails	Avg	Min	Max	Median	95%
GET	GET_comment_list	476	0(0.00%)	271	136	555	280	440
GET	GET_thread	28637	0(0.00%)	176	127	2177	150	280
GET	GET_thread_list	15319	0(0.00%)	229	124	1365	170	510
PATCH	PATCH_comment	476	0(0.00%)	348	175	1013	340	560
PATCH	PATCH_thread	527	6(1.13%)	261	142	1347	240	380

In this test, we would like to compare a course that was similar to the size of the initial BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) and one that is bigger than HarvardX/1368.2x/2T2015 (92 threads, 1697 comments). The course used SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). As expected, the response times were comparable to BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) and slower than HarvardX/1368.2x/2T2015 (92 threads, 1697 comments). An additional test was run against a tiny course TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments). As expected, the response times for TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments) were the best by far.

Test comparing course sizes

On the left is the initial portion of the BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) loadtest. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) loadtest.

On the left is the HarvardX/1368.2x/2T2015 (92 threads, 1697 comments) load test. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) load test.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	324	0(0.00%)	221	139	1036	190	0.1	460
GET	GET_thread	20485	0(0.00%)	168	124	2079	170	12.7	220
GET	GET_thread_list	10919	0(0.00%)	461	132	1677	450	5.6	840
PATCH	PATCH_comment	325	0(0.00%)	348	176	1354	300	0.1	720
PATCH	PATCH_thread	356	3(0.84%)	272	134	895	230	0.2	500

On the left is the TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments) load test. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) load test.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	474	0(0.00%)	188	139	433	180	0.3	250
GET	GET_thread	29492	0(0.00%)	169	129	1408	160	10.3	220
GET	GET_thread_list	15707	0(0.00%)	198	127	1322	160	4.6	340
PATCH	PATCH_comment	474	0(0.00%)	314	177	1042	300	0.3	510
PATCH	PATCH_thread	590	7(1.17%)	246	148	1315	230	0.2	380

Tests while Mixing Courses:

In this series of tests, we will be testing against different sized courses with different ratios. There are spikes in response times in production and one possibility may be that the few very large courses may be the reason. In this initial test, we ran it for 3 hours, with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). We can see that there are spikes in the response times. There have been many 500 responses from the large course which needs to be addressed. There were many unexpected 500 errors during this test. Upon further investigation, it was found that there was a memory issue due to missing indexes.

Initial mixed course test

Course Size	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
Large	GET_comment_list	106	0(0.00%)	864	263	4680	680	0	2700
	GET_thread	6053	0(0.00%)	213	156	1271	200	0.3	280
	GET_thread_list	2928	240(7.58%)	1012	175	5465	1000	0.2	1800
	PATCH_comment	106	0(0.00%)	1017	305	4738	790	0	2700
	PATCH_thread	81	3(3.57%)	681	167	4083	420	0	2000
Small	GET_comment_list	3054	0(0.00%)	247	143	1484	210	0.2	550
	GET_thread	183499	1(0.00%)	187	0	6247	180	28.1	250
	GET_thread_list	97488	21(0.02%)	501	141	6685	420	12.7	1100
	PATCH_comment	3054	1(0.03%)	368	173	4345	330	0.1	750
	PATCH_thread	3305	43(1.28%)	287	140	2584	250	0.2	570

The chart below is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) loadtest by itself for reference. These numbers are not comparable since the test in the chart below was at 10 users vs 20 users in the above test.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	324	0(0.00%)	221	139	1036	190	0.1	460
GET	GET_thread	20485	0(0.00%)	168	124	2079	170	12.7	220
GET	GET_thread_list	10919	0(0.00%)	461	132	1677	450	5.6	840
PATCH	PATCH_comment	325	0(0.00%)	348	176	1354	300	0.1	720
PATCH	PATCH_thread	356	3(0.84%)	272	134	895	230	0.2	500

Addressing the 500s:

Although only present on the larger courses, a significant amount of 500s were being returned. A solution has been proposed to include an index to account for this but for now in order to continue the load tests, the offending parameters "last_activity_at" and "asc" will be removed from out random selection. This also means that our thread retrieval for PATCHing (and DELETEing which has been disabled) will be less random, but sufficiently random. By removing these parameters, we hope to get rid of these 500s and see a performance improvement for large and small courses.

Test reruns:

These tests have the non-indexed fields removed just to verify that those were causing the 500s. All tests were run with 10 locust users for 30 minutes. Response times were all better, no spikes, and response times were better the smaller the course.

SMES/ASLCx/1T2015 (1700 Threads, 3047 comments)

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	262	0(0.00%)	188	148	644	180	0.2	260
GET	GET_thread	16778	0(0.00%)	185	137	1201	180	17.6	240
GET	GET_thread_list	9125	0(0.00%)	417	140	1426	410	10	750
PATCH	PATCH_comment	260	1(0.38%)	348	179	1681	300	0	660
PATCH	PATCH_thread	326	3(0.91%)	248	146	713	220	0.1	440

BerkeleyX/ColWri2.2x/1T2014 (~25,000 threads, 30,000 comments)

Name	reqs	#	fails	Avg	Min	Max	Median	req/s
GET	GET_comment_list	229	0(0.00%)	346	156	2430	290	0.2
GET	GET_thread	14933	0(0.00%)	221	151	1847	210	14.8
GET	GET_thread_list	8005	0(0.00%)	493	168	1906	490	6.4
PATCH	PATCH_comment	226	0(0.00%)	477	217	4568	400	0.2
PATCH	PATCH_thread	259	2(0.77%)	393	188	2312	330	0.2

TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments)

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	514	0(0.00%)	176	133	514	170	0.2	240
GET	GET_thread	29734	0(0.00%)	164	126	2162	160	19.2	220
GET	GET_thread_list	15738	0(0.00%)	195	123	1427	150	10.1	340
PATCH	PATCH_comment	512	0(0.00%)	310	167	2377	290	0.4	520
PATCH	PATCH_thread	506	8(1.56%)	239	136	3293	220	0.3	380

After adding indices:

Rerun of mixed courses test after first index was added

This is before adding the "asc" index and after adding the "last_activity_at" index. The major difference is the lack of spikes, consistent RPM, better response times especially for the large course and no 500s.

This test was run for 3 minutes with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments).

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
Large	GET_comment_list	46	0(0.00%)	443	207	1155	370	0	840
	GET_thread	2404	0(0.00%)	225	168	1275	210	0.7	290
	GET_thread_list	1296	0(0.00%)	489	181	1104	480	0.6	830
	PATCH_comment	45	0(0.00%)	490	221	2060	450	0	780
	PATCH_thread	53	0(0.00%)	489	174	1912	410	0.1	1100
Small	GET_comment_list	877	0(0.00%)	211	154	1072	200	0.3	300
	GET_thread	55710	2(0.00%)	198	141	1562	190	28.2	260
	GET_thread_list	29694	0(0.00%)	449	147	3919	430	13.2	810
	PATCH_comment	875	4(0.46%)	340	184	1584	320	0.2	560
	PATCH_thread	1038	9(0.86%)	259	150	1055	240	0.5	420

Page size vs. Response time:

In a new test, it was found that SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) was slowing down over time. To identify what was happening, a new course was seeded which we will call DAPI (1000 Threads, 500 comments). In the analysis of the forums, we saw that the median body size was 250 characters. Our PATCH operations can be 4, 250, 1000, 5000, or 10000 characters. This test was run over 10 hours. The decrease in response time needed to be addressed.

page size vs response time

Over time DAPI (1000 threads, 500 comments) slowed down over time.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	15793	0(0.00%)	220	162	3116	200	0.2	340
GET	GET_thread	955097	1(0.00%)	214	-8	4509	200	23.8	290
GET	GET_thread_list	510921	0(0.00%)	445	15	5595	420	11.6	810
PATCH	PATCH_comment	15792	0(0.00%)	337	187	3851	320	0.2	490
PATCH	PATCH_thread	17333	164(0.94%)	273	161	3019	250	0.5	400

When running a similar test with a more verbose output, we clearly saw that the page size had an effect on the response time. Although not shown here, larger content size was somewhat correlated to slower response times, but page_size had the better correlation.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	page=1&page_size=100	238	0(0.00%)	791	655	1990	750	0.4	960
GET	page=2&page_size=100	282	0(0.00%)	762	659	1526	730	0.4	910
GET	page=1&page_size=75	240	0(0.00%)	651	544	1038	620	0.2	830
GET	page=2&page_size=75	255	0(0.00%)	677	573	1628	650	0	820
GET	page=7&page_size=100	260	0(0.00%)	568	461	2637	520	0.4	710
GET	page=6&page_size=100	232	0(0.00%)	559	466	843	520	0.1	710
GET	page=9&page_size=100	222	0(0.00%)	568	460	1907	530	0.5	700
GET	page=3&page_size=100	262	0(0.00%)	556	462	1807	520	0.2	700
GET	page=5&page_size=100	259	0(0.00%)	556	463	1375	510	0.3	700
GET	page=4&page_size=100	276	0(0.00%)	555	456	840	520	0.2	690
GET	page=8&page_size=100	236	0(0.00%)	548	468	903	520	0.1	680
GET	page=3&page_size=75	251	0(0.00%)	543	445	1348	510	0.3	670
GET	page=2&page_size=50	278	0(0.00%)	507	415	1141	480	0.5	670
GET	page=1&page_size=50	251	0(0.00%)	512	425	1043	490	0.1	660
GET	page=8&page_size=75	237	0(0.00%)	481	397	1644	440	0.1	650
GET	page=3&page_size=50	247	0(0.00%)	523	451	903	500	0.4	640
GET	page=6&page_size=75	238	0(0.00%)	492	398	1634	450	0.3	640
GET	page=4&page_size=75	218	0(0.00%)	480	395	886	440	0.2	630
GET	page=7&page_size=75	227	0(0.00%)	474	391	795	440	0.5	620
GET	page=5&page_size=75	239	0(0.00%)	466	388	1394	440	0.2	610
GET	page=4&page_size=50	237	0(0.00%)	453	371	1010	420	0.2	610
GET	page=9&page_size=75	216	0(0.00%)	467	397	996	430	0.4	600
GET	page=6&page_size=50	252	0(0.00%)	391	321	1065	360	0.4	540
GET	page=5&page_size=25	249	0(0.00%)	392	323	1367	370	0.4	530
GET	page=7&page_size=50	243	0(0.00%)	392	319	997	360	0.4	530
GET	page=9&page_size=50	221	0(0.00%)	391	330	658	360	0	520
GET	page=8&page_size=50	241	0(0.00%)	388	317	1300	360	0.4	520
GET	page=5&page_size=50	204	0(0.00%)	386	322	723	360	0.1	520
GET	page=2&page_size=25	259	0(0.00%)	363	299	1124	340	0.4	520
GET	page=1&page_size=25	242	0(0.00%)	373	296	1142	350	0.4	510
GET	page=4&page_size=25	238	0(0.00%)	361	292	1171	340	0.3	500
GET	page=6&page_size=25	256	0(0.00%)	365	312	687	350	0.3	480
GET	page=3&page_size=25	232	0(0.00%)	356	302	752	340	0.9	480
GET	page=7&page_size=25	261	0(0.00%)	362	310	590	350	0.1	470
GET	page=9&page_size=25	274	0(0.00%)	304	254	797	280	0.3	430
GET	page=8&page_size=25	232	0(0.00%)	300	253	644	280	0	420
GET	page=8&page_size=1	235	0(0.00%)	228	191	1025	210	0	320
GET	page=3&page_size=1	255	0(0.00%)	235	185	1106	220	0.2	310
GET	page=9&page_size=1	248	0(0.00%)	234	184	1147	210	0.3	310
GET	page=5&page_size=1	247	0(0.00%)	233	186	851	220	0.5	300
GET	page=7&page_size=1	231	0(0.00%)	234	188	2104	220	0.3	290
GET	page=2&page_size=1	251	0(0.00%)	226	187	972	220	0.4	290
GET	page=1&page_size=1	255	0(0.00%)	223	186	461	210	0.3	290
GET	page=4&page_size=1	260	0(0.00%)	222	188	400	210	0.3	290
GET	page=6&page_size=1	266	0(0.00%)	221	187	373	220	0.1	280

GETting a single threads vs comment_count

This tests takes the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) and then retrieves a single thread. This thread is then grouped by its comment_count. This test was run for 1 hour. As expected, because of pagination issues, threads with more comments are taking longer to return.

GET thread vs comment_count

Note that the slow transactions are the same thread.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	Comment_Range:0-25	21567	0(0.00%)	330	263	2050	320	5.7	420
GET	Comment_Range:25-50	3648	0(0.00%)	358	280	932	340	1.1	460
GET	Comment_Range:50-75	802	0(0.00%)	391	308	919	370	0.7	510
GET	Comment_Range:75-100	446	0(0.00%)	424	325	859	400	0.1	570
GET	Comment_Range:100-125	1114	0(0.00%)	444	349	1613	420	0.2	590
GET	Comment_Range:125-150	56	0(0.00%)	486	391	762	470	0	700
GET	Comment_Range:150-175	1745	0(0.00%)	518	403	1029	500	0.3	720
GET	Comment_Range:175-200	1610	0(0.00%)	532	400	1487	510	0.5	730
GET	Comment_Range:225-250	853	0(0.00%)	595	456	1389	570	0.2	830
GET	Comment_Range:350-375	850	0(0.00%)	740	560	1853	700	0.1	1000
GET	Comment_Range:375-400	828	0(0.00%)	772	573	1926	720	0.3	1100
GET	Comment_Range:1575-1600	879	0(0.00%)	2194	1740	4731	2100	0.2	3100

Initial Loadtesting