...

In the initial testing phase of the discussion api, the wait time between each top-level task was 500ms. If a task has multiple calls, such as PATCH comment, the wait time between each of those calls were 1000ms. These waits are very short and represent an accelerated usage of the discussions api. The first test that was run was against BerkeleyXagainst BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) with 10 locust users. A high percentage of the responses were failures and the response times were in the 1000-3000ms range. Under the assumption that a course with more posts would have worse performance, it would be expected that this course which is the 5th biggest course in our system, would not perform. To get a better baseline, we tried a smaller course.

Expand

title	Initial Flowtest against very large course

This was tested against the course BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) against courses-loadtest.edx.org.

Name	reqs	#	fails	Avg	Min	Max	Median	95%
GET	GET_comment_list	213	190(47.15%)	1635	189	5574	960	4500
GET	GET_thread	100935	3035(2.92%)	1722	9	10003	1400	4400
GET	GET_thread_list	7788	51851(86.94%)	2186	163	9313	2000	4900
PATCH	PATCH_comment	207	7(3.27%)	1622	227	9537	1200	4000
PATCH	PATCH_thread	191	87(31.29%)	1210	147	6729	600	400
POST	POST_comment_comment	157	2(1.26%)	1973	363	8839	1500	5300
POST	POST_comment_response	444	216(32.73%)	2416	316	10080	1600	6800
POST	POST_thread	4449	26(0.58%)	1200	197	6663	900	3200

The errors that came up were all 500 server errors. When looking at NewRelic, these errors were all timeout errors.

#	occurences	Error
3035	GET_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
7	PATCH_comment:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
51851	GET_thread_list:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
83	PATCH_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
216	POST_comment_response:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
2	POST_comment_comment:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
4	PATCH_thread:	HTTPError('404 Client Error: NOT FOUND',)
26	POST_thread:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
190	GET_comment_list:	HTTPError('500 Server Error: INTERNAL SERVER ERROR',)

...

Another test was to be run against a small course, without any thread creation.

Anchor

	ramping_course
	ramping_course

Expand

title	Tests for smaller course which gradually grew

The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time). Unlike the loadtest with the very large course "BerkeleyX/ColWri2.2x/1T2014", the server was able to handle the requests without returning 500s. These numbers seem more reasonable. The test ran for 1 hour.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	1718	0(0.00%)	459	163	1783	410	0.1	950
GET	GET_thread	106253	0(0.00%)	195	153	2302	180	10.3	270
GET	GET_thread_list	61009	0(0.00%)	595	155	4064	460	6.8	1700
PATCH	PATCH_comment	1719	0(0.00%)	593	208	2122	520	0.1	1100
PATCH	PATCH_thread	1883	19(1.00%)	403	166	2283	360	0.3	810
POST	POST_comment_comment	1373	0(0.00%)	966	591	4261	900	0.1	1600
POST	POST_comment_response	4135	0(0.00%)	928	546	4194	860	0.5	1600
POST	POST_thread	4688	0(0.00%)	492	424	4102	460	0.4	610

As the loadtest continued, the 95th and 50th percentile gradually became slower and slower. The throughput was very high compared to that of production (~100rpm for a similar call /django_comment_client.forum.views:single_thread) so the test will be rerun with less load.

After analyzing this data, it was realized that the number of posts in this course had actually altered the response times of the http requests. The course had increased about 2-3 times. In a week, about 20,000 threads, 20,000 responses, and 10,000 comments are created in ALL COURSES. In this 1 hour test, we have created a fifth of that value against a single course. Take note that the following tests were against the same course, but with the new increased post count.

The test below was against the same course but at half the throughput. As shown on the graph, the response times were slower. This test ran for 30 minutes. Again, we see that the response times increases slowly as the RPMs decreases appropriately.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	306	0(0.00%)	517	179	1227	490	0.1	910
GET	GET_thread	19602	0(0.00%)	198	150	2403	180	4.2	320
GET	GET_thread_list	11150	1(0.01%)	915	156	4026	770	3.8	1900
PATCH	PATCH_comment	306	0(0.00%)	768	423	2007	690	0	1200
PATCH	PATCH_thread	370	2(0.54%)	508	253	2043	480	0	890
POST	POST_comment_comment	268	0(0.00%)	1139	775	2556	1100	0.1	1500
POST	POST_comment_response	735	0(0.00%)	1108	715	2573	1000	0.1	1500
POST	POST_thread	833	0(0.00%)	478	416	1925	450	0	600

...

In this series of tests, we will be testing against different sized courses with different ratios. There are spikes in response times in production and one possibility may be that the few very large courses may be the reason. In this initial test, we ran it for 3 hours, with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). We can see that there are spikes in the response times. There have been many 500 responses from the large course which needs to be addressed. There were many unexpected 500 errors during this test. Upon further investigation, it was found that there was a memory issue due to missing indexes.

...

Expand

title	Rerun of mixed courses test after first index was added

This is before adding the "asc" index and after adding the "last_activity_at" index. The major difference is the lack of spikes, consistent RPM, better response times especially for the large course and no 500s.

This test was run for 3 minutes with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments).

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
Large	GET_comment_list	46	0(0.00%)	443	207	1155	370	0	840
	GET_thread	2404	0(0.00%)	225	168	1275	210	0.7	290
	GET_thread_list	1296	0(0.00%)	489	181	1104	480	0.6	830
	PATCH_comment	45	0(0.00%)	490	221	2060	450	0	780
	PATCH_thread	53	0(0.00%)	489	174	1912	410	0.1	1100
Small	GET_comment_list	877	0(0.00%)	211	154	1072	200	0.3	300
	GET_thread	55710	2(0.00%)	198	141	1562	190	28.2	260
	GET_thread_list	29694	0(0.00%)	449	147	3919	430	13.2	810
	PATCH_comment	875	4(0.46%)	340	184	1584	320	0.2	560
	PATCH_thread	1038	9(0.86%)	259	150	1055	240	0.5	420

Page size vs. Response time:
Anchor
psvrt
psvrt

In a new test, it was found that SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) was slowing down over time. To identify what was happening, a new course was seeded which we will call DAPI (1000 Threads, 500 comments). In the analysis of the forums, we saw that the median body size was 250 characters. Our PATCH operations can be 4, 250, 1000, 5000, or 10000 characters. This test was run over 10 hours. The decrease in response time needed to be addressed.

Expand

title	page size vs response time

Over time DAPI (1000 threads, 500 comments) slowed down over time.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	GET_comment_list	15793	0(0.00%)	220	162	3116	200	0.2	340
GET	GET_thread	955097	1(0.00%)	214	-8	4509	200	23.8	290
GET	GET_thread_list	510921	0(0.00%)	445	15	5595	420	11.6	810
PATCH	PATCH_comment	15792	0(0.00%)	337	187	3851	320	0.2	490
PATCH	PATCH_thread	17333	164(0.94%)	273	161	3019	250	0.5	400

When running a similar test with a more verbose output, we clearly saw that the page size had an effect on the response time. Although not shown here, larger content size was somewhat correlated to slower response times, but page_size had the better correlation.

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	page=1&page_size=100	238	0(0.00%)	791	655	1990	750	0.4	960
GET	page=2&page_size=100	282	0(0.00%)	762	659	1526	730	0.4	910
GET	page=1&page_size=75	240	0(0.00%)	651	544	1038	620	0.2	830
GET	page=2&page_size=75	255	0(0.00%)	677	573	1628	650	0	820
GET	page=7&page_size=100	260	0(0.00%)	568	461	2637	520	0.4	710
GET	page=6&page_size=100	232	0(0.00%)	559	466	843	520	0.1	710
GET	page=9&page_size=100	222	0(0.00%)	568	460	1907	530	0.5	700
GET	page=3&page_size=100	262	0(0.00%)	556	462	1807	520	0.2	700
GET	page=5&page_size=100	259	0(0.00%)	556	463	1375	510	0.3	700
GET	page=4&page_size=100	276	0(0.00%)	555	456	840	520	0.2	690
GET	page=8&page_size=100	236	0(0.00%)	548	468	903	520	0.1	680
GET	page=3&page_size=75	251	0(0.00%)	543	445	1348	510	0.3	670
GET	page=2&page_size=50	278	0(0.00%)	507	415	1141	480	0.5	670
GET	page=1&page_size=50	251	0(0.00%)	512	425	1043	490	0.1	660
GET	page=8&page_size=75	237	0(0.00%)	481	397	1644	440	0.1	650
GET	page=3&page_size=50	247	0(0.00%)	523	451	903	500	0.4	640
GET	page=6&page_size=75	238	0(0.00%)	492	398	1634	450	0.3	640
GET	page=4&page_size=75	218	0(0.00%)	480	395	886	440	0.2	630
GET	page=7&page_size=75	227	0(0.00%)	474	391	795	440	0.5	620
GET	page=5&page_size=75	239	0(0.00%)	466	388	1394	440	0.2	610
GET	page=4&page_size=50	237	0(0.00%)	453	371	1010	420	0.2	610
GET	page=9&page_size=75	216	0(0.00%)	467	397	996	430	0.4	600
GET	page=6&page_size=50	252	0(0.00%)	391	321	1065	360	0.4	540
GET	page=5&page_size=25	249	0(0.00%)	392	323	1367	370	0.4	530
GET	page=7&page_size=50	243	0(0.00%)	392	319	997	360	0.4	530
GET	page=9&page_size=50	221	0(0.00%)	391	330	658	360	0	520
GET	page=8&page_size=50	241	0(0.00%)	388	317	1300	360	0.4	520
GET	page=5&page_size=50	204	0(0.00%)	386	322	723	360	0.1	520
GET	page=2&page_size=25	259	0(0.00%)	363	299	1124	340	0.4	520
GET	page=1&page_size=25	242	0(0.00%)	373	296	1142	350	0.4	510
GET	page=4&page_size=25	238	0(0.00%)	361	292	1171	340	0.3	500
GET	page=6&page_size=25	256	0(0.00%)	365	312	687	350	0.3	480
GET	page=3&page_size=25	232	0(0.00%)	356	302	752	340	0.9	480
GET	page=7&page_size=25	261	0(0.00%)	362	310	590	350	0.1	470
GET	page=9&page_size=25	274	0(0.00%)	304	254	797	280	0.3	430
GET	page=8&page_size=25	232	0(0.00%)	300	253	644	280	0	420
GET	page=8&page_size=1	235	0(0.00%)	228	191	1025	210	0	320
GET	page=3&page_size=1	255	0(0.00%)	235	185	1106	220	0.2	310
GET	page=9&page_size=1	248	0(0.00%)	234	184	1147	210	0.3	310
GET	page=5&page_size=1	247	0(0.00%)	233	186	851	220	0.5	300
GET	page=7&page_size=1	231	0(0.00%)	234	188	2104	220	0.3	290
GET	page=2&page_size=1	251	0(0.00%)	226	187	972	220	0.4	290
GET	page=1&page_size=1	255	0(0.00%)	223	186	461	210	0.3	290
GET	page=4&page_size=1	260	0(0.00%)	222	188	400	210	0.3	290
GET	page=6&page_size=1	266	0(0.00%)	221	187	373	220	0.1	280

This tests takes the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) and then retrieves a single thread. This thread is then grouped by its comment_count. This test was run for 1 hour. As expected, because of pagination issues, threads with more comments are taking longer to return.

Expand

title	GET thread vs comment_count

Note that the slow transactions are the same thread. Comment_count is also linearly correlated. There is a case where the comment_count was ~4000. The approximate formula to calculate the response time for this context is (comment_count*1.4+360). A thread with ~4000 comments would take approximately 8200ms.

Image Added

Name	reqs	#	fails	Avg	Min	Max	Median	req/s	95%
GET	Comment_Range:0-25	21567	0(0.00%)	330	263	2050	320	5.7	420
GET	Comment_Range:25-50	3648	0(0.00%)	358	280	932	340	1.1	460
GET	Comment_Range:50-75	802	0(0.00%)	391	308	919	370	0.7	510
GET	Comment_Range:75-100	446	0(0.00%)	424	325	859	400	0.1	570
GET	Comment_Range:100-125	1114	0(0.00%)	444	349	1613	420	0.2	590
GET	Comment_Range:125-150	56	0(0.00%)	486	391	762	470	0	700
GET	Comment_Range:150-175	1745	0(0.00%)	518	403	1029	500	0.3	720
GET	Comment_Range:175-200	1610	0(0.00%)	532	400	1487	510	0.5	730
GET	Comment_Range:225-250	853	0(0.00%)	595	456	1389	570	0.2	830
GET	Comment_Range:350-375	850	0(0.00%)	740	560	1853	700	0.1	1000
GET	Comment_Range:375-400	828	0(0.00%)	772	573	1926	720	0.3	1100
GET	Comment_Range:1575-1600	879	0(0.00%)	2194	1740	4731	2100	0.2	3100

Versions Compared

Old Version 42

New Version Current

Key

Page size vs. Response time:
Anchor
psvrt
psvrt

Page Comparison

Versions Compared

Old Version 42

New Version Current

Key

Page size vs. Response time: Anchorpsvrtpsvrt

Pagination: GETting a single threads vs comment_count Anchorpaginationpagination

Page size vs. Response time:
Anchor
psvrt
psvrt

Pagination: GETting a single threads vs comment_count
Anchor
pagination
pagination