Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 49 Next »

Initial Tests:

Note: Locust user wait times were not correctly set here. Ignore user number and wait time and assume requests were being sent as soon as one of the users are ready.

    In the initial testing phase of the discussion api, the wait time between each top-level task was 500ms. If a task has multiple calls, such as PATCH comment, the wait time between each of those calls were 1000ms. These waits are very short and represent an accelerated usage of the discussions api. The first test that was run was against BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) with 10 locust users. A high percentage of the responses were failures and the response times were in the 1000-3000ms range. Under the assumption that a course with more posts would have worse performance, it would be expected that this course which is the 5th biggest course in our system, would not perform. To get a better baseline, we tried a smaller course.

 Initial Flowtest against very large course

This was tested against the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) against courses-loadtest.edx.org.

 

Namereqs#failsAvgMinMaxMedian95%
GETGET_comment_list213190(47.15%)163518955749604500
GETGET_thread1009353035(2.92%)172291000314004400
GETGET_thread_list778851851(86.94%)2186163931320004900
PATCHPATCH_comment2077(3.27%)1622227953712004000
PATCHPATCH_thread19187(31.29%)12101476729600400
POSTPOST_comment_comment1572(1.26%)1973363883915005300
POSTPOST_comment_response444216(32.73%)24163161008016006800
POSTPOST_thread444926(0.58%)120019766639003200

 

The errors that came up were all 500 server errors. When looking at NewRelic, these errors were all timeout errors.

#occurencesError
3035GET_thread:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
7PATCH_comment:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
51851GET_thread_list:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
83PATCH_thread:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
216POST_comment_response:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
2POST_comment_comment:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
4PATCH_thread:HTTPError('404 Client Error: NOT FOUND',)
26POST_thread:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)
190GET_comment_list:HTTPError('500 Server Error: INTERNAL SERVER ERROR',)

    The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) which is about 2000 more posts than the median. In this test, we set up 10 locust users (same as the previous) for 1 hour and yielded better results. It was noted that during the 1 hour, the response times gradually got worse and less consistent as more spikes showed up on the charts. To see if the issue was too much load, another test was run against the "same" course (~5000 threads, ~10,000 commentS) with 5 locust users. This test showed that the response times were actually much worse than the previous test. After analyzing both sets of data, it was realized that the creation of posts in the course had affected the performance of that course. It was found that the number of posts in the course had increase about 2-3 times in an hour. During a week, about 50,000 posts are made against ALL courses. In this 1 hour, a fifth of that was added to a SINGLE COURSE. This significant increase in posts lead to the slower response times in the second test.

Another test was to be run against a small course, without any thread creation.

 Tests for smaller course which gradually grew

The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time). Unlike the loadtest with the very large course "BerkeleyX/ColWri2.2x/1T2014", the server was able to handle the requests without returning 500s. These numbers seem more reasonable. The test ran for 1 hour.

 

Name
reqs
#
fails
Avg
Min
Max
Median
req/s
95%
GETGET_comment_list17180(0.00%)45916317834100.1950
GETGET_thread1062530(0.00%)195153230218010.3270
GETGET_thread_list610090(0.00%)59515540644606.81700
PATCHPATCH_comment17190(0.00%)59320821225200.11100
PATCHPATCH_thread188319(1.00%)40316622833600.3810
POSTPOST_comment_comment13730(0.00%)96659142619000.11600
POSTPOST_comment_response41350(0.00%)92854641948600.51600
POSTPOST_thread46880(0.00%)49242441024600.4610

As the loadtest continued, the 95th and 50th percentile gradually became slower and slower. The throughput was very high compared to that of production (~100rpm for a similar call /django_comment_client.forum.views:single_thread) so the test will be rerun with less load.

After analyzing this data, it was realized that the number of posts in this course had actually altered the response times of the http requests. The course had increased about 2-3 times. In a week, about 20,000 threads, 20,000 responses, and 10,000 comments are created in ALL COURSES. In this 1 hour test, we have created a fifth of that value against a single course. Take note that the following tests were against the same course, but with the new increased post count.

 

The test below was against the same course but at half the throughput. As shown on the graph, the response times were slower. This test ran for 30 minutes. Again, we see that the response times increases slowly as the RPMs decreases appropriately.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list3060(0.00%)51717912274900.1910
GETGET_thread196020(0.00%)19815024031804.2320
GETGET_thread_list111501(0.01%)91515640267703.81900
PATCHPATCH_comment3060(0.00%)768423200769001200
PATCHPATCH_thread3702(0.54%)50825320434800890
POSTPOST_comment_comment2680(0.00%)1139775255611000.11500
POSTPOST_comment_response7350(0.00%)1108715257310000.11500
POSTPOST_thread8330(0.00%)47841619254500600

 



    In this test, the course HarvardX/1368.2x/2T2015 (92 threads, 1697 comments) was used with 10 locust users for 30 minutes without an post creation. As expected, we observed very consistent and lower response times.  

 Initial test for a course smaller than the last while removing post creation
Namereqs#failsAvgMinMaxMedian95%
GETGET_comment_list4760(0.00%)271136555280440
GETGET_thread286370(0.00%)1761272177150280
GETGET_thread_list153190(0.00%)2291241365170510
PATCHPATCH_comment4760(0.00%)3481751013340560
PATCHPATCH_thread5276(1.13%)2611421347240380

 

 

 

    In this test, we would like to compare a course that was similar to the size of the initial BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) and one that is bigger than HarvardX/1368.2x/2T2015 (92 threads, 1697 comments). The course used SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). As expected, the response times were comparable to BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time)  and slower than HarvardX/1368.2x/2T2015 (92 threads, 1697 comments). An additional test was run against a tiny course  TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments). As expected,  the response times for TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments) were the best by far.

 Test comparing course sizes

On the left is the initial portion of the BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time) loadtest. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) loadtest.

On the left is the HarvardX/1368.2x/2T2015 (92 threads, 1697 comments) load test. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) load test.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list3240(0.00%)22113910361900.1460
GETGET_thread204850(0.00%)168124207917012.7220
GETGET_thread_list109190(0.00%)46113216774505.6840
PATCHPATCH_comment3250(0.00%)34817613543000.1720
PATCHPATCH_thread3563(0.84%)2721348952300.2500

On the left is the TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments) load test. On the right is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) load test.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list4740(0.00%)1881394331800.3250
GETGET_thread294920(0.00%)169129140816010.3220
GETGET_thread_list157070(0.00%)19812713221604.6340
PATCHPATCH_comment4740(0.00%)31417710423000.3510
PATCHPATCH_thread5907(1.17%)24614813152300.2380

Tests while Mixing Courses:

    In this series of tests, we will be testing against different sized courses with different ratios. There are spikes in response times in production and one possibility may be that the few very large courses may be the reason. In this initial test, we ran it for 3 hours, with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). We can see that there are spikes in the response times. There have been many 500 responses from the large course which needs to be addressed. There were many unexpected 500 errors during this test. Upon further investigation, it was found that there was a memory issue due to missing indexes.

 Initial mixed course test

Course Sizereqs#failsAvgMinMaxMedianreq/s95%
LargeGET_comment_list1060(0.00%)864263468068002700
 GET_thread60530(0.00%)21315612712000.3280
 GET_thread_list2928240(7.58%)1012175546510000.21800
 PATCH_comment1060(0.00%)1017305473879002700
 PATCH_thread813(3.57%)681167408342002000
SmallGET_comment_list30540(0.00%)24714314842100.2550
 GET_thread1834991(0.00%)1870624718028.1250
 GET_thread_list9748821(0.02%)501141668542012.71100
 PATCH_comment30541(0.03%)36817343453300.1750
 PATCH_thread330543(1.28%)28714025842500.2570

The chart below is the SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) loadtest by itself for reference. These numbers are not comparable since the test in the chart below was at 10 users vs 20 users in the above test.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list3240(0.00%)22113910361900.1460
GETGET_thread204850(0.00%)168124207917012.7220
GETGET_thread_list109190(0.00%)46113216774505.6840
PATCHPATCH_comment3250(0.00%)34817613543000.1720
PATCHPATCH_thread3563(0.84%)2721348952300.2500

Addressing the 500s:

    Although only present on the larger courses, a significant amount of 500s were being returned. A solution has been proposed to include an index to account for this but for now in order to continue the load tests, the offending parameters "last_activity_at" and "asc" will be removed from out random selection. This also means that our thread retrieval for PATCHing (and DELETEing which has been disabled) will be less random, but sufficiently random. By removing these parameters, we hope to get rid of these 500s and see a performance improvement for large and small courses.

Test reruns:

    These tests have the non-indexed fields removed just to verify that those were causing the 500s. All tests were run with 10 locust users for 30 minutes. Response times were all better, no spikes, and response times were better the smaller the course.

 SMES/ASLCx/1T2015 (1700 Threads, 3047 comments)

 

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list2620(0.00%)1881486441800.2260
GETGET_thread167780(0.00%)185137120118017.6240
GETGET_thread_list91250(0.00%)417140142641010750
PATCHPATCH_comment2601(0.38%)34817916813000660
PATCHPATCH_thread3263(0.91%)2481467132200.1440
  BerkeleyX/ColWri2.2x/1T2014 (~25,000 threads, 30,000 comments)

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list2290(0.00%)34615624302900.2 
GETGET_thread149330(0.00%)221151184721014.8 
GETGET_thread_list80050(0.00%)49316819064906.4 
PATCHPATCH_comment2260(0.00%)47721745684000.2 
PATCHPATCH_thread2592(0.77%)39318823123300.2 
 TsinghuaX/00690242_2x/3T2014 (29 threads, 155 comments)

 

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list5140(0.00%)1761335141700.2240
GETGET_thread297340(0.00%)164126216216019.2220
GETGET_thread_list157380(0.00%)195123142715010.1340
PATCHPATCH_comment5120(0.00%)31016723772900.4520
PATCHPATCH_thread5068(1.56%)23913632932200.3380

After adding indices:

 Rerun of mixed courses test after first index was added

This is before adding the "asc" index and after adding the "last_activity_at" index. The major difference is the lack of spikes, consistent RPM, better response times especially for the large course and no 500s.

This test was run for 3 minutes with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). 

Namereqs#failsAvgMinMaxMedianreq/s95%
LargeGET_comment_list460(0.00%)44320711553700840
 GET_thread24040(0.00%)22516812752100.7290
 GET_thread_list12960(0.00%)48918111044800.6830
 PATCH_comment450(0.00%)49022120604500780
 PATCH_thread530(0.00%)48917419124100.11100
SmallGET_comment_list8770(0.00%)21115410722000.3300
 GET_thread557102(0.00%)198141156219028.2260
 GET_thread_list296940(0.00%)449147391943013.2810
 PATCH_comment8754(0.46%)34018415843200.2560
 PATCH_thread10389(0.86%)25915010552400.5420

Page size vs. Response time: 

    In a new test, it was found that SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) was slowing down over time. To identify what was happening, a new course was seeded which we will call DAPI (1000 Threads, 500 comments). In the analysis of the forums, we saw that the median body size was 250 characters. Our PATCH operations can be 4, 250, 1000, 5000, or 10000 characters. This test was run over 10 hours. The decrease in response time needed to be addressed.

 page size vs response time

Over time DAPI (1000 threads, 500 comments) slowed down over time.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETGET_comment_list157930(0.00%)22016231162000.2340
GETGET_thread9550971(0.00%)214-8450920023.8290
GETGET_thread_list5109210(0.00%)44515559542011.6810
PATCHPATCH_comment157920(0.00%)33718738513200.2490
PATCHPATCH_thread17333164(0.94%)27316130192500.5400

When running a similar test with a more verbose output, we clearly saw that the page size had an effect on the response time. Although not shown here, larger content size was somewhat correlated to slower response times, but page_size had the better correlation.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETpage=1&page_size=1002380(0.00%)79165519907500.4960
GETpage=2&page_size=1002820(0.00%)76265915267300.4910
GETpage=1&page_size=752400(0.00%)65154410386200.2830
GETpage=2&page_size=752550(0.00%)67757316286500820
GETpage=7&page_size=1002600(0.00%)56846126375200.4710
GETpage=6&page_size=1002320(0.00%)5594668435200.1710
GETpage=9&page_size=1002220(0.00%)56846019075300.5700
GETpage=3&page_size=1002620(0.00%)55646218075200.2700
GETpage=5&page_size=1002590(0.00%)55646313755100.3700
GETpage=4&page_size=1002760(0.00%)5554568405200.2690
GETpage=8&page_size=1002360(0.00%)5484689035200.1680
GETpage=3&page_size=752510(0.00%)54344513485100.3670
GETpage=2&page_size=502780(0.00%)50741511414800.5670
GETpage=1&page_size=502510(0.00%)51242510434900.1660
GETpage=8&page_size=752370(0.00%)48139716444400.1650
GETpage=3&page_size=502470(0.00%)5234519035000.4640
GETpage=6&page_size=752380(0.00%)49239816344500.3640
GETpage=4&page_size=752180(0.00%)4803958864400.2630
GETpage=7&page_size=752270(0.00%)4743917954400.5620
GETpage=5&page_size=752390(0.00%)46638813944400.2610
GETpage=4&page_size=502370(0.00%)45337110104200.2610
GETpage=9&page_size=752160(0.00%)4673979964300.4600
GETpage=6&page_size=502520(0.00%)39132110653600.4540
GETpage=5&page_size=252490(0.00%)39232313673700.4530
GETpage=7&page_size=502430(0.00%)3923199973600.4530
GETpage=9&page_size=502210(0.00%)3913306583600520
GETpage=8&page_size=502410(0.00%)38831713003600.4520
GETpage=5&page_size=502040(0.00%)3863227233600.1520
GETpage=2&page_size=252590(0.00%)36329911243400.4520
GETpage=1&page_size=252420(0.00%)37329611423500.4510
GETpage=4&page_size=252380(0.00%)36129211713400.3500
GETpage=6&page_size=252560(0.00%)3653126873500.3480
GETpage=3&page_size=252320(0.00%)3563027523400.9480
GETpage=7&page_size=252610(0.00%)3623105903500.1470
GETpage=9&page_size=252740(0.00%)3042547972800.3430
GETpage=8&page_size=252320(0.00%)3002536442800420
GETpage=8&page_size=12350(0.00%)22819110252100320
GETpage=3&page_size=12550(0.00%)23518511062200.2310
GETpage=9&page_size=12480(0.00%)23418411472100.3310
GETpage=5&page_size=12470(0.00%)2331868512200.5300
GETpage=7&page_size=12310(0.00%)23418821042200.3290
GETpage=2&page_size=12510(0.00%)2261879722200.4290
GETpage=1&page_size=12550(0.00%)2231864612100.3290
GETpage=4&page_size=12600(0.00%)2221884002100.3290
GETpage=6&page_size=12660(0.00%)2211873732200.1280

GETting a single threads vs comment_count 

    This tests takes the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) and then retrieves a single thread. This thread is then grouped by its comment_count. This test was run for 1 hour. As expected, because of pagination issues, threads with more comments are taking longer to return. 

 GET thread vs comment_count

Note that the slow transactions are the same thread. Comment_count is also linearly correlated. There is a case where the comment_count was ~4000. The approximate formula to calculate the response time for this context is (comment_count*1.4+360). A thread with ~4000 comments would take approximately 8200ms.

Namereqs#failsAvgMinMaxMedianreq/s95%
GETComment_Range:0-25215670(0.00%)33026320503205.7420
GETComment_Range:25-5036480(0.00%)3582809323401.1460
GETComment_Range:50-758020(0.00%)3913089193700.7510
GETComment_Range:75-1004460(0.00%)4243258594000.1570
GETComment_Range:100-12511140(0.00%)44434916134200.2590
GETComment_Range:125-150560(0.00%)4863917624700700
GETComment_Range:150-17517450(0.00%)51840310295000.3720
GETComment_Range:175-20016100(0.00%)53240014875100.5730
GETComment_Range:225-2508530(0.00%)59545613895700.2830
GETComment_Range:350-3758500(0.00%)74056018537000.11000
GETComment_Range:375-4008280(0.00%)77257319267200.31100
GETComment_Range:1575-16008790(0.00%)21941740473121000.23100

 

 

  • No labels