...
In the initial testing phase of the discussion api, the wait time between each top-level task was 500ms. If a task has multiple calls, such as PATCH comment, the wait time between each of those calls were 1000ms. These waits are very short and represent an accelerated usage of the discussions api. The first test that was run was against BerkeleyXagainst BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) with 10 locust users. A high percentage of the responses were failures and the response times were in the 1000-3000ms range. Under the assumption that a course with more posts would have worse performance, it would be expected that this course which is the 5th biggest course in our system, would not perform. To get a better baseline, we tried a smaller course.
Expand |
---|
title | Initial Flowtest against very large course |
---|
|
This was tested against the course BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) against courses-loadtest.edx.org. Name | reqs | # | fails | Avg | Min | Max | Median | 95% |
---|
GET | GET_comment_list | 213 | 190(47.15%) | 1635 | 189 | 5574 | 960 | 4500 | GET | GET_thread | 100935 | 3035(2.92%) | 1722 | 9 | 10003 | 1400 | 4400 | GET | GET_thread_list | 7788 | 51851(86.94%) | 2186 | 163 | 9313 | 2000 | 4900 | PATCH | PATCH_comment | 207 | 7(3.27%) | 1622 | 227 | 9537 | 1200 | 4000 | PATCH | PATCH_thread | 191 | 87(31.29%) | 1210 | 147 | 6729 | 600 | 400 | POST | POST_comment_comment | 157 | 2(1.26%) | 1973 | 363 | 8839 | 1500 | 5300 | POST | POST_comment_response | 444 | 216(32.73%) | 2416 | 316 | 10080 | 1600 | 6800 | POST | POST_thread | 4449 | 26(0.58%) | 1200 | 197 | 6663 | 900 | 3200 |
The errors that came up were all 500 server errors. When looking at NewRelic, these errors were all timeout errors. # | occurences | Error |
---|
3035 | GET_thread: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 7 | PATCH_comment: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 51851 | GET_thread_list: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 83 | PATCH_thread: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 216 | POST_comment_response: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 2 | POST_comment_comment: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 4 | PATCH_thread: | HTTPError('404 Client Error: NOT FOUND',) | 26 | POST_thread: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) | 190 | GET_comment_list: | HTTPError('500 Server Error: INTERNAL SERVER ERROR',) |
|
...
Another test was to be run against a small course, without any thread creation. Anchor |
---|
| ramping_course |
---|
| ramping_course |
---|
|
Expand |
---|
title | Tests for smaller course which gradually grew |
---|
|
The next course used was BerkeleyX/CS.CS169.1x/3T2013 (783 threads, 3385 comments at the time). Unlike the loadtest with the very large course "BerkeleyX/ColWri2.2x/1T2014", the server was able to handle the requests without returning 500s. These numbers seem more reasonable. The test ran for 1 hour. | | | | | | | | | 95% |
---|
GET | GET_comment_list | 1718 | 0(0.00%) | 459 | 163 | 1783 | 410 | 0.1 | 950 | GET | GET_thread | 106253 | 0(0.00%) | 195 | 153 | 2302 | 180 | 10.3 | 270 | GET | GET_thread_list | 61009 | 0(0.00%) | 595 | 155 | 4064 | 460 | 6.8 | 1700 | PATCH | PATCH_comment | 1719 | 0(0.00%) | 593 | 208 | 2122 | 520 | 0.1 | 1100 | PATCH | PATCH_thread | 1883 | 19(1.00%) | 403 | 166 | 2283 | 360 | 0.3 | 810 | POST | POST_comment_comment | 1373 | 0(0.00%) | 966 | 591 | 4261 | 900 | 0.1 | 1600 | POST | POST_comment_response | 4135 | 0(0.00%) | 928 | 546 | 4194 | 860 | 0.5 | 1600 | POST | POST_thread | 4688 | 0(0.00%) | 492 | 424 | 4102 | 460 | 0.4 | 610 |
As the loadtest continued, the 95th and 50th percentile gradually became slower and slower. The throughput was very high compared to that of production (~100rpm for a similar call /django_comment_client.forum.views:single_thread) so the test will be rerun with less load. After analyzing this data, it was realized that the number of posts in this course had actually altered the response times of the http requests. The course had increased about 2-3 times. In a week, about 20,000 threads, 20,000 responses, and 10,000 comments are created in ALL COURSES. In this 1 hour test, we have created a fifth of that value against a single course. Take note that the following tests were against the same course, but with the new increased post count.
The test below was against the same course but at half the throughput. As shown on the graph, the response times were slower. This test ran for 30 minutes. Again, we see that the response times increases slowly as the RPMs decreases appropriately. Name | reqs | # | fails | Avg | Min | Max | Median | req/s | 95% |
---|
GET | GET_comment_list | 306 | 0(0.00%) | 517 | 179 | 1227 | 490 | 0.1 | 910 | GET | GET_thread | 19602 | 0(0.00%) | 198 | 150 | 2403 | 180 | 4.2 | 320 | GET | GET_thread_list | 11150 | 1(0.01%) | 915 | 156 | 4026 | 770 | 3.8 | 1900 | PATCH | PATCH_comment | 306 | 0(0.00%) | 768 | 423 | 2007 | 690 | 0 | 1200 | PATCH | PATCH_thread | 370 | 2(0.54%) | 508 | 253 | 2043 | 480 | 0 | 890 | POST | POST_comment_comment | 268 | 0(0.00%) | 1139 | 775 | 2556 | 1100 | 0.1 | 1500 | POST | POST_comment_response | 735 | 0(0.00%) | 1108 | 715 | 2573 | 1000 | 0.1 | 1500 | POST | POST_thread | 833 | 0(0.00%) | 478 | 416 | 1925 | 450 | 0 | 600 |
|
...
In this series of tests, we will be testing against different sized courses with different ratios. There are spikes in response times in production and one possibility may be that the few very large courses may be the reason. In this initial test, we ran it for 3 hours, with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments). We can see that there are spikes in the response times. There have been many 500 responses from the large course which needs to be addressed. There were many unexpected 500 errors during this test. Upon further investigation, it was found that there was a memory issue due to missing indexes.
...
Expand |
---|
title | Rerun of mixed courses test after first index was added |
---|
|
This is before adding the "asc" index and after adding the "last_activity_at" index. The major difference is the lack of spikes, consistent RPM, better response times especially for the large course and no 500s. This test was run for 3 minutes with a small course vs large course on a 20:1 ratio. The large course used is BerkeleyX/ColWri2.2x/1T2014 (~25~38,000 threads, 30~40,000 comments) , while the small course is SMES/ASLCx/1T2015 (1700 Threads, 3047 comments).
Name | reqs | # | fails | Avg | Min | Max | Median | req/s | 95% |
---|
Large | GET_comment_list | 46 | 0(0.00%) | 443 | 207 | 1155 | 370 | 0 | 840 | | GET_thread | 2404 | 0(0.00%) | 225 | 168 | 1275 | 210 | 0.7 | 290 | | GET_thread_list | 1296 | 0(0.00%) | 489 | 181 | 1104 | 480 | 0.6 | 830 | | PATCH_comment | 45 | 0(0.00%) | 490 | 221 | 2060 | 450 | 0 | 780 | | PATCH_thread | 53 | 0(0.00%) | 489 | 174 | 1912 | 410 | 0.1 | 1100 | Small | GET_comment_list | 877 | 0(0.00%) | 211 | 154 | 1072 | 200 | 0.3 | 300 | | GET_thread | 55710 | 2(0.00%) | 198 | 141 | 1562 | 190 | 28.2 | 260 | | GET_thread_list | 29694 | 0(0.00%) | 449 | 147 | 3919 | 430 | 13.2 | 810 | | PATCH_comment | 875 | 4(0.46%) | 340 | 184 | 1584 | 320 | 0.2 | 560 | | PATCH_thread | 1038 | 9(0.86%) | 259 | 150 | 1055 | 240 | 0.5 | 420 |
|
In a new test, it was found that SMES/ASLCx/1T2015 (1700 Threads, 3047 comments) was slowing down over time. To identify what was happening, a new course was seeded which we will call DAPI (1000 Threads, 500 comments). In the analysis of the forums, we saw that the median body size was 250 characters. Our PATCH operations can be 4, 250, 1000, 5000, or 10000 characters. This test was run over 10 hours. The decrease in response time needed to be addressed.
Expand |
---|
title | page size vs response time |
---|
|
Over time DAPI (1000 threads, 500 comments) slowed down over time.
Name | reqs | # | fails | Avg | Min | Max | Median | req/s | 95% |
---|
GET | GET_comment_list | 15793 | 0(0.00%) | 220 | 162 | 3116 | 200 | 0.2 | 340 | GET | GET_thread | 955097 | 1(0.00%) | 214 | -8 | 4509 | 200 | 23.8 | 290 | GET | GET_thread_list | 510921 | 0(0.00%) | 445 | 15 | 5595 | 420 | 11.6 | 810 | PATCH | PATCH_comment | 15792 | 0(0.00%) | 337 | 187 | 3851 | 320 | 0.2 | 490 | PATCH | PATCH_thread | 17333 | 164(0.94%) | 273 | 161 | 3019 | 250 | 0.5 | 400 |
When running a similar test with a more verbose output, we clearly saw that the page size had an effect on the response time. Although not shown here, larger content size was somewhat correlated to slower response times, but page_size had the better correlation. Name | reqs | # | fails | Avg | Min | Max | Median | req/s | 95% |
---|
GET | page=1&page_size=100 | 238 | 0(0.00%) | 791 | 655 | 1990 | 750 | 0.4 | 960 | GET | page=2&page_size=100 | 282 | 0(0.00%) | 762 | 659 | 1526 | 730 | 0.4 | 910 | GET | page=1&page_size=75 | 240 | 0(0.00%) | 651 | 544 | 1038 | 620 | 0.2 | 830 | GET | page=2&page_size=75 | 255 | 0(0.00%) | 677 | 573 | 1628 | 650 | 0 | 820 | GET | page=7&page_size=100 | 260 | 0(0.00%) | 568 | 461 | 2637 | 520 | 0.4 | 710 | GET | page=6&page_size=100 | 232 | 0(0.00%) | 559 | 466 | 843 | 520 | 0.1 | 710 | GET | page=9&page_size=100 | 222 | 0(0.00%) | 568 | 460 | 1907 | 530 | 0.5 | 700 | GET | page=3&page_size=100 | 262 | 0(0.00%) | 556 | 462 | 1807 | 520 | 0.2 | 700 | GET | page=5&page_size=100 | 259 | 0(0.00%) | 556 | 463 | 1375 | 510 | 0.3 | 700 | GET | page=4&page_size=100 | 276 | 0(0.00%) | 555 | 456 | 840 | 520 | 0.2 | 690 | GET | page=8&page_size=100 | 236 | 0(0.00%) | 548 | 468 | 903 | 520 | 0.1 | 680 | GET | page=3&page_size=75 | 251 | 0(0.00%) | 543 | 445 | 1348 | 510 | 0.3 | 670 | GET | page=2&page_size=50 | 278 | 0(0.00%) | 507 | 415 | 1141 | 480 | 0.5 | 670 | GET | page=1&page_size=50 | 251 | 0(0.00%) | 512 | 425 | 1043 | 490 | 0.1 | 660 | GET | page=8&page_size=75 | 237 | 0(0.00%) | 481 | 397 | 1644 | 440 | 0.1 | 650 | GET | page=3&page_size=50 | 247 | 0(0.00%) | 523 | 451 | 903 | 500 | 0.4 | 640 | GET | page=6&page_size=75 | 238 | 0(0.00%) | 492 | 398 | 1634 | 450 | 0.3 | 640 | GET | page=4&page_size=75 | 218 | 0(0.00%) | 480 | 395 | 886 | 440 | 0.2 | 630 | GET | page=7&page_size=75 | 227 | 0(0.00%) | 474 | 391 | 795 | 440 | 0.5 | 620 | GET | page=5&page_size=75 | 239 | 0(0.00%) | 466 | 388 | 1394 | 440 | 0.2 | 610 | GET | page=4&page_size=50 | 237 | 0(0.00%) | 453 | 371 | 1010 | 420 | 0.2 | 610 | GET | page=9&page_size=75 | 216 | 0(0.00%) | 467 | 397 | 996 | 430 | 0.4 | 600 | GET | page=6&page_size=50 | 252 | 0(0.00%) | 391 | 321 | 1065 | 360 | 0.4 | 540 | GET | page=5&page_size=25 | 249 | 0(0.00%) | 392 | 323 | 1367 | 370 | 0.4 | 530 | GET | page=7&page_size=50 | 243 | 0(0.00%) | 392 | 319 | 997 | 360 | 0.4 | 530 | GET | page=9&page_size=50 | 221 | 0(0.00%) | 391 | 330 | 658 | 360 | 0 | 520 | GET | page=8&page_size=50 | 241 | 0(0.00%) | 388 | 317 | 1300 | 360 | 0.4 | 520 | GET | page=5&page_size=50 | 204 | 0(0.00%) | 386 | 322 | 723 | 360 | 0.1 | 520 | GET | page=2&page_size=25 | 259 | 0(0.00%) | 363 | 299 | 1124 | 340 | 0.4 | 520 | GET | page=1&page_size=25 | 242 | 0(0.00%) | 373 | 296 | 1142 | 350 | 0.4 | 510 | GET | page=4&page_size=25 | 238 | 0(0.00%) | 361 | 292 | 1171 | 340 | 0.3 | 500 | GET | page=6&page_size=25 | 256 | 0(0.00%) | 365 | 312 | 687 | 350 | 0.3 | 480 | GET | page=3&page_size=25 | 232 | 0(0.00%) | 356 | 302 | 752 | 340 | 0.9 | 480 | GET | page=7&page_size=25 | 261 | 0(0.00%) | 362 | 310 | 590 | 350 | 0.1 | 470 | GET | page=9&page_size=25 | 274 | 0(0.00%) | 304 | 254 | 797 | 280 | 0.3 | 430 | GET | page=8&page_size=25 | 232 | 0(0.00%) | 300 | 253 | 644 | 280 | 0 | 420 | GET | page=8&page_size=1 | 235 | 0(0.00%) | 228 | 191 | 1025 | 210 | 0 | 320 | GET | page=3&page_size=1 | 255 | 0(0.00%) | 235 | 185 | 1106 | 220 | 0.2 | 310 | GET | page=9&page_size=1 | 248 | 0(0.00%) | 234 | 184 | 1147 | 210 | 0.3 | 310 | GET | page=5&page_size=1 | 247 | 0(0.00%) | 233 | 186 | 851 | 220 | 0.5 | 300 | GET | page=7&page_size=1 | 231 | 0(0.00%) | 234 | 188 | 2104 | 220 | 0.3 | 290 | GET | page=2&page_size=1 | 251 | 0(0.00%) | 226 | 187 | 972 | 220 | 0.4 | 290 | GET | page=1&page_size=1 | 255 | 0(0.00%) | 223 | 186 | 461 | 210 | 0.3 | 290 | GET | page=4&page_size=1 | 260 | 0(0.00%) | 222 | 188 | 400 | 210 | 0.3 | 290 | GET | page=6&page_size=1 | 266 | 0(0.00%) | 221 | 187 | 373 | 220 | 0.1 | 280 |
|
This tests takes the course BerkeleyX/ColWri2.2x/1T2014 (~38,000 threads, ~40,000 comments) and then retrieves a single thread. This thread is then grouped by its comment_count. This test was run for 1 hour. As expected, because of pagination issues, threads with more comments are taking longer to return.
Expand |
---|
title | GET thread vs comment_count |
---|
|
Note that the slow transactions are the same thread. Comment_count is also linearly correlated. There is a case where the comment_count was ~4000. The approximate formula to calculate the response time for this context is (comment_count*1.4+360). A thread with ~4000 comments would take approximately 8200ms. Image Added Image Added Name | reqs | # | fails | Avg | Min | Max | Median | req/s | 95% |
---|
GET | Comment_Range:0-25 | 21567 | 0(0.00%) | 330 | 263 | 2050 | 320 | 5.7 | 420 | GET | Comment_Range:25-50 | 3648 | 0(0.00%) | 358 | 280 | 932 | 340 | 1.1 | 460 | GET | Comment_Range:50-75 | 802 | 0(0.00%) | 391 | 308 | 919 | 370 | 0.7 | 510 | GET | Comment_Range:75-100 | 446 | 0(0.00%) | 424 | 325 | 859 | 400 | 0.1 | 570 | GET | Comment_Range:100-125 | 1114 | 0(0.00%) | 444 | 349 | 1613 | 420 | 0.2 | 590 | GET | Comment_Range:125-150 | 56 | 0(0.00%) | 486 | 391 | 762 | 470 | 0 | 700 | GET | Comment_Range:150-175 | 1745 | 0(0.00%) | 518 | 403 | 1029 | 500 | 0.3 | 720 | GET | Comment_Range:175-200 | 1610 | 0(0.00%) | 532 | 400 | 1487 | 510 | 0.5 | 730 | GET | Comment_Range:225-250 | 853 | 0(0.00%) | 595 | 456 | 1389 | 570 | 0.2 | 830 | GET | Comment_Range:350-375 | 850 | 0(0.00%) | 740 | 560 | 1853 | 700 | 0.1 | 1000 | GET | Comment_Range:375-400 | 828 | 0(0.00%) | 772 | 573 | 1926 | 720 | 0.3 | 1100 | GET | Comment_Range:1575-1600 | 879 | 0(0.00%) | 2194 | 1740 | 4731 | 2100 | 0.2 | 3100 |
|