Post index update
Notes:
- It would be suggested to have a usable index for every query that we have. Slow queries can reduce the overall performance.
- Obtaining a list of queries would be ideal for index creation. The current indexes have not been designed with index intersection in mind.
- Common trend amongst slow queries is `count`
- All errors rates observed are due to the decrease in timeouts. https://github.com/edx/cs_comments_service/pull/146
- Index selectivity still needs some work.
Improvements:
After applying the index on Nov 12, ~2pm, we can see a significant improvement as expected.
Regressions:
After applying the index on Nov 11, 11:05, we can see that the 99th percentile began to timeout for the users endpoint. After inspection of the index on the read replica, it's not clear to why this could of happened. The slow query that occurs in these timeouts is
COMMAND database=comments-prod command={:count=>"contents", :query=>{"context"=>"course", "author_id"=>"8567599", "course_id"=>"course-v1:KULeuvenX+EUHURIx+3T2015", "anonymous"=>false, "anonymous_to_peers"=>false, "_type"=>{"$in"=>["CommentThread"]}}}
When running the query against the read replica, the expected index is being used. Removing `author_id_1` should not have affected this because the compound index `author_id_1_course_id_1` would be used instead. The median remains the same and the 95th percentile did not spike like the 99th. It seems like the proper indexes are in place, otherwise, there would be a regression for the median as well. One possible explanation could be the increase of index size explained in this wiki below.
UPDATE:
TLDR: Regression has been addressed. After The index "author_id_1_course_id_1" may have been corrupted. It was removed and then re-added. This endpoint now looks the same as it way before the index update.
Compose reported that there was a slow query where the suggested index was author_id_1, an index that had been removed. When running the example slow query with an explain plan, it was shown that the correct index "author_id_1_course_id_1" was being used as expected. After some back and forth with Compose, it was found in the logs that sometimes the username endpoint will have a long read lock time (in the order of 10s). One suggestion was to reapply "author_id_1_course_id_1" in the case it may have been a corrupted index. After reapplying this index, compose no longer reported this as a slow query and new relic seems to be reporting normal numbers.
Index size:
The indexes that we added takes up more RAM than the ones removed. To add to that, the `delete_spam` index that was removed had a size of 0 which does not really improve performance at all. Overall it seems like another 1GB of indexes was added.
Observations:
Caching in effect. Below are sequential attempts at accessing the default view of a forums heavy course course-v1:IDBx+IDB6x+2015_T2 where the comment count of the pinned posts are ~4000. Each of the 20 threads it is trying to load (limit=20), is taking too long for a query. After each refresh, more data gets cached and then the comment count for the next threads are fetched. After a few refreshes, this course will no longer show 500s and load.
Looks like it is trying to retrieve the count of all comments with with the matching comment_thread_id in order to get the comment_count. This is horribly inefficient.