Action items:
- Discuss with DEVOPS to increase RAM.
- Wait until mongo for forums is in house
- Get the list of most used queries
Notes:
- It would be suggested to have a usable index for every query that we have. Slow queries can reduce the overall performance.
- Obtaining a list of queries would be ideal for index creation. The current indexes have not been designed with index intersection in mind.
- Common trend amongst slow queries is `count`
- All errors rates observed are due to the decrease in timeouts. https://github.com/edx/cs_comments_service/pull/146
Improvements:
After applying the index on Nov 11, 11:05, we can see a significant improvement as expected.
Regressions:
After applying the index on Nov 11, 11:05, we can see that the 99th percentile began to timeout for the users endpoint. After inspection of the index on the read replica, it's not clear to why this could of happened. The slow query that occurs in these timeouts is
COMMAND database=comments-prod command={:count=>"contents", :query=>{"context"=>"course", "author_id"=>"8567599", "course_id"=>"course-v1:KULeuvenX+EUHURIx+3T2015", "anonymous"=>false, "anonymous_to_peers"=>false, "_type"=>{"$in"=>["CommentThread"]}}}
When running the query against the read replica, the expected index is being used. Removing `author_id_1` should not have affected this because the compound index `author_id_1_course_id_1` would be used instead. The median remains the same and the 95th percentile did not spike like the 99th. It seems like the proper indexes are in place, otherwise, there would be a regression for the median as well. One possible explanation could be the increase of index size explained in this wiki below.
Index size:
The indexes that we added takes up more RAM than the ones removed. To add to that, the `delete_spam` index that was removed had a size of 0 which does not really improve performance at all. Overall it seems like another 1GB of indexes was added.
Observations:
Below is a compound index that we have that is not selective enough. Whether its due to poor indexes or our queries is a different question.