Forums Pagination

Forums currently does some of it's pagination in memory. I would like to optimize some of this code and push the pagination down to the database level. This writeup was done to see if this work could be used as a hackathon project.

Overview

Diving in to the code, the project does use a pagination gem:

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/Gemfile#L35 (docs: https://github.com/mislav/will_paginate)

However this isn’t used to to paginate everything. Here is the only occurrence that I have found: https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/lib/helpers.rb#L219

 

Most of the pagination that is hurting us is based in the following locations:

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/presenters/thread.rb#L26-L110

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/api/users.rb#L31-L55 (this calls the above code, we need to figure out exactly what is happening here to see if we could cut down on any queries done here as well)

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/lib/helpers.rb#L121-L230

  

I see 2 major uses cases that we need to take care of:

  • pagination of thread responses
  • pagination of threads that are unread.

 

Thread Responses:

This may not be so bad actually, if you look at these lines only the object_id is being returned:

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/presenters/thread.rb#L74-L76

 

I created a PR to fix this here:

https://github.com/edx/cs_comments_service/pull/140

 

Even this could have a mixed result as far as performance goes. It could make large threads return faster, but smaller threads make take more time, see the PR.

Even with the fixes in PR 140 we could probably experience good performance gains if you move that logic to a stored javascript function as well and that may also be the way to go. My concern is how to we test it? But at the same time we do have specs for this functionality already, it would just run the slimed down ruby code that calls the JS. Though the stored javascript approach could make it harder to find errors. FWIW I have not looked in to how people really use and test these kind of stored javascript procedures.

 

(quick aside:

Going through this some more I also notice that we are doing a tree traversal here:

https://github.com/edx/cs_comments_service/blob/bbeggs/mongodb-3.0-PLAT-730/presenters/thread.rb#L85-L110

This quite possibly could also be big performance hit on large threads, we should investigate further down the line.

)

 

Unread threads:

After review it looks like the main issue here is filtering for pagination when attempting to view only unread items. The comment in the code:

# Filter and paginate based on user read state.  Requires joining a subdocument of the

# user object with documents in the contents collection, which has to be done in memory.

 

The problem here is that mongo can’t join between the users collection for their threads last read date and the content collection. Since we can not do this we query for the comment threads (with the filtering criteria that is directly above it in the code then it iterates in batches over those threads and picks out the unread threads. 

 

You might be able to write a better query or use the aggression framework, but it could be difficult. The aggregation framework might be the better choice using match and skip, but the code would need to start tracking the number of documents skipped pre-filter (as opposed to post-filter as it does now). 

 

The other option to consider here would be to re-write this logic in javascript and insert it as stored javascript on the server (almost analogous to a stored procedure). You could possibly even move the filtering logic there as well. If you did this you could just pass in the user id then leave the JS on the server to take care of doing the manual join on the users read states and items in the content table. TBH this is probably your best choice to improve performance. 

I think you have a reasonably good chance at succesfully moving this logic to javascript, though I don’t think it’s more than a 1 person project.