Investigate / Fix Recent RCM issues

Description

RCM began having numerous issues and was not passing for several weeks. These were caused in part by some Ecommerce Data Loader issues where some course runs had the Empty type. These were resolved by fixing the type to what it should be based on seats, but there is still an unknown about how these course runs had the Empty type in the first place. (One hypothesis is that the course runs were created directly in Studio and then assigned the Empty type through RCM, but the PC for the org (Purdue) said the runs were created in Publisher so not sure how it happened.)

Then there is the real issue. The CoursesApiDataLoader is very consistently (1 out of every 7 or 8 seems to not run into this issue, but it happens the vast majority of the time) failing due to the following exception:

This is a new error that has been happening starting around the middle of December 2019 (sometime after 12/16/2019). The cause of the error is that the courses API is reporting it has more results than it actually has. For example, the count of results on prod is 11926 and when the page_size is set to 50, that results in 239 pages. Except starting at page 235, there are no more results (https://courses.edx.org/api/courses/v1/courses/?page=235&page_size=50 )*. According to Cale, this is the result of the count we display is based on the overall QuerySet size which we then apply some filtering to in the results. However, since RCM is running as a staff user, we should have access to all results.

In my research into this, I was able to find a handful of the ~250 course runs that are no longer showing up. They are

  • BerkeleyX/LUCS.1x/3T2015

  • BerkeleyX/LUCS.2x/3T2015

  • BerkeleyX/LUCS.3x/1T2016

  • BerkeleyX/LUCS.4x/1T2016

All of these course runs are in course-discovery, are in Studio, but are not accessible in the API. They should be at the top of page 2 at https://courses.edx.org/api/courses/v1/courses/?page=2&page_size=50 , but are clearly not and if you try to hit the detail endpoint (https://courses.edx.org/api/courses/v1/courses/BerkeleyX/LUCS.1x/3T2015), you receive a 404 with the response "developer_message":"Course not found.".

Work should be done to figure out which course runs have stopped being returned and why and fixing it so they go back to being returned. Until that happens, RCM will likely continue to fail very often and we won't be updating a subset of our courseware.

* at the time of writing, it was page 235 that began resulting in no results, but this appears to be shifting. When I first was looking into this, it was page 234, but page 234 is now returning 19 results.

Steps to Reproduce

None

Activity

Show:
Dillon Dumesnil
February 25, 2020, 7:28 PM

Boy howdy. This one is a doozy right here.

So recently I added in some logging to RCM to try and get some more details about what's happening with the list endpoint. It appears that when RCM runs into a KeyError, it is because we are receiving the response

From a bit of googling, I found that this is the default message coming from Django Rest Framework. I believe we set our rates here (https://github.com/edx/edx-platform/blob/5ea4d8f50304bd7885d0469f5e0e7aea9faf341c/lms/envs/common.py#L2545). I believe the endpoint is not returning the standard 429 http status code (Too many requests) though since we have code that specifically handles that case and will backoff on our exceptions.

We also started seeing a second new issue which was a JSONDecodeError. From our logging, we noticed two separate scenarios. The first was a 504 http status code (Gateway timeout) that would return an empty response. I'm not sure what might be causing that, and it is likely a hard one to reproduce, but you can find some examples by doing a search for "Status code" inside of a run of RCM.

*Probably extra relevant part for this ticket:
The second scenario is closely related to the original issue described above. Before, when you went to the end of the page_count, you would receive a response that had no results. Now it simply returns a 500 server error. At this point, it might be helpful to point out that someone recently did work to update the reported page count and number of results after consecutive requests. So in your first request (assuming page_size = 50 for this example), it might say 241 pages and 12,015 results (or something like that), but then in a later request, those numbers will be updated to 239 pages and 11,930 results. So in this scenario, page 239 will return the final 30 results, page 240 will return an empty results array, and page 241 will return a 500 http status code server error. The odd part is that if you hit a crazy high page (like 687), you will receive a 404 (instead of a 500). You will also get this behavior on page (original_page_size + 1) (242 in this example).

So what does this mean for us? We should continue to wait for the CR ticket linked (and the PROD ticket linked to that) to be complete as they look into the endpoint and hopefully resolving the missing results issue and the DRF throttling issue. Then we can probably update RCM to better handle some of these issues (like the new api_client has a function called raise_for_status that we would probably be interested in).

Dillon Dumesnil
April 1, 2020, 8:48 PM

Relevant PR for the courses endpoint returning all results to staff users: https://github.com/edx/edx-platform/pull/23574

Dillon Dumesnil
April 6, 2020, 5:54 PM

The above PR was reverted and replaced by https://github.com/edx/course-discovery/pull/2519

Dillon Dumesnil
April 7, 2020, 2:07 PM

The fixes for this included fixing our backoff code and passing in a username to the LMS courses API. In a little more detail:

  • The backoff fix was required because we switched our API client a number of months back, but forgot to update the exception the backoff code was looking for so it wouldn't end up running at all. The exception was switched from a Slumber exception to the more default RequestException.

  • By default, the LMS courses API treats all users as Anonymous Users. In order for the API to treat a specific user as staff, you must pass in the username of the requesting staff user and then it will. Because of the differences in return content between Anonymous and staff users, we wanted RCM to be running as a staff user.

    • As part of the work to pass in the username, it involved making a change to the OAuthApiClient that we use to publicly retrieve the JWT token. This JWT token is then decoded in RCM in order to retrieve the username of the service user.

Done

Story Points

None

Assignee

Dillon Dumesnil

Reporter

Jason Myatt

Labels

Reach

None

Impact

None

Platform Area

None

Customer

None

Partner Manager

None

URL

None

Contributor Name

None

Groups with Read-Only Access

None

Actual Points

None

Category of Work

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Sprint

None

Priority

Unset
Configure