Grades backfill execution plan
Pre-execution
Land Cliff's PR: https://github.com/edx/edx-platform/pull/14925
Ops – Upgrade rabbit servers. https://openedx.atlassian.net/browse/OPS-1837 https://openedx.atlassian.net/browse/OPS-1838 https://openedx.atlassian.net/browse/OPS-1835
Create new celery queue / routing key in edx-internal https://openedx.atlassian.net/browse/TNL-6875
Update jenkins job to reference --routing_key
Enable persistent grades on edge. (https://openedx.atlassian.net/browse/TNL-6626
Backfill Grades
Enable write_only_if_engaged waffle switch on all environments (prod, stage, edge, load test). - ticket
Slow delete persistent grades tables (34 million rows in prod). ~5-8 hours expected - Devops ticket
Export list of courses (per environment) from course overviews table on read replica or call Courses API. (Note: Do not start this until tables have been deleted. - ticket
Stage https://openedx.atlassian.net/browse/TNL-6874
Delete persistent grades tables.
Run against demo course, with 100 users per batch.
Delete persistent grades tables again.
Run against demo course, with 1000 users per batch.
Run against all other courses on stage, using 1000 users per batch.
Decide if we're happy with batching and worker count.
Edge https://openedx.atlassian.net/browse/TNL-6876
Redo batch size testing, if stage didn't have enough data for useful metrics.
Using batch size from batch size test here or or stage, run against 10 courses
Run against 100 courses
Run against all courses.
Production https://openedx.atlassian.net/browse/TNL-6877
Use batch size from edge
Run against demo course
Run against 10 courses
Add extra celery worker
Run against 10 courses
Run against 100 courses
Run against 200 more courses (optional)
Run against all courses
Load test
Can we set up a jenkins job against this environment, or should we just run from command line? @Kevin Falcone (Deactivated)?
Run against all courses when environment is idle.
Notes:
Use Jenkins management command for all test runs, setting course ids and batch size in config model.
Watch celery logs to ensure tasks are being picked up and run smoothly.
Watch new relic to make sure we aren't negatively affecting site performance.