View Source

Background

We've recently run into some issues with migrations on some large tables that were endlessly growing with data we don't need or want. The effort to remove the unwanted rows proved more challenging than expected so we're capturing the lessons learned here.

Best Practices

There are a few different classes of problems that can each be addressed differently. This document is a starting point, as new realms pop up please add them!

Considerations

Does the table need to exist at all?
1. If not, consider using the usual migrations path to drop the table.
2. If the table is large, you may need to slow-prune the table before dropping it (DevOps has run into issues truncate'ing large mysql tables causing EBS queuing impacting the rest of the RDS).
Is there any data in the table that must remain during the delete operation?
1. If so you will probably need to use the batch delete process below.
Are there any foreign keys pointing to the data (in platform or other external dependencies), or important Django signals listening to delete events on the models?
1. If so you may need to use the ORM Delete Process below.

Avoid - Raw SQL

In general we should avoid writing raw SQL for these purposes if we can. It skips some important Django checks and can increase problems by having a disconnect between where we test the code, generally SQLite, and where it gets run in production on MySQL that the ORM helps smooth out (a little).

Batch Delete Process

If you need to keep some data in the table, but clear a lot and keep your auto increment key set this might work for you. MySQL has some issues with large deletes. The Django ORM also has some limitations with batch deletes (especially on MySQL). One we ran across was that any sliced queryset will result in Django creating SQL that works on SQLite (and thus unit tests), but errors on MySQL due to creating a DELETE with a subselect in the WHERE clause. This seems to happen any time you pass an list to a filter "__in", even if it's just a list of integer ids. To work around these issues we tried a few things before coming up with the solution here.

General best practices for deleting in batches:

Never slice your queryset to get the rows to delete
- ex: Grant.objects.filter(expires__lte=Now())[:batch_size]
Create a management command that can be run again later if need be
- Management command should have command line options to configure the batch size and sleep time between batches so DevOps can tune resource usage to play nice with the production traffic
Delete by an indexed column if you can
- id is usually best if you have contiguous ids
- Get the max id you want to delete, the current min id in the table, delete by id chunks by batch size until you reach max id.
If you have to filter by a non-indexed column, try to keep to your batch size by also limiting by an indexed column
- Example: edx-django-oauth2-provider (this deleted ~1.6M production rows in 46 mins w/ chunk size of 3k and 5 secs between batches)
NOTE: Django has a delete optimization that makes this performant. But if the table has foreign keys or connected pre/post delete signals, that optimization WILL NOT be used and this will effectively turn into the ORM Delete Process below.

ORM Delete Process

Sometimes you need Django to cascade your delete to other tables that have foreign keys on the data you are deleting, or important pre/post delete signals need to be run. In this case your life is simple. Just use the usual Django filter() to get your target rows and delete() them one at a time. This will be much slower, but will perform all of the necessary checks. Note that the same code that runs a fast delete on one model can be slow on another due to those FKs or signals. See here for details.

Resources

Django object deletion docs
Information on big MySQL deletes
Example management command from edx-django-oauth2-provider