🇬🇧🇺🇸 Django's delete() is harmful

Update 21st December 2021 (9 days after the original post)

After digging into the Django code for the Collector and the delete() method, I found out that this behavior, even though it's not ideal and has poor performance when working with a lot of data, I discovered that it's the desired behavior.

In the Django deletion code, there are two ways of deleting models: the fast way and the slow way. Models can be deleted in a fast way when they: don't have signals, are not triggering other CASCADE actions or they are part of a parent -> child -> parent relationship.

If any of these conditions don't apply, the fast deletion is skipped and the models will be fetched whole.

This actually makes sense, for example when we have pre_delete or post_delete signals connected to the targeted model: they receive the affected instance as argument, and it will be inconvenient and troublesome if we get an incomplete model instance when in fact we expect receiving the full instance.

To work around this you can:

  • manually do the cascading delete with more efficient querying and batching, from the children models towards the root
  • temporary disable the signals for that delete, so you can enable the fast deletion of those models, with something similar to FactoryBoy's mute_signals context manager
  • replace the default manager's QuerySet with a custom QuerySet which will defer() the problematic fields.

Of course, all these methods will work if you don't do heavy business logic in the signals that expect to be called with the full instance. If you do this, then you are in trouble.

And there is one more issue: we were still running Django 2.x.x, while this fast deletion optimization was added from Django 3.x.x onward. I suppose it was a tough lesson to learn the hard way: strive to keep your libraries as up to date as possible.

Original post

After a one month hiatus from writing, I decided to come back again, but this time with a more technical post. This is about the surprising findings about Django's Model.delete() method, which can cause high memory usage when dealing with a lot of instances, related models and on_delete=CASCADE , a fatal combination.

What happened

Let's say we have some models, A, B and C, defined as:

class A(models.Model):
    pass

class B(models.Model):
    a = models.ForeignKey(A, on_delete=models.CASCADE)

class C(models.Model):
    b = models.ForeignKey(B, on_delete=models.CASCADE)

And in a production scenario we would have a lot of B s that are tied to a certain A and C s that are tied to thos B s.

If you would want to delete that A, Hell will ensue if there is too much data involved. And even if it's not that much data involved, it will still be painfully slow.

What did I expect to happen

As normal, delete() operations will translate in SQL DELETE operations. That's expected and that's what is happening. But what about those CASCADE s? That's where the fun begins!

Ideally, the deletion could be done with a SELECT and a DELETE for each model that is affected, or even better, allow the database to handle the CASCADE mechanism ( PostgreSQL supports it natively ). From this Django ticket discussion on this topic it seems that this is the way CASCADEing is handled partially due to historic reasons (the Django internal API doesn't work like that) and some lack of database capabilities (the ORM strives to work seamlessly on a number of databases, but this feature is implemented differently on different popular databases).

What is actually happening?

Due to how Django's ORM work and the reasons from above, CASCADE is implemented at the framework level, rather than at the database level, meaning that deleting related entities is managed and performed at code level.

How does that work?

  • For the models you are calling delete, check if there are CASCADE dependencies.
  • If there are, for each model instance in the QuerySet targeted for deletion, collect all dependencies that also need to be deleted
  • For the collected dependencies, repeat recursively until you get to the "leaves", those models that can be safely deleted without triggering any more CASCADEs.

As you can see, the queryset whose instances are deleted are collected together with their dependencies, resolving that queryset. And resolving its related models queryset, and then another, etc. until we hit the "leave" queryset which has no other dependency, and the actual deletion can commence.

Where this is happening is in the QuerySet.delete() ( Line 756, when invoking the Collector ) in combination with iterating over the original queryset , which will evaluate and execute that queryset, bringing all its instances in memory, plus the way the related querysets are created , which, due to the way QuerySets work, all the model fields are fetched by default, meaning that ALL intermediary related models will be fetched into memory for during the deletion.

Imagine having some more "bulky" data attached to the B model and about 100 000 such instances linked to a specific A we want to delete.

class B(models.Model):
    a = models.ForeignKey(A, on_delete=models.CASCADE)
    data = JSONField(default=dict) # usually holding about 4kb of data, which is a fair JSON blob

A simple a.delete() will cause 4 * 100 000 = 400 000 kb = 400 MB of extra data to be fetched into memory. In more extreme cases with more instances or bulkier data, or when running in a resource constrained environment (eg. in a memory capped container in a Kubernetes cluster), the infamous Out-Of-Memory Killer will step in and kill the process, rolling back all the deletion progress (due to the whole deletion process being performed in a database transaction).

How to handle this mess?

There is a nice and simple solution to this problem, although imperfect and from outside the framework, from your code, you can only craft a solution that will work for your specific use case.

You will need to identify the reverse order of the model dependencies and delete the instances yourself, in a more efficient manner, by only leveraging the foreign keys and ids of those models.

Here is a quick and dirty work around using Django's built in Paginator class.

a = A.objects.get_instance_i_want_to_delete()
BATCH_SIZE = 500

def delete_efficiently(queryset, model, batch_size=BATCH_SIZE):
    batches = []
    paginator = Paginator(model.only('id'), batch_size)

    # build the batches of ids
    for page_num in range(paginator.num_pages):
        batches.append([c.id for c in paginator.page(page_num + 1).object_list])

    # delete the batches afterwards because we can't delete while performing the pagination
    for batch in batches:
        model.objects.filter(id__in=batch).delete()

delete_efficiently(C.objects.filter(b__a_id=a.id), C)  # delete all Cs that are tied to A through B
delete_efficiently(B.objects.filter(a_id=a.id), B)  # delete all Bs that are tied to A
a.delete()  # delete A the last, when it has no more dependencies left

By paginating and fetching in memory only the IDs of the instances we want to delete, we manage to greatly reduce the memory consumption of this delete operation.

Even though it can be done even more efficiently, other solutions would most likely require writing raw SQL queries, which beats the purpose of using the ORM.

Conclusion

Some tools, such as ORMs are made to make our lives, as developers, easier. But in some cases, they manage to surprise us in some weird ways, while making our heart rate increase when different processes start getting OOM Killed in production.

Some digging into the Django's code, revealed one such case when using the delete() method on an instance with too many direct and indirect (somewhat bulky) dependencies.

After detecting the root cause of the issue, designing a good-enough solution to bypass this limitation is rather easy, once again proving Pareto's principle (80% of the work is spent in 20% of the code, in this case the bulk of the work that caused the performance loss was due to those few lines I highlighted above).