When starting a new Django project often my first step is to create a model and bulk load in some existing data. As I’ve learned more about Django and databases, I’ve learned a few ways to speed up the data loading process. In this post I’ll walk through progressively more efficient ways of loading data, and at the end of the post measure the performance of the methods using fake data generated by the wonderful Faker library.
Setup: Some Fake Data
We start with a Django model composed of several different types of data:
class Thing(models.Model): char = models.CharField(max_length=100) text = models.TextField() integer = models.IntegerField() float = models.FloatField() boolean = models.BooleanField()
Faker, we’re going to generate some fake data to populate our database of Things.
fake = Faker() def fake_thing(): return dict( char=fake.name(), text=fake.text(), integer=fake.pyint(), float=fake.pyfloat(), boolean=fake.pybool(), ) n_things = 100 things = [fake_thing() for _ in range(n_things)]
This data generation process is run once at the beginning of the script, so the same data is being inserted by each method for a fair comparison.
.save() on everything
One of the first things I learned in Django is how to save to the database using
.save() on an object. With this initial knowledge, I used to save a bunch of things in the database with a loop, like this:
for thing in things: t = Thing(**thing) t.save()
This way totally works! But it’s slow. The issue here is that Django runs in autocommit mode by default, so it’s completing a transaction every time an object is saved.
An analogy might help. If we were going to deposit $100 into the bank, autocommit is like forcing the bank teller to deposit our money dollar by dollar, printing a confirmation receipt after each $1 deposit and doing this 100 times. At a large enough scale, this behavior could be really helpful in preventing issues if thousands of people are making deposits of different amounts all at once. In our case we’re the only user of the bank and we want to make one large deposit, so it’s slowing us down.
Method 2: Using a Transaction
We want all of the objects to be saved at once, or not at all if there’s an error, which means we want a single database transaction to occur. With Django, our code doesn’t need to change much to use a transaction, we just wrap our same loop in the atomic transaction context manager:
from django.db import transaction with transaction.atomic(): for thing in things: t = Thing(**thing) t.save()
Learning how to control transactions is a really powerful tool, so head to the Django docs to learn more.
Wrapping it in a transaction means that if there’s an error in saving any of the things in our list, none of them will be saved to the database.
Back to the analogy, this is like if we gave the teller instructions to deposit the money dollar by dollar, but only give us a single receipt at the end if the whole transaction succeeds. It’s more efficient, but we still want a single $100 deposit not 100 individual $1 deposits.
Wrapping everything in a single transaction helps, but there’s still more room for improvement. With Django models
bulk_create is an efficient way to create a bunch of objects all at once in a single query. To use it with Django, your code will look something like this:
thing_objects =  for thing in things: t = Thing(**thing) thing_objects.append(t) Thing.objects.bulk_create(thing_objects)
The workflow here is slightly different: you don’t call save on each object, but instead store all the objects you want to create in a list, then pass that list to the
bulk_create method on the model manager.
There are some caveats to
bulk_create to consider. The one I encounter most is that it doesn’t call the save method on your model or use any signals. If you already have an application with existing data, or some complex business logic that uses custom
save methods or signals, you probably want to think a little more about how your data will be created.
At the Bank of Django,
bulk_create is probably the closest metal model to how I imagine a deposit working. We give the teller $100 and the teller adds $100 to our account and gives us a receipt indicating the full amount has been deposited all at once.
Method 4: PostgreSQL’s COPY command with
If you aren’t using a PostgreSQL database,
bulk_create is usually the best you can get. If you are using PostgreSQL, it has a
COPY command, which is described well in the docs:
COPYmoves data between PostgreSQL tables and standard file-system files.
Thankfully, some amazing people wrote
django-postgres-copy to make using
COPY with Django super easy. Using this library requires a bit of setup and some helper functions to get started, but the import speedup is worth it.
After installing, we’ll need to add the
CopyManager to our
Thing model. This means our model now looks like this:
from postgres_copy import CopyManager class Thing(models.Model): char = models.CharField(max_length=100) text = models.TextField() integer = models.IntegerField() float = models.FloatField() boolean = models.BooleanField() objects = CopyManager()
This allows us to use a
from_csv model manager method to load data from a csv. But
things isn’t a csv a csv, it’s a list of dictionaries. I don’t want to clog up my filesystem with a bunch of csv files just to import, so we’ll write a helper function to create an in-memory csv in order to use this method.
from io import StringIO import pandas as pd def in_memory_csv(data): """Creates an in-memory csv. Assumes `data` is a list of dicts with native python types.""" mem_csv = StringIO() pd.DataFrame(data).to_csv(mem_csv, index=False) mem_csv.seek(0) return mem_csv
The project I was using already had
pandas as a dependency, so I used it for ease. If you’re thinking that using pandas to write a csv is overkill, it’s also easy to write to a csv using the built-in csv module.
Now that we have this helper function to create an in-memory csv of our things, let’s write the code to insert this data into the database.
from contextlib import closing mem_csv = in_memory_csv(things) with closing(mem_csv) as csv_io: Thing.objects.from_csv(csv_io)
Here we used the closing context manager that closes our in-memory csv when we’re done with it. If we don’t close the object and call this multiple times, we’ll run out of memory.
This speedup does come at a cost, we’re skipping some validation (database constraints, indexing) that’s happening behind the scenes when we use
bulk_create. The docs for
django-postgres-copy describe options that you can turn on and off when you import.
Our bank analogy breaks down a bit here, but I’ll give it a shot. Using
COPY in this way is like avoiding the teller altogether and telling your bank: “I am depositing $100. Here’s a copy of what my account will look like after I deposit my money. Make it look like this.”
How much faster is using
django-postgres-copy? Anywhere from 34-77x faster than
save(), and 3.4-5.8x faster than
Detailed results are below.
THINGS: 1000 Method 01: 0.88s. Speedup: 1.00 Method 02: 0.56s. Speedup: 1.56 Method 03: 0.09s. Speedup: 10.03 Method 04: 0.03s. Speedup: 34.51
THINGS: 10000 Method 01: 8.56s. Speedup: 1.00 Method 02: 5.32s. Speedup: 1.61 Method 03: 0.66s. Speedup: 13.07 Method 04: 0.15s. Speedup: 58.11
THINGS: 50000 Method 01: 42.88s. Speedup: 1.00 Method 02: 27.09s. Speedup: 1.58 Method 03: 3.48s. Speedup: 12.31 Method 04: 0.65s. Speedup: 65.91
THINGS: 100000 Method 01: 97.13s. Speedup: 1.00 Method 02: 61.38s. Speedup: 1.58 Method 03: 7.21s. Speedup: 13.48 Method 04: 1.26s. Speedup: 77.15
THINGS: 500000 Method 01: 490.28s. Speedup: 1.00 Method 02: 293.63s. Speedup: 1.67 Method 03: 46.82s. Speedup: 10.47 Method 04: 8.03s. Speedup: 61.03
This was tested locally on my 2018 MacBook Pro using docker with the
postgres:11.4 image for the database and the following
requirements.txt for the Django app:
Django==2.2.5 django-postgres-copy==2.4.2 Faker==2.0.2 pandas==0.25.1
A gist of the management command used for the comparison can be found here.