Fixing Django Fixtures

July 2, 2019 by Chris Hranj

Here at GIPHY we have a lot of services. These services vary in language, visibility, and scale. 

Since I started a few months ago, I’ve been working on a newer internal service called Bouncer. Bouncer is a CMS built on top of the Django framework and primarily talks to another internal Scala service. When I started working on Bouncer, my team and I ran into a recurring issue with generating sandbox data that consumed a lot of our time. After researching multiple approaches I was able to solve this problem by combining a number of tools/libraries into one clean implementation.

This post assumes some working knowledge of the typical Django app structure. If a refresher is needed, the Django Tutorial (which I admittedly referenced while writing this post) should cover it.

The Problem

Spinning up a local instance of Bouncer was simple thanks to the wonders of Docker. However, Bouncer was difficult to develop and test locally because new containers would connect to an empty database and present a sparse user interface.

We needed a bunch of sandbox data to work with, but generating useful, human-readable data was difficult and time consuming without a solid understanding of Bouncer’s complicated data models. Additionally, I wanted the output to be replicable and idempotent, so new engineers could run the same script and share the same starting point.

Note: If you just want to skip the journey and just see the code, head to the ‘Implementation’ section below, or check out some example code on GitHub.

Attempted/Possible Solutions

SQL Dumps

My initial approach was to load a SQL dump from Bouncer’s staging environment into my local database container.

The SQL dump loaded successfully but also produced a bunch of errors (e.g., relation already exists, random sequence issues, etc). A lot of the data in Bouncer’s staging environment survived a ton of migrations and manual manipulation, so my gut feeling was that tracking down these tiny issues would not be worth the time and effort.

Additionally, most of the data in the staging environment looked like nonsensical garbage, as it was quickly and manually created to test one specific feature of Bouncer (it is staging, after all).

Perhaps it was my OCD, but this approach left me with a lot of insecurity about the accuracy and usefulness of my data, so I searched for another solution.

A Pre-Loaded Docker Image

Since Bouncer runs on Docker locally, another possible solution was to create a Docker image with the necessary data already loaded into the database. Developers could then simply pull down the pre-loaded database image and be ready to go. For these reasons, the approach was not ideal:

  1. What if a new use case required additional data for testing? Bouncer is then faced with the same inconvenience of manually creating data with no standardized way to do so.
  2. Any time the data model changes a new Docker image would need to be built, pushed, and pulled down by all the other Bouncer devs. Bouncer is a newer service undergoing refactors, so the data model is constantly in flux.
  3. What if Bouncer wasn’t Docker-ized? If we ever decided to move away from Docker we would be forced to revisit this problem.
  4. Selfishly, I’d prefer to write Python scripts over Docker commands any day. 😉

Fixtures

A fixture (in the Python/Django world) is a module for loading and referencing test data. Fixtures are basically synonymous with JSON although they can also be XML, YAML, etc. (see more info on fixtures here).

I was convinced fixtures would be the solution to my problem. The Django documentation even recommends them. Basic use cases, like a single unit tests with a simple data model can and should be dealt with using fixtures.

Fixtures become troublesome as soon as you start dealing with increasingly complex data models. “Complex” in this case refers to models containing datetimes, UUIDs, and especially foreign keys. Bouncer’s data model contains all of these, including several one-to-many and many-to-many relationships.

After I tried to populate my database using fixtures, I came across the following issues:

  1. Each time the underlying data models change, all of the JSON fixtures need to change as well to stay consistent with the new data models.
  2. Since fixtures deal with JSON and not Python, they are limited to a few simple primitives to represent many types of data. For example, primary and foreign keys must always be hard-coded integers. As soon as the fixture require more than a few objects it becomes difficult to maintain these complicated nests of foreign keys.
  3. I’ve found that when working with ORMs and foreign keys, it’s dangerous to assume the initial state of the database. Fixtures rely on a clean and consistent starting state of the database. If there is stale data lingering in the database (e.g., from a unit test that didn’t clean up), the fixtures will probably fail to load.


A Better Solution

Helper Libraries to the Rescue


After further research, I discovered that the problem can be solved 100% programmatically with the help of two great libraries: factory_boy and faker.

– factory_boy is for fixture replacement/generation

– faker is for generating fake, human-readable data.

factory_boy depends on faker already, so technically only one package is needed. This package and all its dependencies can be installed via the following command:

$ pip install factory_boy

These are some of the benefits of this approach:

  1. Since we’re populating the database programmatically (i.e. using Python), the data can be sourced from anywhere. Future iterations could make network calls or read files to gather data.
  2. The factories can be organized in a framework-agnostic manner so they can be used elsewhere while still being able to integrate into Django easily using the django-admin as demonstrated below.


Implementation

Let’s see some code already. We’ll use a simple data model borrowed from the Django Book to make it easier to explain our approach:

from django.db import models

class Publisher(models.Model):
    name = models.TextField()

class Author(models.Model):
    first_name = models.TextField()
    last_name = models.TextField()

class Book(models.Model):
    title = models.TextField()
    authors = models.ManyToManyField(Author)
    publisher = models.ForeignKey(Publisher, on_delete=models.PROTECT)
    publication_date = models.DateField()
    isbn = models.TextField()

Note that the code above is using Django’s built-in ORM to represent its data models. If you’re using a different ORM like SQLAlchemy your models will differ slightly.

Working with factory_boy

The first step is to re-create the models above as factories using the factory_boy library. It’s important to differentiate the concept of a factory as it pertains to factory_boy from the Python pattern of encapsulating object creation which also references the term ‘factory.’

Create a new file called factories.py in the same directory as your models (mysite/books/factories.py in this example) and add the following code:

import factory
from books.models import Author, Book, Publisher

class PublisherFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Publisher

    name = factory.Faker('company')

class AuthorFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Author

    first_name = factory.Faker('first_name_female')
    last_name = factory.Faker('last_name_female')

class BookFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Book

    title = factory.Faker('sentence', nb_words=4)
    publisher = factory.SubFactory(PublisherFactory)
    publication_date = factory.Faker('date_time')
    isbn = factory.Faker('isbn13', separator="-")

    @factory.post_generation
    def authors(self, create, extracted, **kwargs):
        if not create:
            return

        if extracted:
            for author in extracted:
                self.authors.add(author)

Each factory above defines the Django model it is based off of, as well as the fields in that model that need to be populated. The calls to factory.Faker tell factory_boy to use the Faker library to generate fake data for that field (discussed more below). 

The most difficult piece of code above to understand is the authors function in the BookFactory class. Since authors is a ManyToManyField on the Book model, it requires special functionality in order to be represented inside a factory. You can find more details on how that works here.


Generating Fake Data with faker

The factories above make use of a module in the factory_boy library called Faker. Faker is conceptually broken up into a number of “providers”. A provider is essentially the “type” of data that is being faked.


Example providers include addresses, credit cards, phone numbers, etc. (full list of faker providers here).

The factories in the code above use a number of different providers to match the type of data the model would contain. For example, the isbn field in the BookFactory above uses the isbn provider because it makes the most sense. You can play around with these factories in the django shell:

chranj@~/django-fixture-example/mysite$ ./manage.py shell
>>> from books.factories import BookFactory
>>> BookFactory.create()
< Book: Book object (1)>
As seen above, factories can be instantiated via the create() method. Now we can use the ORM and query for books:

>>> from books.models import Book
>>> Book.objects.all()
< QuerySet [< Book: Book object (1)>]>
Notice that a Book was returned even though we never explicitly created any Book objects. Faker is actually creating and saving this object to the database under the hood. How cool is that!? If you inspect this object you’ll see Faker also created a Publisher and populated all of the fields on both models with random data:

>>> book = Book.objects.first()
>>> book.publisher.name
'Lowe-Curtis'
>>> book.isbn
'978-0-273-85745-7'
Since authors is a ManyToManyField we need to manually create an Author first using the AuthorFactory and pass it into the BookFactory. Create another Book to test this:

>>> from books.factories import AuthorFactory
>>> book = BookFactory.create(authors=[AuthorFactory.create()])
>>> book.authors.first().first_name
'Cynthia'

(More details on Faker can be found here).

Custom django-admin Commands

Now that the necessary models and factories exist, it’s time to automate the use of them in a reusable fashion. The best way to do this is with a django-admin command. We’ll start by integrating factory_boy into django-admin.

Implementing django-admin Commands

django-admin looks for custom commands in the /management/commands/ directory inside a Django app. Commands should each go in their own module and the module name should be one word in all lowercase.

For example, if your app’s name is books and your command is populatebooks, you would create a new file in books/management/commands/populatebooks.py. Create this file (change the naming to fit your needs) and insert the following:

from django.core.management.base import BaseCommand

class Command(BaseCommand):
    help = "Populate mysite with sample data. Generates books, authors, and publishers."

    def handle(self, *args, **options):
        self.stdout.write("Inside populatebooks!")
Custom django-admin commands must inherit the BaseCommand class and must override the handle function, as seen above.

Running Custom Commands

Before building out populatebooks, let’s take a second to test that the django-admin command runs. If Django is running locally, it can be tested using the following:

$ ./manage.py populatebooks

In a Docker-ized environment, the command would look more like the following:

$ docker exec -it CONTAINER_NAME sh -c './manage.py populatebooks'

If everything is configured successfully, "Inside populatebooks!" should be printed to the terminal.

Building A Robust populatebooks

Now let’s integrate the factories we created earlier into this new populatebooks command. Start by defining a _load_fixtures function on the Command class that looks as such:

from books.factories import BookFactory, AuthorFactory

def _load_fixtures(self):
    author1 = AuthorFactory.create()
    BookFactory.create(authors=[author1])
Then update the existing handle function to call _load_fixtures instead of just printing to the terminal.

from django.db import transaction

def handle(self, *args, **options):
    try:
        with transaction.atomic():
            self._load_fixtures()

    except Exception as e:
        raise CommandError(f"{e}\n\nTransaction was not committed due to the above exception.")
We are wrapping the call to _load_fixtures in a try/except block as well as an atomic transaction so if anything goes wrong during its execution, we can rollback and won’t be left with a database in a partially populated state.

Save these changes and run the django-admin command just as you did above as many times as you want. Once it finishes you can check the database and marvel at the new, randomly generated books in your database:

chranj@~/django-fixture-example/mysite$ ./manage.py populatebooks
chranj@~/django-fixture-example/mysite$ ./manage.py shell
>>> from books.models import Book
>>> Book.objects.all()
< QuerySet [< Book: Book object (7)>, < Book: Book object (8)>, < Book: Book object (9)>]>
>>> Book.objects.get(id=7).title
'Blood value minute.'

Congratulations! You’ve just populated your database with randomly generated data using a custom django-admin command! 

Additional Steps

The following sections are optional but recommended, and shouldn’t take too long to implement.

Wiping the Database Before Insertion

It’s more than likely that you’ll want to wipe the database before running a command like populatebooks, but for safety reasons this functionality should be optional. Note that django-admin actually has a flush command that does this, but using it can possibly have unintended consequences as it will wipe everything. Implementing a custom wipe command will give you complete control over what data gets dropped.

def add_arguments(self, parser):
    parser.add_argument(
        '--clean',
        help='Wipe existing data from the database before loading fixtures.',
        action='store_true',
        default=False,
    )
Next add a _clean_db function to take care of deleting objects from the desired models. In this case we want to delete all Author, Book, and Publisher objects:

from books.models import Author, Book, Publisher

def _clean_db(self):
    for model in [Author, Book, Publisher]:
        model.objects.all().delete()
This approach can get a little hairy depending on the complexity of relationships between models and the database backend you are using. For the example data model used throughout this post, simply importing the models and deleting them in a loop was sufficient, but you might need to experiment.

Then update the handle function to check for the --clean argument and call the _clean_db function if passed:

try:
    with transaction.atomic():
        if options['clean']:
            self._clean_db() 
        self._load_fixtures()
Once all the code above is added, it can be tested by running populatebooks with the --clean flag. Keep in mind this will wipe any existing books in your database:

$ ./manage.py populatebooks --clean

If you jump into the database you should now see only the data created during the most recent execution of populatebooks.


Seeding Fake Data

One of the original requirements of this work was to make Bouncer more portable and its data replicable. Due to the nature of Python’s random module which is used by Faker, our populatebooks command is not idempotent, meaning if it runs multiple times in a row (with the --clean) flag, the resulting data will always look different.

We can fix this issue by explicitly setting the random module’s seed.

Add another argument just as we did above that looks as such:

parser.add_argument(
    '--seed',
    help='The initial seed to use when generating random data.',
    default='mysite',
    type=str,
)

Next create a _set_seed function which will seed the random engine used by both factory_boy and faker (read more about this here):

import factory.random

def _set_seed(self, seed):
    self.stdout.write(f"Using seed \'{seed}\' for randomization.")
    factory.random.reseed_random(seed)
Then update handle again to grab the --seed option and pass it into _set_seed as such:

try:
    with transaction.atomic():
        if options['clean']:
            self._clean_db()

        seed = options.get('seed')
        self._set_seed(seed)

        self._load_fixtures()
Now populatebooks can be called in such a manner:

$ ./manage.py populatebooks --clean --seed "this is a seed"

If you call populatebooks multiple times or even from a completely different machine with the same seed (and same dependency versions), you should end up with the same exact data in the database. The only thing to keep in mind is that if you pass a seed but don’t also pass the --clean you may run into database errors depending on unique constraints on the data model.


Other Ideas

This blog post could go on forever with ideas on how to expand the populatebooks command. Some other features that I’ve implemented personally since starting this post include:
    – automating the creation of an admin user so there’s no need to run Django’s createsuperuser
    – checking the current environment that populatebooks is running in so it never runs in production
    – dumping the contents of the database to a file when the --clean command is used in case the command needs to be reverted

See what you can come up with!


Concluding Thoughts

After adding factories and a custom django-admin command to Bouncer, it has become a more portable service and its on-boarding time has decreased significantly.

One could probably argue that this solution was heavily over-engineered (and I would probably agree with you), but this was a great exercise for learning about custom django-admin commands, working with the ORM, and using some awesome third-party Python libraries.

Hopefully the information in this post will be useful in helping you build out your own data generation utilities.

If you have feedback or run into errors, typos, etc in the code feel free to reach out to me via email (chranj@giphy.com) or on Twitter.

Thanks for reading!

– Chris “Brodan” Hranj, Ad Products Engineer



Resources

Thanks to the following blog posts, StackOverflow answers, and documentation which I learned from and referenced while writing this post:

Factory Boy as an Alternative to Django Testing Fixtures
Populate Django database on StackOverflow
Random.seed(): What does it do? on StackOverflow
The Django Book