This article will deliver its points from the point of view of a Django web framework, which is associated to some real cases in the development of Crowd+ by PPL Nice PeoPLe team.
Automatic data seeding is the activity that is done to provide “seed data” or initial data to an application when it is first launched. This activity enables the application to be launched without beginning from scratch every time, which saves much in deployment time. It is commonly associated to and used in deployment processes in which it has a role in:
- providing constant data which is needed by other data models (for example: list of banks, list of countries, etc.);
- providing default users either as a sample or fulfilling specific roles, such as default administrator users, initial sample users, etc.; and
- providing initial data to validate that the application can run correctly (these are usually deleted after the validation process is complete).
The activities usually involves importing initial data stored in various data formats (JSON, YAML, XML are some examples), loading it to the memory (or database if it is used), then making it available to the application.
However, this article will not go through the process of automatic data seeding in the context of deployment, but rather in the testing context in which its advantages are often underrated, yet important.
Why Testing Benefits From Automatic Data Seeding?
The use of automatic data seeding in the deployment context is well known and most programming languages provide thorough documentation on the topic. However, implementing the same activity to improve testing activities are often obscured and lack proper exposure. This is why we must know the actual benefits on doing the activity on a testing context first.
Ensuring Data Dependencies for Testing
A large-scale application may depend on a complex web of data models which correlates to one another and its relations are often the subject in which tests (both unit tests and integration tests) must cover. This is especially important in the case of testing “Read” methods, one of the components of CRUD, in which the actual data must be made available before it can be read to test whether our implementation can actually, well, read them.
Let’s take the following entities as an example:
As shown in the above diagram, most data models depend on other data models, which means that testing a particular data model (excluding tests targeting only the user model) will need to also initialize other models. This is especially complex when we are about to test the relationship in which annotators can participate on projects, in which we must fulfill the following dependencies:
- There should be testing data for Annotator and Project models to be able to actually test the weak entity that represents the relationship between the two models.
- To be able to create Project data models, we must also provide testing data for the Project Supplier model
- Since we need to use Annotator and Project Supplier models, we eventually need to also provide testing data for the User model as well.
In short, testing the relationship between projects and annotators requires the building of at least four dependent data models (five including the relationship itself). Now, let’s see how we can possibly implement these kinds of tests without using any data seeds in the following code snippet:
Even when our test only uses one datum for each models, we still need to add a quite lengthy procedure just to initialize the data. What if we want to use multiple data for each models? What if there are more correlated models? This is where automatic data seeding shines! Instead of creating the actual objects ourselves, we can create a bulk data format that stores the data’s attribute and ask our framework to load them for us (more on that in a moment…).
Reducing Duplication for Multiple Test Suites
In most software architectures, users intending to access/modify data will be served by multiple layers that serves their own functions, from providing the views to the user, processing the incoming/outgoing data, to communicating with the actual database. This is in the spirit of the SOLID Programming Principle, especially the single responsibility part. This pose a special challenge in testing.
To demonstrate this challenge, let’s see how the project-annotator participation relationship functions are implemented inside Crowd+ software architecture, which uses some form of MVC (Model-View-Controller) architecture:
In the spirit of test-driven development (TDD), every layer between the user and the database must be rigorously tested, which means that there are three test suites with each of them must fulfill the data dependency shown previously. This presents some problems if data initialization is coded manually (even with copy-pasting):
- In the case of a change in the underlying data model, that change will need to be propagated to at least three test suites (which possibly creates multiple data from the data model, necessitating more changes).
- The data initialization can possibly violate the “Don’t Repeat Yourself” principle in programming since the data is initialized with the same code in multiple test suites (especially if we resorted to copy-pasting).
In most frameworks, automatic data seeding in tests can make use of a centralized data file in which the frameworks can be asked to bulk-load all data in that file in a simpler instruction. If properly implemented, the initialized data can be fetched on-demand using their primary keys or other attributes if such needs arise.
As a simple additional motivation before we begin, here are some statistics that shows the difference in code length for some test suites used by Crowd+ before and after using automatic data seeding:
Integrating Automatic Data Seeding to Django Tests
Automatic data seeding is implemented inside Django itself so no additional requirements are needed. Its implementation are based on how Django handles migration and automatic data seeding for application deployment, which uses fixtures. Fixtures are data files formatted as such to be able to be read and inserted to the database by Django in bulk. The default file format is XML, but JSON is also supported. YAML is also supported but requires an additional dependency:
Most of the steps here will make use of Django’s built-in tools through the
manage.py file. This guide assumes that you already have an existing Django application and would like to integrate automatic data seeding to it.
Step 1. Migrate the Models to the Database
Since we will be using Django’s own system to generate the fixtures, we will need to migrate the models so that they show up in the database and we can add some data to it. In most cases, we would have done this routinely while developing a Django application, but it is wise to do the migrations once more to ensure that all models are loaded according to your specification.
From your project’s root directory (where manage.py is), migrating can be done through the following commands:
python manage.py makemigrations
python manage.py migrate
The first command will create a “migrations file” which specifies what should be done to the project’s database (creating, updating, or deleting). That file will be used by the second command to actually commit the changes to the database. Now, we are ready to create some “seed data”.
Step 2. Creating Seed Data
There are many ways that seed data may be created. In the Django documentation regarding object serialization, Django actually specifies how the fixture files should be formatted so it is actually possible to create the seed data by hand. However, it is easier to make use of Django’s own ORM system to create the objects in a Pythonic way then exporting them as fixtures. This is the method that Crowd+ uses in making seed data. This also helps to ensure all attributes are included, including automatic foreign key translation and derived attributes.
Django provides direct access to the ORM system through the Django shell which can be accessed through the
manage.py file by using the following command:
python manage.py shell
The shell provides a CLI that can run arbitrary Python/Django codes, including ORM access as shown below:
Actually, we can use the data that is currently loaded in the database, presumably through rigorous manual testing. However, if we want to create a fresh set of data, using a temporary database is preferred.
To create the objects, we can simply use the ORM’s creation method until our data needs are satisfied. For example, if we want to create a User object (including saving it to the database), we can type the following to the shell:
Here are some tips on creating the seed data:
- If you use Django’s default primary key system, try to make one or more attributes other than the primary key to be unique in your seed data. It does not necessarily need to be unique everywhere, just in your seed data. This will ease reference later on. Some possible attributes to use include name or email.
- If your objects have foreign keys, satisfy the dependencies first before creating the object. For example, in the case of Annotator, User objects are created first then associated during Annotator objects creation.
- Use the ORM get methods to fetch objects during foreign keys satisfaction. For example, when creating an Annotator object, User model’s ORM getter can be used to fetch users to be associated.
- [Recommended!] Provide a lookup table for your seed data which is to be shared to the developers. In the case of Crowd+, Google Sheets are used. This way, all developers can know what seed data is available to use.
- The seed data creation can be semi-automated by using the lookup table as a CSV file which can be processed inside the shell as a “bulk object creator”.
Step 3. Dumping / Exporting Data
Django has a built-in method to dump every data that is currently stored in the database. This method shares the same serialization/deserialization as automatic data seeding so the output of this method can be directly used as a fixture. Dumping can be done as a whole (the entire database is exported) or targeted towards a specific model. Dumping each model separately is recommended so that we can selectively use only the fixtures that are needed for each of our test suites.
Exporting data from a model can be done using the following command:
python manage.py dumpdata --format <preferred file format> --indent <preferred indentation level> --output <output file> <application name>.<model name># Example:
python manage.py dumpdata --format json --indent 4 --output user.json repository.user
Here are some explanation regarding the included options:
--format jsonenables you to configure the file type of the fixtures. It is XML by default, but I recommend to use JSON.
--indent 4enables you to configure the number of spaces in one indentation. It defaults to None which disables indentation entirely and produces a single-line output. If you intend on reading the file with your own eyes, I recommend setting this value to your preferred indentation space amount (Crowd+ project uses 4 spaces).
--output user.jsonenables you to export the data to a file of your choosing instead of the standard output.
<application name>is the name of the application (not project!) that your model resides. This is used to find the model among your installed Django application.
<model name>is the model whose data you intend to export.
After Django is done exporting, you should see the file in your project’s root directory (or current directory):
And here is a sample of the resulting fixtures file for the User model:
"name": "Valid Active Annotator 1",
The step can be repeated for every model which you intend to make fixtures out of.
Step 4. Configuring Fixtures Directory
By default, Django will search for fixtures in the
fixtures folder in each of your applications. While this can suffice for small projects, it can be cumbersome to duplicate the fixtures file across all applications that needs them, negating the benefits of using fixtures in the first place. Therefore, it is wise to put the fixtures in a centralized folder, especially if many applications are going to use it.
Similar to static and template files configuration, Django settings can be configured to have a list of folders that Django should look for fixture files. This is achieved through the FIXTURE_DIRS variables. I recommend putting the fixtures under a fixtures folder which resides under an assets folder under the project’s root directory (create it if it does not exist yet), like so:
│ │ └───__pycache__
│ ├───fixtures # We will use this folder
To let Django know about this folder, add the following instructions to the project’s
# Test Fixtures
FIXTURE_DIRS = [
BASE_DIR / "assets" / "fixtures"
Now that we have a folder to store our fixtures, we should move our fixtures file to this folder.
Step 5. Loading Fixtures in a Test Suite
Django test case classes which inherits
TransactionalTestCase (including the ubiquitous
TestCase class) includes support for fixtures by default, but will not use it until we specifically asks for fixtures. To use fixtures in our test suites, we need to add
fixtures as a class variable. The content of this variable is a list of strings which corresponds to the name of the fixture to be used. The file extension is not required and Django will search for the file automatically.
For example, if we have
project_supplier.json as our fixtures, then we need to add the following variable:
Note: The order of the fixtures in the list should match the dependency flow (i.e. models that are dependent on other models should be put after their dependencies).
fixtures = ["user", "annotator", "project_supplier"]
# Some tests
This is why it is recommended to break the fixtures into multiple file. Every test suites may need different sets of models. For example, tests for User repository may only need to include User model’s fixtures only. On the other hand, testing project-annotator relationship may use multiple fixtures.
Step 6. Using the Data in Test
Assuming that there are no problems in fixtures loading, all data that is contained in the fixtures will be available throughout the execution of the test suite just like if we created them manually. You can validate this by calling the ORM’s getter method during the test and see the contents. In some cases, we would not need to address every objects individually so that fetching it from the database inside the test suite is not needed. However, in some cases, we do need to address a specific object, such as:
- for testing creational methods for models that depend on a foreign model;
- for testing update/delete methods for a targeted object; and
- for an in-depth matching of an object returned by the code that we are testing.
To do this, we can use the ORM’s getter method to get the specific objects that we need through one of two methods:
- Referring one or more objects through their primary key, which can be looked up in the fixtures file (in the case of generated PKs) or your lookup table.
- Referring them through our designated “unique” attribute, provided that you have planned ahead.
Even though we still need to store the objects in a variable, it can now be done in more simpler terms and without initializing the data from scratch for every test suite.
The following is a code excerpt from Crowd+ backend testing kit, showing the difference between initializing a data manually and using an automatic data seeding technique in a test suite’s
Using Automatic Data Seeding
When using the fixtures, we don’t need to provide arguments for creating the User and Annotator objects that we intend to use. We simply need to include the appropriate fixtures and fetch the appropriate object. In this case, we used the user’s name as a “referral attribute” for the User model and the associated user for the Annotator model.
Conclusion and Key Takeaways
We finally arrive at the end of this article. Here are some key takeaways:
- Automatic data seeding, if implemented properly, improves scalability and maintainability of test suites by centralizing object attributes in a reusable file.
- Automatic data seeding comes with significant “warm-up” cost to prepare the data but yields significant improvement in test suite creation, especially for large projects.