I’m a big fan of test-driven development for lots of reasons. In addition to helping you clearly define your goals in advance and encouraging you to write as little code as possible, I love the documenting nature of tests. If your project is well tested, you don’t need documentation (at least for other developers) — they can simply look at the tests and observe the enforced behavior of the code.
Unfortunately, testing SQL (or any logic that works on data in a tabular format) is very difficult. There are a lot of reasons for this, but the two I think about most are:
- because tabular data tends to represent lots of complicated concepts bound together, it takes a lot of work to generate realistic test-cases
- the logic expressed in SQL tends to be much more complicated than what is normally expressed in a small function amenable to unit-tests, so every test ends up being an “integration” test.
I work a lot in (and am a huge proponent of) the extract-load-transform (ELT) framework which involves writing complex data transformations in SQL that are then run against raw data stored in the data warehouse (generally via tool like dbt). dbt really improved the ability to test ELT pipelines by allowing developers to specify assertions about the transformed data. Most commonly things like “uniqueness” and “not-nullness” of certain columns, but dbt also allows the developer to specify arbitrarily complex assertions about the transformed data. This is a huge boon, but in almost all implementations I’ve seen, the tests are run against production data after the transformations have been applied rather than against a dedicated test database.
In this paradigm, checking to see if a new transformation obeys the tests often means running the transformation against the full production database in order to validate it. This can be slow and tedious if your transformations take a long time to run. Even on small projects, executing a pipeline and testing it can take upwards of 10 minutes — this is way too slow for doing real test-driven development where ideally you’d like feedback on the order of seconds.
In order to support a more test-driven development workflow for data warehouse development, I believe we need a tool for better generating useful test data. These test data should have the following properties:
- Match the production database schema exactly
- Contain the minimum volume of data required for the tests (so that everything runs as fast as possible)
- Not contain sensitive production data
Most importantly, it should be easy to generate useful “test cases” for doing test-driven development — having to write tedious insert statements to generate the test data for a database just isn’t going to cut it. (If you’ve ever tried doing this, you know that it is extremely tedious and error-prone and not conducive to an efficient developer workflow.)
So, how should we generate these test cases? In my experience, when I’m working on an ELT pipeline, my workflow looks something like:
- Identify a bug because some report “looks wrong”.
- Track down a certain user that exhibits some weird pathological data problem (“oh dang I didn’t know you could have one order shipped to two different addresses”)
- Write ELT code, run it against the production database, repeatedly query the tables filtered to that pathological user to see if the code actually fixes the issue.
- Finally deploy the code, ideally with at least a commit message that references the pathological user id.
Of course, this workflow is not the best! Ideally, we’d have a test case for that pathological data case, but because the overhead of generating data that matches the case is so high, in general, it doesn’t happen.
In my ideal tool, we’d be able to quickly generate test data based on a pathological example from the production database. So the workflow would look something like:
- Identify the pathological user/order
- Tell the tool to copy and anonymize all of the relevant production data for that user into the test database
- Write an assertion that specifies the expected behavior of the data for that user post-transformation
- Run proposed ELT code against the minimal test database, and run all assertions
- Once all of those tests pass, we can push the code to our CI tool which can run the assertions against the full production database
The critical feature here is how we identify and copy pathological test data from the production database in an anonymized way that will allow for using it as a test case. This hooks directly into the current developer workflow, adds the benefits of a true test-driven development workflow, and also builds up the test suite so that it more fully covers all of the potential logical regressions we want to guard against.
Are you doing TDD in ELT? Know of a tool that can generate test data in the way I’ve outlined? I’d love to chat.