Data catalogs are the hot new thing in the data space. Lots of people are starting companies to solve the problem of “data discoverability” and lots of investor money is flowing into those efforts.
While I’m all for more people innovating on tooling in the space, I’ve been a bit skeptical of a lot of these tools and so I want to use this blog post to try to put into words why that’s the case.
The first time that I spoke with someone working on one of these tools, they led off their pitch by asking me “have you ever had the experience where you are trying to start on a data science problem and you just can’t locate the data that you need to get started?”. I responded, honestly, with my answer:
“No, I’ve never really had that problem.”
This really threw my interlocutor for a loop — they weren’t sure how to continue the conversation. They had come from a large, data-first organization where their data was messy and stored in lots of different places (Hadoop, multiple data warehouses, etc.) and so they were building a tool to help data teams document all of the data in all of those different locations so that data scientists can make use of it. However, in my experience at a (big!) e-commerce company, we just didn’t have that problem. Because we rigorously maintained all of our data in a central data warehouse and we were diligent about pruning away old relations that were no longer useful or maintained, we didn’t have much use for a “data catalog”.
Helping people find the data they need is a noble goal! But my view is that putting a data catalog on top of a disjointed and poorly-maintained data system is just putting a bandaid on the bullet wound. The real problem in that context is the poor organization of data! We shouldn’t invest effort in documenting that bad data organization, but rather we should focus on doing the foundation-level work of cleaning and organizing all of the data into one coherent system (why on earth do you have multiple data warehouses??).
In my experience, if the warehouse is built correctly, most businesses can do 80% of their analysis on a handful of ~10 “core” tables. Even Spotify, a big business with lots of data (and well-loved data products), admits that’s the case in their blog post about their data quality journey:
Our first analytics engineering project was to review the landscape of projects and business needs, then consolidate the data into a handful of tables that would cover 80% of the data scientists’ needs.
If 80% of data scientists’ needs are covered by a handful of well-maintained and well-documented tables, then the need for a “data catalog” solution seems much less pressing.
My worry is that practitioners who are facing this problem today — they see that their analysts or data scientists are having a hard time finding the data that they need — will turn to a data catalog as the solution to their problems. However, I believe this problem indicates that they’ve probably done a bad job of building and maintaining their core data model, and so that’s where they should focus their effort.
Slapping a data catalog on top of their data might help alleviate some of the pain, but what it’s actually doing is masking over the real structural problem that the data team will have to deal with at some point — I worry that by delaying the real foundational work that needs to happen, it just ends up being even more difficult and expensive in the long run.