Because I’m a nerd, I end up frequently talking with people about what the biggest opportunity in the data space is. While there are tons of people working on the next git-for-data and different flavors of data catalogs the problem that I actually feel is most pressing but that very few people seem to be working on is a consistent means of publishing, reproducing, and iterating on knowledge within an organization.
The problem is simple, and I’ve seen it recur at every organization I’ve worked with. Here’s roughly what the research analytics life-cycle looks like:
- Someone poses an interesting question
- Some smart person goes off to answer it
- They collect data, do some type of analysis (maybe statistical, maybe not)
- They write up their results in some sort of document (code notebook, memo, PowerPoint)
- They share their results and decision-makers nod along pleasantly
- That report disappears into the dark void of internal documentation, never to be seen or heard-from again
This is a bummer! It’s especially a bummer because, generally, the most common type of analyses I’ve seen done in this context actually do have repeat-use value. Whether it’s “refreshing” an analysis to see if some predictions actually came true or updating some beliefs after some important changes to the underlying business — these types of analyses have the potential to be useful into the future, but today they simply aren’t because we don’t have good tools for accessing, reviewing, and searching these analyses.
When I think of the biggest hole draining the data-value bucket, this is the one I think most needs to be plugged. However, likely because it’s more challenging than building an nth version of a data-dictionary, I haven’t seen much exciting progress in the space.
The obvious exception, surely familiar to those paying attention to such things in the data space, is the open-source project built and maintained by Air BnB, the Knowledge Repo. Before we move ahead, I’ll admit that I believe the Knowledge Repo is a huge step forward in the space, and it tackles a lot of the hard problems that I’m going to lay out. However, I find that its usability is quite a bit lacking, but I really believe that a decently-funded startup could borrow (cough cough) a lot of the ideas they’ve developed, and turn it into a product that’s seamless to use and delivers on the original promise of the project.
So what are the core problems that need to be addressed?
- Findability — the tool we’re looking for needs to be easily searchable such that people in the future interested in reviewing what the state-of-the-art is for internal analysis can easily find what they’re looking for. You could imagine searching for methods (“kaplan meier”) or software libraries (“bsts”) or topics (“customer churn”) — a researcher starting on a project or a decision-maker looking for information should be able to quickly browse through historic research efforts related to their topic of interest
- Versioning — the tool should prominently display how recently an analysis was conducted and any other historic versions of an analysis. The nature of this sort of research work within a company (as opposed to in academia) is that we often see research that is “refreshed” at some regular cadence — a great example is something like a “brand awareness survey” that’s conducted once every N months. In this tool, we want to be able to see the most recent results as well as browse through older versions of this same analysis.
- Inspectability — the tool should make it possible for other researchers to see how the research was performed — ideally showing the code that was used so that it’s possible to view directly what assumptions were made
- Reproducibility — finally, the tool should empower reproducibility as much as possible. If we can package up an analysis with transformation and analysis code and a data as-of date we can make the code reproducible to future investigators (or, at least, point them in the right direction)
While these are tricky problems, the Knowledge Repo effort takes a great stab at these issues by combining notebooks with a metadata layer that allows for linking different “versions” of an analysis and including additional metadata that could be useful in the future.
I would love to see a company tackling this and putting real resources into building a better version of the tool — I think the value that it could create should be sufficient to support a venture-backed enterprise and it bums me out that I keep not seeing it happen.
If you’re working in this space (or are just as enthusiastic as I am), please reach out! I’d love to chat and see what you’re working on.