Code Documentation

A few years ago I had the pleasure of working with an extremely talented software engineer who would frequently post the same two-word comment at different points in my pull requests. Those two words: “basmati rice”.

This engineer was referring to this timeless tweet:

And he was noting that I had made a change to some code that invalidated a code comment without updating the comment.

I think about basmati rice often, generally as I’m thinking about the Problem of Documentation. The Problem of Documentation is that most documentation is bad. There was a long time where I believed that working on internal documentation was critical to having a high-functioning software engineering or data science team. I now no longer believe that’s the case. In fact, I believe that you (yes you) are probably spending too much time on documentation.

I believe that documentation naturally entropies. We should accept that good human-generated documentation is very difficult and expensive to maintain, and that attempting to fight the natural entropy of documentation will be a losing proposition for most organizations. We should not treat documentation as an unalloyed good and instead think critically about which types of documentation make sense in a given context and which documents we actually need to work most effectively.

Important note: in this blog post I’m only talking about internal documentation — that is, documentation not designed to be published and shared with external parties. If you run an api-based software platform like Stripe or dbt or if you maintain an open-source package you need to have good docs in order to serve your customers — that’s a very different use-case than internal documentation (and, generally, companies that do great external-facing documentation have teams of people working full-time on managing and maintaining the docs).

There are two paradigms for documentation, and if we want to think critically about the purpose that documentation serves and how we want to make use of the documentation, it’s critical that we differentiate between the two.

The two paradigms are:

  1. always-up-to-date description of how a system functions
  2. point-in-time explanations of why things work the way they do or why a decision was made

I contend that docs of type one should not be maintained by humans — the natural entropy of documentation destines all such documentation to become basmati rice. Rather, for docs of this type, we should rely on systems for automated documentation. 

My favorite flavor of automatically maintained documentation are tests or assertions. In a dbt project, I’d rather see an assertion for uniqueness and not-nullness than any description of what it means to be a primary key for a table. In a python or R project, I’d rather look at the test cases to see the expected uses of a function than read the potentially-out-of-date documentation about that function.

However, there are other types of automated documentation systems like Swagger / Open API system which can generate automated documentation for a REST API — developing systems and tools that allow for automated documentation generation are a much better use of time than hand-maintaining internal documentation.

Docs of type two are much more suited to being created by humans — largely because there isn’t a “maintenance” component to those documents. If we explicitly recognize that such documentation is designed to reflect the state-of-the-world at the time the doc was written and is not guaranteed to apply as circumstances change in the future. The most common example of this type of documentation is just a corporate memo, but in software, we also find this type of documentation on request-for-comment or proposed-architecture-design docs. The entire git system (git comments, pull request notes, etc) recognizes that these documents are not meant to be eternal but rather apply to exactly one state of the codebase at one point in time.

This second type of docs shouldn’t encapsulate functionality, but rather logic — why was a certain design decision made or a given architecture proposed and rejected? Those are helpful pieces of information but are necessarily “dated” in so far as the reasoning depends upon a certain set of circumstances that could change in the future. 

I believe that we should recognize that entropy is a natural law of documenation, and we should stop chastising humans for not maintaining documentation. We should definitely not build documentation systems (e.g., data dictionaries) that assume that humans will maintain documents over time, but rather we should focus our efforts on building tools that are self-documenting either through tests and assertions or via standardization that allows for swagger-style documentation-generation.

We should encourage the writing of point-in-time documentation that we assume will not be maintained, but rather will continue to exist as a point-in-time artifact. We do need better tools for exposing and exploring those documents and linking them to the relevant sections of the code (a la git history), but yet I see very few people talking or thinking about that problem (a more technical version of the knowledge-sharing problem).

I’ve come full circle on the documentation problem — I used to encourage everyone on my team to write code comments and to make sure to update the documentation with every pull-request so that we can keep the docs up-to-date. I no longer think that effort has a positive ROI. I believe that we should use automated documentation as much as possible and then focus the rest of our efforts on capturing the intended functionality of a system at a given point in time.

About the author

By michael