Defensive Programming

Defensive Programming#

It’s easy to imagine that the key to writing code free from bugs is to just be careful. And indeed, in many circles — such as the social sciences — this is still how most people approach their programming.

The problem, however, is that the just be careful mindset doesn’t work. The empirical reality — discovered by computer scientists and professional programmers decades ago — is that people are just bad at writing code. Humans are accustomed to human languages, which are flexible and forgiving. In conversation, people can generally figure out what you mean from context clues and the redundancy inherent to speech even if you misspeak. But code — and computers — don’t work like that. Every bit of code is taken literally and interpreted in an unforgiving manner. And that means — no matter how careful you are — you’ll end up with mistakes in your code.

It’s not hard to see the folly of “just be careful” thinking in fields like the social sciences. In recent years, a wave of social science papers have turned out to have problems not because of problems of theory or choice of statistical models but because of simple programming errors. In perhaps the most embarrassing, Steven Levitt (co-author of the acclaimed Freakonomics and winner of one of the most prestigious awards in economics) had a paper about the political explosive (at least in the US) topic of abortion and crime that turned out to be wrong because he hadn’t put a set of controls into a regression he thought he had (may require sign-in, though I think you can see without paying). Moreover, as some of my own work has shown, the replication packages that come with a substantial number of political science papers have problems, often generating results that do not match those in the published paper.

So what, if we are so inherently and unavoidably error-prone, can we do to avoid publishing incorrect results or advising stakeholders to do the wrong thing?

In short, we have to learn to program defensively, which means adopting a set of practices designed to:

Minimize the likelihood you will make an error in the first place, and
Maximize the likelihood that you will catch any error you make.

What are these practices? In this reading, we’ll discuss three practices that are of particular interest to data scientists who focus on doing data analysis data science — that is, data scientists who are trying to answer a specific question about the world and who are writing code designed to answer that question (but which will not necessarily be shared or deployed at scale as a neatly-bundled package). If you’re unclear on the distinction between “data analysis” data science and “software engineering” data science, I’ve written more about that distinction here.

Those practices are:

Writing tests that are appropriate for data analysis workflows,
Writing readable, well-formatted code,
Don’t duplicate or transcribe,
Collaborating effectively.