Readability#
When writing code, it can often feel like you’re in a conversation (or fight) with your computer. But if you take nothing else away from this reading, let it be this: we write code for people, not computers.
Your computer doesn’t need all the neat words you get in a language like Python — print
, in
, or list
. It’s just moving electrons through trillions of little circuits. Programming language are there to make what a computer does legible to people. And yet, despite that, we so often find ways to make the code we write utterly impossible for people to read.
That’s a problem, because if you can’t easily understand what’s going on when you look at code, there’s no chance you’ll be able to identify mistakes. That’s why writing readable code isn’t just about aesthetics — its key to avoiding mistakes.
So what makes code readable?
Use informative names.#
Don’t call something var212
if you can call it unemployment_percentage
. Informative names require more typing, but they make your code so much easier to read. And even then, you only have to type out the variable name the first time — after that, tab-completion limits the number of keystrokes required for even the longest names. Moreover, including units in your variable names (percentage
, km
, etc.) can also help avoid confusion.
# Bad
def foo(a, b):
return a / 12 + b
# Good
def convert_ft_and_inches_to_inches(feet, inches):
return feet * 12 + inches
The two functions above do precisely the same thing, but when you see the second function, it’s immediately clear what it does and how to use it. With the first function, it is unclear what it does from its name. Moreover, even if you know its purpose, it can be easy forget how to use it correct — which argument is for feet and which is for inches? The names of the arguments (a
and b
) don’t tell you!
Code with really clear variable names is sometimes called “self-documenting” because a reader can understand its purpose without having to reference comments or additional documentation.
Format Your Code#
Humans are visual creatures with superb visual pattern recognition skills (our ancestors wouldn’t have survived if they couldn’t find camouflaged lions in the grass!). Consistently formatting your code allows you to leverage those innate abilities to notice when things don’t look right.
Once upon a time, I’d point you to a document about principles of good formatting. Today, though, I don’t have to. Instead, you can just install a program to format your code automatically.
In the world of Python, nearly everyone formats with a program called black
or similar formatters (like Ruff) that… do exactly the same thing as Black, but often faster. I strongly recommend setting up “format on save” in your editor. For directions on how to do that in VS Code, go here.
Note
What is “Black”? Black is an “opinionated code formatter.” As they say on their website “By using Black, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.” It’s become the standard for formatting Python code, not necessarily because everyone thinks the way it styles codes is “the best,” but rather because it’s put an end to fights over code format that used to arise in every open-source project.
The package’s name comes from a quote from Henry Ford, who once said of the first mass produced automobile, the Model T, “Any customer can have a car painted any color that he wants so long as it is black.”
Interpret Your Numbers#
When you print out numbers in a data analysis workflow, don’t just throw them out there without context! Data science is about interpreting data to help improve our understanding of the world around us. To that end, any time you share a result, you absolutely must interpret it for the reader. That means no answering a question with Python printing out a number—you should also tell the reader what, in substantive terms, that number means. Honestly, even if you aren’t sharing a result — you’re just trying to print out an important quantity for yourself — making yourself interpret it explicitly will force you to think about what you’re doing in substantive terms.
To illustrate, suppose you asked someone how many households live below the poverty line, and they answered “0.183”. You’d have so many questions! For example, is that a share / proportion (18.3%), or a percentage (0.18%)?
I recognize that when an exercise prompt is clear, it may feel unnecessary/pedantic, but it’s a good practice to get used to in the real world. Get your units wrong and… bad things can happen.
My suggestion for how to do this is to present answers with f-strings (not familiar with f-strings? Best Python feature in years. Learn all the tricks here!). These will allow you to combine interpretation/units with your result and also make it easy to format your answer (so it doesn’t have 20 decimal places). For example:
below_poverty = 0.183
print(
"The share of US Households earning less than "
f"$20,000 in 2008 was {below_poverty:.2f}"
)
The share of US Households earning less than $20,000 in 2008 was 0.18
And for money values, you can easily add commas and round off to two decimals:
amount_of_money = 30_021_032.2398823
print(f"The price paid for the company was ${amount_of_money:,.2f}")
The price paid for the company was $30,021,032.24
Comments#
Comment your code! Comments help in two ways.
First, and most obviously, they make it easy to figure out what’s going on when you come back to code days, weeks, or months after it was originally written.
Second, it forces you to think about what you’re doing in substantive terms (“This section calculates the share of people within each occupation who have college degrees”) rather than just in programming logic, which can help you catch substantive problems with code that may run without problems but will not actually generate the quantity of interest.
To be clear, not everything requires a comment, and the more informative your variable and function names, the fewer comments you need to put around your code to make it legible. In industry, it is increasingly common to limit comments in programs, relying instead on informative variable and function names (self-documenting code) or docstrings (the information about a function you can get with function_name?).
In the world of data analysis, however, I’m still a big advocate for commenting. The reason is that much of what we do in our data analysis scripts is motivated by a non-obvious property of the underlying data (maybe a certain set of observations that shouldn’t be duplicated are, and you’ve figured out that you can drop half of them safely but the other half need new identifiers). Comments — in my view — remain the best way to keep a record of that type of information about the data.