Don’t Duplicate Information#
Information should only be expressed once in a file. For example, say you want to drop observations from a dataset if a person’s income has ever dropped below a poverty threshold of $20,000. You could do this like this:
import pandas as pd
df = pd.DataFrame(
{
"income_2019": [10_000, 20_000, 30_000, 40_000, 50_000],
"income_2018": [50_000, 40_000, 30_000, 20_000, 10_000],
"income_2017": [50_000, 20_000, 30_000, 40_000, 50_000],
}
)
df = df[
(df["income_2019"] < 20_000)
| (df["income_2018"] < 20_000)
| (df["income_2017"] < 20_000)
]
df
income_2019 | income_2018 | income_2017 | |
---|---|---|---|
0 | 10000 | 50000 | 50000 |
4 | 50000 | 10000 | 50000 |
And indeed, this would work. But suppose you decided to change that cutoff from 20,000 to 30,000. The way this is written, you’ve opened yourself up to the possibility that in trying to change these cutoffs, you may change two of these but forget the third (something especially likely if the uses of the cutoff aren’t all in exactly the same place in your code). A better way of expressing this that avoids this possibility is:
df = pd.DataFrame(
{
"income_2019": [10_000, 20_000, 30_000, 40_000, 50_000],
"income_2018": [50_000, 40_000, 30_000, 20_000, 10_000],
"income_2017": [50_000, 20_000, 30_000, 40_000, 50_000],
}
)
tax_income_threshold = 20_000
df = df[
(df["income_2019"] < tax_income_threshold)
| (df["income_2018"] < tax_income_threshold)
| (df["income_2017"] < tax_income_threshold)
]
df
income_2019 | income_2018 | income_2017 | |
---|---|---|---|
0 | 10000 | 50000 | 50000 |
4 | 50000 | 10000 | 50000 |
Written like this, if you ever decide to go back and change the common cutoff, you only have to make in one place, and there’s no way to make the change in some cases but forget others.
No Magic Numbers#
The example above also obeys another good rule for programming: no magic numbers, meaning don’t write code that has constants (like 20_000
) that appear without explanation. Assigning them to a variable — especially a well-named, self-documenting variable — first improves code legibility.
Never Transcribe#
This one is more relevant for academics, but it’s important anyway: never transcribe numbers into a report or paper if you can possibly avoid it.
We’ve already covered tricks to maximize the probability we catch our mistakes, but how do we minimize the likelihood they will occur? If there is anything I learned working as the Replication Assistant at the Quarterly Journal of Political Science testing people’s replication packages, it is that authors should never transcribe numbers from their statistical software into their papers by hand. This was easily the largest source of replication issues we encountered, as doing so introduced two types of errors:
Mis-transcriptions: Humans just aren’t built to reliably transcribe dozens of numbers by hand. If the error is in the last decimal place, it doesn’t mean much, but when a decimal point drifts, or a negative sign is dropped, the results are often quite substantively important.
Failures to Update: We are constantly updating our code, and authors who hand transcribe their results often update their code and forget to update all of their results, leaving old results in their paper.
So, how do you avoid this problem? Use tools that will directly export your results into plain text or formatted tables you can use in the program where you are working. For example, statsmodels
in Python can export regression tables to lots of formats, R has stargazer, and Stata has estout.
I suggest users not only do this for tables - which is increasingly common - but also for statistics that appear in text. For example, to put a single number into \(\LaTeX\), you just generate the number you want to put in your paper, convert it to a string, and save it to disk as a .tex
file (e.g., exported_statistic.tex
). Then, in your paper, simply add a \input{exported_statistic.tex}
call, and LaTeX will insert the contents of that .tex
file verbatim into your paper.
For example, here’s a way to save a single number to put into LaTeX:
# Here's a number I want in a paper
x = 1 / 3
# Format it...
x = f"{x:.2f}"
# Now write to disk!
import os
os.chdir("/users/nce8/desktop")
with open("test_file.tex", "w") as text_file:
text_file.write(x)
# Now I'm gonna erase
# the file so I don't clutter my desktop. :)
os.remove("test_file.tex")
While this type of integration works best for LaTeX, it can still be accomplished with many other programs like Word. For example, most packages that generate .tex
files that can be easily integrated into LaTeX also often have options to export to .txt
or .rtf
files that you can easily use in Word. These tools can be used to generate tables that can either be (a) copied whole-cloth into Word by hand (minimizing the risk of mis-transcriptions that may occur when typing individual values) or (b) using Word’s Link to Existing File feature to connect your Word document to the output of your code in a way that ensures the Word doc loads the most recent version of the table every time Word is opened. Some great tips for combining R with Word can be found here.