Pulling Back the Curtain: Pandas to Numpy with patsy#
In our last reading, we learned about the ols
function in statsmodels.formula.api
, and the ease with which it allows data scientists to not only fit a linear regression on data stored in Pandas. We saw how easy it was to work with the regression results, to change features like how standard errors were calculated, to evaluate model performance, and more.
The one thing we deliberately did not go into in the last reading was the details of the model specification. The formula we used — "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region"
— isn’t too hard to interpret, and if you’re coming from the R programming language it likely looked very familiar.
In this reading, we’ll pull back the curtain a little on how statsmodels used data from a pandas DataFrame to fit a linear regression, even though some of the contents of that DataFrame were string variables (like region
). In doing so, we’ll not only learn the details of how to specify models using these formula strings, but also the nuts and bolts of model fitting — an understanding that will be critical for successfully navigating the Python Data Science ecosystem beyond statsmodels.
patsy Formula Syntax#
To move between pandas and numpy, we will use the patsy.dmatrices()
function. It takes a pandas DataFrame and a formula string and returns a tuple containing an outcome vector and a matrix of explanatory variables.
To illustrate how patsy works, let’s create a quite small toy dataset just to help us visualize what is going on.
import pandas as pd
import numpy as np
pd.set_option("mode.copy_on_write", True)
toy_df = pd.DataFrame(
{
"y": [10, 11, 9, 8, 12],
"a": [1, 2, 3, 1, 3],
"b": [3.14, np.nan, 2.5, 0, 3],
"names": ["Nick", "Lyra", "Nick", "Lyra", "Lyra"],
}
)
toy_df
y | a | b | names | |
---|---|---|---|---|
0 | 10 | 1 | 3.14 | Nick |
1 | 11 | 2 | NaN | Lyra |
2 | 9 | 3 | 2.50 | Nick |
3 | 8 | 1 | 0.00 | Lyra |
4 | 12 | 3 | 3.00 | Lyra |
Now let’s suppose we want to regress y
on a
and b
. Let’s use dmatrices
to create our matrices. Note that patsy.dmatrices
returns two objects — our outcome vector and our matrix of explanatory variables, so if you assign the result to a single variable you will get a tuple
with the two items, or you can assign the result to two variables (separated by a comma) and the results will be distributed to those two variables:
import patsy
y, X = patsy.dmatrices("y ~ a + b", data=toy_df)
Let’s start by looking at our X matrix:
X
DesignMatrix with shape (4, 2)
a b
1 3.14
3 2.50
1 0.00
3 3.00
Terms:
'a' (column 0)
'b' (column 1)
As you can see, there are a few differences between our pandas DataFrame and this DesignMatrix
(these are basically numpy arrays with some extra metadata, like labels. You can use them with any function that expects numpy arrays).
The DesignMatrix
X
includes a column of 1s labelled asIntercept
. You may have never thought about this before, but the intercept in a regression doesn’t appear by magic — it’s actually a coefficient created by adding a column that has the same value for every observation (1 makes interpretation easiest). By default,patsy
will always add an intercept, but if you would prefer it did not, just add- 1
or+ 0
to the formula.Our pandas DataFrame had 5 rows, but
X
only has 4. Why? Well, the second row of our DataFrame included a missing value (np.nan
) in theb
column. As noted above, we can’t do matrix algebra with missing values, so any row that includes a missing value in one of the variables in the formula is automatically dropped. And yes, it drops that row fromy
too.
y
DesignMatrix with shape (4, 1)
y
10
9
8
12
Terms:
'y' (column 0)
Categorical Variables#
Great, but this is just the start of things. Now let’s add names
— a string variable — to the model.
y, X = patsy.dmatrices("y ~ a + b + names", data=toy_df)
X
DesignMatrix with shape (4, 4)
Intercept names[T.Nick] a b
1 1 1 3.14
1 1 3 2.50
1 0 1 0.00
1 0 3 3.00
Terms:
'Intercept' (column 0)
'names' (column 1)
'a' (column 2)
'b' (column 3)
Now our X
matrix has a new column called names[T.Nick]
that contains 0s and 1s. Basically, if the variable takes on the value after T.
(here, Nick
), then that variable will be 1; otherwise it will be a zero. If our variable c
took on more than two values (say, "Lyra"
, "Nick",
"Trillian"
), X
would have also had a variable called names[T.Trillian]
.
By default, patsy will always omit one category (the “omitted category” or “reference category”). As a result, for a variable that takes on N
unique values, patsy will always add N-1
variables when doing one-hot encoding.
patsy will do this automatically when it encounters a Categorical
or object
dtype
column, but you can also use C()
to force it to treat any variable as categorical. For example, if I put C(a)
in the formula, it will treat a
as a categorical variable and apply one-hot encoding to all the unique values of the variable:
y, X = patsy.dmatrices("y ~ C(a) + b", data=toy_df)
X
DesignMatrix with shape (4, 4)
Intercept C(a)[T.2] C(a)[T.3] b
1 0 0 3.14
1 0 1 2.50
1 0 0 0.00
1 0 1 3.00
Terms:
'Intercept' (column 0)
'C(a)' (columns 1:3)
'b' (column 3)
Arguments can also be passed to C()
to modify patsy’s one-hot encoding behavior. For example, if you pass C(names, Treatment('Nick'))
, "Nick"
will become the omitted category (instead of the default, which is the first value alphabetically):
y, X = patsy.dmatrices("y ~ a + b + C(names, Treatment('Nick'))", data=toy_df)
X
DesignMatrix with shape (4, 4)
Intercept C(names, Treatment('Nick'))[T.Lyra] a b
1 0 1 3.14
1 0 3 2.50
1 1 1 0.00
1 1 3 3.00
Terms:
'Intercept' (column 0)
"C(names, Treatment('Nick'))" (column 1)
'a' (column 2)
'b' (column 3)
Interaction Terms#
Interaction terms are when a regression includes not only two variables on their own, but also the two variables multiplied by one another as a third variable. Interaction terms can be tricky to understand — especially if you haven’t taken a linear regression course before — but they’re used when we think that two variables don’t only matter on their own, but also that the impact of one variable depends on the value of the other variable.
For example, suppose you are trying to estimate a patient’s risk of a heart attack. You may think that heart attack risk is related to both the patient’s blood pressure and the patient’s age. Generally speaking, patients with high blood pressure are at greater risk of heart attacks, and older patients are at greater risk of heart attacks.
But suppose you think that blood pressure only matters for older patients, and that as patients get older, the amount that high blood pressure matters also increases? In other words, suppose you suspect that a 40 year old patient with a systolic blood pressure of 150 would have 2 times the risk of a heart attack than if they had a blood pressure of 120, but that a 70 year old patient with blood pressure of 150 would have 5 times the risk of a heart attack than if they had a blood pressure of 120.
To capture this, we would put age, blood pressure, and age times blood pressure in our model. And this third term — blood pressure times age — is what we call an interaction term.
Patsy can also create interaction terms on the fly. For example, if you wished to add the multiplicative interaction of a
and b
, you could either put a*b
in place of a + b
to have patsy insert a
, b
, and a*b
to the model:
y, X = patsy.dmatrices("y ~ a * b + names", data=toy_df)
X
DesignMatrix with shape (4, 5)
Intercept names[T.Nick] a b a:b
1 1 1 3.14 3.14
1 1 3 2.50 7.50
1 0 1 0.00 0.00
1 0 3 3.00 9.00
Terms:
'Intercept' (column 0)
'names' (column 1)
'a' (column 2)
'b' (column 3)
'a:b' (column 4)
Or you can leave in a + b
and add a:b
to just add the multiplicative interaction term:
y, X = patsy.dmatrices("y ~ a * b + names", data=toy_df)
X
DesignMatrix with shape (4, 5)
Intercept names[T.Nick] a b a:b
1 1 1 3.14 3.14
1 1 3 2.50 7.50
1 0 1 0.00 0.00
1 0 3 3.00 9.00
Terms:
'Intercept' (column 0)
'names' (column 1)
'a' (column 2)
'b' (column 3)
'a:b' (column 4)
Isn’t this formula syntax the same as R?#
Pretty close! Here’s a list of all known differences, but as you’ll see, there aren’t too many.