Pulling Back the Curtain: Pandas to Numpy with patsy

Pulling Back the Curtain: Pandas to Numpy with patsy#

In our last reading, we learned about the ols function in statsmodels.formula.api, and the ease with which it allows data scientists to not only fit a linear regression on data stored in Pandas. We saw how easy it was to work with the regression results, to change features like how standard errors were calculated, to evaluate model performance, and more.

The one thing we deliberately did not go into in the last reading was the details of the model specification. The formula we used — "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region" — isn’t too hard to interpret, and if you’re coming from the R programming language it likely looked very familiar.

In this reading, we’ll pull back the curtain a little on how statsmodels used data from a pandas DataFrame to fit a linear regression, even though some of the contents of that DataFrame were string variables (like region). In doing so, we’ll not only learn the details of how to specify models using these formula strings, but also the nuts and bolts of model fitting — an understanding that will be critical for successfully navigating the Python Data Science ecosystem beyond statsmodels.

The Hidden Step Between Pandas and Your Results#

At the level of the underlying math, a linear regression is fit using matrix calculations. Indeed, if you imagine putting your outcome variable in a vector called \(y\) and your explanatory variables in a matrix \(X\) where each column is one of your explanatory variables and each row is an observation. To get our vector of coefficients \(\beta\), your computer uses the following matrix formula (don’t worry if you don’t know what all the terms there mean — I just want to try and make this concrete by showing the formula):

\[\beta = (X^TX)^{-1} X^Ty\]

In other words, we get \(\beta\) by doing some matrix multiplication and calculating matrix inverses on our data. Indeed, matrix math (also known as matrix algebra) underlies a huge proportion of statistics and machine learning tools, which is one of the reasons that a linear algebra course (the part of math focused on the study of matrix math) is often recommended for data science students. (Don’t worry — you don’t need to know linear algebra to start as a data scientist, and you don’t need to understand that formula above; I’m just telling you this to motivate what comes next).

But you can’t always do matrix math on a pandas DataFrame. pandas DataFrames are great because of their flexibility, but you can’t multiply to pandas DataFrames that contain some integer columns, some floating point columns, and some string columns. Moreover, you also can’t do matrix calculations with missing values — the inverse of a matrix with missing values is not a defined concept.

So how was smf.ols() able to fit that regression? patsy. patsy is a package that takes as input a pandas DataFrame and a formula (like the "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region" formula we used above), and returns a numpy array of floating point numbers without missing data (and with labels!) on which matrix algebra can be performed. All smf.ols() is doing is:

calling patsy,
giving patsy your DataFrame and formula,
taking the numpy arrays it returns, and
passing those arrays to its more traditional OLS function, OLS() from statsmodels.api.

That function then does the actual matrix algebra described above and gives you back a result!

Why do I care?#

“OK,” I hear you yell, “but if statsmodels does this for me what do I care how it does it?”

You need to care because, outside a handful of functions for which statsmodels offers this convenience (you can find the full list here), nearly every modeling tool you encounter in the Python ecosystem will expect numpy arrays of floating point numbers as inputs. That means that it’s important to (a) understand why they need numpy arrays and (b) be able to use patsy directly to move from pandas DataFrames to numpy arrays yourself.

patsy Formula Syntax#

To move between pandas and numpy, we will use the patsy.dmatrices() function. It takes a pandas DataFrame and a formula string and returns a tuple containing an outcome vector and a matrix of explanatory variables.

To illustrate how patsy works, let’s create a quite small toy dataset just to help us visualize what is going on.

import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

toy_df = pd.DataFrame(
    {
        "y": [10, 11, 9, 8, 12],
        "a": [1, 2, 3, 1, 3],
        "b": [3.14, np.nan, 2.5, 0, 3],
        "names": ["Nick", "Lyra", "Nick", "Lyra", "Lyra"],
    }
)
toy_df

	y	a	b	names
0	10	1	3.14	Nick
1	11	2	NaN	Lyra
2	9	3	2.50	Nick
3	8	1	0.00	Lyra
4	12	3	3.00	Lyra

Now let’s suppose we want to regress y on a and b. Let’s use dmatrices to create our matrices. Note that patsy.dmatrices returns two objects — our outcome vector and our matrix of explanatory variables, so if you assign the result to a single variable you will get a tuple with the two items, or you can assign the result to two variables (separated by a comma) and the results will be distributed to those two variables:

import patsy

y, X = patsy.dmatrices("y ~ a + b", data=toy_df)

Let’s start by looking at our X matrix:

DesignMatrix with shape (4, 2)
  a     b
  1  3.14
  3  2.50
  1  0.00
  3  3.00
  Terms:
    'a' (column 0)
    'b' (column 1)

As you can see, there are a few differences between our pandas DataFrame and this DesignMatrix (these are basically numpy arrays with some extra metadata, like labels. You can use them with any function that expects numpy arrays).

The DesignMatrix X includes a column of 1s labelled as Intercept. You may have never thought about this before, but the intercept in a regression doesn’t appear by magic — it’s actually a coefficient created by adding a column that has the same value for every observation (1 makes interpretation easiest). By default, patsy will always add an intercept, but if you would prefer it did not, just add - 1 or + 0 to the formula.
Our pandas DataFrame had 5 rows, but X only has 4. Why? Well, the second row of our DataFrame included a missing value (np.nan) in the b column. As noted above, we can’t do matrix algebra with missing values, so any row that includes a missing value in one of the variables in the formula is automatically dropped. And yes, it drops that row from y too.

DesignMatrix with shape (4, 1)
   y
  10
   9
   8
  12
  Terms:
    'y' (column 0)

Categorical Variables#

Great, but this is just the start of things. Now let’s add names — a string variable — to the model.

y, X = patsy.dmatrices("y ~ a + b + names", data=toy_df)
X

DesignMatrix with shape (4, 4)
  Intercept  names[T.Nick]  a     b
          1              1  1  3.14
          1              1  3  2.50
          1              0  1  0.00
          1              0  3  3.00
  Terms:
    'Intercept' (column 0)
    'names' (column 1)
    'a' (column 2)
    'b' (column 3)

Now our X matrix has a new column called names[T.Nick] that contains 0s and 1s. Basically, if the variable takes on the value after T. (here, Nick), then that variable will be 1; otherwise it will be a zero. If our variable c took on more than two values (say, "Lyra", "Nick", "Trillian"), X would have also had a variable called names[T.Trillian].

By default, patsy will always omit one category (the “omitted category” or “reference category”). As a result, for a variable that takes on N unique values, patsy will always add N-1 variables when doing one-hot encoding.

patsy will do this automatically when it encounters a Categorical or object dtype column, but you can also use C() to force it to treat any variable as categorical. For example, if I put C(a) in the formula, it will treat a as a categorical variable and apply one-hot encoding to all the unique values of the variable:

y, X = patsy.dmatrices("y ~ C(a) + b", data=toy_df)
X

DesignMatrix with shape (4, 4)
  Intercept  C(a)[T.2]  C(a)[T.3]     b
          1          0          0  3.14
          1          0          1  2.50
          1          0          0  0.00
          1          0          1  3.00
  Terms:
    'Intercept' (column 0)
    'C(a)' (columns 1:3)
    'b' (column 3)

Arguments can also be passed to C() to modify patsy’s one-hot encoding behavior. For example, if you pass C(names, Treatment('Nick')), "Nick" will become the omitted category (instead of the default, which is the first value alphabetically):

y, X = patsy.dmatrices("y ~ a + b + C(names, Treatment('Nick'))", data=toy_df)
X

DesignMatrix with shape (4, 4)
  Intercept  C(names, Treatment('Nick'))[T.Lyra]  a     b
          1                                    0  1  3.14
          1                                    0  3  2.50
          1                                    1  1  0.00
          1                                    1  3  3.00
  Terms:
    'Intercept' (column 0)
    "C(names, Treatment('Nick'))" (column 1)
    'a' (column 2)
    'b' (column 3)

Interaction Terms#

Interaction terms are when a regression includes not only two variables on their own, but also the two variables multiplied by one another as a third variable. Interaction terms can be tricky to understand — especially if you haven’t taken a linear regression course before — but they’re used when we think that two variables don’t only matter on their own, but also that the impact of one variable depends on the value of the other variable.

For example, suppose you are trying to estimate a patient’s risk of a heart attack. You may think that heart attack risk is related to both the patient’s blood pressure and the patient’s age. Generally speaking, patients with high blood pressure are at greater risk of heart attacks, and older patients are at greater risk of heart attacks.

But suppose you think that blood pressure only matters for older patients, and that as patients get older, the amount that high blood pressure matters also increases? In other words, suppose you suspect that a 40 year old patient with a systolic blood pressure of 150 would have 2 times the risk of a heart attack than if they had a blood pressure of 120, but that a 70 year old patient with blood pressure of 150 would have 5 times the risk of a heart attack than if they had a blood pressure of 120.

To capture this, we would put age, blood pressure, and age times blood pressure in our model. And this third term — blood pressure times age — is what we call an interaction term.

Patsy can also create interaction terms on the fly. For example, if you wished to add the multiplicative interaction of a and b, you could either put a*b in place of a + b to have patsy insert a, b, and a*b to the model:

y, X = patsy.dmatrices("y ~ a * b + names", data=toy_df)
X

DesignMatrix with shape (4, 5)
  Intercept  names[T.Nick]  a     b   a:b
          1              1  1  3.14  3.14
          1              1  3  2.50  7.50
          1              0  1  0.00  0.00
          1              0  3  3.00  9.00
  Terms:
    'Intercept' (column 0)
    'names' (column 1)
    'a' (column 2)
    'b' (column 3)
    'a:b' (column 4)

Or you can leave in a + b and add a:b to just add the multiplicative interaction term:

y, X = patsy.dmatrices("y ~ a * b + names", data=toy_df)
X

DesignMatrix with shape (4, 5)
  Intercept  names[T.Nick]  a     b   a:b
          1              1  1  3.14  3.14
          1              1  3  2.50  7.50
          1              0  1  0.00  0.00
          1              0  3  3.00  9.00
  Terms:
    'Intercept' (column 0)
    'names' (column 1)
    'a' (column 2)
    'b' (column 3)
    'a:b' (column 4)

Isn’t this formula syntax the same as R?#

Pretty close! Here’s a list of all known differences, but as you’ll see, there aren’t too many.