OPTIONAL: Beyond The Basic Model#

It’s our hope that the last two readings will be accessible to anyone who has gotten this far in our specialization, regardless of your prior familiarity with linear regression.

In this reading, however, we will provide an overview of some of the more advanced functionality provided by statsmodels. The purpose of this is to provide readers who are used to working with linear regressions in another programming language (like R or Stata) with a quick introduction to the syntax for doing tasks that are commonly used in practice but which we do not have the space to explain in this course.

In particular, in this reading we will discuss different types of standard errors (e.g., clustered and heteroskedastic robust standard errors)

Heteroskedastic Robust and Clustered Standard Errors#

One of the most common modifications to a standard linear regression is the use of heteroskedastic robust and clustered standard errors, and these are easy to use in statsmodels.

To illustrate, let’s begin with a simple regression:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

pd.set_option("mode.copy_on_write", True)

# Load data on infant mortality, gdp per capita, and
# World Bank CPIA public sector transparency, accountability,
# and corruption in the public sector scores
# (1 = low transparency and accountability, 6 = high transparency and accountability).

wdi = pd.read_csv("data/wdi_corruption.csv")

# Check one observation to get a feel for things.
wdi.sample().T
40
country_name Mauritania
gdp_per_capita_ppp 372.270362
CPIA_public_sector_rating 3.0
mortality_rate_under5_per_1000 84.1
Mortality rate, under-5, female (per 1,000 live births) 77.8
Mortality rate, under-5, male (per 1,000 live births) 90.2
Population, total 4046301.0
region Sub-Saharan Africa
# Fit model
corruption_model = smf.ols(
    "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
    data=wdi,
).fit()

# Get regression result
corruption_model.summary()
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 13.12
Date: Sun, 21 Jul 2024 Prob (F-statistic): 2.11e-10
Time: 16:39:55 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 36.430 4.665 0.000 97.183 242.696
region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646
region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327
region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814
region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545
region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544
np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297
CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To change how standard errors are calculated, we use the .get_robustcov_results() method. For heteroskedastic robust standard errors, for example, we simply use the cov_type keyword argument and pass our preferred method for calculating the errors. Here’s a code snipped for HC2, for example:

model_w_robust_ses = corruption_model.get_robustcov_results(cov_type="HC2")
model_w_robust_ses.summary()
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 48.92
Date: Wed, 05 Jun 2024 Prob (F-statistic): 1.68e-23
Time: 13:55:46 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: HC2
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 37.846 4.490 0.000 94.357 245.522
region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417
region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257
region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351
region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633
region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300
np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293
CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)

Clustering standard errors is accomplished by similar means, although one must pass a vector of group identifiers on which to cluster.

(Make sure to drop any rows from the original data that have missing observations that would have been dropped from the original regression before passing a single variable as group identifiers).

model_w_clusters = corruption_model.get_robustcov_results(
    cov_type="cluster", groups=wdi.dropna().region
)
model_w_clusters.summary()
/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2
  warnings.warn('covariance of constraints does not have full '
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 4.404
Date: Sun, 21 Jul 2024 Prob (F-statistic): 0.0789
Time: 16:45:56 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: cluster
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 16.975 10.011 0.000 126.304 213.575
region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735
region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688
region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564
region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785
region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406
np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369
CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

Weighted Regression#

Weighted least squares is also available in statsmodels (wls is a little finicky and wants na values dropped prior to model fitting):

weighted_model = smf.wls(
    "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
    data=wdi.dropna(),
    weights=wdi.dropna()["Population, total"],
)
weighted_model.fit().summary()
WLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.394
Model: WLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 6.040
Date: Sun, 21 Jul 2024 Prob (F-statistic): 1.91e-05
Time: 16:52:11 Log-Likelihood: -382.90
No. Observations: 73 AIC: 781.8
Df Residuals: 65 BIC: 800.1
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.0137 44.372 -0.023 0.982 -89.630 87.602
region[T.Europe and Central Asia] -7.5259 18.115 -0.415 0.679 -43.705 28.653
region[T.Latin America and Caribbean] 6.5553 19.604 0.334 0.739 -32.597 45.707
region[T.Middle East and North Africa] 23.9896 24.400 0.983 0.329 -24.741 72.720
region[T.South Asia] 24.0141 9.828 2.443 0.017 4.386 43.643
region[T.Sub-Saharan Africa] 48.9199 9.844 4.970 0.000 29.261 68.579
np.log(gdp_per_capita_ppp) 3.8200 5.471 0.698 0.488 -7.106 14.746
CPIA_public_sector_rating 1.1358 6.601 0.172 0.864 -12.048 14.320
Omnibus: 5.883 Durbin-Watson: 1.809
Prob(Omnibus): 0.053 Jarque-Bera (JB): 6.686
Skew: 0.337 Prob(JB): 0.0353
Kurtosis: 4.320 Cond. No. 143.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Post-Regression Testing#

statsmodels also provides a flexible syntax for post-regression testing. To test whether the FE for South Asia and the Middle East and North Africa are the same, we just run:

hypotheses = "region[T.South Asia] = region[T.Middle East and North Africa]"
corruption_model.t_test(hypotheses)
<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             1.1703     23.625      0.050      0.961     -46.013      48.354
==============================================================================

Generalized Linear Models and Generalized Additive Models#

Finally, for those interested in going beyond standard linear regression, statsmodels supports both Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs).

You can read the documentation for Generalized Linear Models, including logits, probits, poisson, binomial, etc., here.

Documentation for Generalized Additive Models can be found here, although users interested in GAMs may also wish to look into pyGAM.