OPTIONAL: Beyond The Basic Model#
It’s our hope that the last two readings will be accessible to anyone who has gotten this far in our specialization, regardless of your prior familiarity with linear regression.
In this reading, however, we will provide an overview of some of the more advanced functionality provided by statsmodels
. The purpose of this is to provide readers who are used to working with linear regressions in another programming language (like R or Stata) with a quick introduction to the syntax for doing tasks that are commonly used in practice but which we do not have the space to explain in this course.
In particular, in this reading we will discuss different types of standard errors (e.g., clustered and heteroskedastic robust standard errors)
Heteroskedastic Robust and Clustered Standard Errors#
One of the most common modifications to a standard linear regression is the use of heteroskedastic robust and clustered standard errors, and these are easy to use in statsmodels
.
To illustrate, let’s begin with a simple regression:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
pd.set_option("mode.copy_on_write", True)
# Load data on infant mortality, gdp per capita, and
# World Bank CPIA public sector transparency, accountability,
# and corruption in the public sector scores
# (1 = low transparency and accountability, 6 = high transparency and accountability).
wdi = pd.read_csv("data/wdi_corruption.csv")
# Check one observation to get a feel for things.
wdi.sample().T
40 | |
---|---|
country_name | Mauritania |
gdp_per_capita_ppp | 372.270362 |
CPIA_public_sector_rating | 3.0 |
mortality_rate_under5_per_1000 | 84.1 |
Mortality rate, under-5, female (per 1,000 live births) | 77.8 |
Mortality rate, under-5, male (per 1,000 live births) | 90.2 |
Population, total | 4046301.0 |
region | Sub-Saharan Africa |
# Fit model
corruption_model = smf.ols(
"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
data=wdi,
).fit()
# Get regression result
corruption_model.summary()
Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.541 |
Method: | Least Squares | F-statistic: | 13.12 |
Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 2.11e-10 |
Time: | 16:39:55 | Log-Likelihood: | -322.68 |
No. Observations: | 73 | AIC: | 661.4 |
Df Residuals: | 65 | BIC: | 679.7 |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 169.9397 | 36.430 | 4.665 | 0.000 | 97.183 | 242.696 |
region[T.Europe and Central Asia] | -15.9265 | 12.304 | -1.294 | 0.200 | -40.499 | 8.646 |
region[T.Latin America and Caribbean] | 1.9023 | 9.226 | 0.206 | 0.837 | -16.523 | 20.327 |
region[T.Middle East and North Africa] | 3.7668 | 23.057 | 0.163 | 0.871 | -42.280 | 49.814 |
region[T.South Asia] | 4.9372 | 9.818 | 0.503 | 0.617 | -14.671 | 24.545 |
region[T.Sub-Saharan Africa] | 27.8448 | 7.360 | 3.783 | 0.000 | 13.145 | 42.544 |
np.log(gdp_per_capita_ppp) | -13.3790 | 4.547 | -2.942 | 0.005 | -22.461 | -4.297 |
CPIA_public_sector_rating | -7.1417 | 4.387 | -1.628 | 0.108 | -15.902 | 1.619 |
Omnibus: | 4.467 | Durbin-Watson: | 1.617 |
---|---|---|---|
Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 |
Skew: | 0.592 | Prob(JB): | 0.112 |
Kurtosis: | 2.813 | Cond. No. | 128. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To change how standard errors are calculated, we use the .get_robustcov_results()
method. For heteroskedastic robust standard errors, for example, we simply use the cov_type
keyword argument and pass our preferred method for calculating the errors. Here’s a code snipped for HC2, for example:
model_w_robust_ses = corruption_model.get_robustcov_results(cov_type="HC2")
model_w_robust_ses.summary()
Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.541 |
Method: | Least Squares | F-statistic: | 48.92 |
Date: | Wed, 05 Jun 2024 | Prob (F-statistic): | 1.68e-23 |
Time: | 13:55:46 | Log-Likelihood: | -322.68 |
No. Observations: | 73 | AIC: | 661.4 |
Df Residuals: | 65 | BIC: | 679.7 |
Df Model: | 7 | ||
Covariance Type: | HC2 |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 169.9397 | 37.846 | 4.490 | 0.000 | 94.357 | 245.522 |
region[T.Europe and Central Asia] | -15.9265 | 5.763 | -2.764 | 0.007 | -27.436 | -4.417 |
region[T.Latin America and Caribbean] | 1.9023 | 6.687 | 0.284 | 0.777 | -11.453 | 15.257 |
region[T.Middle East and North Africa] | 3.7668 | 8.304 | 0.454 | 0.652 | -12.817 | 20.351 |
region[T.South Asia] | 4.9372 | 9.361 | 0.527 | 0.600 | -13.759 | 23.633 |
region[T.Sub-Saharan Africa] | 27.8448 | 7.238 | 3.847 | 0.000 | 13.389 | 42.300 |
np.log(gdp_per_capita_ppp) | -13.3790 | 4.550 | -2.941 | 0.005 | -22.465 | -4.293 |
CPIA_public_sector_rating | -7.1417 | 3.966 | -1.801 | 0.076 | -15.063 | 0.779 |
Omnibus: | 4.467 | Durbin-Watson: | 1.617 |
---|---|---|---|
Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 |
Skew: | 0.592 | Prob(JB): | 0.112 |
Kurtosis: | 2.813 | Cond. No. | 128. |
Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)
Clustering standard errors is accomplished by similar means, although one must pass a vector of group identifiers on which to cluster.
(Make sure to drop any rows from the original data that have missing observations that would have been dropped from the original regression before passing a single variable as group identifiers).
model_w_clusters = corruption_model.get_robustcov_results(
cov_type="cluster", groups=wdi.dropna().region
)
model_w_clusters.summary()
/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2
warnings.warn('covariance of constraints does not have full '
Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.541 |
Method: | Least Squares | F-statistic: | 4.404 |
Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 0.0789 |
Time: | 16:45:56 | Log-Likelihood: | -322.68 |
No. Observations: | 73 | AIC: | 661.4 |
Df Residuals: | 65 | BIC: | 679.7 |
Df Model: | 7 | ||
Covariance Type: | cluster |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 169.9397 | 16.975 | 10.011 | 0.000 | 126.304 | 213.575 |
region[T.Europe and Central Asia] | -15.9265 | 1.241 | -12.829 | 0.000 | -19.118 | -12.735 |
region[T.Latin America and Caribbean] | 1.9023 | 0.694 | 2.739 | 0.041 | 0.117 | 3.688 |
region[T.Middle East and North Africa] | 3.7668 | 2.644 | 1.425 | 0.214 | -3.030 | 10.564 |
region[T.South Asia] | 4.9372 | 0.719 | 6.869 | 0.001 | 3.090 | 6.785 |
region[T.Sub-Saharan Africa] | 27.8448 | 1.385 | 20.098 | 0.000 | 24.283 | 31.406 |
np.log(gdp_per_capita_ppp) | -13.3790 | 2.727 | -4.906 | 0.004 | -20.389 | -6.369 |
CPIA_public_sector_rating | -7.1417 | 2.067 | -3.455 | 0.018 | -12.455 | -1.829 |
Omnibus: | 4.467 | Durbin-Watson: | 1.617 |
---|---|---|---|
Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 |
Skew: | 0.592 | Prob(JB): | 0.112 |
Kurtosis: | 2.813 | Cond. No. | 128. |
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
Weighted Regression#
Weighted least squares is also available in statsmodels
(wls is a little finicky and wants na
values dropped prior to model fitting):
weighted_model = smf.wls(
"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
data=wdi.dropna(),
weights=wdi.dropna()["Population, total"],
)
weighted_model.fit().summary()
Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.394 |
---|---|---|---|
Model: | WLS | Adj. R-squared: | 0.329 |
Method: | Least Squares | F-statistic: | 6.040 |
Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 1.91e-05 |
Time: | 16:52:11 | Log-Likelihood: | -382.90 |
No. Observations: | 73 | AIC: | 781.8 |
Df Residuals: | 65 | BIC: | 800.1 |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -1.0137 | 44.372 | -0.023 | 0.982 | -89.630 | 87.602 |
region[T.Europe and Central Asia] | -7.5259 | 18.115 | -0.415 | 0.679 | -43.705 | 28.653 |
region[T.Latin America and Caribbean] | 6.5553 | 19.604 | 0.334 | 0.739 | -32.597 | 45.707 |
region[T.Middle East and North Africa] | 23.9896 | 24.400 | 0.983 | 0.329 | -24.741 | 72.720 |
region[T.South Asia] | 24.0141 | 9.828 | 2.443 | 0.017 | 4.386 | 43.643 |
region[T.Sub-Saharan Africa] | 48.9199 | 9.844 | 4.970 | 0.000 | 29.261 | 68.579 |
np.log(gdp_per_capita_ppp) | 3.8200 | 5.471 | 0.698 | 0.488 | -7.106 | 14.746 |
CPIA_public_sector_rating | 1.1358 | 6.601 | 0.172 | 0.864 | -12.048 | 14.320 |
Omnibus: | 5.883 | Durbin-Watson: | 1.809 |
---|---|---|---|
Prob(Omnibus): | 0.053 | Jarque-Bera (JB): | 6.686 |
Skew: | 0.337 | Prob(JB): | 0.0353 |
Kurtosis: | 4.320 | Cond. No. | 143. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Post-Regression Testing#
statsmodels
also provides a flexible syntax for post-regression testing. To test whether the FE for South Asia and the Middle East and North Africa are the same, we just run:
hypotheses = "region[T.South Asia] = region[T.Middle East and North Africa]"
corruption_model.t_test(hypotheses)
<class 'statsmodels.stats.contrast.ContrastResults'>
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 1.1703 23.625 0.050 0.961 -46.013 48.354
==============================================================================
Generalized Linear Models and Generalized Additive Models#
Finally, for those interested in going beyond standard linear regression, statsmodels
supports both Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs).
You can read the documentation for Generalized Linear Models, including logits, probits, poisson, binomial, etc., here.
Documentation for Generalized Additive Models can be found here, although users interested in GAMs may also wish to look into pyGAM.