OPTIONAL: Beyond The Basic Model

OPTIONAL: Beyond The Basic Model#

It’s our hope that the last two readings will be accessible to anyone who has gotten this far in our specialization, regardless of your prior familiarity with linear regression.

In this reading, however, we will provide an overview of some of the more advanced functionality provided by statsmodels. The purpose of this is to provide readers who are used to working with linear regressions in another programming language (like R or Stata) with a quick introduction to the syntax for doing tasks that are commonly used in practice but which we do not have the space to explain in this course.

In particular, in this reading we will discuss different types of standard errors (e.g., clustered and heteroskedastic robust standard errors)

Heteroskedastic Robust and Clustered Standard Errors#

One of the most common modifications to a standard linear regression is the use of heteroskedastic robust and clustered standard errors, and these are easy to use in statsmodels.

To illustrate, let’s begin with a simple regression:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

pd.set_option("mode.copy_on_write", True)

# Load data on infant mortality, gdp per capita, and
# World Bank CPIA public sector transparency, accountability,
# and corruption in the public sector scores
# (1 = low transparency and accountability, 6 = high transparency and accountability).

wdi = pd.read_csv("data/wdi_corruption.csv")

# Check one observation to get a feel for things.
wdi.sample().T

	40
country_name	Mauritania
gdp_per_capita_ppp	372.270362
CPIA_public_sector_rating	3.0
mortality_rate_under5_per_1000	84.1
Mortality rate, under-5, female (per 1,000 live births)	77.8
Mortality rate, under-5, male (per 1,000 live births)	90.2
Population, total	4046301.0
region	Sub-Saharan Africa

# Fit model
corruption_model = smf.ols(
    "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
    data=wdi,
).fit()

# Get regression result
corruption_model.summary()

OLS Regression Results
Dep. Variable:	mortality_rate_under5_per_1000	R-squared:	0.586
Model:	OLS	Adj. R-squared:	0.541
Method:	Least Squares	F-statistic:	13.12
Date:	Sun, 21 Jul 2024	Prob (F-statistic):	2.11e-10
Time:	16:39:55	Log-Likelihood:	-322.68
No. Observations:	73	AIC:	661.4
Df Residuals:	65	BIC:	679.7
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	169.9397	36.430	4.665	0.000	97.183	242.696
region[T.Europe and Central Asia]	-15.9265	12.304	-1.294	0.200	-40.499	8.646
region[T.Latin America and Caribbean]	1.9023	9.226	0.206	0.837	-16.523	20.327
region[T.Middle East and North Africa]	3.7668	23.057	0.163	0.871	-42.280	49.814
region[T.South Asia]	4.9372	9.818	0.503	0.617	-14.671	24.545
region[T.Sub-Saharan Africa]	27.8448	7.360	3.783	0.000	13.145	42.544
np.log(gdp_per_capita_ppp)	-13.3790	4.547	-2.942	0.005	-22.461	-4.297
CPIA_public_sector_rating	-7.1417	4.387	-1.628	0.108	-15.902	1.619

Omnibus:	4.467	Durbin-Watson:	1.617
Prob(Omnibus):	0.107	Jarque-Bera (JB):	4.375
Skew:	0.592	Prob(JB):	0.112
Kurtosis:	2.813	Cond. No.	128.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To change how standard errors are calculated, we use the .get_robustcov_results() method. For heteroskedastic robust standard errors, for example, we simply use the cov_type keyword argument and pass our preferred method for calculating the errors. Here’s a code snipped for HC2, for example:

model_w_robust_ses = corruption_model.get_robustcov_results(cov_type="HC2")
model_w_robust_ses.summary()

OLS Regression Results
Dep. Variable:	mortality_rate_under5_per_1000	R-squared:	0.586
Model:	OLS	Adj. R-squared:	0.541
Method:	Least Squares	F-statistic:	48.92
Date:	Wed, 05 Jun 2024	Prob (F-statistic):	1.68e-23
Time:	13:55:46	Log-Likelihood:	-322.68
No. Observations:	73	AIC:	661.4
Df Residuals:	65	BIC:	679.7
Df Model:	7
Covariance Type:	HC2

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	169.9397	37.846	4.490	0.000	94.357	245.522
region[T.Europe and Central Asia]	-15.9265	5.763	-2.764	0.007	-27.436	-4.417
region[T.Latin America and Caribbean]	1.9023	6.687	0.284	0.777	-11.453	15.257
region[T.Middle East and North Africa]	3.7668	8.304	0.454	0.652	-12.817	20.351
region[T.South Asia]	4.9372	9.361	0.527	0.600	-13.759	23.633
region[T.Sub-Saharan Africa]	27.8448	7.238	3.847	0.000	13.389	42.300
np.log(gdp_per_capita_ppp)	-13.3790	4.550	-2.941	0.005	-22.465	-4.293
CPIA_public_sector_rating	-7.1417	3.966	-1.801	0.076	-15.063	0.779

Omnibus:	4.467	Durbin-Watson:	1.617
Prob(Omnibus):	0.107	Jarque-Bera (JB):	4.375
Skew:	0.592	Prob(JB):	0.112
Kurtosis:	2.813	Cond. No.	128.

Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)

Clustering standard errors is accomplished by similar means, although one must pass a vector of group identifiers on which to cluster.

(Make sure to drop any rows from the original data that have missing observations that would have been dropped from the original regression before passing a single variable as group identifiers).

model_w_clusters = corruption_model.get_robustcov_results(
    cov_type="cluster", groups=wdi.dropna().region
)
model_w_clusters.summary()

/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2
  warnings.warn('covariance of constraints does not have full '

OLS Regression Results
Dep. Variable:	mortality_rate_under5_per_1000	R-squared:	0.586
Model:	OLS	Adj. R-squared:	0.541
Method:	Least Squares	F-statistic:	4.404
Date:	Sun, 21 Jul 2024	Prob (F-statistic):	0.0789
Time:	16:45:56	Log-Likelihood:	-322.68
No. Observations:	73	AIC:	661.4
Df Residuals:	65	BIC:	679.7
Df Model:	7
Covariance Type:	cluster

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	169.9397	16.975	10.011	0.000	126.304	213.575
region[T.Europe and Central Asia]	-15.9265	1.241	-12.829	0.000	-19.118	-12.735
region[T.Latin America and Caribbean]	1.9023	0.694	2.739	0.041	0.117	3.688
region[T.Middle East and North Africa]	3.7668	2.644	1.425	0.214	-3.030	10.564
region[T.South Asia]	4.9372	0.719	6.869	0.001	3.090	6.785
region[T.Sub-Saharan Africa]	27.8448	1.385	20.098	0.000	24.283	31.406
np.log(gdp_per_capita_ppp)	-13.3790	2.727	-4.906	0.004	-20.389	-6.369
CPIA_public_sector_rating	-7.1417	2.067	-3.455	0.018	-12.455	-1.829

Omnibus:	4.467	Durbin-Watson:	1.617
Prob(Omnibus):	0.107	Jarque-Bera (JB):	4.375
Skew:	0.592	Prob(JB):	0.112
Kurtosis:	2.813	Cond. No.	128.

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

Weighted Regression#

Weighted least squares is also available in statsmodels (wls is a little finicky and wants na values dropped prior to model fitting):

weighted_model = smf.wls(
    "mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region",
    data=wdi.dropna(),
    weights=wdi.dropna()["Population, total"],
)
weighted_model.fit().summary()

WLS Regression Results
Dep. Variable:	mortality_rate_under5_per_1000	R-squared:	0.394
Model:	WLS	Adj. R-squared:	0.329
Method:	Least Squares	F-statistic:	6.040
Date:	Sun, 21 Jul 2024	Prob (F-statistic):	1.91e-05
Time:	16:52:11	Log-Likelihood:	-382.90
No. Observations:	73	AIC:	781.8
Df Residuals:	65	BIC:	800.1
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1.0137	44.372	-0.023	0.982	-89.630	87.602
region[T.Europe and Central Asia]	-7.5259	18.115	-0.415	0.679	-43.705	28.653
region[T.Latin America and Caribbean]	6.5553	19.604	0.334	0.739	-32.597	45.707
region[T.Middle East and North Africa]	23.9896	24.400	0.983	0.329	-24.741	72.720
region[T.South Asia]	24.0141	9.828	2.443	0.017	4.386	43.643
region[T.Sub-Saharan Africa]	48.9199	9.844	4.970	0.000	29.261	68.579
np.log(gdp_per_capita_ppp)	3.8200	5.471	0.698	0.488	-7.106	14.746
CPIA_public_sector_rating	1.1358	6.601	0.172	0.864	-12.048	14.320

Omnibus:	5.883	Durbin-Watson:	1.809
Prob(Omnibus):	0.053	Jarque-Bera (JB):	6.686
Skew:	0.337	Prob(JB):	0.0353
Kurtosis:	4.320	Cond. No.	143.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Post-Regression Testing#

statsmodels also provides a flexible syntax for post-regression testing. To test whether the FE for South Asia and the Middle East and North Africa are the same, we just run:

hypotheses = "region[T.South Asia] = region[T.Middle East and North Africa]"
corruption_model.t_test(hypotheses)

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             1.1703     23.625      0.050      0.961     -46.013      48.354
==============================================================================

Generalized Linear Models and Generalized Additive Models#

Finally, for those interested in going beyond standard linear regression, statsmodels supports both Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs).

You can read the documentation for Generalized Linear Models, including logits, probits, poisson, binomial, etc., here.

Documentation for Generalized Additive Models can be found here, although users interested in GAMs may also wish to look into pyGAM.