{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# OPTIONAL: Beyond The Basic Model\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's our hope that the last two readings will be accessible to anyone who has gotten this far in our specialization, regardless of your prior familiarity with linear regression.\n",
"\n",
"In this reading, however, we will provide an overview of some of the more advanced functionality provided by `statsmodels`. The purpose of this is to provide readers who are used to working with linear regressions in another programming language (like R or Stata) with a quick introduction to the syntax for doing tasks that are commonly used in practice but which we do not have the space to explain in this course.\n",
"\n",
"In particular, in this reading we will discuss different types of standard errors (e.g., clustered and heteroskedastic robust standard errors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Heteroskedastic Robust and Clustered Standard Errors\n",
"\n",
"One of the most common modifications to a standard linear regression is the use of heteroskedastic robust and clustered standard errors, and these are easy to use in `statsmodels`.\n",
"\n",
"To illustrate, let's begin with a simple regression:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 40 | \n",
"
\n",
" \n",
" \n",
" \n",
" country_name | \n",
" Mauritania | \n",
"
\n",
" \n",
" gdp_per_capita_ppp | \n",
" 372.270362 | \n",
"
\n",
" \n",
" CPIA_public_sector_rating | \n",
" 3.0 | \n",
"
\n",
" \n",
" mortality_rate_under5_per_1000 | \n",
" 84.1 | \n",
"
\n",
" \n",
" Mortality rate, under-5, female (per 1,000 live births) | \n",
" 77.8 | \n",
"
\n",
" \n",
" Mortality rate, under-5, male (per 1,000 live births) | \n",
" 90.2 | \n",
"
\n",
" \n",
" Population, total | \n",
" 4046301.0 | \n",
"
\n",
" \n",
" region | \n",
" Sub-Saharan Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 40\n",
"country_name Mauritania\n",
"gdp_per_capita_ppp 372.270362\n",
"CPIA_public_sector_rating 3.0\n",
"mortality_rate_under5_per_1000 84.1\n",
"Mortality rate, under-5, female (per 1,000 live... 77.8\n",
"Mortality rate, under-5, male (per 1,000 live b... 90.2\n",
"Population, total 4046301.0\n",
"region Sub-Saharan Africa"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import statsmodels.formula.api as smf\n",
"\n",
"pd.set_option(\"mode.copy_on_write\", True)\n",
"\n",
"# Load data on infant mortality, gdp per capita, and\n",
"# World Bank CPIA public sector transparency, accountability,\n",
"# and corruption in the public sector scores\n",
"# (1 = low transparency and accountability, 6 = high transparency and accountability).\n",
"\n",
"wdi = pd.read_csv(\"data/wdi_corruption.csv\")\n",
"\n",
"# Check one observation to get a feel for things.\n",
"wdi.sample().T"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"OLS Regression Results\n",
"\n",
" Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 | \n",
"
\n",
"\n",
" Model: | OLS | Adj. R-squared: | 0.541 | \n",
"
\n",
"\n",
" Method: | Least Squares | F-statistic: | 13.12 | \n",
"
\n",
"\n",
" Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 2.11e-10 | \n",
"
\n",
"\n",
" Time: | 16:39:55 | Log-Likelihood: | -322.68 | \n",
"
\n",
"\n",
" No. Observations: | 73 | AIC: | 661.4 | \n",
"
\n",
"\n",
" Df Residuals: | 65 | BIC: | 679.7 | \n",
"
\n",
"\n",
" Df Model: | 7 | | | \n",
"
\n",
"\n",
" Covariance Type: | nonrobust | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 169.9397 | 36.430 | 4.665 | 0.000 | 97.183 | 242.696 | \n",
"
\n",
"\n",
" region[T.Europe and Central Asia] | -15.9265 | 12.304 | -1.294 | 0.200 | -40.499 | 8.646 | \n",
"
\n",
"\n",
" region[T.Latin America and Caribbean] | 1.9023 | 9.226 | 0.206 | 0.837 | -16.523 | 20.327 | \n",
"
\n",
"\n",
" region[T.Middle East and North Africa] | 3.7668 | 23.057 | 0.163 | 0.871 | -42.280 | 49.814 | \n",
"
\n",
"\n",
" region[T.South Asia] | 4.9372 | 9.818 | 0.503 | 0.617 | -14.671 | 24.545 | \n",
"
\n",
"\n",
" region[T.Sub-Saharan Africa] | 27.8448 | 7.360 | 3.783 | 0.000 | 13.145 | 42.544 | \n",
"
\n",
"\n",
" np.log(gdp_per_capita_ppp) | -13.3790 | 4.547 | -2.942 | 0.005 | -22.461 | -4.297 | \n",
"
\n",
"\n",
" CPIA_public_sector_rating | -7.1417 | 4.387 | -1.628 | 0.108 | -15.902 | 1.619 | \n",
"
\n",
"
\n",
"\n",
"\n",
" Omnibus: | 4.467 | Durbin-Watson: | 1.617 | \n",
"
\n",
"\n",
" Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 | \n",
"
\n",
"\n",
" Skew: | 0.592 | Prob(JB): | 0.112 | \n",
"
\n",
"\n",
" Kurtosis: | 2.813 | Cond. No. | 128. | \n",
"
\n",
"
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 13.12 \\\\\n",
"\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 2.11e-10 \\\\\n",
"\\textbf{Time:} & 16:39:55 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n",
"\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n",
"\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n",
"\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 169.9397 & 36.430 & 4.665 & 0.000 & 97.183 & 242.696 \\\\\n",
"\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 12.304 & -1.294 & 0.200 & -40.499 & 8.646 \\\\\n",
"\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 9.226 & 0.206 & 0.837 & -16.523 & 20.327 \\\\\n",
"\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 23.057 & 0.163 & 0.871 & -42.280 & 49.814 \\\\\n",
"\\textbf{region[T.South Asia]} & 4.9372 & 9.818 & 0.503 & 0.617 & -14.671 & 24.545 \\\\\n",
"\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.360 & 3.783 & 0.000 & 13.145 & 42.544 \\\\\n",
"\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.547 & -2.942 & 0.005 & -22.461 & -4.297 \\\\\n",
"\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 4.387 & -1.628 & 0.108 & -15.902 & 1.619 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n",
"\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n",
"\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n",
"\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==========================================================================================\n",
"Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n",
"Model: OLS Adj. R-squared: 0.541\n",
"Method: Least Squares F-statistic: 13.12\n",
"Date: Sun, 21 Jul 2024 Prob (F-statistic): 2.11e-10\n",
"Time: 16:39:55 Log-Likelihood: -322.68\n",
"No. Observations: 73 AIC: 661.4\n",
"Df Residuals: 65 BIC: 679.7\n",
"Df Model: 7 \n",
"Covariance Type: nonrobust \n",
"==========================================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"----------------------------------------------------------------------------------------------------------\n",
"Intercept 169.9397 36.430 4.665 0.000 97.183 242.696\n",
"region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646\n",
"region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327\n",
"region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814\n",
"region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545\n",
"region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544\n",
"np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297\n",
"CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619\n",
"==============================================================================\n",
"Omnibus: 4.467 Durbin-Watson: 1.617\n",
"Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n",
"Skew: 0.592 Prob(JB): 0.112\n",
"Kurtosis: 2.813 Cond. No. 128.\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fit model\n",
"corruption_model = smf.ols(\n",
" \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region\",\n",
" data=wdi,\n",
").fit()\n",
"\n",
"# Get regression result\n",
"corruption_model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To change how standard errors are calculated, we use the `.get_robustcov_results()` method. For heteroskedastic robust standard errors, for example, we simply use the `cov_type` keyword argument and pass our preferred method for calculating the errors. Here's a code snipped for HC2, for example:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"OLS Regression Results\n",
"\n",
" Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 | \n",
"
\n",
"\n",
" Model: | OLS | Adj. R-squared: | 0.541 | \n",
"
\n",
"\n",
" Method: | Least Squares | F-statistic: | 48.92 | \n",
"
\n",
"\n",
" Date: | Wed, 05 Jun 2024 | Prob (F-statistic): | 1.68e-23 | \n",
"
\n",
"\n",
" Time: | 13:55:46 | Log-Likelihood: | -322.68 | \n",
"
\n",
"\n",
" No. Observations: | 73 | AIC: | 661.4 | \n",
"
\n",
"\n",
" Df Residuals: | 65 | BIC: | 679.7 | \n",
"
\n",
"\n",
" Df Model: | 7 | | | \n",
"
\n",
"\n",
" Covariance Type: | HC2 | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 169.9397 | 37.846 | 4.490 | 0.000 | 94.357 | 245.522 | \n",
"
\n",
"\n",
" region[T.Europe and Central Asia] | -15.9265 | 5.763 | -2.764 | 0.007 | -27.436 | -4.417 | \n",
"
\n",
"\n",
" region[T.Latin America and Caribbean] | 1.9023 | 6.687 | 0.284 | 0.777 | -11.453 | 15.257 | \n",
"
\n",
"\n",
" region[T.Middle East and North Africa] | 3.7668 | 8.304 | 0.454 | 0.652 | -12.817 | 20.351 | \n",
"
\n",
"\n",
" region[T.South Asia] | 4.9372 | 9.361 | 0.527 | 0.600 | -13.759 | 23.633 | \n",
"
\n",
"\n",
" region[T.Sub-Saharan Africa] | 27.8448 | 7.238 | 3.847 | 0.000 | 13.389 | 42.300 | \n",
"
\n",
"\n",
" np.log(gdp_per_capita_ppp) | -13.3790 | 4.550 | -2.941 | 0.005 | -22.465 | -4.293 | \n",
"
\n",
"\n",
" CPIA_public_sector_rating | -7.1417 | 3.966 | -1.801 | 0.076 | -15.063 | 0.779 | \n",
"
\n",
"
\n",
"\n",
"\n",
" Omnibus: | 4.467 | Durbin-Watson: | 1.617 | \n",
"
\n",
"\n",
" Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 | \n",
"
\n",
"\n",
" Skew: | 0.592 | Prob(JB): | 0.112 | \n",
"
\n",
"\n",
" Kurtosis: | 2.813 | Cond. No. | 128. | \n",
"
\n",
"
Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)"
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 48.92 \\\\\n",
"\\textbf{Date:} & Wed, 05 Jun 2024 & \\textbf{ Prob (F-statistic):} & 1.68e-23 \\\\\n",
"\\textbf{Time:} & 13:55:46 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n",
"\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n",
"\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n",
"\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & HC2 & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 169.9397 & 37.846 & 4.490 & 0.000 & 94.357 & 245.522 \\\\\n",
"\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 5.763 & -2.764 & 0.007 & -27.436 & -4.417 \\\\\n",
"\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 6.687 & 0.284 & 0.777 & -11.453 & 15.257 \\\\\n",
"\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 8.304 & 0.454 & 0.652 & -12.817 & 20.351 \\\\\n",
"\\textbf{region[T.South Asia]} & 4.9372 & 9.361 & 0.527 & 0.600 & -13.759 & 23.633 \\\\\n",
"\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.238 & 3.847 & 0.000 & 13.389 & 42.300 \\\\\n",
"\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.550 & -2.941 & 0.005 & -22.465 & -4.293 \\\\\n",
"\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 3.966 & -1.801 & 0.076 & -15.063 & 0.779 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n",
"\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n",
"\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n",
"\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors are heteroscedasticity robust (HC2)"
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==========================================================================================\n",
"Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n",
"Model: OLS Adj. R-squared: 0.541\n",
"Method: Least Squares F-statistic: 48.92\n",
"Date: Wed, 05 Jun 2024 Prob (F-statistic): 1.68e-23\n",
"Time: 13:55:46 Log-Likelihood: -322.68\n",
"No. Observations: 73 AIC: 661.4\n",
"Df Residuals: 65 BIC: 679.7\n",
"Df Model: 7 \n",
"Covariance Type: HC2 \n",
"==========================================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"----------------------------------------------------------------------------------------------------------\n",
"Intercept 169.9397 37.846 4.490 0.000 94.357 245.522\n",
"region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417\n",
"region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257\n",
"region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351\n",
"region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633\n",
"region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300\n",
"np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293\n",
"CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779\n",
"==============================================================================\n",
"Omnibus: 4.467 Durbin-Watson: 1.617\n",
"Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n",
"Skew: 0.592 Prob(JB): 0.112\n",
"Kurtosis: 2.813 Cond. No. 128.\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors are heteroscedasticity robust (HC2)\n",
"\"\"\""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_w_robust_ses = corruption_model.get_robustcov_results(cov_type=\"HC2\")\n",
"model_w_robust_ses.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clustering standard errors is accomplished by similar means, although one must pass a vector of group identifiers on which to cluster. \n",
"\n",
"(Make sure to drop any rows from the original data that have missing observations that would have been dropped from the original regression before passing a single variable as group identifiers). "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2\n",
" warnings.warn('covariance of constraints does not have full '\n"
]
},
{
"data": {
"text/html": [
"\n",
"OLS Regression Results\n",
"\n",
" Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.586 | \n",
"
\n",
"\n",
" Model: | OLS | Adj. R-squared: | 0.541 | \n",
"
\n",
"\n",
" Method: | Least Squares | F-statistic: | 4.404 | \n",
"
\n",
"\n",
" Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 0.0789 | \n",
"
\n",
"\n",
" Time: | 16:45:56 | Log-Likelihood: | -322.68 | \n",
"
\n",
"\n",
" No. Observations: | 73 | AIC: | 661.4 | \n",
"
\n",
"\n",
" Df Residuals: | 65 | BIC: | 679.7 | \n",
"
\n",
"\n",
" Df Model: | 7 | | | \n",
"
\n",
"\n",
" Covariance Type: | cluster | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 169.9397 | 16.975 | 10.011 | 0.000 | 126.304 | 213.575 | \n",
"
\n",
"\n",
" region[T.Europe and Central Asia] | -15.9265 | 1.241 | -12.829 | 0.000 | -19.118 | -12.735 | \n",
"
\n",
"\n",
" region[T.Latin America and Caribbean] | 1.9023 | 0.694 | 2.739 | 0.041 | 0.117 | 3.688 | \n",
"
\n",
"\n",
" region[T.Middle East and North Africa] | 3.7668 | 2.644 | 1.425 | 0.214 | -3.030 | 10.564 | \n",
"
\n",
"\n",
" region[T.South Asia] | 4.9372 | 0.719 | 6.869 | 0.001 | 3.090 | 6.785 | \n",
"
\n",
"\n",
" region[T.Sub-Saharan Africa] | 27.8448 | 1.385 | 20.098 | 0.000 | 24.283 | 31.406 | \n",
"
\n",
"\n",
" np.log(gdp_per_capita_ppp) | -13.3790 | 2.727 | -4.906 | 0.004 | -20.389 | -6.369 | \n",
"
\n",
"\n",
" CPIA_public_sector_rating | -7.1417 | 2.067 | -3.455 | 0.018 | -12.455 | -1.829 | \n",
"
\n",
"
\n",
"\n",
"\n",
" Omnibus: | 4.467 | Durbin-Watson: | 1.617 | \n",
"
\n",
"\n",
" Prob(Omnibus): | 0.107 | Jarque-Bera (JB): | 4.375 | \n",
"
\n",
"\n",
" Skew: | 0.592 | Prob(JB): | 0.112 | \n",
"
\n",
"\n",
" Kurtosis: | 2.813 | Cond. No. | 128. | \n",
"
\n",
"
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)"
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n",
"\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 4.404 \\\\\n",
"\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 0.0789 \\\\\n",
"\\textbf{Time:} & 16:45:56 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n",
"\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n",
"\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n",
"\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & cluster & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & 169.9397 & 16.975 & 10.011 & 0.000 & 126.304 & 213.575 \\\\\n",
"\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 1.241 & -12.829 & 0.000 & -19.118 & -12.735 \\\\\n",
"\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 0.694 & 2.739 & 0.041 & 0.117 & 3.688 \\\\\n",
"\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 2.644 & 1.425 & 0.214 & -3.030 & 10.564 \\\\\n",
"\\textbf{region[T.South Asia]} & 4.9372 & 0.719 & 6.869 & 0.001 & 3.090 & 6.785 \\\\\n",
"\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 1.385 & 20.098 & 0.000 & 24.283 & 31.406 \\\\\n",
"\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 2.727 & -4.906 & 0.004 & -20.389 & -6.369 \\\\\n",
"\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 2.067 & -3.455 & 0.018 & -12.455 & -1.829 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n",
"\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n",
"\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n",
"\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{OLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors are robust to cluster correlation (cluster)"
],
"text/plain": [
"\n",
"\"\"\"\n",
" OLS Regression Results \n",
"==========================================================================================\n",
"Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n",
"Model: OLS Adj. R-squared: 0.541\n",
"Method: Least Squares F-statistic: 4.404\n",
"Date: Sun, 21 Jul 2024 Prob (F-statistic): 0.0789\n",
"Time: 16:45:56 Log-Likelihood: -322.68\n",
"No. Observations: 73 AIC: 661.4\n",
"Df Residuals: 65 BIC: 679.7\n",
"Df Model: 7 \n",
"Covariance Type: cluster \n",
"==========================================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"----------------------------------------------------------------------------------------------------------\n",
"Intercept 169.9397 16.975 10.011 0.000 126.304 213.575\n",
"region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735\n",
"region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688\n",
"region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564\n",
"region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785\n",
"region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406\n",
"np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369\n",
"CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829\n",
"==============================================================================\n",
"Omnibus: 4.467 Durbin-Watson: 1.617\n",
"Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n",
"Skew: 0.592 Prob(JB): 0.112\n",
"Kurtosis: 2.813 Cond. No. 128.\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors are robust to cluster correlation (cluster)\n",
"\"\"\""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_w_clusters = corruption_model.get_robustcov_results(\n",
" cov_type=\"cluster\", groups=wdi.dropna().region\n",
")\n",
"model_w_clusters.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Weighted Regression\n",
"\n",
"Weighted least squares is also available in `statsmodels` (wls is a little finicky and wants `na` values dropped prior to model fitting):"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"WLS Regression Results\n",
"\n",
" Dep. Variable: | mortality_rate_under5_per_1000 | R-squared: | 0.394 | \n",
"
\n",
"\n",
" Model: | WLS | Adj. R-squared: | 0.329 | \n",
"
\n",
"\n",
" Method: | Least Squares | F-statistic: | 6.040 | \n",
"
\n",
"\n",
" Date: | Sun, 21 Jul 2024 | Prob (F-statistic): | 1.91e-05 | \n",
"
\n",
"\n",
" Time: | 16:52:11 | Log-Likelihood: | -382.90 | \n",
"
\n",
"\n",
" No. Observations: | 73 | AIC: | 781.8 | \n",
"
\n",
"\n",
" Df Residuals: | 65 | BIC: | 800.1 | \n",
"
\n",
"\n",
" Df Model: | 7 | | | \n",
"
\n",
"\n",
" Covariance Type: | nonrobust | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | coef | std err | t | P>|t| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | -1.0137 | 44.372 | -0.023 | 0.982 | -89.630 | 87.602 | \n",
"
\n",
"\n",
" region[T.Europe and Central Asia] | -7.5259 | 18.115 | -0.415 | 0.679 | -43.705 | 28.653 | \n",
"
\n",
"\n",
" region[T.Latin America and Caribbean] | 6.5553 | 19.604 | 0.334 | 0.739 | -32.597 | 45.707 | \n",
"
\n",
"\n",
" region[T.Middle East and North Africa] | 23.9896 | 24.400 | 0.983 | 0.329 | -24.741 | 72.720 | \n",
"
\n",
"\n",
" region[T.South Asia] | 24.0141 | 9.828 | 2.443 | 0.017 | 4.386 | 43.643 | \n",
"
\n",
"\n",
" region[T.Sub-Saharan Africa] | 48.9199 | 9.844 | 4.970 | 0.000 | 29.261 | 68.579 | \n",
"
\n",
"\n",
" np.log(gdp_per_capita_ppp) | 3.8200 | 5.471 | 0.698 | 0.488 | -7.106 | 14.746 | \n",
"
\n",
"\n",
" CPIA_public_sector_rating | 1.1358 | 6.601 | 0.172 | 0.864 | -12.048 | 14.320 | \n",
"
\n",
"
\n",
"\n",
"\n",
" Omnibus: | 5.883 | Durbin-Watson: | 1.809 | \n",
"
\n",
"\n",
" Prob(Omnibus): | 0.053 | Jarque-Bera (JB): | 6.686 | \n",
"
\n",
"\n",
" Skew: | 0.337 | Prob(JB): | 0.0353 | \n",
"
\n",
"\n",
" Kurtosis: | 4.320 | Cond. No. | 143. | \n",
"
\n",
"
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/latex": [
"\\begin{center}\n",
"\\begin{tabular}{lclc}\n",
"\\toprule\n",
"\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.394 \\\\\n",
"\\textbf{Model:} & WLS & \\textbf{ Adj. R-squared: } & 0.329 \\\\\n",
"\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 6.040 \\\\\n",
"\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 1.91e-05 \\\\\n",
"\\textbf{Time:} & 16:52:11 & \\textbf{ Log-Likelihood: } & -382.90 \\\\\n",
"\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 781.8 \\\\\n",
"\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 800.1 \\\\\n",
"\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n",
"\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lcccccc}\n",
" & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n",
"\\midrule\n",
"\\textbf{Intercept} & -1.0137 & 44.372 & -0.023 & 0.982 & -89.630 & 87.602 \\\\\n",
"\\textbf{region[T.Europe and Central Asia]} & -7.5259 & 18.115 & -0.415 & 0.679 & -43.705 & 28.653 \\\\\n",
"\\textbf{region[T.Latin America and Caribbean]} & 6.5553 & 19.604 & 0.334 & 0.739 & -32.597 & 45.707 \\\\\n",
"\\textbf{region[T.Middle East and North Africa]} & 23.9896 & 24.400 & 0.983 & 0.329 & -24.741 & 72.720 \\\\\n",
"\\textbf{region[T.South Asia]} & 24.0141 & 9.828 & 2.443 & 0.017 & 4.386 & 43.643 \\\\\n",
"\\textbf{region[T.Sub-Saharan Africa]} & 48.9199 & 9.844 & 4.970 & 0.000 & 29.261 & 68.579 \\\\\n",
"\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & 3.8200 & 5.471 & 0.698 & 0.488 & -7.106 & 14.746 \\\\\n",
"\\textbf{CPIA\\_public\\_sector\\_rating} & 1.1358 & 6.601 & 0.172 & 0.864 & -12.048 & 14.320 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\\begin{tabular}{lclc}\n",
"\\textbf{Omnibus:} & 5.883 & \\textbf{ Durbin-Watson: } & 1.809 \\\\\n",
"\\textbf{Prob(Omnibus):} & 0.053 & \\textbf{ Jarque-Bera (JB): } & 6.686 \\\\\n",
"\\textbf{Skew:} & 0.337 & \\textbf{ Prob(JB): } & 0.0353 \\\\\n",
"\\textbf{Kurtosis:} & 4.320 & \\textbf{ Cond. No. } & 143. \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"%\\caption{WLS Regression Results}\n",
"\\end{center}\n",
"\n",
"Notes: \\newline\n",
" [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
],
"text/plain": [
"\n",
"\"\"\"\n",
" WLS Regression Results \n",
"==========================================================================================\n",
"Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.394\n",
"Model: WLS Adj. R-squared: 0.329\n",
"Method: Least Squares F-statistic: 6.040\n",
"Date: Sun, 21 Jul 2024 Prob (F-statistic): 1.91e-05\n",
"Time: 16:52:11 Log-Likelihood: -382.90\n",
"No. Observations: 73 AIC: 781.8\n",
"Df Residuals: 65 BIC: 800.1\n",
"Df Model: 7 \n",
"Covariance Type: nonrobust \n",
"==========================================================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"----------------------------------------------------------------------------------------------------------\n",
"Intercept -1.0137 44.372 -0.023 0.982 -89.630 87.602\n",
"region[T.Europe and Central Asia] -7.5259 18.115 -0.415 0.679 -43.705 28.653\n",
"region[T.Latin America and Caribbean] 6.5553 19.604 0.334 0.739 -32.597 45.707\n",
"region[T.Middle East and North Africa] 23.9896 24.400 0.983 0.329 -24.741 72.720\n",
"region[T.South Asia] 24.0141 9.828 2.443 0.017 4.386 43.643\n",
"region[T.Sub-Saharan Africa] 48.9199 9.844 4.970 0.000 29.261 68.579\n",
"np.log(gdp_per_capita_ppp) 3.8200 5.471 0.698 0.488 -7.106 14.746\n",
"CPIA_public_sector_rating 1.1358 6.601 0.172 0.864 -12.048 14.320\n",
"==============================================================================\n",
"Omnibus: 5.883 Durbin-Watson: 1.809\n",
"Prob(Omnibus): 0.053 Jarque-Bera (JB): 6.686\n",
"Skew: 0.337 Prob(JB): 0.0353\n",
"Kurtosis: 4.320 Cond. No. 143.\n",
"==============================================================================\n",
"\n",
"Notes:\n",
"[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
"\"\"\""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weighted_model = smf.wls(\n",
" \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region\",\n",
" data=wdi.dropna(),\n",
" weights=wdi.dropna()[\"Population, total\"],\n",
")\n",
"weighted_model.fit().summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Post-Regression Testing\n",
"\n",
"`statsmodels` also provides a flexible syntax for post-regression testing. To test whether the FE for South Asia and the Middle East and North Africa are the same, we just run:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
" Test for Constraints \n",
"==============================================================================\n",
" coef std err t P>|t| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"c0 1.1703 23.625 0.050 0.961 -46.013 48.354\n",
"=============================================================================="
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hypotheses = \"region[T.South Asia] = region[T.Middle East and North Africa]\"\n",
"corruption_model.t_test(hypotheses)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generalized Linear Models and Generalized Additive Models\n",
"\n",
"Finally, for those interested in going beyond standard linear regression, `statsmodels` supports both Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs).\n",
"\n",
"You can read the documentation for [Generalized Linear Models](https://www.statsmodels.org/stable/glm.html), including logits, probits, poisson, binomial, etc., here.\n",
"\n",
"Documentation for [Generalized Additive Models can be found here](https://www.statsmodels.org/stable/gam.html), although users interested in GAMs may also wish to look into [pyGAM](https://pygam.readthedocs.io/en/latest/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}