{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OPTIONAL: Beyond The Basic Model\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's our hope that the last two readings will be accessible to anyone who has gotten this far in our specialization, regardless of your prior familiarity with linear regression.\n", "\n", "In this reading, however, we will provide an overview of some of the more advanced functionality provided by `statsmodels`. The purpose of this is to provide readers who are used to working with linear regressions in another programming language (like R or Stata) with a quick introduction to the syntax for doing tasks that are commonly used in practice but which we do not have the space to explain in this course.\n", "\n", "In particular, in this reading we will discuss different types of standard errors (e.g., clustered and heteroskedastic robust standard errors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heteroskedastic Robust and Clustered Standard Errors\n", "\n", "One of the most common modifications to a standard linear regression is the use of heteroskedastic robust and clustered standard errors, and these are easy to use in `statsmodels`.\n", "\n", "To illustrate, let's begin with a simple regression:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
40
country_nameMauritania
gdp_per_capita_ppp372.270362
CPIA_public_sector_rating3.0
mortality_rate_under5_per_100084.1
Mortality rate, under-5, female (per 1,000 live births)77.8
Mortality rate, under-5, male (per 1,000 live births)90.2
Population, total4046301.0
regionSub-Saharan Africa
\n", "
" ], "text/plain": [ " 40\n", "country_name Mauritania\n", "gdp_per_capita_ppp 372.270362\n", "CPIA_public_sector_rating 3.0\n", "mortality_rate_under5_per_1000 84.1\n", "Mortality rate, under-5, female (per 1,000 live... 77.8\n", "Mortality rate, under-5, male (per 1,000 live b... 90.2\n", "Population, total 4046301.0\n", "region Sub-Saharan Africa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import statsmodels.formula.api as smf\n", "\n", "pd.set_option(\"mode.copy_on_write\", True)\n", "\n", "# Load data on infant mortality, gdp per capita, and\n", "# World Bank CPIA public sector transparency, accountability,\n", "# and corruption in the public sector scores\n", "# (1 = low transparency and accountability, 6 = high transparency and accountability).\n", "\n", "wdi = pd.read_csv(\"data/wdi_corruption.csv\")\n", "\n", "# Check one observation to get a feel for things.\n", "wdi.sample().T" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 13.12
Date: Sun, 21 Jul 2024 Prob (F-statistic): 2.11e-10
Time: 16:39:55 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 36.430 4.665 0.000 97.183 242.696
region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646
region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327
region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814
region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545
region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544
np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297
CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 13.12 \\\\\n", "\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 2.11e-10 \\\\\n", "\\textbf{Time:} & 16:39:55 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 36.430 & 4.665 & 0.000 & 97.183 & 242.696 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 12.304 & -1.294 & 0.200 & -40.499 & 8.646 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 9.226 & 0.206 & 0.837 & -16.523 & 20.327 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 23.057 & 0.163 & 0.871 & -42.280 & 49.814 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 9.818 & 0.503 & 0.617 & -14.671 & 24.545 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.360 & 3.783 & 0.000 & 13.145 & 42.544 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.547 & -2.942 & 0.005 & -22.461 & -4.297 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 4.387 & -1.628 & 0.108 & -15.902 & 1.619 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 13.12\n", "Date: Sun, 21 Jul 2024 Prob (F-statistic): 2.11e-10\n", "Time: 16:39:55 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 36.430 4.665 0.000 97.183 242.696\n", "region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646\n", "region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327\n", "region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814\n", "region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545\n", "region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544\n", "np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297\n", "CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit model\n", "corruption_model = smf.ols(\n", " \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region\",\n", " data=wdi,\n", ").fit()\n", "\n", "# Get regression result\n", "corruption_model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To change how standard errors are calculated, we use the `.get_robustcov_results()` method. For heteroskedastic robust standard errors, for example, we simply use the `cov_type` keyword argument and pass our preferred method for calculating the errors. Here's a code snipped for HC2, for example:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 48.92
Date: Wed, 05 Jun 2024 Prob (F-statistic): 1.68e-23
Time: 13:55:46 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: HC2
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 37.846 4.490 0.000 94.357 245.522
region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417
region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257
region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351
region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633
region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300
np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293
CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)" ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 48.92 \\\\\n", "\\textbf{Date:} & Wed, 05 Jun 2024 & \\textbf{ Prob (F-statistic):} & 1.68e-23 \\\\\n", "\\textbf{Time:} & 13:55:46 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & HC2 & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 37.846 & 4.490 & 0.000 & 94.357 & 245.522 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 5.763 & -2.764 & 0.007 & -27.436 & -4.417 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 6.687 & 0.284 & 0.777 & -11.453 & 15.257 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 8.304 & 0.454 & 0.652 & -12.817 & 20.351 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 9.361 & 0.527 & 0.600 & -13.759 & 23.633 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.238 & 3.847 & 0.000 & 13.389 & 42.300 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.550 & -2.941 & 0.005 & -22.465 & -4.293 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 3.966 & -1.801 & 0.076 & -15.063 & 0.779 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors are heteroscedasticity robust (HC2)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 48.92\n", "Date: Wed, 05 Jun 2024 Prob (F-statistic): 1.68e-23\n", "Time: 13:55:46 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: HC2 \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 37.846 4.490 0.000 94.357 245.522\n", "region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417\n", "region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257\n", "region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351\n", "region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633\n", "region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300\n", "np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293\n", "CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors are heteroscedasticity robust (HC2)\n", "\"\"\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_w_robust_ses = corruption_model.get_robustcov_results(cov_type=\"HC2\")\n", "model_w_robust_ses.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clustering standard errors is accomplished by similar means, although one must pass a vector of group identifiers on which to cluster. \n", "\n", "(Make sure to drop any rows from the original data that have missing observations that would have been dropped from the original regression before passing a single variable as group identifiers). " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2\n", " warnings.warn('covariance of constraints does not have full '\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 4.404
Date: Sun, 21 Jul 2024 Prob (F-statistic): 0.0789
Time: 16:45:56 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: cluster
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 16.975 10.011 0.000 126.304 213.575
region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735
region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688
region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564
region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785
region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406
np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369
CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are robust to cluster correlation (cluster)" ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 4.404 \\\\\n", "\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 0.0789 \\\\\n", "\\textbf{Time:} & 16:45:56 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & cluster & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 16.975 & 10.011 & 0.000 & 126.304 & 213.575 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 1.241 & -12.829 & 0.000 & -19.118 & -12.735 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 0.694 & 2.739 & 0.041 & 0.117 & 3.688 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 2.644 & 1.425 & 0.214 & -3.030 & 10.564 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 0.719 & 6.869 & 0.001 & 3.090 & 6.785 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 1.385 & 20.098 & 0.000 & 24.283 & 31.406 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 2.727 & -4.906 & 0.004 & -20.389 & -6.369 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 2.067 & -3.455 & 0.018 & -12.455 & -1.829 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors are robust to cluster correlation (cluster)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 4.404\n", "Date: Sun, 21 Jul 2024 Prob (F-statistic): 0.0789\n", "Time: 16:45:56 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: cluster \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 16.975 10.011 0.000 126.304 213.575\n", "region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735\n", "region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688\n", "region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564\n", "region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785\n", "region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406\n", "np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369\n", "CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors are robust to cluster correlation (cluster)\n", "\"\"\"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_w_clusters = corruption_model.get_robustcov_results(\n", " cov_type=\"cluster\", groups=wdi.dropna().region\n", ")\n", "model_w_clusters.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weighted Regression\n", "\n", "Weighted least squares is also available in `statsmodels` (wls is a little finicky and wants `na` values dropped prior to model fitting):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
WLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.394
Model: WLS Adj. R-squared: 0.329
Method: Least Squares F-statistic: 6.040
Date: Sun, 21 Jul 2024 Prob (F-statistic): 1.91e-05
Time: 16:52:11 Log-Likelihood: -382.90
No. Observations: 73 AIC: 781.8
Df Residuals: 65 BIC: 800.1
Df Model: 7
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept -1.0137 44.372 -0.023 0.982 -89.630 87.602
region[T.Europe and Central Asia] -7.5259 18.115 -0.415 0.679 -43.705 28.653
region[T.Latin America and Caribbean] 6.5553 19.604 0.334 0.739 -32.597 45.707
region[T.Middle East and North Africa] 23.9896 24.400 0.983 0.329 -24.741 72.720
region[T.South Asia] 24.0141 9.828 2.443 0.017 4.386 43.643
region[T.Sub-Saharan Africa] 48.9199 9.844 4.970 0.000 29.261 68.579
np.log(gdp_per_capita_ppp) 3.8200 5.471 0.698 0.488 -7.106 14.746
CPIA_public_sector_rating 1.1358 6.601 0.172 0.864 -12.048 14.320
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 5.883 Durbin-Watson: 1.809
Prob(Omnibus): 0.053 Jarque-Bera (JB): 6.686
Skew: 0.337 Prob(JB): 0.0353
Kurtosis: 4.320 Cond. No. 143.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.394 \\\\\n", "\\textbf{Model:} & WLS & \\textbf{ Adj. R-squared: } & 0.329 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 6.040 \\\\\n", "\\textbf{Date:} & Sun, 21 Jul 2024 & \\textbf{ Prob (F-statistic):} & 1.91e-05 \\\\\n", "\\textbf{Time:} & 16:52:11 & \\textbf{ Log-Likelihood: } & -382.90 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 781.8 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 800.1 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & -1.0137 & 44.372 & -0.023 & 0.982 & -89.630 & 87.602 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -7.5259 & 18.115 & -0.415 & 0.679 & -43.705 & 28.653 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 6.5553 & 19.604 & 0.334 & 0.739 & -32.597 & 45.707 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 23.9896 & 24.400 & 0.983 & 0.329 & -24.741 & 72.720 \\\\\n", "\\textbf{region[T.South Asia]} & 24.0141 & 9.828 & 2.443 & 0.017 & 4.386 & 43.643 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 48.9199 & 9.844 & 4.970 & 0.000 & 29.261 & 68.579 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & 3.8200 & 5.471 & 0.698 & 0.488 & -7.106 & 14.746 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & 1.1358 & 6.601 & 0.172 & 0.864 & -12.048 & 14.320 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 5.883 & \\textbf{ Durbin-Watson: } & 1.809 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.053 & \\textbf{ Jarque-Bera (JB): } & 6.686 \\\\\n", "\\textbf{Skew:} & 0.337 & \\textbf{ Prob(JB): } & 0.0353 \\\\\n", "\\textbf{Kurtosis:} & 4.320 & \\textbf{ Cond. No. } & 143. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{WLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " WLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.394\n", "Model: WLS Adj. R-squared: 0.329\n", "Method: Least Squares F-statistic: 6.040\n", "Date: Sun, 21 Jul 2024 Prob (F-statistic): 1.91e-05\n", "Time: 16:52:11 Log-Likelihood: -382.90\n", "No. Observations: 73 AIC: 781.8\n", "Df Residuals: 65 BIC: 800.1\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept -1.0137 44.372 -0.023 0.982 -89.630 87.602\n", "region[T.Europe and Central Asia] -7.5259 18.115 -0.415 0.679 -43.705 28.653\n", "region[T.Latin America and Caribbean] 6.5553 19.604 0.334 0.739 -32.597 45.707\n", "region[T.Middle East and North Africa] 23.9896 24.400 0.983 0.329 -24.741 72.720\n", "region[T.South Asia] 24.0141 9.828 2.443 0.017 4.386 43.643\n", "region[T.Sub-Saharan Africa] 48.9199 9.844 4.970 0.000 29.261 68.579\n", "np.log(gdp_per_capita_ppp) 3.8200 5.471 0.698 0.488 -7.106 14.746\n", "CPIA_public_sector_rating 1.1358 6.601 0.172 0.864 -12.048 14.320\n", "==============================================================================\n", "Omnibus: 5.883 Durbin-Watson: 1.809\n", "Prob(Omnibus): 0.053 Jarque-Bera (JB): 6.686\n", "Skew: 0.337 Prob(JB): 0.0353\n", "Kurtosis: 4.320 Cond. No. 143.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weighted_model = smf.wls(\n", " \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) + CPIA_public_sector_rating + region\",\n", " data=wdi.dropna(),\n", " weights=wdi.dropna()[\"Population, total\"],\n", ")\n", "weighted_model.fit().summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Post-Regression Testing\n", "\n", "`statsmodels` also provides a flexible syntax for post-regression testing. To test whether the FE for South Asia and the Middle East and North Africa are the same, we just run:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " Test for Constraints \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "c0 1.1703 23.625 0.050 0.961 -46.013 48.354\n", "==============================================================================" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hypotheses = \"region[T.South Asia] = region[T.Middle East and North Africa]\"\n", "corruption_model.t_test(hypotheses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generalized Linear Models and Generalized Additive Models\n", "\n", "Finally, for those interested in going beyond standard linear regression, `statsmodels` supports both Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs).\n", "\n", "You can read the documentation for [Generalized Linear Models](https://www.statsmodels.org/stable/glm.html), including logits, probits, poisson, binomial, etc., here.\n", "\n", "Documentation for [Generalized Additive Models can be found here](https://www.statsmodels.org/stable/gam.html), although users interested in GAMs may also wish to look into [pyGAM](https://pygam.readthedocs.io/en/latest/)." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }