{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Regression in Python\n", "\n", "There are many packages that implement linear regression in Python. As detailed in our last reading, however, depending on whether they are intended for use in prediction or inference, the way these packages operate can vary substantially. \n", "\n", "In this reading, we will look at how linear regression has been implemented in two major packages — [statsmodels](http://statsmodels.org) and [scikit-learn](http://scikit-learn.org). Both of these packages can fit a wide range of linear models, but the way they are organized and the results they report reflect the different audiences for whom they were designed. \n", "\n", "*Broadly speaking*, statsmodels is a library written by statisticians for statisticians, biostatisticians, social scientists, and natural scientists. It can do prediction, but its focus is *inference*, and as we will see that is reflected throughout the package. \n", "\n", "scikit-learn, by contrast, was written by and for computer scientists interested in machine learning. Its focus is on prediction, and while it includes a far more diverse collection of machine learning models than statsmodels, it does not include all the features someone doing inference might expect for evaluating model performance or doing things like calculating different types of standard errors.\n", "\n", "## Regression in statsmodels\n", "\n", "Because it is the more feature-rich library when it comes to regression, we will start our exploration of linear regression in Python with statsmodels. If you have any interest in inference, are coming from a programming language like R or Stata, and/or have a background in statistics, social science, or the natural sciences, then statsmodels is the package that will feel most familiar and useful. \n", "\n", "While our focus will be on linear regression, the statsmodels package includes a wide range of tools for inference and modeling, from simple models like linear and logistic regression to generalized linear models (GLMs), non-parametric regression, robust linear models, time series models, survival analysis, multiple imputation, generalized additive models (GAMs), and more. (Curious if it includes that *one* model that's near and dear to your heart? [Feel free to go check and come back](https://www.statsmodels.org/stable/user-guide.html)).\n", "\n", "Moreover, it provides by far the easiest interface for moving from a pandas DataFrame to a regression in the Python ecosystem. To illustrate, let's fit a quick regression looking at countries' under-5 mortality rates as a function of GDP per capita and Work Bank ratings of each countries' public sector in terms of transparency, accountability, and levels of corruption:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
73
country_nameYemen, Rep.
gdp_per_capita_ppp3108.764217
CPIA_public_sector_rating1.5
mortality_rate_under5_per_100055.4
Mortality rate, under-5, female (per 1,000 live births)51.3
Mortality rate, under-5, male (per 1,000 live births)59.4
Population, total26497889.0
regionMiddle East and North Africa
\n", "
" ], "text/plain": [ " 73\n", "country_name Yemen, Rep.\n", "gdp_per_capita_ppp 3108.764217\n", "CPIA_public_sector_rating 1.5\n", "mortality_rate_under5_per_1000 55.4\n", "Mortality rate, under-5, female (per 1,000 live... 51.3\n", "Mortality rate, under-5, male (per 1,000 live b... 59.4\n", "Population, total 26497889.0\n", "region Middle East and North Africa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import statsmodels.formula.api as smf\n", "\n", "pd.set_option(\"mode.copy_on_write\", True)\n", "\n", "# Load data on infant mortality, gdp per capita, and\n", "# World Bank CPIA public sector transparency, accountability,\n", "# and corruption in the public sector scores\n", "# (1 = low transparency and accountability, 6 = high transparency and accountability).\n", "\n", "wdi = pd.read_csv(\"data/wdi_corruption.csv\")\n", "\n", "# Check one observation to get a feel for things.\n", "wdi.sample().T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 13.12
Date: Sat, 20 Jul 2024 Prob (F-statistic): 2.11e-10
Time: 19:04:19 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 36.430 4.665 0.000 97.183 242.696
region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646
region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327
region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814
region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545
region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544
np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297
CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 13.12 \\\\\n", "\\textbf{Date:} & Sat, 20 Jul 2024 & \\textbf{ Prob (F-statistic):} & 2.11e-10 \\\\\n", "\\textbf{Time:} & 19:04:19 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & nonrobust & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 36.430 & 4.665 & 0.000 & 97.183 & 242.696 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 12.304 & -1.294 & 0.200 & -40.499 & 8.646 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 9.226 & 0.206 & 0.837 & -16.523 & 20.327 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 23.057 & 0.163 & 0.871 & -42.280 & 49.814 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 9.818 & 0.503 & 0.617 & -14.671 & 24.545 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.360 & 3.783 & 0.000 & 13.145 & 42.544 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.547 & -2.942 & 0.005 & -22.461 & -4.297 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 4.387 & -1.628 & 0.108 & -15.902 & 1.619 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 13.12\n", "Date: Sat, 20 Jul 2024 Prob (F-statistic): 2.11e-10\n", "Time: 19:04:19 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 36.430 4.665 0.000 97.183 242.696\n", "region[T.Europe and Central Asia] -15.9265 12.304 -1.294 0.200 -40.499 8.646\n", "region[T.Latin America and Caribbean] 1.9023 9.226 0.206 0.837 -16.523 20.327\n", "region[T.Middle East and North Africa] 3.7668 23.057 0.163 0.871 -42.280 49.814\n", "region[T.South Asia] 4.9372 9.818 0.503 0.617 -14.671 24.545\n", "region[T.Sub-Saharan Africa] 27.8448 7.360 3.783 0.000 13.145 42.544\n", "np.log(gdp_per_capita_ppp) -13.3790 4.547 -2.942 0.005 -22.461 -4.297\n", "CPIA_public_sector_rating -7.1417 4.387 -1.628 0.108 -15.902 1.619\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit model\n", "corruption_model = smf.ols(\n", " \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) +\"\n", " \" CPIA_public_sector_rating + region\",\n", " data=wdi,\n", ").fit()\n", "\n", "# Get regression result\n", "corruption_model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(why `smf.ols`? Recall that one name for linear regression is \"Ordinary Least Squares,\" which is often shortened to \"OLS.\")\n", "\n", "Not bad, huh? Not only was our regression a one-liner, but the `.summary()` method has provided us with all our coefficients with labels, standard errors, t-statistics, p-values, 95% confidence intervals, and a range of additional statistics about the regression. \n", "\n", "We were also able to apply come manipulations right in the regression specification — note the use of `np.log()` to apply the log function to `gdp_per_capita_ppp` without having to create a new variable in our DataFrame.\n", "\n", "Finally (if this doesn't mean much to you, don't worry about it), statsmodels recognized that `region` contained strings rather than numbers, so it dynamically created a set of indicator variables (i.e., it created one-hot encodings).\n", "\n", "If you come from R or Stata, none of that is likely to seem notable to you, but as we'll discuss in the next section (and as we'll see when we get to `scikit-learn`), those are conveniences that should not be taken for granted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Results\n", "\n", "While `.summary()` is great for printing out results, this is a class about *programming* for data science, and so, of course, we also need to know how to access these results programatically. \n", "\n", "All results — along with a large number of other useful statistics about model performance — are accessible through the fit model (in this example, `corruption_model`). This fit model is a `RegressionResult` object. You can find a full list `RegressionResult` attributes and methods [here](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults). Two of the most important attributes of a `RegressionResult` object are: `.params` (a pandas Series of regression coefficients) and `.df_resid` (a DataFrame of residuals). \n", "\n", "To illustrate:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Intercept 169.939656\n", "region[T.Europe and Central Asia] -15.926485\n", "region[T.Latin America and Caribbean] 1.902313\n", "region[T.Middle East and North Africa] 3.766843\n", "region[T.South Asia] 4.937173\n", "region[T.Sub-Saharan Africa] 27.844846\n", "np.log(gdp_per_capita_ppp) -13.379033\n", "CPIA_public_sector_rating -7.141716\n", "dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corruption_model.params" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The partial correlation of CPIA rating with under five mortality is -7.14\n" ] } ], "source": [ "print(\n", " \"The partial correlation of CPIA rating with under \"\n", " f\"five mortality is {corruption_model.params['CPIA_public_sector_rating']:.2f}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For those who are familiar with these concepts, a RegressionResult object also has methods for post-regression testing (e.g., f-tests, Lagrange Multiplier (LM) tests of linear restrictions, likelihood ratio tests). \n", "\n", "For example, to test whether the coefficient for the South Asia region is statistically significantly different from the coefficient for the Middle East and North Africa region, you would:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " Test for Constraints \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "c0 1.1703 23.625 0.050 0.961 -46.013 48.354\n", "==============================================================================" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corruption_model.t_test(\"region[T.South Asia] = region[T.Middle East and North Africa]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alternative Standard Error Calculations\n", "\n", "`RegressionResult` objects also have a method for different standard error calculations (again, don't worry if this doesn't mean anything to you!). statsmodels supports `HC1`, `HC2`, and `HC3` heteroskedastic robust standard errors, as well as heteroskedasticity-autocorrelation robust standard errors:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 48.92
Date: Sat, 20 Jul 2024 Prob (F-statistic): 1.68e-23
Time: 19:04:19 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: HC2
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 37.846 4.490 0.000 94.357 245.522
region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417
region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257
region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351
region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633
region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300
np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293
CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)" ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 48.92 \\\\\n", "\\textbf{Date:} & Sat, 20 Jul 2024 & \\textbf{ Prob (F-statistic):} & 1.68e-23 \\\\\n", "\\textbf{Time:} & 19:04:19 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & HC2 & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 37.846 & 4.490 & 0.000 & 94.357 & 245.522 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 5.763 & -2.764 & 0.007 & -27.436 & -4.417 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 6.687 & 0.284 & 0.777 & -11.453 & 15.257 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 8.304 & 0.454 & 0.652 & -12.817 & 20.351 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 9.361 & 0.527 & 0.600 & -13.759 & 23.633 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 7.238 & 3.847 & 0.000 & 13.389 & 42.300 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 4.550 & -2.941 & 0.005 & -22.465 & -4.293 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 3.966 & -1.801 & 0.076 & -15.063 & 0.779 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors are heteroscedasticity robust (HC2)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 48.92\n", "Date: Sat, 20 Jul 2024 Prob (F-statistic): 1.68e-23\n", "Time: 19:04:19 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: HC2 \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 37.846 4.490 0.000 94.357 245.522\n", "region[T.Europe and Central Asia] -15.9265 5.763 -2.764 0.007 -27.436 -4.417\n", "region[T.Latin America and Caribbean] 1.9023 6.687 0.284 0.777 -11.453 15.257\n", "region[T.Middle East and North Africa] 3.7668 8.304 0.454 0.652 -12.817 20.351\n", "region[T.South Asia] 4.9372 9.361 0.527 0.600 -13.759 23.633\n", "region[T.Sub-Saharan Africa] 27.8448 7.238 3.847 0.000 13.389 42.300\n", "np.log(gdp_per_capita_ppp) -13.3790 4.550 -2.941 0.005 -22.465 -4.293\n", "CPIA_public_sector_rating -7.1417 3.966 -1.801 0.076 -15.063 0.779\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors are heteroscedasticity robust (HC2)\n", "\"\"\"" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corruption_model.get_robustcov_results(cov_type=\"HC2\").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`RegressionResults` also support clustered standard errors using `corruption_model.get_robustcov_results(cov_type=\"cluster\", groups=)`. However, `groups` has to be passed a vector of integer group " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2\n", " warnings.warn('covariance of constraints does not have full '\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586
Model: OLS Adj. R-squared: 0.541
Method: Least Squares F-statistic: 4.404
Date: Sat, 20 Jul 2024 Prob (F-statistic): 0.0789
Time: 19:04:20 Log-Likelihood: -322.68
No. Observations: 73 AIC: 661.4
Df Residuals: 65 BIC: 679.7
Df Model: 7
Covariance Type: cluster
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 169.9397 16.975 10.011 0.000 126.304 213.575
region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735
region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688
region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564
region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785
region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406
np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369
CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 4.467 Durbin-Watson: 1.617
Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375
Skew: 0.592 Prob(JB): 0.112
Kurtosis: 2.813 Cond. No. 128.


Notes:
[1] Standard Errors are robust to cluster correlation (cluster)" ], "text/latex": [ "\\begin{center}\n", "\\begin{tabular}{lclc}\n", "\\toprule\n", "\\textbf{Dep. Variable:} & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{ R-squared: } & 0.586 \\\\\n", "\\textbf{Model:} & OLS & \\textbf{ Adj. R-squared: } & 0.541 \\\\\n", "\\textbf{Method:} & Least Squares & \\textbf{ F-statistic: } & 4.404 \\\\\n", "\\textbf{Date:} & Sat, 20 Jul 2024 & \\textbf{ Prob (F-statistic):} & 0.0789 \\\\\n", "\\textbf{Time:} & 19:04:20 & \\textbf{ Log-Likelihood: } & -322.68 \\\\\n", "\\textbf{No. Observations:} & 73 & \\textbf{ AIC: } & 661.4 \\\\\n", "\\textbf{Df Residuals:} & 65 & \\textbf{ BIC: } & 679.7 \\\\\n", "\\textbf{Df Model:} & 7 & \\textbf{ } & \\\\\n", "\\textbf{Covariance Type:} & cluster & \\textbf{ } & \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lcccccc}\n", " & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]} \\\\\n", "\\midrule\n", "\\textbf{Intercept} & 169.9397 & 16.975 & 10.011 & 0.000 & 126.304 & 213.575 \\\\\n", "\\textbf{region[T.Europe and Central Asia]} & -15.9265 & 1.241 & -12.829 & 0.000 & -19.118 & -12.735 \\\\\n", "\\textbf{region[T.Latin America and Caribbean]} & 1.9023 & 0.694 & 2.739 & 0.041 & 0.117 & 3.688 \\\\\n", "\\textbf{region[T.Middle East and North Africa]} & 3.7668 & 2.644 & 1.425 & 0.214 & -3.030 & 10.564 \\\\\n", "\\textbf{region[T.South Asia]} & 4.9372 & 0.719 & 6.869 & 0.001 & 3.090 & 6.785 \\\\\n", "\\textbf{region[T.Sub-Saharan Africa]} & 27.8448 & 1.385 & 20.098 & 0.000 & 24.283 & 31.406 \\\\\n", "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)} & -13.3790 & 2.727 & -4.906 & 0.004 & -20.389 & -6.369 \\\\\n", "\\textbf{CPIA\\_public\\_sector\\_rating} & -7.1417 & 2.067 & -3.455 & 0.018 & -12.455 & -1.829 \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "\\begin{tabular}{lclc}\n", "\\textbf{Omnibus:} & 4.467 & \\textbf{ Durbin-Watson: } & 1.617 \\\\\n", "\\textbf{Prob(Omnibus):} & 0.107 & \\textbf{ Jarque-Bera (JB): } & 4.375 \\\\\n", "\\textbf{Skew:} & 0.592 & \\textbf{ Prob(JB): } & 0.112 \\\\\n", "\\textbf{Kurtosis:} & 2.813 & \\textbf{ Cond. No. } & 128. \\\\\n", "\\bottomrule\n", "\\end{tabular}\n", "%\\caption{OLS Regression Results}\n", "\\end{center}\n", "\n", "Notes: \\newline\n", " [1] Standard Errors are robust to cluster correlation (cluster)" ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==========================================================================================\n", "Dep. Variable: mortality_rate_under5_per_1000 R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.541\n", "Method: Least Squares F-statistic: 4.404\n", "Date: Sat, 20 Jul 2024 Prob (F-statistic): 0.0789\n", "Time: 19:04:20 Log-Likelihood: -322.68\n", "No. Observations: 73 AIC: 661.4\n", "Df Residuals: 65 BIC: 679.7\n", "Df Model: 7 \n", "Covariance Type: cluster \n", "==========================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------------------\n", "Intercept 169.9397 16.975 10.011 0.000 126.304 213.575\n", "region[T.Europe and Central Asia] -15.9265 1.241 -12.829 0.000 -19.118 -12.735\n", "region[T.Latin America and Caribbean] 1.9023 0.694 2.739 0.041 0.117 3.688\n", "region[T.Middle East and North Africa] 3.7668 2.644 1.425 0.214 -3.030 10.564\n", "region[T.South Asia] 4.9372 0.719 6.869 0.001 3.090 6.785\n", "region[T.Sub-Saharan Africa] 27.8448 1.385 20.098 0.000 24.283 31.406\n", "np.log(gdp_per_capita_ppp) -13.3790 2.727 -4.906 0.004 -20.389 -6.369\n", "CPIA_public_sector_rating -7.1417 2.067 -3.455 0.018 -12.455 -1.829\n", "==============================================================================\n", "Omnibus: 4.467 Durbin-Watson: 1.617\n", "Prob(Omnibus): 0.107 Jarque-Bera (JB): 4.375\n", "Skew: 0.592 Prob(JB): 0.112\n", "Kurtosis: 2.813 Cond. No. 128.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors are robust to cluster correlation (cluster)\n", "\"\"\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corruption_model.get_robustcov_results(\n", " cov_type=\"cluster\", groups=wdi.dropna().region\n", ").summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Regression scikit-learn\n", "\n", "What about `scikit-learn`? `scikit-learn` is probably the most popular library for machine learning, and like `statsmodels` it is also able to fit linear regressions. Because `scikit-learn` is a library written for prediction rather than inference, however, the way it has been implemented is very, *very* different from `statsmodels`. To illustrate, let's fit the same regression we did at the top of this reading using scikit-learn. \n", "\n", "(Note that for reasons we'll cover in the following reading, I have to do some extra data wrangling first)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "wdi[\"log_gdp_per_cap\"] = np.log(wdi[\"gdp_per_capita_ppp\"])\n", "wdi_w_onehots = pd.concat(\n", " [wdi, pd.get_dummies(wdi[\"region\"], drop_first=True, prefix=\"reg\").astype(\"int\")],\n", " axis=\"columns\",\n", ")\n", "subset = wdi_w_onehots[\n", " [\n", " \"mortality_rate_under5_per_1000\",\n", " \"log_gdp_per_cap\",\n", " \"CPIA_public_sector_rating\",\n", " \"reg_Europe and Central Asia\",\n", " \"reg_Latin America and Caribbean\",\n", " \"reg_Middle East and North Africa\",\n", " \"reg_South Asia\",\n", " \"reg_Sub-Saharan Africa\",\n", " ]\n", "].dropna()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-13.37903311, -7.14171606, -15.9264855 , 1.9023125 ,\n", " 3.76684256, 4.9371726 , 27.84484623])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "# Fit linear regression\n", "my_model = LinearRegression(fit_intercept=True)\n", "my_model.fit(subset.iloc[:, 1:].values, subset.iloc[:, 0].values)\n", "\n", "# Get coefficients\n", "my_model.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And there it is! If you look back to the first regression we fit, you will see that the coefficients calculated by `scikit-learn` are identical to those computed by `statsmodels`. \n", "\n", "However, as you can also see by the fact that the only method for presenting the regression results in `scikit-learn` is to spit out the coefficients as a simple numpy array, `scikit-learn` was never designed for users interested in interpreting the coefficients of a regression. There's no `.summary()` method that presents all the statistics provided by `statsmodels`, and none of the additional functionality for doing things like calculating different types of standard errors (if you're new to linear regression, those are just different ways for calculating the statistical properties of estimated coefficients). \n", "\n", "Yes, `scikit-learn` can fit the same models, but the user experience is *radically* different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next: Pulling Back the Curtain\n", "\n", "This reading has hopefully given you a sense of how easy statsmodels makes it to fit and analyze a linear regression. Some of the extensions to basic modelling discussed here may not be familiar to everyone, but even so, hopefully this gives you a sense of how things are organized so when you *do* learn more about linear regression, you'll know how to use that knowledge. \n", "\n", "In our next reading, we'll put ourselves squarely back in the land of data science programming as we pull back the curtain and take a look at what's going on behind the scenes of `statsmodels.formula.api.ols`.\n" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }