{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression in Python\n",
    "\n",
    "There are many packages that implement linear regression in Python. As detailed in our last reading, however, depending on whether they are intended for use in prediction or inference, the way these packages operate can vary substantially. \n",
    "\n",
    "In this reading, we will look at how linear regression has been implemented in two major packages — [statsmodels](http://statsmodels.org) and [scikit-learn](http://scikit-learn.org). Both of these packages can fit a wide range of linear models, but the way they are organized and the results they report reflect the different audiences for whom they were designed. \n",
    "\n",
    "*Broadly speaking*, statsmodels is a library written by statisticians for statisticians, biostatisticians, social scientists, and natural scientists. It can do prediction, but its focus is *inference*, and as we will see that is reflected throughout the package. \n",
    "\n",
    "scikit-learn, by contrast, was written by and for computer scientists interested in machine learning. Its focus is on prediction, and while it includes a far more diverse collection of machine learning models than statsmodels, it does not include all the features someone doing inference might expect for evaluating model performance or doing things like calculating different types of standard errors.\n",
    "\n",
    "## Regression in statsmodels\n",
    "\n",
    "Because it is the more feature-rich library when it comes to regression, we will start our exploration of linear regression in Python with statsmodels. If you have any interest in inference, are coming from a programming language like R or Stata, and/or have a background in statistics, social science, or the natural sciences, then statsmodels is the package that will feel most familiar and useful. \n",
    "\n",
    "While our focus will be on linear regression, the statsmodels package includes a wide range of tools for inference and modeling, from simple models like linear and logistic regression to generalized linear models (GLMs), non-parametric regression, robust linear models, time series models, survival analysis, multiple imputation, generalized additive models (GAMs), and more. (Curious if it includes that *one* model that's near and dear to your heart? [Feel free to go check and come back](https://www.statsmodels.org/stable/user-guide.html)).\n",
    "\n",
    "Moreover, it provides by far the easiest interface for moving from a pandas DataFrame to a regression in the Python ecosystem. To illustrate, let's fit a quick regression looking at countries' under-5 mortality rates as a function of GDP per capita and Work Bank ratings of each countries' public sector in terms of transparency, accountability, and levels of corruption:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>73</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>country_name</th>\n",
       "      <td>Yemen, Rep.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gdp_per_capita_ppp</th>\n",
       "      <td>3108.764217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CPIA_public_sector_rating</th>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mortality_rate_under5_per_1000</th>\n",
       "      <td>55.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mortality rate, under-5, female (per 1,000 live births)</th>\n",
       "      <td>51.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mortality rate, under-5, male (per 1,000 live births)</th>\n",
       "      <td>59.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Population, total</th>\n",
       "      <td>26497889.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>region</th>\n",
       "      <td>Middle East and North Africa</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                              73\n",
       "country_name                                                         Yemen, Rep.\n",
       "gdp_per_capita_ppp                                                   3108.764217\n",
       "CPIA_public_sector_rating                                                    1.5\n",
       "mortality_rate_under5_per_1000                                              55.4\n",
       "Mortality rate, under-5, female (per 1,000 live...                          51.3\n",
       "Mortality rate, under-5, male (per 1,000 live b...                          59.4\n",
       "Population, total                                                     26497889.0\n",
       "region                                              Middle East and North Africa"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import statsmodels.formula.api as smf\n",
    "\n",
    "pd.set_option(\"mode.copy_on_write\", True)\n",
    "\n",
    "# Load data on infant mortality, gdp per capita, and\n",
    "# World Bank CPIA public sector transparency, accountability,\n",
    "# and corruption in the public sector scores\n",
    "# (1 = low transparency and accountability, 6 = high transparency and accountability).\n",
    "\n",
    "wdi = pd.read_csv(\"data/wdi_corruption.csv\")\n",
    "\n",
    "# Check one observation to get a feel for things.\n",
    "wdi.sample().T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>    <td>mortality_rate_under5_per_1000</td> <th>  R-squared:         </th> <td>   0.586</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                          <td>OLS</td>              <th>  Adj. R-squared:    </th> <td>   0.541</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>                    <td>Least Squares</td>         <th>  F-statistic:       </th> <td>   13.12</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>                    <td>Sat, 20 Jul 2024</td>        <th>  Prob (F-statistic):</th> <td>2.11e-10</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                        <td>19:04:19</td>            <th>  Log-Likelihood:    </th> <td> -322.68</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>             <td>    73</td>             <th>  AIC:               </th> <td>   661.4</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>                 <td>    65</td>             <th>  BIC:               </th> <td>   679.7</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>                     <td>     7</td>             <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>             <td>nonrobust</td>           <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "                     <td></td>                       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>                              <td>  169.9397</td> <td>   36.430</td> <td>    4.665</td> <td> 0.000</td> <td>   97.183</td> <td>  242.696</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Europe and Central Asia]</th>      <td>  -15.9265</td> <td>   12.304</td> <td>   -1.294</td> <td> 0.200</td> <td>  -40.499</td> <td>    8.646</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Latin America and Caribbean]</th>  <td>    1.9023</td> <td>    9.226</td> <td>    0.206</td> <td> 0.837</td> <td>  -16.523</td> <td>   20.327</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Middle East and North Africa]</th> <td>    3.7668</td> <td>   23.057</td> <td>    0.163</td> <td> 0.871</td> <td>  -42.280</td> <td>   49.814</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.South Asia]</th>                   <td>    4.9372</td> <td>    9.818</td> <td>    0.503</td> <td> 0.617</td> <td>  -14.671</td> <td>   24.545</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Sub-Saharan Africa]</th>           <td>   27.8448</td> <td>    7.360</td> <td>    3.783</td> <td> 0.000</td> <td>   13.145</td> <td>   42.544</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.log(gdp_per_capita_ppp)</th>             <td>  -13.3790</td> <td>    4.547</td> <td>   -2.942</td> <td> 0.005</td> <td>  -22.461</td> <td>   -4.297</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>CPIA_public_sector_rating</th>              <td>   -7.1417</td> <td>    4.387</td> <td>   -1.628</td> <td> 0.108</td> <td>  -15.902</td> <td>    1.619</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 4.467</td> <th>  Durbin-Watson:     </th> <td>   1.617</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.107</td> <th>  Jarque-Bera (JB):  </th> <td>   4.375</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.592</td> <th>  Prob(JB):          </th> <td>   0.112</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.813</td> <th>  Cond. No.          </th> <td>    128.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}                         & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{  R-squared:         } &     0.586   \\\\\n",
       "\\textbf{Model:}                                 &                OLS                 & \\textbf{  Adj. R-squared:    } &     0.541   \\\\\n",
       "\\textbf{Method:}                                &           Least Squares            & \\textbf{  F-statistic:       } &     13.12   \\\\\n",
       "\\textbf{Date:}                                  &          Sat, 20 Jul 2024          & \\textbf{  Prob (F-statistic):} &  2.11e-10   \\\\\n",
       "\\textbf{Time:}                                  &              19:04:19              & \\textbf{  Log-Likelihood:    } &   -322.68   \\\\\n",
       "\\textbf{No. Observations:}                      &                   73               & \\textbf{  AIC:               } &     661.4   \\\\\n",
       "\\textbf{Df Residuals:}                          &                   65               & \\textbf{  BIC:               } &     679.7   \\\\\n",
       "\\textbf{Df Model:}                              &                    7               & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}                       &             nonrobust              & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                                                & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept}                              &     169.9397  &       36.430     &     4.665  &         0.000        &       97.183    &      242.696     \\\\\n",
       "\\textbf{region[T.Europe and Central Asia]}      &     -15.9265  &       12.304     &    -1.294  &         0.200        &      -40.499    &        8.646     \\\\\n",
       "\\textbf{region[T.Latin America and Caribbean]}  &       1.9023  &        9.226     &     0.206  &         0.837        &      -16.523    &       20.327     \\\\\n",
       "\\textbf{region[T.Middle East and North Africa]} &       3.7668  &       23.057     &     0.163  &         0.871        &      -42.280    &       49.814     \\\\\n",
       "\\textbf{region[T.South Asia]}                   &       4.9372  &        9.818     &     0.503  &         0.617        &      -14.671    &       24.545     \\\\\n",
       "\\textbf{region[T.Sub-Saharan Africa]}           &      27.8448  &        7.360     &     3.783  &         0.000        &       13.145    &       42.544     \\\\\n",
       "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)}          &     -13.3790  &        4.547     &    -2.942  &         0.005        &      -22.461    &       -4.297     \\\\\n",
       "\\textbf{CPIA\\_public\\_sector\\_rating}           &      -7.1417  &        4.387     &    -1.628  &         0.108        &      -15.902    &        1.619     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       &  4.467 & \\textbf{  Durbin-Watson:     } &    1.617  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.107 & \\textbf{  Jarque-Bera (JB):  } &    4.375  \\\\\n",
       "\\textbf{Skew:}          &  0.592 & \\textbf{  Prob(JB):          } &    0.112  \\\\\n",
       "\\textbf{Kurtosis:}      &  2.813 & \\textbf{  Cond. No.          } &     128.  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                                  OLS Regression Results                                  \n",
       "==========================================================================================\n",
       "Dep. Variable:     mortality_rate_under5_per_1000   R-squared:                       0.586\n",
       "Model:                                        OLS   Adj. R-squared:                  0.541\n",
       "Method:                             Least Squares   F-statistic:                     13.12\n",
       "Date:                            Sat, 20 Jul 2024   Prob (F-statistic):           2.11e-10\n",
       "Time:                                    19:04:19   Log-Likelihood:                -322.68\n",
       "No. Observations:                              73   AIC:                             661.4\n",
       "Df Residuals:                                  65   BIC:                             679.7\n",
       "Df Model:                                       7                                         \n",
       "Covariance Type:                        nonrobust                                         \n",
       "==========================================================================================================\n",
       "                                             coef    std err          t      P>|t|      [0.025      0.975]\n",
       "----------------------------------------------------------------------------------------------------------\n",
       "Intercept                                169.9397     36.430      4.665      0.000      97.183     242.696\n",
       "region[T.Europe and Central Asia]        -15.9265     12.304     -1.294      0.200     -40.499       8.646\n",
       "region[T.Latin America and Caribbean]      1.9023      9.226      0.206      0.837     -16.523      20.327\n",
       "region[T.Middle East and North Africa]     3.7668     23.057      0.163      0.871     -42.280      49.814\n",
       "region[T.South Asia]                       4.9372      9.818      0.503      0.617     -14.671      24.545\n",
       "region[T.Sub-Saharan Africa]              27.8448      7.360      3.783      0.000      13.145      42.544\n",
       "np.log(gdp_per_capita_ppp)               -13.3790      4.547     -2.942      0.005     -22.461      -4.297\n",
       "CPIA_public_sector_rating                 -7.1417      4.387     -1.628      0.108     -15.902       1.619\n",
       "==============================================================================\n",
       "Omnibus:                        4.467   Durbin-Watson:                   1.617\n",
       "Prob(Omnibus):                  0.107   Jarque-Bera (JB):                4.375\n",
       "Skew:                           0.592   Prob(JB):                        0.112\n",
       "Kurtosis:                       2.813   Cond. No.                         128.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit model\n",
    "corruption_model = smf.ols(\n",
    "    \"mortality_rate_under5_per_1000 ~ np.log(gdp_per_capita_ppp) +\"\n",
    "    \" CPIA_public_sector_rating + region\",\n",
    "    data=wdi,\n",
    ").fit()\n",
    "\n",
    "# Get regression result\n",
    "corruption_model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(why `smf.ols`? Recall that one name for linear regression is \"Ordinary Least Squares,\" which is often shortened to \"OLS.\")\n",
    "\n",
    "Not bad, huh? Not only was our regression a one-liner, but the `.summary()` method has provided us with all our coefficients with labels, standard errors, t-statistics, p-values, 95% confidence intervals, and a range of additional statistics about the regression. \n",
    "\n",
    "We were also able to apply come manipulations right in the regression specification — note the use of `np.log()` to apply the log function to `gdp_per_capita_ppp` without having to create a new variable in our DataFrame.\n",
    "\n",
    "Finally (if this doesn't mean much to you, don't worry about it), statsmodels recognized that `region` contained strings rather than numbers, so it dynamically created a set of indicator variables (i.e., it created one-hot encodings).\n",
    "\n",
    "If you come from R or Stata, none of that is likely to seem notable to you, but as we'll discuss in the next section (and as we'll see when we get to `scikit-learn`), those are conveniences that should not be taken for granted."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accessing Results\n",
    "\n",
    "While `.summary()` is great for printing out results, this is a class about *programming* for data science, and so, of course, we also need to know how to access these results programatically. \n",
    "\n",
    "All results — along with a large number of other useful statistics about model performance — are accessible through the fit model (in this example, `corruption_model`). This fit model is a `RegressionResult` object. You can find a full list `RegressionResult` attributes and methods [here](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.RegressionResults.html#statsmodels.regression.linear_model.RegressionResults). Two of the most important attributes of a `RegressionResult` object are: `.params` (a pandas Series of regression coefficients) and `.df_resid` (a DataFrame of residuals). \n",
    "\n",
    "To illustrate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Intercept                                 169.939656\n",
       "region[T.Europe and Central Asia]         -15.926485\n",
       "region[T.Latin America and Caribbean]       1.902313\n",
       "region[T.Middle East and North Africa]      3.766843\n",
       "region[T.South Asia]                        4.937173\n",
       "region[T.Sub-Saharan Africa]               27.844846\n",
       "np.log(gdp_per_capita_ppp)                -13.379033\n",
       "CPIA_public_sector_rating                  -7.141716\n",
       "dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corruption_model.params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The partial correlation of CPIA rating with under five mortality is -7.14\n"
     ]
    }
   ],
   "source": [
    "print(\n",
    "    \"The partial correlation of CPIA rating with under \"\n",
    "    f\"five mortality is {corruption_model.params['CPIA_public_sector_rating']:.2f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For those who are familiar with these concepts, a RegressionResult object also has methods for post-regression testing (e.g., f-tests, Lagrange Multiplier (LM) tests of linear restrictions, likelihood ratio tests). \n",
    "\n",
    "For example, to test whether the coefficient for the South Asia region is statistically significantly different from the coefficient for the Middle East and North Africa region, you would:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'statsmodels.stats.contrast.ContrastResults'>\n",
       "                             Test for Constraints                             \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "c0             1.1703     23.625      0.050      0.961     -46.013      48.354\n",
       "=============================================================================="
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corruption_model.t_test(\"region[T.South Asia] = region[T.Middle East and North Africa]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alternative Standard Error Calculations\n",
    "\n",
    "`RegressionResult` objects also have a method for different standard error calculations (again, don't worry if this doesn't mean anything to you!). statsmodels supports `HC1`, `HC2`, and `HC3` heteroskedastic robust standard errors, as well as heteroskedasticity-autocorrelation robust standard errors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>    <td>mortality_rate_under5_per_1000</td> <th>  R-squared:         </th> <td>   0.586</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                          <td>OLS</td>              <th>  Adj. R-squared:    </th> <td>   0.541</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>                    <td>Least Squares</td>         <th>  F-statistic:       </th> <td>   48.92</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>                    <td>Sat, 20 Jul 2024</td>        <th>  Prob (F-statistic):</th> <td>1.68e-23</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                        <td>19:04:19</td>            <th>  Log-Likelihood:    </th> <td> -322.68</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>             <td>    73</td>             <th>  AIC:               </th> <td>   661.4</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>                 <td>    65</td>             <th>  BIC:               </th> <td>   679.7</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>                     <td>     7</td>             <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>                <td>HC2</td>              <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "                     <td></td>                       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>                              <td>  169.9397</td> <td>   37.846</td> <td>    4.490</td> <td> 0.000</td> <td>   94.357</td> <td>  245.522</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Europe and Central Asia]</th>      <td>  -15.9265</td> <td>    5.763</td> <td>   -2.764</td> <td> 0.007</td> <td>  -27.436</td> <td>   -4.417</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Latin America and Caribbean]</th>  <td>    1.9023</td> <td>    6.687</td> <td>    0.284</td> <td> 0.777</td> <td>  -11.453</td> <td>   15.257</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Middle East and North Africa]</th> <td>    3.7668</td> <td>    8.304</td> <td>    0.454</td> <td> 0.652</td> <td>  -12.817</td> <td>   20.351</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.South Asia]</th>                   <td>    4.9372</td> <td>    9.361</td> <td>    0.527</td> <td> 0.600</td> <td>  -13.759</td> <td>   23.633</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Sub-Saharan Africa]</th>           <td>   27.8448</td> <td>    7.238</td> <td>    3.847</td> <td> 0.000</td> <td>   13.389</td> <td>   42.300</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.log(gdp_per_capita_ppp)</th>             <td>  -13.3790</td> <td>    4.550</td> <td>   -2.941</td> <td> 0.005</td> <td>  -22.465</td> <td>   -4.293</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>CPIA_public_sector_rating</th>              <td>   -7.1417</td> <td>    3.966</td> <td>   -1.801</td> <td> 0.076</td> <td>  -15.063</td> <td>    0.779</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 4.467</td> <th>  Durbin-Watson:     </th> <td>   1.617</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.107</td> <th>  Jarque-Bera (JB):  </th> <td>   4.375</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.592</td> <th>  Prob(JB):          </th> <td>   0.112</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.813</td> <th>  Cond. No.          </th> <td>    128.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors are heteroscedasticity robust (HC2)"
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}                         & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{  R-squared:         } &     0.586   \\\\\n",
       "\\textbf{Model:}                                 &                OLS                 & \\textbf{  Adj. R-squared:    } &     0.541   \\\\\n",
       "\\textbf{Method:}                                &           Least Squares            & \\textbf{  F-statistic:       } &     48.92   \\\\\n",
       "\\textbf{Date:}                                  &          Sat, 20 Jul 2024          & \\textbf{  Prob (F-statistic):} &  1.68e-23   \\\\\n",
       "\\textbf{Time:}                                  &              19:04:19              & \\textbf{  Log-Likelihood:    } &   -322.68   \\\\\n",
       "\\textbf{No. Observations:}                      &                   73               & \\textbf{  AIC:               } &     661.4   \\\\\n",
       "\\textbf{Df Residuals:}                          &                   65               & \\textbf{  BIC:               } &     679.7   \\\\\n",
       "\\textbf{Df Model:}                              &                    7               & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}                       &                HC2                 & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                                                & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept}                              &     169.9397  &       37.846     &     4.490  &         0.000        &       94.357    &      245.522     \\\\\n",
       "\\textbf{region[T.Europe and Central Asia]}      &     -15.9265  &        5.763     &    -2.764  &         0.007        &      -27.436    &       -4.417     \\\\\n",
       "\\textbf{region[T.Latin America and Caribbean]}  &       1.9023  &        6.687     &     0.284  &         0.777        &      -11.453    &       15.257     \\\\\n",
       "\\textbf{region[T.Middle East and North Africa]} &       3.7668  &        8.304     &     0.454  &         0.652        &      -12.817    &       20.351     \\\\\n",
       "\\textbf{region[T.South Asia]}                   &       4.9372  &        9.361     &     0.527  &         0.600        &      -13.759    &       23.633     \\\\\n",
       "\\textbf{region[T.Sub-Saharan Africa]}           &      27.8448  &        7.238     &     3.847  &         0.000        &       13.389    &       42.300     \\\\\n",
       "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)}          &     -13.3790  &        4.550     &    -2.941  &         0.005        &      -22.465    &       -4.293     \\\\\n",
       "\\textbf{CPIA\\_public\\_sector\\_rating}           &      -7.1417  &        3.966     &    -1.801  &         0.076        &      -15.063    &        0.779     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       &  4.467 & \\textbf{  Durbin-Watson:     } &    1.617  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.107 & \\textbf{  Jarque-Bera (JB):  } &    4.375  \\\\\n",
       "\\textbf{Skew:}          &  0.592 & \\textbf{  Prob(JB):          } &    0.112  \\\\\n",
       "\\textbf{Kurtosis:}      &  2.813 & \\textbf{  Cond. No.          } &     128.  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors are heteroscedasticity robust (HC2)"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                                  OLS Regression Results                                  \n",
       "==========================================================================================\n",
       "Dep. Variable:     mortality_rate_under5_per_1000   R-squared:                       0.586\n",
       "Model:                                        OLS   Adj. R-squared:                  0.541\n",
       "Method:                             Least Squares   F-statistic:                     48.92\n",
       "Date:                            Sat, 20 Jul 2024   Prob (F-statistic):           1.68e-23\n",
       "Time:                                    19:04:19   Log-Likelihood:                -322.68\n",
       "No. Observations:                              73   AIC:                             661.4\n",
       "Df Residuals:                                  65   BIC:                             679.7\n",
       "Df Model:                                       7                                         \n",
       "Covariance Type:                              HC2                                         \n",
       "==========================================================================================================\n",
       "                                             coef    std err          t      P>|t|      [0.025      0.975]\n",
       "----------------------------------------------------------------------------------------------------------\n",
       "Intercept                                169.9397     37.846      4.490      0.000      94.357     245.522\n",
       "region[T.Europe and Central Asia]        -15.9265      5.763     -2.764      0.007     -27.436      -4.417\n",
       "region[T.Latin America and Caribbean]      1.9023      6.687      0.284      0.777     -11.453      15.257\n",
       "region[T.Middle East and North Africa]     3.7668      8.304      0.454      0.652     -12.817      20.351\n",
       "region[T.South Asia]                       4.9372      9.361      0.527      0.600     -13.759      23.633\n",
       "region[T.Sub-Saharan Africa]              27.8448      7.238      3.847      0.000      13.389      42.300\n",
       "np.log(gdp_per_capita_ppp)               -13.3790      4.550     -2.941      0.005     -22.465      -4.293\n",
       "CPIA_public_sector_rating                 -7.1417      3.966     -1.801      0.076     -15.063       0.779\n",
       "==============================================================================\n",
       "Omnibus:                        4.467   Durbin-Watson:                   1.617\n",
       "Prob(Omnibus):                  0.107   Jarque-Bera (JB):                4.375\n",
       "Skew:                           0.592   Prob(JB):                        0.112\n",
       "Kurtosis:                       2.813   Cond. No.                         128.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors are heteroscedasticity robust (HC2)\n",
       "\"\"\""
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corruption_model.get_robustcov_results(cov_type=\"HC2\").summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`RegressionResults` also support clustered standard errors using `corruption_model.get_robustcov_results(cov_type=\"cluster\", groups=)`. However, `groups` has to be passed a vector of integer group "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/statsmodels/base/model.py:1896: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 7, but rank is 2\n",
      "  warnings.warn('covariance of constraints does not have full '\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>    <td>mortality_rate_under5_per_1000</td> <th>  R-squared:         </th> <td>   0.586</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                          <td>OLS</td>              <th>  Adj. R-squared:    </th> <td>   0.541</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>                    <td>Least Squares</td>         <th>  F-statistic:       </th> <td>   4.404</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>                    <td>Sat, 20 Jul 2024</td>        <th>  Prob (F-statistic):</th>  <td>0.0789</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                        <td>19:04:20</td>            <th>  Log-Likelihood:    </th> <td> -322.68</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>             <td>    73</td>             <th>  AIC:               </th> <td>   661.4</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>                 <td>    65</td>             <th>  BIC:               </th> <td>   679.7</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>                     <td>     7</td>             <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>              <td>cluster</td>            <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "                     <td></td>                       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>                              <td>  169.9397</td> <td>   16.975</td> <td>   10.011</td> <td> 0.000</td> <td>  126.304</td> <td>  213.575</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Europe and Central Asia]</th>      <td>  -15.9265</td> <td>    1.241</td> <td>  -12.829</td> <td> 0.000</td> <td>  -19.118</td> <td>  -12.735</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Latin America and Caribbean]</th>  <td>    1.9023</td> <td>    0.694</td> <td>    2.739</td> <td> 0.041</td> <td>    0.117</td> <td>    3.688</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Middle East and North Africa]</th> <td>    3.7668</td> <td>    2.644</td> <td>    1.425</td> <td> 0.214</td> <td>   -3.030</td> <td>   10.564</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.South Asia]</th>                   <td>    4.9372</td> <td>    0.719</td> <td>    6.869</td> <td> 0.001</td> <td>    3.090</td> <td>    6.785</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>region[T.Sub-Saharan Africa]</th>           <td>   27.8448</td> <td>    1.385</td> <td>   20.098</td> <td> 0.000</td> <td>   24.283</td> <td>   31.406</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>np.log(gdp_per_capita_ppp)</th>             <td>  -13.3790</td> <td>    2.727</td> <td>   -4.906</td> <td> 0.004</td> <td>  -20.389</td> <td>   -6.369</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>CPIA_public_sector_rating</th>              <td>   -7.1417</td> <td>    2.067</td> <td>   -3.455</td> <td> 0.018</td> <td>  -12.455</td> <td>   -1.829</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 4.467</td> <th>  Durbin-Watson:     </th> <td>   1.617</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.107</td> <th>  Jarque-Bera (JB):  </th> <td>   4.375</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.592</td> <th>  Prob(JB):          </th> <td>   0.112</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.813</td> <th>  Cond. No.          </th> <td>    128.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors are robust to cluster correlation (cluster)"
      ],
      "text/latex": [
       "\\begin{center}\n",
       "\\begin{tabular}{lclc}\n",
       "\\toprule\n",
       "\\textbf{Dep. Variable:}                         & mortality\\_rate\\_under5\\_per\\_1000 & \\textbf{  R-squared:         } &     0.586   \\\\\n",
       "\\textbf{Model:}                                 &                OLS                 & \\textbf{  Adj. R-squared:    } &     0.541   \\\\\n",
       "\\textbf{Method:}                                &           Least Squares            & \\textbf{  F-statistic:       } &     4.404   \\\\\n",
       "\\textbf{Date:}                                  &          Sat, 20 Jul 2024          & \\textbf{  Prob (F-statistic):} &   0.0789    \\\\\n",
       "\\textbf{Time:}                                  &              19:04:20              & \\textbf{  Log-Likelihood:    } &   -322.68   \\\\\n",
       "\\textbf{No. Observations:}                      &                   73               & \\textbf{  AIC:               } &     661.4   \\\\\n",
       "\\textbf{Df Residuals:}                          &                   65               & \\textbf{  BIC:               } &     679.7   \\\\\n",
       "\\textbf{Df Model:}                              &                    7               & \\textbf{                     } &             \\\\\n",
       "\\textbf{Covariance Type:}                       &              cluster               & \\textbf{                     } &             \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lcccccc}\n",
       "                                                & \\textbf{coef} & \\textbf{std err} & \\textbf{t} & \\textbf{P$> |$t$|$} & \\textbf{[0.025} & \\textbf{0.975]}  \\\\\n",
       "\\midrule\n",
       "\\textbf{Intercept}                              &     169.9397  &       16.975     &    10.011  &         0.000        &      126.304    &      213.575     \\\\\n",
       "\\textbf{region[T.Europe and Central Asia]}      &     -15.9265  &        1.241     &   -12.829  &         0.000        &      -19.118    &      -12.735     \\\\\n",
       "\\textbf{region[T.Latin America and Caribbean]}  &       1.9023  &        0.694     &     2.739  &         0.041        &        0.117    &        3.688     \\\\\n",
       "\\textbf{region[T.Middle East and North Africa]} &       3.7668  &        2.644     &     1.425  &         0.214        &       -3.030    &       10.564     \\\\\n",
       "\\textbf{region[T.South Asia]}                   &       4.9372  &        0.719     &     6.869  &         0.001        &        3.090    &        6.785     \\\\\n",
       "\\textbf{region[T.Sub-Saharan Africa]}           &      27.8448  &        1.385     &    20.098  &         0.000        &       24.283    &       31.406     \\\\\n",
       "\\textbf{np.log(gdp\\_per\\_capita\\_ppp)}          &     -13.3790  &        2.727     &    -4.906  &         0.004        &      -20.389    &       -6.369     \\\\\n",
       "\\textbf{CPIA\\_public\\_sector\\_rating}           &      -7.1417  &        2.067     &    -3.455  &         0.018        &      -12.455    &       -1.829     \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "\\begin{tabular}{lclc}\n",
       "\\textbf{Omnibus:}       &  4.467 & \\textbf{  Durbin-Watson:     } &    1.617  \\\\\n",
       "\\textbf{Prob(Omnibus):} &  0.107 & \\textbf{  Jarque-Bera (JB):  } &    4.375  \\\\\n",
       "\\textbf{Skew:}          &  0.592 & \\textbf{  Prob(JB):          } &    0.112  \\\\\n",
       "\\textbf{Kurtosis:}      &  2.813 & \\textbf{  Cond. No.          } &     128.  \\\\\n",
       "\\bottomrule\n",
       "\\end{tabular}\n",
       "%\\caption{OLS Regression Results}\n",
       "\\end{center}\n",
       "\n",
       "Notes: \\newline\n",
       " [1] Standard Errors are robust to cluster correlation (cluster)"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                                  OLS Regression Results                                  \n",
       "==========================================================================================\n",
       "Dep. Variable:     mortality_rate_under5_per_1000   R-squared:                       0.586\n",
       "Model:                                        OLS   Adj. R-squared:                  0.541\n",
       "Method:                             Least Squares   F-statistic:                     4.404\n",
       "Date:                            Sat, 20 Jul 2024   Prob (F-statistic):             0.0789\n",
       "Time:                                    19:04:20   Log-Likelihood:                -322.68\n",
       "No. Observations:                              73   AIC:                             661.4\n",
       "Df Residuals:                                  65   BIC:                             679.7\n",
       "Df Model:                                       7                                         \n",
       "Covariance Type:                          cluster                                         \n",
       "==========================================================================================================\n",
       "                                             coef    std err          t      P>|t|      [0.025      0.975]\n",
       "----------------------------------------------------------------------------------------------------------\n",
       "Intercept                                169.9397     16.975     10.011      0.000     126.304     213.575\n",
       "region[T.Europe and Central Asia]        -15.9265      1.241    -12.829      0.000     -19.118     -12.735\n",
       "region[T.Latin America and Caribbean]      1.9023      0.694      2.739      0.041       0.117       3.688\n",
       "region[T.Middle East and North Africa]     3.7668      2.644      1.425      0.214      -3.030      10.564\n",
       "region[T.South Asia]                       4.9372      0.719      6.869      0.001       3.090       6.785\n",
       "region[T.Sub-Saharan Africa]              27.8448      1.385     20.098      0.000      24.283      31.406\n",
       "np.log(gdp_per_capita_ppp)               -13.3790      2.727     -4.906      0.004     -20.389      -6.369\n",
       "CPIA_public_sector_rating                 -7.1417      2.067     -3.455      0.018     -12.455      -1.829\n",
       "==============================================================================\n",
       "Omnibus:                        4.467   Durbin-Watson:                   1.617\n",
       "Prob(Omnibus):                  0.107   Jarque-Bera (JB):                4.375\n",
       "Skew:                           0.592   Prob(JB):                        0.112\n",
       "Kurtosis:                       2.813   Cond. No.                         128.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors are robust to cluster correlation (cluster)\n",
       "\"\"\""
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corruption_model.get_robustcov_results(\n",
    "    cov_type=\"cluster\", groups=wdi.dropna().region\n",
    ").summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression scikit-learn\n",
    "\n",
    "What about `scikit-learn`? `scikit-learn` is probably the most popular library for machine learning, and like `statsmodels` it is also able to fit linear regressions. Because `scikit-learn` is a library written for prediction rather than inference, however, the way it has been implemented is very, *very* different from `statsmodels`. To illustrate, let's fit the same regression we did at the top of this reading using scikit-learn. \n",
    "\n",
    "(Note that for reasons we'll cover in the following reading, I have to do some extra data wrangling first)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "wdi[\"log_gdp_per_cap\"] = np.log(wdi[\"gdp_per_capita_ppp\"])\n",
    "wdi_w_onehots = pd.concat(\n",
    "    [wdi, pd.get_dummies(wdi[\"region\"], drop_first=True, prefix=\"reg\").astype(\"int\")],\n",
    "    axis=\"columns\",\n",
    ")\n",
    "subset = wdi_w_onehots[\n",
    "    [\n",
    "        \"mortality_rate_under5_per_1000\",\n",
    "        \"log_gdp_per_cap\",\n",
    "        \"CPIA_public_sector_rating\",\n",
    "        \"reg_Europe and Central Asia\",\n",
    "        \"reg_Latin America and Caribbean\",\n",
    "        \"reg_Middle East and North Africa\",\n",
    "        \"reg_South Asia\",\n",
    "        \"reg_Sub-Saharan Africa\",\n",
    "    ]\n",
    "].dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-13.37903311,  -7.14171606, -15.9264855 ,   1.9023125 ,\n",
       "         3.76684256,   4.9371726 ,  27.84484623])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# Fit linear regression\n",
    "my_model = LinearRegression(fit_intercept=True)\n",
    "my_model.fit(subset.iloc[:, 1:].values, subset.iloc[:, 0].values)\n",
    "\n",
    "# Get coefficients\n",
    "my_model.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And there it is! If you look back to the first regression we fit, you will see that the coefficients calculated by `scikit-learn` are identical to those computed by `statsmodels`. \n",
    "\n",
    "However, as you can also see by the fact that the only method for presenting the regression results in `scikit-learn` is to spit out the coefficients as a simple numpy array, `scikit-learn` was never designed for users interested in interpreting the coefficients of a regression. There's no `.summary()` method that presents all the statistics provided by `statsmodels`, and none of the additional functionality for doing things like calculating different types of standard errors (if you're new to linear regression, those are just different ways for calculating the statistical properties of estimated coefficients). \n",
    "\n",
    "Yes, `scikit-learn` can fit the same models, but the user experience is *radically* different."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next: Pulling Back the Curtain\n",
    "\n",
    "This reading has hopefully given you a sense of how easy statsmodels makes it to fit and analyze a linear regression. Some of the extensions to basic modelling discussed here may not be familiar to everyone, but even so, hopefully this gives you a sense of how things are organized so when you *do* learn more about linear regression, you'll know how to use that knowledge. \n",
    "\n",
    "In our next reading, we'll put ourselves squarely back in the land of data science programming as we pull back the curtain and take a look at what's going on behind the scenes of `statsmodels.formula.api.ols`.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}