Inference vs. Prediction Data Science

Inference vs. Prediction Data Science#

A key concept in data science is the distinction between prediction and inference. Prediction and inference are two different goals that a data scientist may have in mind when they sit down to analyze their data. Like linear regression itself, these concepts would normally be a little outside the scope of a specialization on programming for data science, but the data science software ecosystem is organized around which of these goals you seek to accomplish, so understanding this distinction in goals will help you know which Python package to reach for in a given situation.

Prediction#

Prediction is the practice of using data to build models that can be used to predict the value of an unobserved variable for a given observation. A data scientist interested in predicting the total amount a customer may spend on a website over their lifetime based on their activity on the site during the first visit is engaged in prediction. Similarly, a data scientist engaged in prediction might use a database of mammograms that have already been reviewed by human radiologists to train a model that can predict whether a new mammogram (one not reviewed by a human) would have been flagged as abnormal if it were reviewed by a human is also doing prediction.

If you are familiar with terms like “Supervised Machine Learning” or “Deep Learning,” those refer to the practice of prediction.

Prediction is usually used to achieve one of two goals. The first is answering questions about what is likely to occur in the future to a specific individual. Answering these questions is useful for identifying individuals for additional care or attention. For example, a hospital might want to know, “How likely is Patient A to experience complications after surgery?” so they can decide whether the patient should receive extra nursing attention during recovery, or a factory owner might ask, “How likely is this machine to break down in the next month?” to help them determine when to take the machine offline for maintenance.

The second common use of prediction is for automation, which can be accomplished by building a model that predicts what a human would do in a given circumstance. In the example of a model designed to screen mammograms for cancer in place of human radiologists, what that model is effectively doing is trying to predict what a human radiologist would say if they reviewed the mammogram.[1]

Inference#

Inference is the practice of analyzing data we can observe to help us better understand processes and mechanisms that we cannot see directly.

Perhaps the most intuitive examples of inference in data science are clinical trials and A/B testing. In both cases, a data scientist is interested in the effect of some action — giving patients a new drug in a clinical trial or changing the layout of a website in an A/B test. This effect is not something we can observe directly, but we can infer the effect by randomly assigning some people to the treatment (taking the drug or seeing the new layout) and some people to a control (no drug or the old website layout) and comparing outcomes between these two groups. If patients who get the new drug get better but those who don’t do not, or if users who see the new website spend more than the users still seeing the old website, we can infer that the action had an effect.

While using data science for prediction gets all the headlines (in part because we’ve only recently developed tools that are really good at prediction), statistical inference has been with us for a long time. We cannot directly observe gravity, for example. Still, we infer its existence and specific properties by fitting models to everything from falling objects to the observed movement of planets and stars. The existence of dark matter, similarly, was first inferred by Vera Rubin. She observed how the speed of distant stars varied with their distance from the galactic center. She fit a model to this relationship and showed the resulting rotational speed curve could not be explained purely by our standard models of gravity and the matter we can see. Indeed, inference is used for everything from modeling public opinion before elections to understanding population dynamics in wildlife management and everything in between.

A Distinction In Purpose#

The distinction between prediction and inference is subtle because it is a distinction in the goals of a data scientist, not in the tools they are using. Linear and logistic regression, for example, are commonly used for both prediction and inference (we haven’t touched on logistic regression, so if you aren’t familiar with it yet, don’t worry about it).

But the way these tools are used — especially how a data scientist would interpret and evaluate the performance of a linear regression — will differ radically depending on whether they are interested in prediction or inference.

A data scientist interested in prediction focuses almost entirely on whether their model predicts outcomes for individual observations that are close to their true outcomes. To illustrate, consider a scientist trying to predict patient blood pressure using patient height, weight, age, and exercise. If the scientist only cared about prediction, then all they would care about is how close the blood pressure values the model predicts for patients are to their real blood pressures. If you have heard of terms like R-squared, Area Under the Curve (AUC), Accuracy, or Precision, those are the type of metrics someone doing prediction cares about most. If the model can’t explain the vast majority of variation in blood pressure, it has no value for prediction.

A data scientist interested in inference, by contract, might not actually care at all about whether the model can explain much of the variation in the outcome variable. Instead, they care about whether the model gives them good estimates of the parameters of the model (the regression coefficients) they care about, helping them to understand how different factors impact outcomes. For example, suppose someone is running a clinical trial in which some patients are randomly assigned to take a new blood pressure drug. They may use linear regression to compare outcomes between patients in the treatment group (those who took the drug) and those in the control group (those who did not). A model that only has a single indicator variable for whether the patient was assigned to the treatment condition or the control condition will not be any good at predicting the blood pressure of any particular patient (the R-squared will be close to zero if you are familiar with that measure), but that’s ok; what the data scientist cares about is the coefficient on the variable that indicates if a patient was given the new drug or not — if the coefficient on that is large and statistically significant, the data scientist will have learned that the trial was successful and that the drug caused a reduction in blood pressure.

This is a little simplistic, but broadly speaking data scientists interested in prediction only care about whether the model predicts the outcome variable well (what is sometimes called the “left-hand side” of regression since we usually write our regression \(Y = \alpha + X\beta + \epsilon\) and \(Y\) is the outcome variable), while data scientists interested in inference care most about the coefficients and standard errors (what is sometimes called the “right-hand side” of a regression). Indeed, as we’ll see in some of our following readings, this difference is evident in how linear regression has been implemented in packages designed for prediction as opposed to packages designed for inference.

[1]

This may sound a little convoluted, but it is both true and important for serious data scientists to understand. Because the focus of this specialization is on programming for data science, a full discussion of this idea is beyond the scope of this course, and if you have never been exposed to supervised machine learning or prediction before, this may not make a lot of sense. Nevertheless, in case you have taken a statistics or machine learning course, a short digression on this idea. Basically, data scientists often like to say that models designed to, say, automate the job of a radiologist are “looking for cancer” in the mammograms they evaluate. But that is not quite true. Most statistical models are developed (“trained”) to automate tasks by using a large dataset of “training examples” — that is, data in which the outcome we care about is included in the dataset. For automation, this will be data in which a person has already completed the task, and their conclusions or actions have been recorded. To train a mammogram reading algorithm, in other words, you first need a database of mammograms that radiologists have already labeled as indicating cancer or not. And what the model is actually trained to do is not “find cancer” but rather to “give the same answers given by human radiologists.” And this is why the distinction between “predicting what a radiologist would say” and “detecting cancer” becomes important: because these models were trained to emulate the behavior of human radiologists, the model will recapitulate any systematic biases held by the human radiologists who reviewed the mammograms in the training data. Did the radiologists struggle to detect cancer in dense breast tissue? So too will the algorithm trained on their labels. The tendency for algorithms to replicate human biases present in training data, unfortunately, extends to gender and racial biases, a problem that has caused many problems when companies have tried to use algorithms to replace humans in contexts like hiring, extending credit, or in policing.

Inference vs. Prediction Data Science

Contents

Inference vs. Prediction Data Science#

Prediction#

Inference#

A Distinction In Purpose#