{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Validating a merge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you merge data, you have to make some assumptions about the nature of the data you're working with. For example, you have to assume it's OK to connect variables that share the same value of your merging variables. Similarly, you have to make assumptions about whether your merge a one-to-one merge (meaning there is only one observation of the variable you're merging on in both datasets), a one-to-many merge (meaning there is only one observation of the variable you're merging on in the first dataset, but multiple observations in the second). So before running a merge, answer the following questions:\n",
"\n",
"1. What variable do you think will be consistent across these two datasets you can use for merging?\n",
"2. Do you think there will be exactly 1 observation for each value in your first dataset?\n",
"3. Do you think there will be exactly 1 observation for each value in your second dataset?\n",
"\n",
"Being correct in your assumptions about these things is *very* important. If you think there's only one observation per value of your merging variable in each dataset, but there are in fact 2, you'll end up with two observations for each value after the merge. If that were to happen, the structure of your data would be damaged and it would indicate that you didn't understand something about your data. \n",
"\n",
"### Did the type of merge you were expecting happen? The `validate` keyword\n",
"\n",
"Because of the importance of this, `pandas` provides a utility for *testing* these assumptions when you do a merge: the `validate` keyword! Validate will accept `\"1:1\"`, `\"1:m\"`, `\"m:1\"`, and `\"m:m\"` where the 'm' stands for \"many\". It will then check to make sure your merge matches the type of merge you think it is. I *highly* recommend always using this option.\n",
"\n",
"Checking whether you are doing a one-to-one, many-to-one, one-to-many, or many-to-many merge is only the first type of diagnostic test you should run on *every* merge you conduct."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Identifying merge issues: the `indicator` keyword\n",
"The second test is to see if you data merged successfully!\n",
"\n",
"To help with this, the `merge` function in pandas offers a keyword option called `indicator`. If you set `indicator` to `True`, then pandas will add a column to the result of your merge called `_merge`. This variable will tell you, for each observation in your merged data, whether: (a) that observation came from a successful merge of both datasets, (b) if that observation was in the left dataset (the first one you passed) but not the right dataset, or (c) if that observation was in the right dataset but not the left). This allows you to quickly identify failed merges!\n",
"\n",
"For example, suppose you had the following data:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" df1_var | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" key1 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" key2 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key df1_var\n",
"0 key1 1\n",
"1 key2 2"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"pd.set_option(\"mode.copy_on_write\", True)\n",
"\n",
"df1 = pd.DataFrame({\"key\": [\"key1\", \"key2\"], \"df1_var\": [1, 2]})\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" df2_var | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" key1 | \n",
" a | \n",
"
\n",
" \n",
" 1 | \n",
" Key2 | \n",
" b | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key df2_var\n",
"0 key1 a\n",
"1 Key2 b"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = pd.DataFrame({\"key\": [\"key1\", \"Key2\"], \"df2_var\": [\"a\", \"b\"]})\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now suppose you *expected* that all observations should merge when you merge these datasets (because you hadn't noticed the typo in `df2` where `key2` has a capital `Key2`. If you just run a merge, it works without any problems:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And so you might carry on in life unaware your data is now corrupted: instead of two merged rows, you now have 3, only 1 of which merged correctly!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" df1_var | \n",
" df2_var | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" key1 | \n",
" 1.0 | \n",
" a | \n",
"
\n",
" \n",
" 1 | \n",
" key2 | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" Key2 | \n",
" NaN | \n",
" b | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key df1_var df2_var\n",
"0 key1 1.0 a\n",
"1 key2 2.0 NaN\n",
"2 Key2 NaN b"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When what you really wanted was: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" df1_var | \n",
" df2_var | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" key1 | \n",
" 1 | \n",
" a | \n",
"
\n",
" \n",
" 1 | \n",
" key2 | \n",
" 2 | \n",
" b | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key df1_var df2_var\n",
"0 key1 1 a\n",
"1 key2 2 b"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2_correct = df2.copy()\n",
"df2_correct.loc[df2.key == \"Key2\", \"key\"] = \"key2\"\n",
"pd.merge(df1, df2_correct, on=\"key\", how=\"outer\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(in a small dataset, you'd quickly see you have 1 row instead of 2, but if you have millions of rows, a couple missing won't be evident). \n",
"\n",
"But now suppose we use the `indicator` function:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"_merge\n",
"left_only 1\n",
"right_only 1\n",
"both 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data = pd.merge(df1, df2, on=\"key\", how=\"outer\", indicator=True)\n",
"new_data._merge.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could immediately see that only one observation merged correct, and that one row from each dataset failed to merge!\n",
"\n",
"Moreover, we can look at the failed merges:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key | \n",
" df1_var | \n",
" df2_var | \n",
" _merge | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" key2 | \n",
" 2.0 | \n",
" NaN | \n",
" left_only | \n",
"
\n",
" \n",
" 2 | \n",
" Key2 | \n",
" NaN | \n",
" b | \n",
" right_only | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key df1_var df2_var _merge\n",
"1 key2 2.0 NaN left_only\n",
"2 Key2 NaN b right_only"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_data[new_data._merge != \"both\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Allowing us to easily diagnose the problem. \n",
"\n",
"**Note:** The `pandas` merge function allows users to decide whether to keep only observations that merge (`how='inner'`), all the observations from the first dataset pasted to merge (`how='left'`), all the observations from the second dataset passed to merge (`how='right'`), or all observations (`how='outer'`):\n",
"\n",
"But one danger to using the more restrictive options (like the default, `how='inner'`) is that the merge throws away all the observations that fail to merge, and while this may be the *eventual* goal of your analysis, it means that you don't get to see all the observations that failed to merge that maybe you thought *would* merge. In other words, it throws away the errors so you can't look at them! \n",
"\n",
"So to use `indicator` effectively, you have to:\n",
"\n",
"- Not use `how=\"inner\"`, and\n",
"- Check the values of `_merge` after your merge. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a good practice, use *both* the `validate` keyword *and* the `indicator` keyword (with `how='outer'` when possible). "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}