Validating a merge#

When you merge data, you have to make some assumptions about the nature of the data you’re working with. For example, you have to assume it’s OK to connect variables that share the same value of your merging variables. Similarly, you have to make assumptions about whether your merge a one-to-one merge (meaning there is only one observation of the variable you’re merging on in both datasets), a one-to-many merge (meaning there is only one observation of the variable you’re merging on in the first dataset, but multiple observations in the second). So before running a merge, answer the following questions:

  1. What variable do you think will be consistent across these two datasets you can use for merging?

  2. Do you think there will be exactly 1 observation for each value in your first dataset?

  3. Do you think there will be exactly 1 observation for each value in your second dataset?

Being correct in your assumptions about these things is very important. If you think there’s only one observation per value of your merging variable in each dataset, but there are in fact 2, you’ll end up with two observations for each value after the merge. If that were to happen, the structure of your data would be damaged and it would indicate that you didn’t understand something about your data.

Did the type of merge you were expecting happen? The validate keyword#

Because of the importance of this, pandas provides a utility for testing these assumptions when you do a merge: the validate keyword! Validate will accept "1:1", "1:m", "m:1", and "m:m" where the ‘m’ stands for “many”. It will then check to make sure your merge matches the type of merge you think it is. I highly recommend always using this option.

Checking whether you are doing a one-to-one, many-to-one, one-to-many, or many-to-many merge is only the first type of diagnostic test you should run on every merge you conduct.

Identifying merge issues: the indicator keyword#

The second test is to see if you data merged successfully!

To help with this, the merge function in pandas offers a keyword option called indicator. If you set indicator to True, then pandas will add a column to the result of your merge called _merge. This variable will tell you, for each observation in your merged data, whether: (a) that observation came from a successful merge of both datasets, (b) if that observation was in the left dataset (the first one you passed) but not the right dataset, or (c) if that observation was in the right dataset but not the left). This allows you to quickly identify failed merges!

For example, suppose you had the following data:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

df1 = pd.DataFrame({"key": ["key1", "key2"], "df1_var": [1, 2]})
df1
key df1_var
0 key1 1
1 key2 2
df2 = pd.DataFrame({"key": ["key1", "Key2"], "df2_var": ["a", "b"]})
df2
key df2_var
0 key1 a
1 Key2 b

Now suppose you expected that all observations should merge when you merge these datasets (because you hadn’t noticed the typo in df2 where key2 has a capital Key2. If you just run a merge, it works without any problems:

new_data = pd.merge(df1, df2, on="key", how="outer")

And so you might carry on in life unaware your data is now corrupted: instead of two merged rows, you now have 3, only 1 of which merged correctly!

new_data
key df1_var df2_var
0 key1 1.0 a
1 key2 2.0 NaN
2 Key2 NaN b

When what you really wanted was:

df2_correct = df2.copy()
df2_correct.loc[df2.key == "Key2", "key"] = "key2"
pd.merge(df1, df2_correct, on="key", how="outer")
key df1_var df2_var
0 key1 1 a
1 key2 2 b

(in a small dataset, you’d quickly see you have 1 row instead of 2, but if you have millions of rows, a couple missing won’t be evident).

But now suppose we use the indicator function:

new_data = pd.merge(df1, df2, on="key", how="outer", indicator=True)
new_data._merge.value_counts()
_merge
left_only     1
right_only    1
both          1
Name: count, dtype: int64

We could immediately see that only one observation merged correct, and that one row from each dataset failed to merge!

Moreover, we can look at the failed merges:

new_data[new_data._merge != "both"]
key df1_var df2_var _merge
1 key2 2.0 NaN left_only
2 Key2 NaN b right_only

Allowing us to easily diagnose the problem.

Note: The pandas merge function allows users to decide whether to keep only observations that merge (how='inner'), all the observations from the first dataset pasted to merge (how='left'), all the observations from the second dataset passed to merge (how='right'), or all observations (how='outer'):

But one danger to using the more restrictive options (like the default, how='inner') is that the merge throws away all the observations that fail to merge, and while this may be the eventual goal of your analysis, it means that you don’t get to see all the observations that failed to merge that maybe you thought would merge. In other words, it throws away the errors so you can’t look at them!

So to use indicator effectively, you have to:

  • Not use how="inner", and

  • Check the values of _merge after your merge.

As a good practice, use both the validate keyword and the indicator keyword (with how='outer' when possible).