Combining datasets: merging

Combining datasets: merging#

In the previous clean example, we demonstrated the three types of joins. Incomplete data are common, however. When either table is missing some of the content that would otherwise allow for a match, we need to consider how best to combine the datasets. The merge method can help us here.

Types of merging#

The merge method for pandas DataFrames has numerous parameters, but to accomplish the majority of common merges, there are four types of merges to consider: left, right, inner, and outer, which are each illustrated in the figure below, and through which we specify using the “how” parameter.

Here, the order of entering the parameters is very important. We’ll assume that we’re using the command pd.merge(a, b, ...) with a coming before b and that we’re merging on the data in column “C1”.

  • A left merge starts with the contents of a (the DataFrame on the “left”) and joins the rows of “C1” in b that match something in a. In the example below, we can see that the entries A and B in C1 match, but there is no C in b, therefore, there is a NaN in its place.

  • A right merge performs the same operation, but starts with the contents of b (the DataFrame on the “right”); in this case we see that there is no match for the value D in column C1 of DataFrame a, so that entry has a NaN value inserted.

  • An inner merge only keeps rows that have content that matches from both a and b.

  • An outer merge keeps all value and all rows from both a and b.

Types of Merges

Try it yourself#

Go ahead and try the above examples - create the DataFrames and perform the above merges. Make sure your results look the same as the above.

import pandas as pd

pd.set_option("mode.copy_on_write", True)

a = pd.DataFrame(data={"C1": ["A", "B", "C"], "C2": ["x", "y", "z"]})
a
C1 C2
0 A x
1 B y
2 C z
b = pd.DataFrame(data={"C1": ["A", "B", "D"], "C3": [11, 12, 13]})
b
C1 C3
0 A 11
1 B 12
2 D 13
merged = pd.merge(a, b, how="left", on="C1")
merged
C1 C2 C3
0 A x 11.0
1 B y 12.0
2 C z NaN
merged = pd.merge(a, b, how="right", on="C1")
merged
C1 C2 C3
0 A x 11
1 B y 12
2 D NaN 13
merged = pd.merge(a, b, how="inner", on="C1")
merged
C1 C2 C3
0 A x 11
1 B y 12
merged = pd.merge(a, b, how="outer", on="C1")
merged
C1 C2 C3
0 A x 11.0
1 B y 12.0
2 C z NaN
3 D NaN 13.0

Now that we have an understanding of the types of joins and the methods by which we can merge datasets (left, right, inner, and outer), in the next lesson, we’ll walk through a practical example where we put these tools into action.