Combining datasets: merging#
In the previous clean example, we demonstrated the three types of joins. Incomplete data are common, however. When either table is missing some of the content that would otherwise allow for a match, we need to consider how best to combine the datasets. The merge
method can help us here.
Types of merging#
The merge
method for pandas
DataFrames has numerous parameters, but to accomplish the majority of common merges, there are four types of merges to consider: left, right, inner, and outer, which are each illustrated in the figure below, and through which we specify using the “how” parameter.
Here, the order of entering the parameters is very important. We’ll assume that we’re using the command pd.merge(a, b, ...)
with a
coming before b
and that we’re merging on the data in column “C1”.
A left merge starts with the contents of
a
(the DataFrame on the “left”) and joins the rows of “C1” inb
that match something ina
. In the example below, we can see that the entries A and B in C1 match, but there is no C inb
, therefore, there is aNaN
in its place.A right merge performs the same operation, but starts with the contents of
b
(the DataFrame on the “right”); in this case we see that there is no match for the value D in column C1 of DataFramea
, so that entry has aNaN
value inserted.An inner merge only keeps rows that have content that matches from both
a
andb
.An outer merge keeps all value and all rows from both
a
andb
.
Try it yourself#
Go ahead and try the above examples - create the DataFrames and perform the above merges. Make sure your results look the same as the above.
import pandas as pd
pd.set_option("mode.copy_on_write", True)
a = pd.DataFrame(data={"C1": ["A", "B", "C"], "C2": ["x", "y", "z"]})
a
C1 | C2 | |
---|---|---|
0 | A | x |
1 | B | y |
2 | C | z |
b = pd.DataFrame(data={"C1": ["A", "B", "D"], "C3": [11, 12, 13]})
b
C1 | C3 | |
---|---|---|
0 | A | 11 |
1 | B | 12 |
2 | D | 13 |
merged = pd.merge(a, b, how="left", on="C1")
merged
C1 | C2 | C3 | |
---|---|---|---|
0 | A | x | 11.0 |
1 | B | y | 12.0 |
2 | C | z | NaN |
merged = pd.merge(a, b, how="right", on="C1")
merged
C1 | C2 | C3 | |
---|---|---|---|
0 | A | x | 11 |
1 | B | y | 12 |
2 | D | NaN | 13 |
merged = pd.merge(a, b, how="inner", on="C1")
merged
C1 | C2 | C3 | |
---|---|---|---|
0 | A | x | 11 |
1 | B | y | 12 |
merged = pd.merge(a, b, how="outer", on="C1")
merged
C1 | C2 | C3 | |
---|---|---|---|
0 | A | x | 11.0 |
1 | B | y | 12.0 |
2 | C | z | NaN |
3 | D | NaN | 13.0 |
Now that we have an understanding of the types of joins and the methods by which we can merge datasets (left, right, inner, and outer), in the next lesson, we’ll walk through a practical example where we put these tools into action.