Index Alignment Exercises, Discussion#
If you didn’t see the problem coming and change your code, you probably did the following:
import pandas as pd
attendees = pd.DataFrame(
{
"names": ["Jill", "Kumar", "Zaira"],
"prizes": [0, 0, 0],
"arrival_order": [2, 1, 3],
}
)
arrival_prizes = pd.Series([20, 10, 0])
arrival_prizes
0 20
1 10
2 0
dtype: int64
attendees = attendees.sort_values("arrival_order")
attendees
names | prizes | arrival_order | |
---|---|---|---|
1 | Kumar | 0 | 1 |
0 | Jill | 0 | 2 |
2 | Zaira | 0 | 3 |
attendees["prizes"] = attendees["prizes"] + arrival_prizes
attendees
names | prizes | arrival_order | |
---|---|---|---|
1 | Kumar | 10 | 1 |
0 | Jill | 20 | 2 |
2 | Zaira | 0 | 3 |
The Problem#
Uh oh… as you can see, 20 dollars went to the person who arrived second, and 10 dollars go to the person who arrived first… Why did that happen?
The answer is index alignment.
In numpy
or R
, when you try and add two arrays of the same length, the first entry of the first array is added to the first entry of the second array to create the first entry of the result; the second entry is added to the second entry, etc. For example:
import numpy as np
np.array([1, 2, 3]) + np.array([1, 2, 3])
array([2, 4, 6])
But that is NOT how pandas
operates. Instead, pandas
will always align data based on index values. And when you sort data, the index value associated with each row doesn’t change. Take a look at attendees
: when we sorted the data, the data was re-ordered, but so were the index values: Jill is now the second row, but her index value is still 0, Kumar is now the first row, but his index value is still 1.
attendees
names | prizes | arrival_order | |
---|---|---|---|
1 | Kumar | 10 | 1 |
0 | Jill | 20 | 2 |
2 | Zaira | 0 | 3 |
The result is that when you combine attendees['arrival_order']
with prizes
, the entry of prizes
with index value 0
(20 dollars) is added to Jill’s row, and the entry with index value 1
(10 dollars) is added to Kumar’s row.
Forcing row-by-row alignment#
Thankfully, it’s not too hard to avoid index alignment. When you reset an index, by default your data gets a new index where each row’s index value is it’s row number. To see this in action, let’s start our exercise over by creating our original data structures again:
import pandas as pd
attendees = pd.DataFrame(
{
"names": ["Jill", "Kumar", "Zaira"],
"prizes": [0, 0, 0],
"arrival_order": [2, 1, 3],
}
)
arrival_prizes = pd.Series([20, 10, 0])
Now let’s sort attendees
by arrival_order
just like last time:
attendees = attendees.sort_values("arrival_order")
attendees
names | prizes | arrival_order | |
---|---|---|---|
1 | Kumar | 0 | 1 |
0 | Jill | 0 | 2 |
2 | Zaira | 0 | 3 |
But now before we add arrival_prizes
to attendees
, let’s reset the index of arrival_prizes
:
attendees = attendees.reset_index()
attendees
index | names | prizes | arrival_order | |
---|---|---|---|---|
0 | 1 | Kumar | 0 | 1 |
1 | 0 | Jill | 0 | 2 |
2 | 2 | Zaira | 0 | 3 |
As you can see, the new index (numbers in bold on left side) is now just the row numbers.
However, as you’ll see the old index has now been moved over to create a new column. Confusingly, pandas
likes to call that new column… index
. Yeah, I know. It’s not the index, it’s just a column named index. 😫.
(To avoid this problem, you can use the drop=True
option (reset_index(drop=True)
). But I wanted to show you the behavior if you don’t specify that so you aren’t confused when you see this for the first time. )
OK, so now that the index for attendees
is just row numbers, and that’s also the organization we have for arrival_prizes
:
arrival_prizes
0 20
1 10
2 0
dtype: int64
NOW we can add them together and they will add up row-by-row:
attendees["prizes"] = attendees["prizes"] + arrival_prizes
attendees
index | names | prizes | arrival_order | |
---|---|---|---|---|
0 | 1 | Kumar | 20 | 1 |
1 | 0 | Jill | 10 | 2 |
2 | 2 | Zaira | 0 | 3 |
And we get the result we expected!
When Index Alignment Comes Up#
The other thing about index alignment is that it thankfully doesn’t come up all that often. Indeed, that’s why it’s often not emphasized in intro exercises. That’s because different columns in the same DataFrame always share the same index, so when you execute operations using columns from the same DataFrame, index alignment looks like order alignment. This issue only comes up with you are doing an operation on Series
that are not from the same DataFrame
.