Index alignment#
To illustrate how index alignment can sometimes lead to problems, let’s consider the following example: suppose we’re throwing a party, and we plan to give people prizes based on the order in which they arrive. The first person to arrive at the party will get 20 dollars, the second will get 10 dollars, and the third person doesn’t get anything.
To keep track of how many prizes everyone gets, we build a DataFrame with all the party attendees, their arrival order, and a column for keeping track of how much they’ve received in prizes.
We can also build a Series with the prize amounts we plan to give people.
Exercise 1#
Use the code below to get started:
import pandas as pd
attendees = pd.DataFrame(
{
"names": ["Jill", "Kumar", "Zaira"],
"prizes": [0, 0, 0],
"arrival_order": [2, 1, 3],
}
)
arrival_prizes = pd.Series([20, 10, 0])
Exercise 2#
Now let’s sort our attendees
list by arrival_order
so that the first row is the person who arrived first, the second is the person who arrived second, etc. to match how we’ve organized arrival_prizes
.
Exercise 3#
Now let’s “give” everyone their arrival prizes by adding arrival prizes to people’s prize column:
attendees["prizes"] = attendees["prizes"] + arrival_prizes
Exercise 4#
Now let’s look at the result. Does it look like what you expected? Do you know what went wrong?
After you’ve formulated your thoughts, continue to Discussion.
Exercise 5#
If you ever want to get alignment on row numbers, the easiest way to achieve that is to reset the indices on both objects you want to merge. When one re-sets indices without specifying a column to become the new index, the new index will just be row-numbers.
So reset prizes to 0, do what you need to do to get the order right, reset the index, and try again.
Note: When you reset the index on a Series, the Series is converted to a DataFrame, and the old index is added as a column. To avoid this behavior and just drop the old index when re-setting indices (in either a Series or a DataFrame), use the drop=True
argument when resetting the index.
Exercise 6#
OK, so besides doing automatic alignment, is there a reason to use indices?
Let’s find out. Create the following fake dataset of social security numbers and some “names” (random strings). Warning: this will take a little time to run.
import numpy.random as npr
import string
import random
npr.seed(42)
random.seed(42)
size = 1000000 # 1,000,000
people = pd.DataFrame(
{
"social_security_numbers": npr.randint(low=10000000, high=99999999, size=size),
"names": [
"".join(random.choices(string.ascii_uppercase, k=10)) for i in range(size)
],
}
)
Exercise 7#
Now subset your data to get the social security number associated with the name of “TPKSMSLREI”. (Yes, there are ways to get real random names, but they take a while to run because they query websites that generate fake names, so we’re just doing this!).
Exercise 8#
Now time your operation using the %timeit
ipython magic function.
Exercise 9#
Now make names
your index for this data. Then try subsetting using loc[]
to get all the observations with the name of “TPKSMSLREI” and time the operation.
Take-away#
So in short: indices can be nice in that they do automatic alignment, provided you’re expecting it. Moreover, if you want to pull random rows out of your dataset, it is often much faster than a regular query!