Subsetting DataFrame Tricks and Gotchas

Subsetting DataFrame Tricks and Gotchas#

Two features of subsetting DataFrames are worth special attention: subsetting with simple square brackets ([]), and subsetting columns with dot-notation.

import pandas as pd

world = pd.read_csv(
    "https://raw.githubusercontent.com/nickeubank/"
    "practicaldatascience/master/Example_Data/world-small.csv"
)
world

	country	region	gdppcap08	polityIV
0	Albania	C&E Europe	7715	17.8
1	Algeria	Africa	8033	10.0
2	Angola	Africa	5899	8.0
3	Argentina	S. America	14333	18.0
4	Armenia	C&E Europe	6070	15.0
...	...	...	...	...
140	Venezuela	S. America	12804	16.0
141	Vietnam	Asia-Pacific	2785	3.0
142	Yemen	Middle East	2400	8.0
143	Zambia	Africa	1356	15.0
144	Zimbabwe	Africa	188	6.0

145 rows × 4 columns

`[]` Square brackets#

As with Series, single square brackets in pandas change their behavior depending on the values you pass them. Again, it is worth emphasizing that there is nothing that one can do with square brackets that you can’t do with .loc and .iloc, so if they seem too strange, you don’t have to use them.

With that said, as summarized below, [] is actually much safer on DataFrames than on Series.

The rules of [] in DataFrames are:

If your entry is a single column name, or a list of column names, it will return those columns.
If your entry is a slice, it will work like iloc and select rows based on row order.
If your entry is a Boolean array, and of exactly the same length as the number of rows in your data, it will subset rows.

# Select one column
world["country"].head()

    Albania
    Algeria
     Angola
  Argentina
    Armenia
Name: country, dtype: object

# Select multiple columns
world[["country", "gdppcap08"]].head()

	country	gdppcap08
0	Albania	7715
1	Algeria	8033
2	Angola	5899
3	Argentina	14333
4	Armenia	6070

# Boolean test
world[world["gdppcap08"] > 10000].head()

	country	region	gdppcap08	polityIV
3	Argentina	S. America	14333	18.0
5	Australia	Asia-Pacific	35677	20.0
6	Austria	W. Europe	38152	20.0
8	Bahrain	Middle East	34605	3.0
10	Belarus	C&E Europe	12261	3.0

# Slice of rows
world[0:3]

	country	region	gdppcap08	polityIV
0	Albania	C&E Europe	7715	17.8
1	Algeria	Africa	8033	10.0
2	Angola	Africa	5899	8.0

My advice on using `[]` on DataFrames#

[] is much safer on DataFrames because the situation we saw with Series where [] might subset on index labels (if your index labels are integers) or it might subset on row order (if your index labels are not integers) doesn’t exist. Moreover, selecting a single column is extremely common, and this is a case where I use single square brackets all the time.

In a Series, if I pass 0, it’s always unclear whether that’s going to get me the first row (row-order-based) or the row with index value 0 (if I have integer index values).

On a DataFrame, a single entry or list of entries will only attempt to match columns based on index values, and if that fails, it throws an exception rather than defaulting to acting like .iloc:

world[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

Similarly, Boolean subsetting always acts like you’re using .loc (aligning on index values where it can, row order if it can’t), and slices in [] always get behavior like .iloc, making behavior much more predictable.

Getting Columns with Dot-Notation#

In addition to passing the name of a column into .loc or to [], columns can also sometimes be access using dot-notation:

world.country.head()

    Albania
    Algeria
     Angola
  Argentina
    Armenia
Name: country, dtype: object

This method of getting columns is very easy and intuitive (given how often we use dot-notation in Python more broadly), but it has a couple significant pit-falls:

Only works for column names without spaces or punctuation
You can’t pass a variable to dot-notation, you have to write out the column explicitly (so you can’t write generalized code).
Only works if the column name isn’t the same as an existing method (e.g., df.count will call the count method, even if you have a column named "count")
Can cause problems if you try to put it on the left side of the equals sign.

Of these, the reasons for the first and second aren’t complicated, but the third and fourth concerns bear exploring.

Suppose we added a column to our data called rank that gave each country’s GDP rank (this code is a little convoluted because there is an easier way to do this, but this works):

world = world.sort_values("gdppcap08")
world["rank"] = range(0, len(world))
world.head()

	country	region	gdppcap08	polityIV	rank
144	Zimbabwe	Africa	188	6.0	0
29	Congo Kinshasa	Africa	321	15.0	1
76	Liberia	Africa	388	10.0	2
53	Guinea-Bissau	Africa	538	11.0	3
40	Eritrea	Africa	632	3.0	4

Now watch what happens if we try to access the rank column with dot-notation—we don’t get the column, we get the method rank (that’s what bound method NDFrame.rank at the top of this output means — it’s returning the method, then telling us about the object—our DataFrame—the method is associated with):

world.rank

<bound method NDFrame.rank of             country        region  gdppcap08  polityIV  rank
      Zimbabwe        Africa        188       6.0     0
 Congo Kinshasa        Africa        321      15.0     1
        Liberia        Africa        388      10.0     2
  Guinea-Bissau        Africa        538      11.0     3
        Eritrea        Africa        632       3.0     4
..              ...           ...        ...       ...   ...
        Ireland     W. Europe      44200      20.0   140
 United States    N. America      46716      20.0   141
     Singapore  Asia-Pacific      49284       8.0   142
         Norway   Scandinavia      58138      20.0   143
         Qatar   Middle East      85868       0.0   144

[145 rows x 5 columns]>

Similar issues arise if you try to assign a column using dot-notation on the left side of the assignment operator. For example, suppose we want to shift everyone’s rank up by 1:

# We make an assignment to what we *think* is
# the `rank` column
world.rank = world["rank"] + 1

world.head()

	country	region	gdppcap08	polityIV	rank
144	Zimbabwe	Africa	188	6.0	0
29	Congo Kinshasa	Africa	321	15.0	1
76	Liberia	Africa	388	10.0	2
53	Guinea-Bissau	Africa	538	11.0	3
40	Eritrea	Africa	632	3.0	4

It fails silently because what you’ve actually done is over-written the method rank with the column rank plus 1. Now now only has your rank column not changed (see it still starts with 0), but now you’ve broken the rank method:

world.rank()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 world.rank()

TypeError: 'Series' object is not callable

And finally, if you try to create a column using dot-notation on the left-hand side of the assignment operator, you will also get into trouble:

world.rank_doubled = range(0, 2 * len(world), 2)
world.head()

/var/folders/fs/h_8_rwsn5hvg9mhp0txgc_s9v6191b/T/ipykernel_22305/4041630850.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  world.rank_doubled = range(0, 2 * len(world), 2)

	country	region	gdppcap08	polityIV	rank
144	Zimbabwe	Africa	188	6.0	0
29	Congo Kinshasa	Africa	321	15.0	1
76	Liberia	Africa	388	10.0	2
53	Guinea-Bissau	Africa	538	11.0	3
40	Eritrea	Africa	632	3.0	4

See now rank_doubled wasn’t added to your DataFrame? It just disappears. pandas does now raise a warning, but warnings don’t stop your code from running, so if you don’t see it, you can corrupt your data.

My advice on dot-notation:

Just don’t use dot-notation on the left-hand side of the assignment operator. You’re begging for trouble, and just making that a rule means you don’t have to worry about when you might be causing a conflict with a method.
Try not to use it on the right side of the assignment operator. It’s safer than using it on the left side of the assignment operator, but none of us will ever memorize all the names of methods in pandas, and if your column happens to have the same name as a method, you may not notice the error.

Subsetting DataFrame Tricks and Gotchas

Contents

Subsetting DataFrame Tricks and Gotchas#

[] Square brackets#

My advice on using [] on DataFrames#

Getting Columns with Dot-Notation#

`[]` Square brackets#

My advice on using `[]` on DataFrames#