Queries reprised: asking your data questions#
Much of this week has focused on data preparation: combining datasets together, reconfiguring and aggregating them into groups, all for the purpose of preparing the data to help us develop insights and ask questions. In this lesson, we’ll be returning to how we can programmatically ask questions of our datasets using queries. We have seen all of this before when we discussed subsetting our data and filtering based on different logical expressions. Here we will introduce a compact way to execute such queries to add another tool into your programmatic toolbox for working with data. A few of you may have encountered structured query language (SQL) before, and if so, there will be many connections to what we discuss today.
Let’s start with our sales data from a previous lesson:
import pandas as pd
pd.set_option("mode.copy_on_write", True)
sales = pd.DataFrame(
data={
"employee": [
"Katrina",
"Guanyu",
"Jan",
"Roman",
"Jacqueline",
"Paola",
"Esperanza",
"Alaina",
"Egweyn",
],
"sales": [14, 17, 6, 12, 8, 3, 7, 15, 5],
"year": [2018, 2019, 2020, 2018, 2020, 2019, 2019, 2020, 2020],
}
)
sales
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
1 | Guanyu | 17 | 2019 |
2 | Jan | 6 | 2020 |
3 | Roman | 12 | 2018 |
4 | Jacqueline | 8 | 2020 |
5 | Paola | 3 | 2019 |
6 | Esperanza | 7 | 2019 |
7 | Alaina | 15 | 2020 |
8 | Egweyn | 5 | 2020 |
To query a pandas
DataFrame, we need to use the query
method which takes a query string. This string allows you to ask a world of questions of your data for example, see the table below for a set of examples of quieries and their corresponding query strings for our sales
dataset:
Query |
Query string |
---|---|
Show me sales greater than 10 |
“sales > 10” |
Show me data from 2018 |
“year == 2018” |
Show me sales are greater than 13 and the year is 2018 |
“sales > 13 and year == 2018” |
Show me everything EXCEPT for when sales are greater than 13 and the year is 2018 |
“not (sales > 13 and year == 2018)” |
Show me data where sales divided by 3 are greater than 3 |
“sales/3 > 3” |
Show me employees whose names are alphabetically after J |
“employee > ‘J’” |
Let’s show each of these queries in action below:
sales.query("sales > 10")
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
1 | Guanyu | 17 | 2019 |
3 | Roman | 12 | 2018 |
7 | Alaina | 15 | 2020 |
sales.query("year == 2018")
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
3 | Roman | 12 | 2018 |
sales.query("sales > 13 and year == 2018")
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
sales.query("not (sales > 13 and year == 2018)")
employee | sales | year | |
---|---|---|---|
1 | Guanyu | 17 | 2019 |
2 | Jan | 6 | 2020 |
3 | Roman | 12 | 2018 |
4 | Jacqueline | 8 | 2020 |
5 | Paola | 3 | 2019 |
6 | Esperanza | 7 | 2019 |
7 | Alaina | 15 | 2020 |
8 | Egweyn | 5 | 2020 |
sales.query("sales/3 > 3")
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
1 | Guanyu | 17 | 2019 |
3 | Roman | 12 | 2018 |
7 | Alaina | 15 | 2020 |
sales.query('employee > "J"')
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
2 | Jan | 6 | 2020 |
3 | Roman | 12 | 2018 |
4 | Jacqueline | 8 | 2020 |
5 | Paola | 3 | 2019 |
Now, these same queries could have been accomplished using selection and filtering concept we discussed earlier, as shown below, but requiring more sytax (and the query
method is often more computationally efficient). Let’s try rewriting sales.query('sales > 13 and year == 2018')
using the techniques we described earlier:
sales[(sales["sales"] > 13) & (sales["year"] == 2018)]
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
The query
method makes our code far easier to follow.
Another nice feature of query
is that we can bring in variables from our local workspace, but outside of the DataFrame by prefixing the @
symbol to the variable name. Let’s say we knew that each sale was worth $1,000. Then we could query based on the total dollar amount of sales as follows:
revenue_per_sale = 1000
sales.query("sales * @revenue_per_sale > 10000")
employee | sales | year | |
---|---|---|---|
0 | Katrina | 14 | 2018 |
1 | Guanyu | 17 | 2019 |
3 | Roman | 12 | 2018 |
7 | Alaina | 15 | 2020 |
Computation with columns#
In a similar way that we were able to create more readable and efficient queries with the query
method, pandas
also has a faster, more readable method for performing a number of computations on columns. let’s create a DataFrame with some data to demonstrate:
import numpy as np
rand_matrix = np.random.rand(5, 4)
data = pd.DataFrame(
rand_matrix, columns=["speed_initial", "speed_final", "time_initial", "time_final"]
)
data_copy = data.copy() # Make a copy to use later
data
speed_initial | speed_final | time_initial | time_final | |
---|---|---|---|---|
0 | 0.271065 | 0.720802 | 0.796269 | 0.433519 |
1 | 0.312085 | 0.349255 | 0.258158 | 0.833784 |
2 | 0.986399 | 0.129630 | 0.059127 | 0.246308 |
3 | 0.542444 | 0.295594 | 0.250011 | 0.102992 |
4 | 0.226410 | 0.670819 | 0.388360 | 0.600059 |
In physics, we can calculate acceleration of an object by it’s change in velocity divided by the time it took to change that velocity. Let’s calculate that using the tools we’ve discussed alread:
data["acceleration"] = (data["speed_final"] - data["speed_initial"]) / (
data["time_final"] - data["time_initial"]
)
data
speed_initial | speed_final | time_initial | time_final | acceleration | |
---|---|---|---|---|---|
0 | 0.271065 | 0.720802 | 0.796269 | 0.433519 | -1.239797 |
1 | 0.312085 | 0.349255 | 0.258158 | 0.833784 | 0.064573 |
2 | 0.986399 | 0.129630 | 0.059127 | 0.246308 | -4.577205 |
3 | 0.542444 | 0.295594 | 0.250011 | 0.102992 | 1.679045 |
4 | 0.226410 | 0.670819 | 0.388360 | 0.600059 | 2.099245 |
Instead we can use the eval
function, which has a syntax similar to that of query
, but lets us do a number of common computations more clearly (and more quickly, since the code will run more quickly as well):
data_copy.eval(
"acceleration = (speed_final - speed_initial) / (time_final - time_initial)",
inplace=True,
)
data_copy
speed_initial | speed_final | time_initial | time_final | acceleration | |
---|---|---|---|---|---|
0 | 0.271065 | 0.720802 | 0.796269 | 0.433519 | -1.239797 |
1 | 0.312085 | 0.349255 | 0.258158 | 0.833784 | 0.064573 |
2 | 0.986399 | 0.129630 | 0.059127 | 0.246308 | -4.577205 |
3 | 0.542444 | 0.295594 | 0.250011 | 0.102992 | 1.679045 |
4 | 0.226410 | 0.670819 | 0.388360 | 0.600059 | 2.099245 |
Similarly, we could prefix a local variable name with the @
symbol to incorporate the variable into the eval
expression. While not appropriate for cases which use complex functions, eval
is a helpful tool for simple expressions.
Summary#
While none of the ends that were achieved by the tools in this section were unable to be reached using the tools from earlier parts of this course, the query
and eval
tools discussed here help to make that process easier to understand, easier to implement, and more efficient for processing.