Parquet#
When it comes to storing tabular data in Python, there are a lot of choices, many of which we’ve talked about before (HDF5, CSV, dta, etc.). However, the storage format I think it best today (October 2023) is parquet
.
parquet
has a number of strengths:
It preserves type information: Unlike a CSV,
parquet
files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.It’s fast: That type information means when loading, Python doesn’t have to try and figure out the types of each column.
It’s small:
parquet
compresses your data automatically (and no, that doesn’t slow it down – it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it’s faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files).It’s portable:
parquet
is not a Python-specific format – it’s an Apache Software Foundation standard.It’s built for distributed computing:
parquet
was actually invented to support Hadoop distributed computing.
And as of pandas 2.1, one of the libraries that powers it (pyarrow
) comes bundled with pandas!
Using parquet#
Here’s some code examples for saving and loading parquet
files:
# Make a toy dataset
import pandas as pd
import numpy.random as npr
import random
import string
size = 100000
df = pd.DataFrame(
{
"a": npr.randint(0, 10000, size=size),
"b": [
"".join(random.choices(string.ascii_uppercase, k=10)) for i in range(size)
],
}
)
df.head()
a | b | |
---|---|---|
0 | 3576 | CYPFOXEUOK |
1 | 8934 | MTBQBRTSNG |
2 | 7428 | HDOTAHEISG |
3 | 2046 | DZVSMYDWBP |
4 | 7052 | TBXYGKMYIP |
# We can store "b" as a categorical (analogous to Factors in R).
# This is helpful, for reasons we'll discuss on Tuesday,
# but categoricals are only preserved by fastparquet.
df["b"] = pd.Categorical(df["b"])
df.dtypes
a int64
b category
dtype: object
# Save
df.to_parquet("test.parquet")
/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if _pandas_api.is_sparse(col):
# Reload
df2 = pd.read_parquet("test.parquet")
df2.head()
a | b | |
---|---|---|
0 | 3576 | CYPFOXEUOK |
1 | 8934 | MTBQBRTSNG |
2 | 7428 | HDOTAHEISG |
3 | 2046 | DZVSMYDWBP |
4 | 7052 | TBXYGKMYIP |
# Note that column b is still a categorical.
df2.dtypes
a int64
b category
dtype: object
You can also easily load subsets of columns:
df2 = pd.read_parquet("test.parquet", columns=["b"])
df2.head()
b | |
---|---|
0 | CYPFOXEUOK |
1 | MTBQBRTSNG |
2 | HDOTAHEISG |
3 | DZVSMYDWBP |
4 | TBXYGKMYIP |
Chunking with parquet#
Parquet allows chunking, but not quite as easily as you can chunk a csv.
First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you’re chunking) may be too many:
# Save into 50,000 row chunks,
# so we should get file saved into two chunks.
df.to_parquet("test.parquet", row_group_size=50_000)
/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if _pandas_api.is_sparse(col):
# Then we have to read it in using the `fastparquet`
# library itself (there's no way to do this directly from
# pandas I'm afraid):
from fastparquet import ParquetFile
pf = ParquetFile("test.parquet")
# Iterates over row groups
for rg in pf.iter_row_groups():
print(rg)
a b
0 3576 CYPFOXEUOK
1 8934 MTBQBRTSNG
2 7428 HDOTAHEISG
3 2046 DZVSMYDWBP
4 7052 TBXYGKMYIP
... ... ...
49995 2048 TQAKZQYDAW
49996 2461 LAQXYKGFDK
49997 9509 BHVESYGUML
49998 3626 WGONPSEPYT
49999 681 HKIDHJBLZQ
[50000 rows x 2 columns]
a b
0 6435 XFOYVSEBHR
1 3175 UNHTHMZUNA
2 1083 EPVGFGZTSE
3 7424 KWAHGCUPYJ
4 296 RONMJTKGEA
... ... ...
49995 8489 BDWRLTZLSM
49996 3943 WPRFOWTWLQ
49997 9784 LUBQICKYQA
49998 6256 EUITUAVZQE
49999 9411 RPGOEHYMLW
[50000 rows x 2 columns]
# So you can filter and work with each group if you want:
all_rows_w_even_a = []
for rg in pf.iter_row_groups():
rg = rg.loc[(rg["a"] % 2) == 0]
all_rows_w_even_a.append(rg)
even_rows = pd.concat(all_rows_w_even_a)
even_rows.sample(10)
a | b | |
---|---|---|
643 | 6558 | EZLQUOMYBN |
31920 | 1278 | CBGUFGTRKI |
44151 | 7382 | QEEXGJDDHX |
35662 | 5396 | JPSGFWQHTU |
15697 | 7468 | GTTSBHPKLM |
25991 | 6310 | GLTBRLPEKU |
7456 | 5502 | KNBMNVMWWA |
49984 | 5758 | BNTBTKXBDQ |
41328 | 8714 | AVCGRMQVIN |
43772 | 8390 | WPQYXPULAH |
# Clean up
import os
os.remove("test.parquet")