Parquet

Parquet#

When it comes to storing tabular data in Python, there are a lot of choices, many of which we’ve talked about before (HDF5, CSV, dta, etc.). However, the storage format I think it best today (October 2023) is parquet.

parquet has a number of strengths:

It preserves type information: Unlike a CSV, parquet files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.
It’s fast: That type information means when loading, Python doesn’t have to try and figure out the types of each column.
It’s small: parquet compresses your data automatically (and no, that doesn’t slow it down – it fact it makes it faster. The reason is that getting data from memory is such a comparatively slow operation, it’s faster to load compressed data to RAM and then decompress it than to transfer larger uncompressed files).
It’s portable: parquet is not a Python-specific format – it’s an Apache Software Foundation standard.
It’s built for distributed computing: parquet was actually invented to support Hadoop distributed computing.

And as of pandas 2.1, one of the libraries that powers it (pyarrow) comes bundled with pandas!

Using parquet#

Here’s some code examples for saving and loading parquet files:

# Make a toy dataset

import pandas as pd
import numpy.random as npr
import random
import string

size = 100000

df = pd.DataFrame(
    {
        "a": npr.randint(0, 10000, size=size),
        "b": [
            "".join(random.choices(string.ascii_uppercase, k=10)) for i in range(size)
        ],
    }
)

df.head()

	a	b
0	3576	CYPFOXEUOK
1	8934	MTBQBRTSNG
2	7428	HDOTAHEISG
3	2046	DZVSMYDWBP
4	7052	TBXYGKMYIP

# We can store "b" as a categorical (analogous to Factors in R).
# This is helpful, for reasons we'll discuss on Tuesday,
# but categoricals are only preserved by fastparquet.
df["b"] = pd.Categorical(df["b"])
df.dtypes

a       int64
b    category
dtype: object

# Save
df.to_parquet("test.parquet")

/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if _pandas_api.is_sparse(col):

# Reload
df2 = pd.read_parquet("test.parquet")
df2.head()

	a	b
0	3576	CYPFOXEUOK
1	8934	MTBQBRTSNG
2	7428	HDOTAHEISG
3	2046	DZVSMYDWBP
4	7052	TBXYGKMYIP

# Note that column b is still a categorical.

df2.dtypes

a       int64
b    category
dtype: object

You can also easily load subsets of columns:

df2 = pd.read_parquet("test.parquet", columns=["b"])
df2.head()

	b
0	CYPFOXEUOK
1	MTBQBRTSNG
2	HDOTAHEISG
3	DZVSMYDWBP
4	TBXYGKMYIP

Chunking with parquet#

Parquet allows chunking, but not quite as easily as you can chunk a csv.

First, you have to save your file into chunks of a size you want. Parquet always saves rows as chunks, but by default each chunk has 50,000,000 rows, which (if you’re chunking) may be too many:

# Save into 50,000 row chunks,
# so we should get file saved into two chunks.

df.to_parquet("test.parquet", row_group_size=50_000)

/Users/nce8/opt/miniconda3/lib/python3.11/site-packages/pyarrow/pandas_compat.py:373: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if _pandas_api.is_sparse(col):

# Then we have to read it in using the `fastparquet`
# library itself (there's no way to do this directly from
# pandas I'm afraid):

from fastparquet import ParquetFile

pf = ParquetFile("test.parquet")

# Iterates over row groups
for rg in pf.iter_row_groups():
    print(rg)

          a           b
    3576  CYPFOXEUOK
    8934  MTBQBRTSNG
    7428  HDOTAHEISG
    2046  DZVSMYDWBP
    7052  TBXYGKMYIP
...     ...         ...
2048  TQAKZQYDAW
2461  LAQXYKGFDK
9509  BHVESYGUML
3626  WGONPSEPYT
 681  HKIDHJBLZQ

[50000 rows x 2 columns]
          a           b
    6435  XFOYVSEBHR
    3175  UNHTHMZUNA
    1083  EPVGFGZTSE
    7424  KWAHGCUPYJ
     296  RONMJTKGEA
...     ...         ...
8489  BDWRLTZLSM
3943  WPRFOWTWLQ
9784  LUBQICKYQA
6256  EUITUAVZQE
9411  RPGOEHYMLW

[50000 rows x 2 columns]

# So you can filter and work with each group if you want:
all_rows_w_even_a = []
for rg in pf.iter_row_groups():
    rg = rg.loc[(rg["a"] % 2) == 0]
    all_rows_w_even_a.append(rg)

even_rows = pd.concat(all_rows_w_even_a)
even_rows.sample(10)

	a	b
643	6558	EZLQUOMYBN
31920	1278	CBGUFGTRKI
44151	7382	QEEXGJDDHX
35662	5396	JPSGFWQHTU
15697	7468	GTTSBHPKLM
25991	6310	GLTBRLPEKU
7456	5502	KNBMNVMWWA
49984	5758	BNTBTKXBDQ
41328	8714	AVCGRMQVIN
43772	8390	WPQYXPULAH

# Clean up
import os

os.remove("test.parquet")

Parquet

Contents

Parquet#

Using parquet#

Chunking with parquet#