Exploring the Polars Dataframe Library
Polars is a relatively new data frame library that I don’t yet have experience, but I do want to use it for storing data related to this library, so I’m recording my experiments here.
import polars as pl
Quick Start
The syntax looks very much like {dplyr}
, following simple verbs that can be chained together via method chaining. You can use the backslash for line continuation but I prefer using parentheses.
df = pl.read_csv("https://j.mp/iriscsv")
(
df.filter(pl.col("sepal_length") > 5)
.groupby("species")
.agg(pl.col("sepal_length").sum())
)
Polars Expressions
An expression is essentially a definition of a data transformation on a polars data frame. By being able to optimize the query on that expression and parallelizing, Polars is able to achieve performance.
# Select the column foo, sort it ascending, and get first 2 elements
pl.col("foo").sort().head(2)
# You can pipe expressions together
df.select(
[pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") == 1).sum()]
)
import polars as pl
import numpy as np
np.random.seed(12)
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": list(np.random.rand(5)),
"groups": ["A", "A", "B", "C", "B"],
}
)
df
# Take the names vector, count how many, unique, and name a column
df.select(
[
pl.col("names").n_unique().alias("unique_names_1"),
pl.col("names").unique().count().alias("unique_names_2"),
]
)
# Summing, minimum, maximum, std dev, and variance
df.select(
[
pl.sum("random").alias("sum"),
pl.min("random").alias("min"),
pl.max("random").alias("max"),
pl.col("random").max().alias("other_max"),
pl.std("random").alias("std dev"),
pl.var("random").alias("variance"),
]
)
# Get all names where the name contains am at the end
df.select(
[
pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
]
)
# When the random field is >0.5, then use 0 else use the random, multiple by nrs
df.select(
[
pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random"))
* pl.sum("nrs"),
]
)
# Window functions in SQL, get the sum of random over groups, and get the list of random over names
df.select(
[
pl.col("*"), # select all
pl.col("random").sum().over("groups").alias("sum[random]/groups"),
pl.col("random").list().over("names").alias("random/name"),
]
)
Expression Contexts
Expression contexts define how the expression is evaluated.
Select Context
- Does operations over columns
- Must produce
Series
of same length or length 1 (broadcasted) - You can add
pl.all()
to make it more like{dplyr}
’smutate()
, otherwise
it behaves like transmute()
.
df.select(
[
pl.all(),
pl.sum("nrs").alias("nrs2"),
pl.col("names").sort().alias("names2"),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
]
)
Add Columns
df.with_columns(
[
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
]
)
- Behaves like
dplyr::mutate()
.
GroupBy Context
df.groupby("groups").agg(
[
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
pl.col("names").reverse().alias(("reversed names")),
]
)
- Expressions work on groups
- You may return any length
- You use
agg()
after a agroupby()
Thoughts
I do think this API is much more user-friendly, tractable, and less error-prone than the pandas
API. I would like, as much as possible, for my workflow sot use polars
whenever I start something new, and avoid having to use pandas
unless there is some package compatibility quirk that needs to be resolved.