Exploring the Polars Dataframe Library

Polars is a relatively new data frame library that I don’t yet have experience, but I do want to use it for storing data related to this library, so I’m recording my experiments here.

import polars as pl

Quick Start

The syntax looks very much like {dplyr}, following simple verbs that can be chained together via method chaining. You can use the backslash for line continuation but I prefer using parentheses.

df = pl.read_csv("https://j.mp/iriscsv")
(
    df.filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.col("sepal_length").sum())
)

Polars Expressions

An expression is essentially a definition of a data transformation on a polars data frame. By being able to optimize the query on that expression and parallelizing, Polars is able to achieve performance.

# Select the column foo, sort it ascending, and get first 2 elements
pl.col("foo").sort().head(2)

# You can pipe expressions together
df.select(
    [pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") == 1).sum()]
)

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": list(np.random.rand(5)),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
df

# Take the names vector, count how many, unique, and name a column
df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)

# Summing, minimum, maximum, std dev, and variance
df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)

# Get all names where the name contains am at the end
df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)

# When the random field is >0.5, then use 0 else use the random, multiple by nrs
df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random"))
        * pl.sum("nrs"),
    ]
)

# Window functions in SQL, get the sum of random over groups, and get the list of random over names
df.select(
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
)

Expression Contexts

Expression contexts define how the expression is evaluated.

Select Context

Does operations over columns
Must produce Series of same length or length 1 (broadcasted)
You can add pl.all() to make it more like {dplyr}’s mutate(), otherwise

it behaves like transmute().

df.select(
    [
        pl.all(),
        pl.sum("nrs").alias("nrs2"),
        pl.col("names").sort().alias("names2"),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)

Add Columns

df.with_columns(
    [
        pl.sum("nrs").alias("nrs_sum"),
        pl.col("random").count().alias("count"),
    ]
)

Behaves like dplyr::mutate().

GroupBy Context

df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # sum nrs by groups
        pl.col("random").count().alias("count"),  # count group members
        # sum random where name != null
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)

Expressions work on groups
You may return any length
You use agg() after a a groupby()

Thoughts

I do think this API is much more user-friendly, tractable, and less error-prone than the pandas API. I would like, as much as possible, for my workflow sot use polars whenever I start something new, and avoid having to use pandas unless there is some package compatibility quirk that needs to be resolved.