Welcome to my field notes!

Field notes are notes I leave myself as I go through my day to day work. The hope is that other people will also find these notes useful. Note that these notes are unfiltered and unverified.

Polars

Polars is a relatively new data frame library that I don’t yet have experience, but I do want to use it for storing data related to this library, so I’m recording my experiments here.

Code
import polars as pl

Quick Start

The syntax looks very much like {dplyr}, following simple verbs that can be chained together via method chaining. You can use the backslash for line continuation but I prefer using parentheses.

Code
df = pl.read_csv("https://j.mp/iriscsv")
(
    df.filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.col("sepal_length").sum())
)
shape: (3, 2)
species sepal_length
str f64
"versicolor" 281.9
"virginica" 324.5
"setosa" 116.9

Polars Expressions

An expression is essentially a definition of a data transformation on a polars data frame. By being able to optimize the query on that expression and parallelizing, Polars is able to achieve performance.

Code
# Select the column foo, sort it ascending, and get first 2 elements
pl.col("foo").sort().head(2)
col("foo") ASC.slice(offset=0i32, length=2u64)
Code
# You can pipe expressions together
df.select(
    [pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") == 1).sum()]
)
NotFoundError: foo
Code
import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": list(np.random.rand(5)),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
df
shape: (5, 4)
nrs names random groups
i64 str f64 str
1 "foo" 0.154163 "A"
2 "ham" 0.74005 "A"
3 "spam" 0.263315 "B"
null "egg" 0.533739 "C"
5 null 0.014575 "B"
Code
# Take the names vector, count how many, unique, and name a column
df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)
shape: (1, 2)
unique_names_1 unique_names_2
u32 u32
5 5
Code
# Summing, minimum, maximum, std dev, and variance
df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)
shape: (1, 6)
sum min max other_max std dev variance
f64 f64 f64 f64 f64 f64
1.705842 0.014575 0.74005 0.74005 0.293209 0.085971
Code
# Get all names where the name contains am at the end
df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)
shape: (1, 1)
names
u32
2
Code
# When the random field is >0.5, then use 0 else use the random, multiple by nrs
df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random"))
        * pl.sum("nrs"),
    ]
)
shape: (5, 1)
literal
f64
1.695791
0.0
2.896465
0.0
0.160325
Code
# Window functions in SQL, get the sum of random over groups, and get the list of random over names
df.select(
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
)
shape: (5, 6)
nrs names random groups sum[random]/groups random/name
i64 str f64 str f64 list[f64]
1 "foo" 0.154163 "A" 0.894213 [0.154163]
2 "ham" 0.74005 "A" 0.894213 [0.74005]
3 "spam" 0.263315 "B" 0.27789 [0.263315]
null "egg" 0.533739 "C" 0.533739 [0.533739]
5 null 0.014575 "B" 0.27789 [0.014575]

Expression Contexts

Expression contexts define how the expression is evaluated.

Select Context

  • Does operations over columns
  • Must produce Series of same length or length 1 (broadcasted)
  • You can add pl.all() to make it more like {dplyr}’s mutate(), otherwise it behaves like transmute().
Code
df.select(
    [
        pl.all(),
        pl.sum("nrs").alias("nrs2"),
        pl.col("names").sort().alias("names2"),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)
shape: (5, 8)
nrs names random groups nrs2 names2 first name 10xnrs
i64 str f64 str i64 str str f64
1 "foo" 0.154163 "A" 11 null "foo" 27.5
2 "ham" 0.74005 "A" 11 "egg" "foo" 27.5
3 "spam" 0.263315 "B" 11 "foo" "foo" 27.5
null "egg" 0.533739 "C" 11 "ham" "foo" 27.5
5 null 0.014575 "B" 11 "spam" "foo" 27.5

Add Columns

  • Behaves like dplyr::mutate().
Code
df.with_columns(
    [
        pl.sum("nrs").alias("nrs_sum"),
        pl.col("random").count().alias("count"),
    ]
)
shape: (5, 6)
nrs names random groups nrs_sum count
i64 str f64 str i64 u32
1 "foo" 0.154163 "A" 11 5
2 "ham" 0.74005 "A" 11 5
3 "spam" 0.263315 "B" 11 5
null "egg" 0.533739 "C" 11 5
5 null 0.014575 "B" 11 5

GroupBy Context

  • Expressions work on groups
  • You may return any length
  • You use agg() after a a groupby()
Code
df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # sum nrs by groups
        pl.col("random").count().alias("count"),  # count group members
        # sum random where name != null
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)
shape: (3, 5)
groups nrs count random_sum reversed names
str i64 u32 f64 list[str]
"B" 8 2 0.263315 [null, "spam"]
"A" 3 2 0.894213 ["ham", "foo"]
"C" null 1 0.533739 ["egg"]