Field notes are notes I leave myself as I go through my day to day work. The hope is that other people will also find these notes useful. Note that these notes are unfiltered and unverified.
Polars
Polars is a relatively new data frame library that I don’t yet have experience, but I do want to use it for storing data related to this library, so I’m recording my experiments here.
Code
import polars as pl
Quick Start
The syntax looks very much like {dplyr}, following simple verbs that can be chained together via method chaining. You can use the backslash for line continuation but I prefer using parentheses.
An expression is essentially a definition of a data transformation on a polars data frame. By being able to optimize the query on that expression and parallelizing, Polars is able to achieve performance.
Code
# Select the column foo, sort it ascending, and get first 2 elementspl.col("foo").sort().head(2)
col("foo") ASC.slice(offset=0i32, length=2u64)
Code
# You can pipe expressions togetherdf.select( [pl.col("foo").sort().head(2), pl.col("bar").filter(pl.col("foo") ==1).sum()])
# Take the names vector, count how many, unique, and name a columndf.select( [ pl.col("names").n_unique().alias("unique_names_1"), pl.col("names").unique().count().alias("unique_names_2"), ])
# Get all names where the name contains am at the enddf.select( [ pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(), ])
shape: (1, 1)
names
u32
2
Code
# When the random field is >0.5, then use 0 else use the random, multiple by nrsdf.select( [ pl.when(pl.col("random") >0.5).then(0).otherwise(pl.col("random"))* pl.sum("nrs"), ])
shape: (5, 1)
literal
f64
1.695791
0.0
2.896465
0.0
0.160325
Code
# Window functions in SQL, get the sum of random over groups, and get the list of random over namesdf.select( [ pl.col("*"), # select all pl.col("random").sum().over("groups").alias("sum[random]/groups"), pl.col("random").list().over("names").alias("random/name"), ])
shape: (5, 6)
nrs
names
random
groups
sum[random]/groups
random/name
i64
str
f64
str
f64
list[f64]
1
"foo"
0.154163
"A"
0.894213
[0.154163]
2
"ham"
0.74005
"A"
0.894213
[0.74005]
3
"spam"
0.263315
"B"
0.27789
[0.263315]
null
"egg"
0.533739
"C"
0.533739
[0.533739]
5
null
0.014575
"B"
0.27789
[0.014575]
Expression Contexts
Expression contexts define how the expression is evaluated.
Select Context
Does operations over columns
Must produce Series of same length or length 1 (broadcasted)
You can add pl.all() to make it more like {dplyr}’s mutate(), otherwise it behaves like transmute().
df.groupby("groups").agg( [ pl.sum("nrs"), # sum nrs by groups pl.col("random").count().alias("count"), # count group members# sum random where name != null pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"), pl.col("names").reverse().alias(("reversed names")), ])