1 - Introducción a Polars

Polars es una librería de Python orientada a la manipulación y análisis de datos tabulares. Su API está pensada para trabajar de manera clara y eficiente con tablas, selecciones, filtros, agregaciones y transformaciones de columnas.

Las dos estructuras más importantes son:

import polars as pl

import pyprojroot

ROOT = pyprojroot.here()

A lo largo de este apunte vamos a trabajar con datos de los juegos olímpicos que se encuentran en olympics.csv. Este conjunto de datos, que contiene datos básicos de los atletas que participaron en los juegos olímpicos entre Atenas 1896 y Río 2016, fue publicado en la semana 32 del año 2024 proyecto TidyTuesday y contiene las siguientes variables:

Columna Descripción
id Identificador del atleta.
name Nombre del atleta.
sex Sexo del atleta.
age Edad del atleta.
height Altura del atleta en centímetros.
weight Peso del atleta en kilogramos.
team País o equipo por el que compite.
noc Código NOC asociado a la región o delegación.
games Nombre de la edición de los Juegos Olímpicos.
year Año de los Juegos Olímpicos.
season Temporada de los Juegos: invierno o verano.
city Ciudad anfitriona de los Juegos Olímpicos.
sport Deporte.
event Prueba o evento específico.
medal Medalla obtenida: oro, plata, bronce o valor faltante si no obtuvo medalla.

Lectura de datos tabulares

La lectura de datos es una de las tareas fundamentales en el análisis de datos. En Polars, muchas fuciones para importar datos comienzan con read_ y devuelven un DataFrame. Algunos ejemplos son:

  • pl.read_csv()
  • pl.read_parquet()
  • pl.read_excel()
  • pl.read_json()
  • pl.read_database()

Los archivos CSV (valores separados por comas) son uno de los formatos más comunes para almacenar datos tabulares. En este caso, los datos están en olympics.csv. Vamos a usar pl.read_csv() para leerlos.

datos = pl.read_csv(ROOT / "datos" / "olympics.csv")

Inspección de datos tabulares

La representación de un DataFrame en una Jupyter Notebook permite echar un primer vistazo a los datos:

datos
shape: (271_116, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
135569 "Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" 1976 "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135571 "Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" 1998 "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null
135571 "Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" 2002 "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null

No solo es posible ver las primeras y últimas filas de la tabla, sino que también es posible ver la cantidad de filas y columnas (shape: (271_116, 15)) y el tipo de datos de cada una de las variables. Con el objetivo de hacer un uso eficiente del almacenamiento y poder de procesamiento disponible, Polars ofrece una gran variedad de tipos de datos. A lo largo del curso los iremos conociendo.

type(datos)
polars.dataframe.frame.DataFrame

Si quisieramos ver solamente las primeras filas o columnas, podemos utilizar los métodos .head() y .tail() de DataFrame. Estos cuentan con un parámetro opcional n que permite controlar la cantidad de filas obtenidas.

datos.head()
shape: (5, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
datos.tail()
shape: (5, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
135569 "Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" 1976 "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135571 "Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" 1998 "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null
135571 "Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" 2002 "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null
datos.head(9)
shape: (9, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 27 185 82.0 "Netherlands" "NED" "1994 Winter" 1994 "Winter" "Lillehammer" "Speed Skating" "Speed Skating Women's 500 metr… null

Para inspeccionar la estructura del DataFrame de manera programática se puede consultar su esquema (schema), los tipos de datos (dtypes) o la la cantidad de filas y columnas (shape).

datos.schema
Schema([('id', Int64),
        ('name', String),
        ('sex', String),
        ('age', Int64),
        ('height', Int64),
        ('weight', Float64),
        ('team', String),
        ('noc', String),
        ('games', String),
        ('year', Int64),
        ('season', String),
        ('city', String),
        ('sport', String),
        ('event', String),
        ('medal', String)])
datos.dtypes
[Int64,
 String,
 String,
 Int64,
 Int64,
 Float64,
 String,
 String,
 String,
 Int64,
 String,
 String,
 String,
 String,
 String]
datos.shape
(271116, 15)

También es posible obtener una tabla con el conteo de valores nulos por columna usando el método null_count.

datos.null_count()
shape: (1, 15)
id name sex age height weight team noc games year season city sport event medal
u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
0 0 0 9474 60171 62875 0 0 0 0 0 0 0 0 231333

En el caso de necesitar un resumen rápido y más completo, podemos hacer uso del método describe.

datos.describe()
shape: (9, 16)
statistic id name sex age height weight team noc games year season city sport event medal
str f64 str str f64 f64 f64 str str str f64 str str str str str
"count" 271116.0 "271116" "271116" 261642.0 210945.0 208241.0 "271116" "271116" "271116" 271116.0 "271116" "271116" "271116" "271116" "39783"
"null_count" 0.0 "0" "0" 9474.0 60171.0 62875.0 "0" "0" "0" 0.0 "0" "0" "0" "0" "231333"
"mean" 68248.954396 null null 25.556898 175.33897 70.702393 null null null 1978.37848 null null null null null
"std" 39022.286345 null null 6.393561 10.518462 14.34802 null null null 29.877632 null null null null null
"min" 1.0 "A Dijiang" "F" 10.0 127.0 25.0 "30. Februar" "AFG" "1896 Summer" 1896.0 "Summer" "Albertville" "Aeronautics" "Aeronautics Mixed Aeronautics" "Bronze"
"25%" 34643.0 null null 21.0 168.0 60.0 null null null 1960.0 null null null null null
"50%" 68205.0 null null 24.0 175.0 70.0 null null null 1988.0 null null null null null
"75%" 102097.0 null null 28.0 183.0 79.0 null null null 2002.0 null null null null null
"max" 135571.0 "zzet nce" "M" 97.0 226.0 214.0 "rn-2" "ZIM" "2016 Summer" 2016.0 "Winter" "Vancouver" "Wrestling" "Wrestling Women's Middleweight… "Silver"

Si se quieren descartar rápidamente filas con valores nulos en algunas de sus columnas, se usa el método drop_nulls.

datos.drop_nulls()
shape: (30_181, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
16 "Juhamatti Tapio Aaltonen" "M" 28 184 85.0 "Finland" "FIN" "2014 Winter" 2014 "Winter" "Sochi" "Ice Hockey" "Ice Hockey Men's Ice Hockey" "Bronze"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Individual Al… "Bronze"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Team All-Arou… "Gold"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Horse Vault" "Gold"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Pommelled Hor… "Gold"
135553 "Galina Ivanovna Zybina (-Fyodo… "F" 25 168 80.0 "Soviet Union" "URS" "1956 Summer" 1956 "Summer" "Melbourne" "Athletics" "Athletics Women's Shot Put" "Silver"
135553 "Galina Ivanovna Zybina (-Fyodo… "F" 33 168 80.0 "Soviet Union" "URS" "1964 Summer" 1964 "Summer" "Tokyo" "Athletics" "Athletics Women's Shot Put" "Bronze"
135554 "Bogusaw Zych" "M" 28 182 82.0 "Poland" "POL" "1980 Summer" 1980 "Summer" "Moskva" "Fencing" "Fencing Men's Foil, Team" "Bronze"
135563 "Olesya Nikolayevna Zykina" "F" 19 171 64.0 "Russia" "RUS" "2000 Summer" 2000 "Summer" "Sydney" "Athletics" "Athletics Women's 4 x 400 metr… "Bronze"
135563 "Olesya Nikolayevna Zykina" "F" 23 171 64.0 "Russia" "RUS" "2004 Summer" 2004 "Summer" "Athina" "Athletics" "Athletics Women's 4 x 400 metr… "Silver"

Vale la pena notar que los DataFrame también cuentan con el método drop_nans, pero este descarta filas con valores numéricos que no son válidos.

Cálculo de resumenes

A la hora de trabajar con datos, suele ser de interés procesarlos de alguna manera para obtener información respecto de algunas variables relevantes. Por ejemplo, se pueden calcular valores medios, mínimos, máximos, medianas, diferencias, razones, etc. Las cuentas que resumen muchos datos en un único valor se llaman resumenes o agregaciones (porque se están resumiendo agregando varios valores de alguna forma en un único valor).

Resumenes simples

Como ya vimos, una forma muy rápida de calcular estadísticas básicas con un DataFrame es utilizar el método .describe, que genera estadísticas básicas para las columnas numéricas del conjunto de datos. Veamos que también nos permite especificar el listado de percentiles que queremos utilizar.

datos.describe(percentiles=[0.1, 0.4, 0.6, 0.9])
shape: (10, 16)
statistic id name sex age height weight team noc games year season city sport event medal
str f64 str str f64 f64 f64 str str str f64 str str str str str
"count" 271116.0 "271116" "271116" 261642.0 210945.0 208241.0 "271116" "271116" "271116" 271116.0 "271116" "271116" "271116" "271116" "39783"
"null_count" 0.0 "0" "0" 9474.0 60171.0 62875.0 "0" "0" "0" 0.0 "0" "0" "0" "0" "231333"
"mean" 68248.954396 null null 25.556898 175.33897 70.702393 null null null 1978.37848 null null null null null
"std" 39022.286345 null null 6.393561 10.518462 14.34802 null null null 29.877632 null null null null null
"min" 1.0 "A Dijiang" "F" 10.0 127.0 25.0 "30. Februar" "AFG" "1896 Summer" 1896.0 "Summer" "Albertville" "Aeronautics" "Aeronautics Mixed Aeronautics" "Bronze"
"10%" 14102.0 null null 19.0 162.0 54.0 null null null 1928.0 null null null null null
"40%" 54826.0 null null 23.0 173.0 66.0 null null null 1976.0 null null null null null
"60%" 81686.0 null null 26.0 178.0 73.0 null null null 1992.0 null null null null null
"90%" 122194.0 null null 33.0 188.0 89.0 null null null 2012.0 null null null null null
"max" 135571.0 "zzet nce" "M" 97.0 226.0 214.0 "rn-2" "ZIM" "2016 Summer" 2016.0 "Winter" "Vancouver" "Wrestling" "Wrestling Women's Middleweight… "Silver"

A nivel columna, si queremos calcular la cantidad de ocurrencias de una columna categórica o discreta, podemos seleccionar la columna y luego usar value_counts().

datos["sex"].value_counts()
shape: (2, 2)
sex count
str u32
"F" 74522
"M" 196594

Notemos que el método value_counts pertenece a la Series y no al DataFrame.

type(datos["sex"])
polars.series.series.Series

Este método también recibe los parámetros sort que ordena las filas de mayor a menor cantidad de ocurrencias, name que permite renombrar la columna de conteo y normalize que divide los conteos por su suma de modo tal que representan una proporción.

datos["noc"].value_counts(sort=True)
shape: (230, 2)
noc count
str u32
"USA" 18853
"FRA" 12758
"GBR" 12256
"ITA" 10715
"GER" 9830
"YMD" 5
"SSD" 3
"NBO" 2
"UNK" 2
"NFL" 1
datos["noc"].value_counts(sort=True, name="n")
shape: (230, 2)
noc n
str u32
"USA" 18853
"FRA" 12758
"GBR" 12256
"ITA" 10715
"GER" 9830
"YMD" 5
"SSD" 3
"NBO" 2
"UNK" 2
"NFL" 1
datos["noc"].value_counts(sort=True, name="p", normalize=True)
shape: (230, 2)
noc p
str f64
"USA" 0.069539
"FRA" 0.047057
"GBR" 0.045206
"ITA" 0.039522
"GER" 0.036258
"YMD" 0.000018
"SSD" 0.000011
"NBO" 0.000007
"UNK" 0.000007
"NFL" 0.000004

Otros métodos interesantes son unique y .n_unique, que devuelven los valores únicos de una columna y su cantidad.

datos["noc"].unique()
shape: (230,)
noc
str
"TLS"
"GUM"
"PAK"
"NOR"
"GBS"
"KAZ"
"MDV"
"SAM"
"STP"
"VIN"
datos["noc"].n_unique()
230
datos["games"].unique()
shape: (51,)
games
str
"1988 Winter"
"1980 Winter"
"1980 Summer"
"2014 Winter"
"1976 Winter"
"1984 Winter"
"1924 Summer"
"1896 Summer"
"1998 Winter"
"1932 Summer"

En cuanto a las variables numéricas, las Series de Polars ofrecen una gran variedad de resumenes estadísticos:

datos["age"].mean()
25.556898357297374
print("Media:", datos["age"].mean())
print("Mediana:", datos["age"].median())
print("Desvío estándar:", datos["age"].std())
print("Mínimo:", datos["age"].min())
print("Máximo:", datos["age"].max())
print("Percentil 10:", datos["age"].quantile(0.1))
Media: 25.556898357297374
Mediana: 24.0
Desvío estándar: 6.393560847035814
Mínimo: 10
Máximo: 97
Percentil 10: 19.0

Expresiones

En Polars, es posible construir expresiones que representan transformaciones sobre los datos.

Para hacerlo, primero se indican las columnas con pl.col() y luego se las combina usando operadores aritméticos habituales o métodos provistos por Polars.

Por ejemplo, la siguiente expresión combina las columnas "height" y "weight" de modo tal que se calcule el índice de masa corporal.

pl.col("weight") / (pl.col("height") / 100) ** 2
[(col("weight")) / ([(col("height")) / (dyn int: 100)].pow([dyn int: 2]))]

Como se puede ver, el bloque anterior todavía no realiza ningún cálculo sobre el conjunto de datos. Lo que hace es definir, de forma abstracta, una transformación que queremos aplicar.

Para evaluar esa expresión, en este caso para calcular el índice de masa corporal, necesitamos usarla dentro de un contexto de ejecución. Polars ofrece varios, pero por ahora nos interesa select. Este contexto permite aplicar expresiones sobre las columnas y producir nuevas columnas, ya sea como resúmenes, combinaciones de columnas existentes o incluso valores literales.

Veamos que, en principio, select permite seleccionar columnas:

datos.select("name", "age", "sex")
shape: (271_116, 3)
name age sex
str i64 str
"A Dijiang" 24 "M"
"A Lamusi" 23 "M"
"Gunnar Nielsen Aaby" 24 "M"
"Edgar Lindenau Aabye" 34 "M"
"Christine Jacoba Aaftink" 21 "F"
"Andrzej ya" 29 "M"
"Piotr ya" 27 "M"
"Piotr ya" 27 "M"
"Tomasz Ireneusz ya" 30 "M"
"Tomasz Ireneusz ya" 34 "M"

Sin embargo, como acabamos de mencionar, también permite ejecutar expresiones.

datos.select("name", pl.col("age") * 12)
shape: (271_116, 2)
name age
str i64
"A Dijiang" 288
"A Lamusi" 276
"Gunnar Nielsen Aaby" 288
"Edgar Lindenau Aabye" 408
"Christine Jacoba Aaftink" 252
"Andrzej ya" 348
"Piotr ya" 324
"Piotr ya" 324
"Tomasz Ireneusz ya" 360
"Tomasz Ireneusz ya" 408

Si queremos guardar el resultado de la expresión en una nueva columna, tenemos que pasarla como parámetro:

datos.select(
    "name", 
    "height", 
    "weight", 
    bmi=pl.col("weight") / (pl.col("height") / 100) ** 2
)
shape: (271_116, 4)
name height weight bmi
str i64 f64 f64
"A Dijiang" 180 80.0 24.691358
"A Lamusi" 170 60.0 20.761246
"Gunnar Nielsen Aaby" null null null
"Edgar Lindenau Aabye" null null null
"Christine Jacoba Aaftink" 185 82.0 23.959094
"Andrzej ya" 179 89.0 27.776911
"Piotr ya" 176 59.0 19.047004
"Piotr ya" 176 59.0 19.047004
"Tomasz Ireneusz ya" 185 96.0 28.049671
"Tomasz Ireneusz ya" 185 96.0 28.049671

O utilizar el método alias.

datos.select(
    "name", 
    "height", 
    "weight", 
    (pl.col("weight") / (pl.col("height") / 100) ** 2).alias("bmi")
)
shape: (271_116, 4)
name height weight bmi
str i64 f64 f64
"A Dijiang" 180 80.0 24.691358
"A Lamusi" 170 60.0 20.761246
"Gunnar Nielsen Aaby" null null null
"Edgar Lindenau Aabye" null null null
"Christine Jacoba Aaftink" 185 82.0 23.959094
"Andrzej ya" 179 89.0 27.776911
"Piotr ya" 176 59.0 19.047004
"Piotr ya" 176 59.0 19.047004
"Tomasz Ireneusz ya" 185 96.0 28.049671
"Tomasz Ireneusz ya" 185 96.0 28.049671

Retemando el cálculo de resumenes, es posible utilizar un método de agregación, por ejempĺo:

datos.select(pl.col("age").mean())
shape: (1, 1)
age
f64
25.556898
datos.select(
    pl.col("age").mean().alias("edad_promedio"),
    pl.col("height").mean().alias("altura_promedio"),
)
shape: (1, 2)
edad_promedio altura_promedio
f64 f64
25.556898 175.33897

O bien:

datos.select(
    pl.col("age").mean().alias("edad_media"),
    pl.col("age").std().alias("edad_std"),
)
shape: (1, 2)
edad_media edad_std
f64 f64
25.556898 6.393561

Como iremos viendo a lo largo del curso, las expresiones de Polars son muy potentes y se pueden usar en varios contextos, no solo en select.

Resumenes agrupados

El siguiente paso es calcular resumenes para cada nivel de otra variable (o combinaciones de niveles de múltiples variables).

Por ejemplo:

  • ¿Cuál es la edad promedio de los atletas en cada juego olímpico?
  • ¿Cuántas medallas de oro ganó cada país en cada edición?
  • ¿Qué país estuvo más represenado por mujeres?

Lo primero que tenemos que conocer para realizar esta tarea es método group_by, que es un contexto de ejecución de expresiones.

Este método agrupa el DataFrame según una o más columnas para que luego podamos aplicar agregaciones sobre cada grupo.

datos.group_by("year")
<polars.dataframe.group_by.GroupBy at 0x7e8aac1be3c0>

La representación del objeto agrupado es bastante austera, pero una vez agrupados los datos podemos aplicar agregaciones con .agg().

datos.group_by("games").agg(pl.col("age").mean())
shape: (51, 2)
games age
str f64
"1906 Summer" 27.125253
"1984 Summer" 24.002811
"1928 Winter" 26.581301
"1976 Winter" 23.739459
"1924 Summer" 28.449373
"1956 Summer" 26.033673
"1952 Summer" 26.258851
"1980 Winter" 23.65616
"1932 Summer" 33.45154
"1960 Winter" 24.241877

Utilizando el método sort() se puede obtener una tabla ordenada según variables de interés, por ejemplo games.

datos.group_by("games").agg(pl.col("age").mean()).sort("games")
shape: (51, 2)
games age
str f64
"1896 Summer" 23.580645
"1900 Summer" 29.034031
"1904 Summer" 26.69815
"1906 Summer" 27.125253
"1908 Summer" 26.970228
"2008 Summer" 25.734118
"2010 Winter" 26.124262
"2012 Summer" 25.961378
"2014 Winter" 25.987324
"2016 Summer" 26.207919

Y con un poco de creatividad podemos obtener la edición donde se presentó el atleta mas longevo:

datos.group_by("games").agg(pl.col("age").max()).sort("age").tail(1)
shape: (1, 2)
games age
str i64
"1928 Summer" 97

Otro ejemplo:

(
    datos
    .group_by("games", "noc")
    .agg(
        (pl.col("sex") == "F").mean().alias("p_mujer"),
        pl.len().alias("n")
    )
    .sort("p_mujer", "n")
    .tail(10)
)
shape: (10, 4)
games noc p_mujer n
str str f64 u32
"2014 Winter" "TOG" 1.0 3
"1992 Winter" "HON" 1.0 3
"2006 Winter" "HKG" 1.0 3
"1988 Summer" "MYA" 1.0 3
"2006 Winter" "DEN" 1.0 4
"1932 Winter" "GBR" 1.0 4
"2002 Winter" "HKG" 1.0 5
"1960 Summer" "ZIM" 1.0 10
"1972 Winter" "PRK" 1.0 10
"1976 Summer" "PER" 1.0 14

Una forma de obtener la cantidad de observaciones por grupo es:

datos.group_by("games", "noc").agg(pl.len())
shape: (3_837, 3)
games noc len
str str u32
"1988 Summer" "ASA" 6
"1956 Summer" "PHI" 48
"1960 Summer" "HAI" 1
"2002 Winter" "EST" 51
"2016 Summer" "SEN" 22
"1988 Summer" "ANG" 33
"1984 Summer" "FRA" 354
"1996 Summer" "BOL" 9
"2010 Winter" "SLO" 107
"1900 Summer" "GRE" 4

Manipulación de columnas

Seleccionar y transformar

Podemos seleccionar una columna de un DataFrame pasando el nombre de la misma dentro de corchetes (de la misma forma que haríamos con un diccionario).

datos["noc"]
shape: (271_116,)
noc
str
"CHN"
"CHN"
"DEN"
"DEN"
"NED"
"POL"
"POL"
"POL"
"POL"
"POL"

El resultado es una Series, no un DataFrame compuesto por una sola columna.

type(datos["noc"])
polars.series.series.Series

Si queremos obtener un DataFrame, tenemos que usar el método select.

datos.select("noc")
shape: (271_116, 1)
noc
str
"CHN"
"CHN"
"DEN"
"DEN"
"NED"
"POL"
"POL"
"POL"
"POL"
"POL"
type(datos.select("noc"))
polars.dataframe.frame.DataFrame
datos.select(pl.col("noc").alias("pais"))
shape: (271_116, 1)
pais
str
"CHN"
"CHN"
"DEN"
"DEN"
"NED"
"POL"
"POL"
"POL"
"POL"
"POL"
datos.select(pl.col("noc").alias("pais"), pl.col("age").mean())
shape: (271_116, 2)
pais age
str f64
"CHN" 25.556898
"CHN" 25.556898
"DEN" 25.556898
"DEN" 25.556898
"NED" 25.556898
"POL" 25.556898
"POL" 25.556898
"POL" 25.556898
"POL" 25.556898
"POL" 25.556898
datos.select((pl.col("weight") - pl.col("weight").mean()) / pl.col("weight").std())
shape: (271_116, 1)
weight
f64
0.648006
-0.745914
null
null
0.787398
1.275271
-0.81561
-0.81561
1.763143
1.763143

Si se desea modificar y/o agregar columnas a un DataFrame, conviene utilizar el contexto with_columns, que devuelve una tabla con todas las columnas del DataFrame original junto con las nuevas columnas generadas a partir de las expresiones.

datos.with_columns(
    weight_z=(pl.col("weight") - pl.col("weight").mean()) / pl.col("weight").std()
)
shape: (271_116, 16)
id name sex age height weight team noc games year season city sport event medal weight_z
i64 str str i64 i64 f64 str str str i64 str str str str str f64
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null 0.648006
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null -0.745914
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold" null
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null 0.787398
135569 "Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" 1976 "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null 1.275271
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null -0.81561
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null -0.81561
135571 "Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" 1998 "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null 1.763143
135571 "Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" 2002 "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null 1.763143

Debido a esta diferencia entre select y with_columns, las expresiones usadas dentro dewith_columnsdeben producir series que tengan la misma longitud que las columnas originales delDataFrame. En cambio, en el contextoselect` alcanza con que las expresiones produzcan series de la misma longitud entre sí.

Descartar

Para descartar columnas tenemos el método .drop. Simplemente le pasamos el nombre de las columnas que queremos descartar.

datos.drop("id", "year", "season")
shape: (271_116, 12)
name sex age height weight team noc games city sport event medal
str str i64 i64 f64 str str str str str str str
"A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
"A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
"Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" "Antwerpen" "Football" "Football Men's Football" null
"Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
"Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
"Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null
"Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
"Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
"Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null
"Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null

Renombrar

datos.select("name", "sex", "age").rename({"name": "nombre", "sex": "sexo", "age": "edad"})
shape: (271_116, 3)
nombre sexo edad
str str i64
"A Dijiang" "M" 24
"A Lamusi" "M" 23
"Gunnar Nielsen Aaby" "M" 24
"Edgar Lindenau Aabye" "M" 34
"Christine Jacoba Aaftink" "F" 21
"Andrzej ya" "M" 29
"Piotr ya" "M" 27
"Piotr ya" "M" 27
"Tomasz Ireneusz ya" "M" 30
"Tomasz Ireneusz ya" "M" 34

Manipulación de filas

Ordenar

Para ordenar las observaciones en función de una o más variables usamos .sort(). Si pasamos una sola columna alcanza con una cadena; si queremos ordenar por varias, conviene pasar una lista.

El argumento descending permite invertir el orden. Como Polars no usa un índice persistente, ordenar las filas no requiere ningún paso extra para reenumerarlas.

datos.sort("year")
shape: (271_116, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1724 "Aristidis Akratopoulos" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Tennis" "Tennis Men's Singles" null
1724 "Aristidis Akratopoulos" "M" null null null "Greece-3" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Tennis" "Tennis Men's Doubles" null
1725 "Konstantinos "Kostas" Akratopo… "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Tennis" "Tennis Men's Singles" null
1725 "Konstantinos "Kostas" Akratopo… "M" null null null "Greece-3" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Tennis" "Tennis Men's Doubles" null
4113 "Anastasios Andreou" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Athletics" "Athletics Men's 110 metres Hur… null
135528 "Marc Zwiebler" "M" 32 181 75.0 "Germany" "GER" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Badminton" "Badminton Men's Singles" null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 100 metres" null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 200 metres" null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 4 x 100 metr… null
135568 "Olga Igorevna Zyuzkova" "F" 33 171 69.0 "Belarus" "BLR" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Basketball" "Basketball Women's Basketball" null
datos.sort("year", descending=True)
shape: (271_116, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
22 "Andreea Aanei" "F" 22 170 125.0 "Romania" "ROU" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Weightlifting" "Weightlifting Women's Super-He… null
51 "Nstor Abad Sanjun" "M" 23 167 64.0 "Spain" "ESP" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Gymnastics" "Gymnastics Men's Individual Al… null
51 "Nstor Abad Sanjun" "M" 23 167 64.0 "Spain" "ESP" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Gymnastics" "Gymnastics Men's Floor Exercis… null
51 "Nstor Abad Sanjun" "M" 23 167 64.0 "Spain" "ESP" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Gymnastics" "Gymnastics Men's Parallel Bars" null
51 "Nstor Abad Sanjun" "M" 23 167 64.0 "Spain" "ESP" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Gymnastics" "Gymnastics Men's Horizontal Ba… null
135285 "Khristos Zoumis" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Athletics" "Athletics Men's Triple Jump" null
135481 "Jules Alexis "Louis" Zutter" "M" 30 null null "Switzerland" "SUI" "1896 Summer" 1896 "Summer" "Athina" "Gymnastics" "Gymnastics Men's Horse Vault" "Silver"
135481 "Jules Alexis "Louis" Zutter" "M" 30 null null "Switzerland" "SUI" "1896 Summer" 1896 "Summer" "Athina" "Gymnastics" "Gymnastics Men's Parallel Bars" "Silver"
135481 "Jules Alexis "Louis" Zutter" "M" 30 null null "Switzerland" "SUI" "1896 Summer" 1896 "Summer" "Athina" "Gymnastics" "Gymnastics Men's Horizontal Ba… null
135481 "Jules Alexis "Louis" Zutter" "M" 30 null null "Switzerland" "SUI" "1896 Summer" 1896 "Summer" "Athina" "Gymnastics" "Gymnastics Men's Pommelled Hor… "Gold"
datos.sort(["season", "year"])
shape: (271_116, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
57440 "Karakatsanis" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Shooting" "Shooting Men's Free Rifle, Thr… null
122329 "Georgios Tsitas" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Wrestling" "Wrestling Men's Unlimited Clas… "Silver"
122303 "Leonidas Tsiklitiras" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Gymnastics" "Gymnastics Men's Horizontal Ba… null
38123 "Carl Galle" "M" 23 154 45.0 "Germany" "GER" "1896 Summer" 1896 "Summer" "Athina" "Athletics" "Athletics Men's 1,500 metres" null
122152 "A. Tryfiatis-Trypiapis" "M" null null null "Greece" "GRE" "1896 Summer" 1896 "Summer" "Athina" "Cycling" "Cycling Men's 12-Hours Race" null
23464 "Charle Cournoyer" "M" 22 173 68.0 "Canada" "CAN" "2014 Winter" 2014 "Winter" "Sochi" "Short Track Speed Skating" "Short Track Speed Skating Men'… null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
31342 "Tiril Kampenhaug Eckhoff" "F" 23 163 59.0 "Norway" "NOR" "2014 Winter" 2014 "Winter" "Sochi" "Biathlon" "Biathlon Mixed 2 x 6 kilometre… "Gold"
128915 "Yuta Watase" "M" 31 178 63.0 "Japan" "JPN" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Normal Hill,… null

Filtrar

Es posible seleccionar filas por posición usando slices. Esto resulta útil para inspeccionar rápidamente una parte de la tabla.

datos[0:10]
shape: (10, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 27 185 82.0 "Netherlands" "NED" "1994 Winter" 1994 "Winter" "Lillehammer" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 27 185 82.0 "Netherlands" "NED" "1994 Winter" 1994 "Winter" "Lillehammer" "Speed Skating" "Speed Skating Women's 1,000 me… null
datos[0:10:2]
shape: (5, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 27 185 82.0 "Netherlands" "NED" "1994 Winter" 1994 "Winter" "Lillehammer" "Speed Skating" "Speed Skating Women's 500 metr… null
datos[0:100:10, :5]
shape: (10, 5)
id name sex age height
i64 str str i64 i64
1 "A Dijiang" "M" 24 180
6 "Per Knut Aaland" "M" 31 188
7 "John Aalberg" "M" 31 183
11 "Jorma Ilmari Aalto" "M" 22 182
16 "Juhamatti Tapio Aaltonen" "M" 28 184
17 "Paavo Johannes Aaltonen" "M" 32 175
20 "Kjetil Andr Aamodt" "M" 20 176
20 "Kjetil Andr Aamodt" "M" 26 176
22 "Andreea Aanei" "F" 22 170
28 "Jan-Erik Aarberg" "M" 47 170

Cuando lo que queremos es conservar filas según una condición sobre sus valores, en cambio, conviene usar el método filter. Este método acepta condiciones en forma de expresiones booleanas, es decir, filter también es un contexto.

Por ejemplo, podemos quedarnos solo con los Juegos Olímpicos de invierno, excluirlos o combinar condiciones sobre season, year, sport o medal.

datos.filter(pl.col("season") == "Winter")
shape: (48_564, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 500 metr… null
5 "Christine Jacoba Aaftink" "F" 25 185 82.0 "Netherlands" "NED" "1992 Winter" 1992 "Winter" "Albertville" "Speed Skating" "Speed Skating Women's 1,000 me… null
5 "Christine Jacoba Aaftink" "F" 27 185 82.0 "Netherlands" "NED" "1994 Winter" 1994 "Winter" "Lillehammer" "Speed Skating" "Speed Skating Women's 500 metr… null
135569 "Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" 1976 "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135571 "Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" 1998 "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null
135571 "Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" 2002 "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null
datos.filter(pl.col("season") != "Winter")
shape: (222_552, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
8 "Cornelia "Cor" Aalten (-Strann… "F" 18 168 null "Netherlands" "NED" "1932 Summer" 1932 "Summer" "Los Angeles" "Athletics" "Athletics Women's 100 metres" null
135565 "Fernando scar Zylberberg" "M" 27 168 76.0 "Argentina" "ARG" "2004 Summer" 2004 "Summer" "Athina" "Hockey" "Hockey Men's Hockey" null
135566 "James Francis "Jim" Zylker" "M" 21 175 75.0 "United States" "USA" "1972 Summer" 1972 "Summer" "Munich" "Football" "Football Men's Football" null
135567 "Aleksandr Viktorovich Zyuzin" "M" 24 183 72.0 "Russia" "RUS" "2000 Summer" 2000 "Summer" "Sydney" "Rowing" "Rowing Men's Lightweight Coxle… null
135567 "Aleksandr Viktorovich Zyuzin" "M" 28 183 72.0 "Russia" "RUS" "2004 Summer" 2004 "Summer" "Athina" "Rowing" "Rowing Men's Lightweight Coxle… null
135568 "Olga Igorevna Zyuzkova" "F" 33 171 69.0 "Belarus" "BLR" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Basketball" "Basketball Women's Basketball" null
datos.filter(pl.col("season") == "Summer", pl.col("year") >= 2000)
shape: (67_474, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
12 "Jyri Tapani Aalto" "M" 31 172 70.0 "Finland" "FIN" "2000 Summer" 2000 "Summer" "Sydney" "Badminton" "Badminton Men's Singles" null
13 "Minna Maarit Aalto" "F" 34 159 55.5 "Finland" "FIN" "2000 Summer" 2000 "Summer" "Sydney" "Sailing" "Sailing Women's Windsurfer" null
18 "Timo Antero Aaltonen" "M" 31 189 130.0 "Finland" "FIN" "2000 Summer" 2000 "Summer" "Sydney" "Athletics" "Athletics Men's Shot Put" null
21 "Ragnhild Margrethe Aamodt" "F" 27 163 null "Norway" "NOR" "2008 Summer" 2008 "Summer" "Beijing" "Handball" "Handball Women's Handball" "Gold"
135565 "Fernando scar Zylberberg" "M" 23 168 76.0 "Argentina" "ARG" "2000 Summer" 2000 "Summer" "Sydney" "Hockey" "Hockey Men's Hockey" null
135565 "Fernando scar Zylberberg" "M" 27 168 76.0 "Argentina" "ARG" "2004 Summer" 2004 "Summer" "Athina" "Hockey" "Hockey Men's Hockey" null
135567 "Aleksandr Viktorovich Zyuzin" "M" 24 183 72.0 "Russia" "RUS" "2000 Summer" 2000 "Summer" "Sydney" "Rowing" "Rowing Men's Lightweight Coxle… null
135567 "Aleksandr Viktorovich Zyuzin" "M" 28 183 72.0 "Russia" "RUS" "2004 Summer" 2004 "Summer" "Athina" "Rowing" "Rowing Men's Lightweight Coxle… null
135568 "Olga Igorevna Zyuzkova" "F" 33 171 69.0 "Belarus" "BLR" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Basketball" "Basketball Women's Basketball" null
datos.filter(pl.col("medal").is_in(["Gold", "Silver"]))
shape: (26_488, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Team All-Arou… "Gold"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Horse Vault" "Gold"
17 "Paavo Johannes Aaltonen" "M" 28 175 64.0 "Finland" "FIN" "1948 Summer" 1948 "Summer" "London" "Gymnastics" "Gymnastics Men's Pommelled Hor… "Gold"
20 "Kjetil Andr Aamodt" "M" 20 176 85.0 "Norway" "NOR" "1992 Winter" 1992 "Winter" "Albertville" "Alpine Skiing" "Alpine Skiing Men's Super G" "Gold"
135544 "Krzysztof Zwoliski" "M" 21 175 70.0 "Poland" "POL" "1980 Summer" 1980 "Summer" "Moskva" "Athletics" "Athletics Men's 4 x 100 metres… "Silver"
135545 "Henk Jan Zwolle" "M" 31 197 93.0 "Netherlands" "NED" "1996 Summer" 1996 "Summer" "Atlanta" "Rowing" "Rowing Men's Coxed Eights" "Gold"
135553 "Galina Ivanovna Zybina (-Fyodo… "F" 21 168 80.0 "Soviet Union" "URS" "1952 Summer" 1952 "Summer" "Helsinki" "Athletics" "Athletics Women's Shot Put" "Gold"
135553 "Galina Ivanovna Zybina (-Fyodo… "F" 25 168 80.0 "Soviet Union" "URS" "1956 Summer" 1956 "Summer" "Melbourne" "Athletics" "Athletics Women's Shot Put" "Silver"
135563 "Olesya Nikolayevna Zykina" "F" 23 171 64.0 "Russia" "RUS" "2004 Summer" 2004 "Summer" "Athina" "Athletics" "Athletics Women's 4 x 400 metr… "Silver"
datos.filter(
    pl.col("sport").is_in(["Athletics", "Swimming"]),
    pl.col("year").is_between(2000, 2016)
)
shape: (19_887, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
18 "Timo Antero Aaltonen" "M" 31 189 130.0 "Finland" "FIN" "2000 Summer" 2000 "Summer" "Sydney" "Athletics" "Athletics Men's Shot Put" null
34 "Jamale (Djamel-) Aarrass (Ahra… "M" 30 187 76.0 "France" "FRA" "2012 Summer" 2012 "Summer" "London" "Athletics" "Athletics Men's 1,500 metres" null
36 "Stefan Remco Aartsen" "M" 25 194 78.0 "Netherlands" "NED" "2000 Summer" 2000 "Summer" "Sydney" "Swimming" "Swimming Men's 100 metres Butt… null
36 "Stefan Remco Aartsen" "M" 25 194 78.0 "Netherlands" "NED" "2000 Summer" 2000 "Summer" "Sydney" "Swimming" "Swimming Men's 200 metres Butt… null
49 "Moonika Aava" "F" 24 168 65.0 "Estonia" "EST" "2004 Summer" 2004 "Summer" "Athina" "Athletics" "Athletics Women's Javelin Thro… null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 100 metres" null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 200 metres" null
135547 "Viktoriya Viktorovna Zyabkina" "F" 23 174 62.0 "Kazakhstan" "KAZ" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Athletics" "Athletics Women's 4 x 100 metr… null
135563 "Olesya Nikolayevna Zykina" "F" 19 171 64.0 "Russia" "RUS" "2000 Summer" 2000 "Summer" "Sydney" "Athletics" "Athletics Women's 4 x 400 metr… "Bronze"
135563 "Olesya Nikolayevna Zykina" "F" 23 171 64.0 "Russia" "RUS" "2004 Summer" 2004 "Summer" "Athina" "Athletics" "Athletics Women's 4 x 400 metr… "Silver"
datos.filter((pl.col("year") < 1950) | (pl.col("year") > 2010))
shape: (77_751, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
8 "Cornelia "Cor" Aalten (-Strann… "F" 18 168 null "Netherlands" "NED" "1932 Summer" 1932 "Summer" "Los Angeles" "Athletics" "Athletics Women's 100 metres" null
8 "Cornelia "Cor" Aalten (-Strann… "F" 18 168 null "Netherlands" "NED" "1932 Summer" 1932 "Summer" "Los Angeles" "Athletics" "Athletics Women's 4 x 100 metr… null
135558 "ukasz Tomasz ygado" "M" 32 200 89.0 "Poland" "POL" "2012 Summer" 2012 "Summer" "London" "Volleyball" "Volleyball Men's Volleyball" null
135561 "Frantiek Zyka" "M" 26 null null "Czechoslovakia" "TCH" "1928 Summer" 1928 "Summer" "Amsterdam" "Athletics" "Athletics Men's Marathon" null
135568 "Olga Igorevna Zyuzkova" "F" 33 171 69.0 "Belarus" "BLR" "2016 Summer" 2016 "Summer" "Rio de Janeiro" "Basketball" "Basketball Women's Basketball" null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null

Cuando usamos operadores como | conviene encerrar cada comparación entre paréntesis para que la expresión se evalúe como esperamos.

Descartar duplicados

En Polars podemos usar .unique() para quedarnos con combinaciones únicas de una o más columnas. Esto sirve tanto para explorar valores posibles como para descartar filas duplicadas según un subconjunto de variables.

Si usamos el parámetro subset, Polars decide si dos filas son duplicadas mirando solo esas columnas.

datos.select("season", "sport").unique()
shape: (69, 2)
season sport
str str
"Summer" "Croquet"
"Summer" "Gymnastics"
"Summer" "Lacrosse"
"Summer" "Diving"
"Winter" "Ice Hockey"
"Summer" "Roque"
"Summer" "Wrestling"
"Summer" "Badminton"
"Winter" "Alpinism"
"Summer" "Figure Skating"
datos.unique(subset=["season", "sport"])
shape: (69, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
23 "Fritz Aanes" "M" 22 187 89.0 "Norway" "NOR" "2000 Summer" 2000 "Summer" "Sydney" "Wrestling" "Wrestling Men's Light-Heavywei… null
43 "Morten Gjerdrum Aasen" "M" 34 185 75.0 "Norway" "NOR" "1992 Summer" 1992 "Summer" "Barcelona" "Equestrianism" "Equestrianism Mixed Jumping, I… null
154 "Khadija Abbouda" "F" 40 165 57.0 "Morocco" "MAR" "2008 Summer" 2008 "Summer" "Beijing" "Archery" "Archery Women's Individual" null
82 "Enrique Abaroa Martnez" "M" 26 183 75.0 "Mexico" "MEX" "2000 Summer" 2000 "Summer" "Sydney" "Tennis" "Tennis Men's Doubles" null
8 "Cornelia "Cor" Aalten (-Strann… "F" 18 168 null "Netherlands" "NED" "1932 Summer" 1932 "Summer" "Los Angeles" "Athletics" "Athletics Women's 100 metres" null
5716 "John Jacob Astor" "M" 21 null null "Great Britain" "GBR" "1908 Summer" 1908 "Summer" "London" "Racquets" "Racquets Men's Singles" "Bronze"
387 "Sonia Abejn Esteban" "F" 19 169 48.0 "Spain" "ESP" "2004 Summer" 2004 "Summer" "Athina" "Rhythmic Gymnastics" "Rhythmic Gymnastics Women's Gr… null
20 "Kjetil Andr Aamodt" "M" 20 176 85.0 "Norway" "NOR" "1992 Winter" 1992 "Winter" "Albertville" "Alpine Skiing" "Alpine Skiing Men's Downhill" null
21 "Ragnhild Margrethe Aamodt" "F" 27 163 null "Norway" "NOR" "2008 Summer" 2008 "Summer" "Beijing" "Handball" "Handball Women's Handball" "Gold"
5822 "George H. Atkinson" "M" null null null "Wolseley-Siddeley-1" "GBR" "1908 Summer" 1908 "Summer" "London" "Motorboating" "Motorboating Mixed A-Class (Op… null

Combinación de tablas

Para combinar tablas en Polars hay dos operaciones muy comunes: la concatenación y la fusión. La concatenación pega tablas por posición, mientras que la fusión relaciona filas a partir de una o más claves.

Concatenación

La concatenación combina tablas que ya tienen una estructura compatible. Con pl.concat() podemos apilarlas verticalmente o pegarlas horizontalmente.

En una concatenación vertical, las filas de una tabla se agregan debajo de las de otra. En una concatenación horizontal, en cambio, las columnas se agregan lado a lado y las filas se alinean por posición.

ventas_2023 = pl.DataFrame({
    "id_venta": [1, 2, 3],
    "producto": ["Cuaderno", "Lapicera", "Regla"],
    "unidades": [5, 8, 3],
})

ventas_2024 = pl.DataFrame({
    "id_venta": [4, 5],
    "producto": ["Cuaderno", "Mochila"],
    "unidades": [6, 2],
})

detalle_ventas = pl.DataFrame({
    "id_venta": [1, 2, 3],
    "producto": ["Cuaderno", "Lapicera", "Regla"],
})

precios = pl.DataFrame({
    "precio_unitario": [1200, 800, 1500],
    "descuento": [0.10, 0.00, 0.05],
})
pl.concat([ventas_2023, ventas_2024], how="vertical")
shape: (5, 3)
id_venta producto unidades
i64 str i64
1 "Cuaderno" 5
2 "Lapicera" 8
3 "Regla" 3
4 "Cuaderno" 6
5 "Mochila" 2
pl.concat([detalle_ventas, precios], how="horizontal")
shape: (3, 4)
id_venta producto precio_unitario descuento
i64 str i64 f64
1 "Cuaderno" 1200 0.1
2 "Lapicera" 800 0.0
3 "Regla" 1500 0.05

Fusión

Cuando queremos combinar tablas a partir de una clave compartida, usamos .join(). Los tipos de fusión más comunes son inner, left, right y full.

  • inner: conserva solo las filas con coincidencia en ambas tablas.
  • left: conserva todas las filas de la tabla izquierda y agrega coincidencias de la derecha.
  • right: conserva todas las filas de la tabla derecha y agrega coincidencias de la izquierda.
  • full: conserva todas las filas de ambas tablas, aunque no tengan coincidencia.

La documentación oficial de Polars muestra más variantes y detalles sobre joins: https://docs.pola.rs/user-guide/transformations/joins/

clientes = pl.DataFrame({
    "id_cliente": [101, 102, 103, 105],
    "nombre": ["Ana", "Bruno", "Carla", "Diego"],
})

segmentos = pl.DataFrame({
    "id_cliente": [102, 103, 104],
    "segmento": ["Mayorista", "Minorista", "Mayorista"],
})
clientes.join(segmentos, on="id_cliente", how="inner")
shape: (2, 3)
id_cliente nombre segmento
i64 str str
102 "Bruno" "Mayorista"
103 "Carla" "Minorista"
clientes.join(segmentos, on="id_cliente")
shape: (2, 3)
id_cliente nombre segmento
i64 str str
102 "Bruno" "Mayorista"
103 "Carla" "Minorista"

El resultado es el mismo porque how es por defecto igual a "inner".

clientes.join(segmentos, on="id_cliente", how="left")
shape: (4, 3)
id_cliente nombre segmento
i64 str str
101 "Ana" null
102 "Bruno" "Mayorista"
103 "Carla" "Minorista"
105 "Diego" null
clientes.join(segmentos, on="id_cliente", how="right")
shape: (3, 3)
nombre id_cliente segmento
str i64 str
"Bruno" 102 "Mayorista"
"Carla" 103 "Minorista"
null 104 "Mayorista"
clientes.join(segmentos, on="id_cliente", how="full")
shape: (5, 4)
id_cliente nombre id_cliente_right segmento
i64 str i64 str
101 "Ana" null null
102 "Bruno" 102 "Mayorista"
103 "Carla" 103 "Minorista"
105 "Diego" null null
null null 104 "Mayorista"

También es posible unir usando más de una columna. En ese caso, le pasamos a on una lista con los nombres de las claves que deben coincidir simultáneamente.

ventas_sucursal = pl.DataFrame({
    "sucursal": ["Centro", "Centro", "Norte", "Norte"],
    "producto": ["Cuaderno", "Lapicera", "Cuaderno", "Mochila"],
    "unidades": [20, 35, 18, 7],
})

precios_sucursal = pl.DataFrame({
    "sucursal": ["Centro", "Centro", "Norte", "Sur"],
    "producto": ["Cuaderno", "Lapicera", "Cuaderno", "Mochila"],
    "precio_unitario": [1200, 800, 1250, 32000],
})

ventas_sucursal.join(precios_sucursal, on=["sucursal", "producto"], how="left")
shape: (4, 4)
sucursal producto unidades precio_unitario
str str i64 i64
"Centro" "Cuaderno" 20 1200
"Centro" "Lapicera" 35 800
"Norte" "Cuaderno" 18 1250
"Norte" "Mochila" 7 null

Escritura de de datos

Así como Polars ofrece varias funciones para leer datos, también incluye métodos para exportarlos. Uno de los más usados es .write_csv, aunque también existen .write_parquet y .write_excel (estos últimos pueden requerir dependencias adicionales).

El primer argumento de .write_csv es la ruta donde queremos guardar el archivo.

datos.head(10).write_csv("primeras_10_filas.csv")
datos
shape: (271_116, 15)
id name sex age height weight team noc games year season city sport event medal
i64 str str i64 i64 f64 str str str i64 str str str str str
1 "A Dijiang" "M" 24 180 80.0 "China" "CHN" "1992 Summer" 1992 "Summer" "Barcelona" "Basketball" "Basketball Men's Basketball" null
2 "A Lamusi" "M" 23 170 60.0 "China" "CHN" "2012 Summer" 2012 "Summer" "London" "Judo" "Judo Men's Extra-Lightweight" null
3 "Gunnar Nielsen Aaby" "M" 24 null null "Denmark" "DEN" "1920 Summer" 1920 "Summer" "Antwerpen" "Football" "Football Men's Football" null
4 "Edgar Lindenau Aabye" "M" 34 null null "Denmark/Sweden" "DEN" "1900 Summer" 1900 "Summer" "Paris" "Tug-Of-War" "Tug-Of-War Men's Tug-Of-War" "Gold"
5 "Christine Jacoba Aaftink" "F" 21 185 82.0 "Netherlands" "NED" "1988 Winter" 1988 "Winter" "Calgary" "Speed Skating" "Speed Skating Women's 500 metr… null
135569 "Andrzej ya" "M" 29 179 89.0 "Poland-1" "POL" "1976 Winter" 1976 "Winter" "Innsbruck" "Luge" "Luge Mixed (Men)'s Doubles" null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135570 "Piotr ya" "M" 27 176 59.0 "Poland" "POL" "2014 Winter" 2014 "Winter" "Sochi" "Ski Jumping" "Ski Jumping Men's Large Hill, … null
135571 "Tomasz Ireneusz ya" "M" 30 185 96.0 "Poland" "POL" "1998 Winter" 1998 "Winter" "Nagano" "Bobsleigh" "Bobsleigh Men's Four" null
135571 "Tomasz Ireneusz ya" "M" 34 185 96.0 "Poland" "POL" "2002 Winter" 2002 "Winter" "Salt Lake City" "Bobsleigh" "Bobsleigh Men's Four" null
df_equipos = (
    datos
    .filter(pl.col("season") == "Summer")
    .group_by("year")
    .agg(pl.col("noc").n_unique())
    .sort("year")
)
df_equipos
shape: (29, 2)
year noc
i64 u32
1896 12
1900 31
1904 15
1906 21
1908 22
2000 200
2004 201
2008 204
2012 205
2016 207
df_equipos.write_csv("equipos_por_año.csv")

Enlaces útiles