import polars as pl
import pyprojroot
ROOT = pyprojroot.here()1 - Introducción a Polars
Polars es una librería de Python orientada a la manipulación y análisis de datos tabulares. Su API está pensada para trabajar de manera clara y eficiente con tablas, selecciones, filtros, agregaciones y transformaciones de columnas.
Las dos estructuras más importantes son:
DataFrame: una tabla bidimensional con filas y columnas.Series: una estructura unidimensional que representa una columna de la tabla.
A lo largo de este apunte vamos a trabajar con datos de los juegos olímpicos que se encuentran en olympics.csv. Este conjunto de datos, que contiene datos básicos de los atletas que participaron en los juegos olímpicos entre Atenas 1896 y Río 2016, fue publicado en la semana 32 del año 2024 proyecto TidyTuesday y contiene las siguientes variables:
| Columna | Descripción |
|---|---|
id |
Identificador del atleta. |
name |
Nombre del atleta. |
sex |
Sexo del atleta. |
age |
Edad del atleta. |
height |
Altura del atleta en centímetros. |
weight |
Peso del atleta en kilogramos. |
team |
País o equipo por el que compite. |
noc |
Código NOC asociado a la región o delegación. |
games |
Nombre de la edición de los Juegos Olímpicos. |
year |
Año de los Juegos Olímpicos. |
season |
Temporada de los Juegos: invierno o verano. |
city |
Ciudad anfitriona de los Juegos Olímpicos. |
sport |
Deporte. |
event |
Prueba o evento específico. |
medal |
Medalla obtenida: oro, plata, bronce o valor faltante si no obtuvo medalla. |
Lectura de datos tabulares
La lectura de datos es una de las tareas fundamentales en el análisis de datos. En Polars, muchas fuciones para importar datos comienzan con read_ y devuelven un DataFrame. Algunos ejemplos son:
pl.read_csv()pl.read_parquet()pl.read_excel()pl.read_json()pl.read_database()
Los archivos CSV (valores separados por comas) son uno de los formatos más comunes para almacenar datos tabulares. En este caso, los datos están en olympics.csv. Vamos a usar pl.read_csv() para leerlos.
datos = pl.read_csv(ROOT / "datos" / "olympics.csv")Inspección de datos tabulares
La representación de un DataFrame en una Jupyter Notebook permite echar un primer vistazo a los datos:
datos| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135569 | "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | 1976 | "Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | 1998 | "Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | 2002 | "Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
No solo es posible ver las primeras y últimas filas de la tabla, sino que también es posible ver la cantidad de filas y columnas (shape: (271_116, 15)) y el tipo de datos de cada una de las variables. Con el objetivo de hacer un uso eficiente del almacenamiento y poder de procesamiento disponible, Polars ofrece una gran variedad de tipos de datos. A lo largo del curso los iremos conociendo.
type(datos)polars.dataframe.frame.DataFrame
Si quisieramos ver solamente las primeras filas o columnas, podemos utilizar los métodos .head() y .tail() de DataFrame. Estos cuentan con un parámetro opcional n que permite controlar la cantidad de filas obtenidas.
datos.head()| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
datos.tail()| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 135569 | "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | 1976 | "Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | 1998 | "Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | 2002 | "Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
datos.head(9)| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 27 | 185 | 82.0 | "Netherlands" | "NED" | "1994 Winter" | 1994 | "Winter" | "Lillehammer" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
Para inspeccionar la estructura del DataFrame de manera programática se puede consultar su esquema (schema), los tipos de datos (dtypes) o la la cantidad de filas y columnas (shape).
datos.schemaSchema([('id', Int64),
('name', String),
('sex', String),
('age', Int64),
('height', Int64),
('weight', Float64),
('team', String),
('noc', String),
('games', String),
('year', Int64),
('season', String),
('city', String),
('sport', String),
('event', String),
('medal', String)])
datos.dtypes[Int64,
String,
String,
Int64,
Int64,
Float64,
String,
String,
String,
Int64,
String,
String,
String,
String,
String]
datos.shape(271116, 15)
También es posible obtener una tabla con el conteo de valores nulos por columna usando el método null_count.
datos.null_count()| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 0 | 0 | 9474 | 60171 | 62875 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 231333 |
En el caso de necesitar un resumen rápido y más completo, podemos hacer uso del método describe.
datos.describe()| statistic | id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | f64 | str | str | f64 | f64 | f64 | str | str | str | f64 | str | str | str | str | str |
| "count" | 271116.0 | "271116" | "271116" | 261642.0 | 210945.0 | 208241.0 | "271116" | "271116" | "271116" | 271116.0 | "271116" | "271116" | "271116" | "271116" | "39783" |
| "null_count" | 0.0 | "0" | "0" | 9474.0 | 60171.0 | 62875.0 | "0" | "0" | "0" | 0.0 | "0" | "0" | "0" | "0" | "231333" |
| "mean" | 68248.954396 | null | null | 25.556898 | 175.33897 | 70.702393 | null | null | null | 1978.37848 | null | null | null | null | null |
| "std" | 39022.286345 | null | null | 6.393561 | 10.518462 | 14.34802 | null | null | null | 29.877632 | null | null | null | null | null |
| "min" | 1.0 | "A Dijiang" | "F" | 10.0 | 127.0 | 25.0 | "30. Februar" | "AFG" | "1896 Summer" | 1896.0 | "Summer" | "Albertville" | "Aeronautics" | "Aeronautics Mixed Aeronautics" | "Bronze" |
| "25%" | 34643.0 | null | null | 21.0 | 168.0 | 60.0 | null | null | null | 1960.0 | null | null | null | null | null |
| "50%" | 68205.0 | null | null | 24.0 | 175.0 | 70.0 | null | null | null | 1988.0 | null | null | null | null | null |
| "75%" | 102097.0 | null | null | 28.0 | 183.0 | 79.0 | null | null | null | 2002.0 | null | null | null | null | null |
| "max" | 135571.0 | "zzet nce" | "M" | 97.0 | 226.0 | 214.0 | "rn-2" | "ZIM" | "2016 Summer" | 2016.0 | "Winter" | "Vancouver" | "Wrestling" | "Wrestling Women's Middleweight… | "Silver" |
Si se quieren descartar rápidamente filas con valores nulos en algunas de sus columnas, se usa el método drop_nulls.
datos.drop_nulls()| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 16 | "Juhamatti Tapio Aaltonen" | "M" | 28 | 184 | 85.0 | "Finland" | "FIN" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ice Hockey" | "Ice Hockey Men's Ice Hockey" | "Bronze" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Individual Al… | "Bronze" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Team All-Arou… | "Gold" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Horse Vault" | "Gold" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Pommelled Hor… | "Gold" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135553 | "Galina Ivanovna Zybina (-Fyodo… | "F" | 25 | 168 | 80.0 | "Soviet Union" | "URS" | "1956 Summer" | 1956 | "Summer" | "Melbourne" | "Athletics" | "Athletics Women's Shot Put" | "Silver" |
| 135553 | "Galina Ivanovna Zybina (-Fyodo… | "F" | 33 | 168 | 80.0 | "Soviet Union" | "URS" | "1964 Summer" | 1964 | "Summer" | "Tokyo" | "Athletics" | "Athletics Women's Shot Put" | "Bronze" |
| 135554 | "Bogusaw Zych" | "M" | 28 | 182 | 82.0 | "Poland" | "POL" | "1980 Summer" | 1980 | "Summer" | "Moskva" | "Fencing" | "Fencing Men's Foil, Team" | "Bronze" |
| 135563 | "Olesya Nikolayevna Zykina" | "F" | 19 | 171 | 64.0 | "Russia" | "RUS" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Athletics" | "Athletics Women's 4 x 400 metr… | "Bronze" |
| 135563 | "Olesya Nikolayevna Zykina" | "F" | 23 | 171 | 64.0 | "Russia" | "RUS" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Athletics" | "Athletics Women's 4 x 400 metr… | "Silver" |
Vale la pena notar que los DataFrame también cuentan con el método drop_nans, pero este descarta filas con valores numéricos que no son válidos.
Cálculo de resumenes
A la hora de trabajar con datos, suele ser de interés procesarlos de alguna manera para obtener información respecto de algunas variables relevantes. Por ejemplo, se pueden calcular valores medios, mínimos, máximos, medianas, diferencias, razones, etc. Las cuentas que resumen muchos datos en un único valor se llaman resumenes o agregaciones (porque se están resumiendo agregando varios valores de alguna forma en un único valor).
Resumenes simples
Como ya vimos, una forma muy rápida de calcular estadísticas básicas con un DataFrame es utilizar el método .describe, que genera estadísticas básicas para las columnas numéricas del conjunto de datos. Veamos que también nos permite especificar el listado de percentiles que queremos utilizar.
datos.describe(percentiles=[0.1, 0.4, 0.6, 0.9])| statistic | id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | f64 | str | str | f64 | f64 | f64 | str | str | str | f64 | str | str | str | str | str |
| "count" | 271116.0 | "271116" | "271116" | 261642.0 | 210945.0 | 208241.0 | "271116" | "271116" | "271116" | 271116.0 | "271116" | "271116" | "271116" | "271116" | "39783" |
| "null_count" | 0.0 | "0" | "0" | 9474.0 | 60171.0 | 62875.0 | "0" | "0" | "0" | 0.0 | "0" | "0" | "0" | "0" | "231333" |
| "mean" | 68248.954396 | null | null | 25.556898 | 175.33897 | 70.702393 | null | null | null | 1978.37848 | null | null | null | null | null |
| "std" | 39022.286345 | null | null | 6.393561 | 10.518462 | 14.34802 | null | null | null | 29.877632 | null | null | null | null | null |
| "min" | 1.0 | "A Dijiang" | "F" | 10.0 | 127.0 | 25.0 | "30. Februar" | "AFG" | "1896 Summer" | 1896.0 | "Summer" | "Albertville" | "Aeronautics" | "Aeronautics Mixed Aeronautics" | "Bronze" |
| "10%" | 14102.0 | null | null | 19.0 | 162.0 | 54.0 | null | null | null | 1928.0 | null | null | null | null | null |
| "40%" | 54826.0 | null | null | 23.0 | 173.0 | 66.0 | null | null | null | 1976.0 | null | null | null | null | null |
| "60%" | 81686.0 | null | null | 26.0 | 178.0 | 73.0 | null | null | null | 1992.0 | null | null | null | null | null |
| "90%" | 122194.0 | null | null | 33.0 | 188.0 | 89.0 | null | null | null | 2012.0 | null | null | null | null | null |
| "max" | 135571.0 | "zzet nce" | "M" | 97.0 | 226.0 | 214.0 | "rn-2" | "ZIM" | "2016 Summer" | 2016.0 | "Winter" | "Vancouver" | "Wrestling" | "Wrestling Women's Middleweight… | "Silver" |
A nivel columna, si queremos calcular la cantidad de ocurrencias de una columna categórica o discreta, podemos seleccionar la columna y luego usar value_counts().
datos["sex"].value_counts()| sex | count |
|---|---|
| str | u32 |
| "F" | 74522 |
| "M" | 196594 |
Notemos que el método value_counts pertenece a la Series y no al DataFrame.
type(datos["sex"])polars.series.series.Series
Este método también recibe los parámetros sort que ordena las filas de mayor a menor cantidad de ocurrencias, name que permite renombrar la columna de conteo y normalize que divide los conteos por su suma de modo tal que representan una proporción.
datos["noc"].value_counts(sort=True)| noc | count |
|---|---|
| str | u32 |
| "USA" | 18853 |
| "FRA" | 12758 |
| "GBR" | 12256 |
| "ITA" | 10715 |
| "GER" | 9830 |
| … | … |
| "YMD" | 5 |
| "SSD" | 3 |
| "NBO" | 2 |
| "UNK" | 2 |
| "NFL" | 1 |
datos["noc"].value_counts(sort=True, name="n")| noc | n |
|---|---|
| str | u32 |
| "USA" | 18853 |
| "FRA" | 12758 |
| "GBR" | 12256 |
| "ITA" | 10715 |
| "GER" | 9830 |
| … | … |
| "YMD" | 5 |
| "SSD" | 3 |
| "NBO" | 2 |
| "UNK" | 2 |
| "NFL" | 1 |
datos["noc"].value_counts(sort=True, name="p", normalize=True)| noc | p |
|---|---|
| str | f64 |
| "USA" | 0.069539 |
| "FRA" | 0.047057 |
| "GBR" | 0.045206 |
| "ITA" | 0.039522 |
| "GER" | 0.036258 |
| … | … |
| "YMD" | 0.000018 |
| "SSD" | 0.000011 |
| "NBO" | 0.000007 |
| "UNK" | 0.000007 |
| "NFL" | 0.000004 |
Otros métodos interesantes son unique y .n_unique, que devuelven los valores únicos de una columna y su cantidad.
datos["noc"].unique()| noc |
|---|
| str |
| "TLS" |
| "GUM" |
| "PAK" |
| "NOR" |
| "GBS" |
| … |
| "KAZ" |
| "MDV" |
| "SAM" |
| "STP" |
| "VIN" |
datos["noc"].n_unique()230
datos["games"].unique()| games |
|---|
| str |
| "1988 Winter" |
| "1980 Winter" |
| "1980 Summer" |
| "2014 Winter" |
| "1976 Winter" |
| … |
| "1984 Winter" |
| "1924 Summer" |
| "1896 Summer" |
| "1998 Winter" |
| "1932 Summer" |
En cuanto a las variables numéricas, las Series de Polars ofrecen una gran variedad de resumenes estadísticos:
datos["age"].mean()25.556898357297374
print("Media:", datos["age"].mean())
print("Mediana:", datos["age"].median())
print("Desvío estándar:", datos["age"].std())
print("Mínimo:", datos["age"].min())
print("Máximo:", datos["age"].max())
print("Percentil 10:", datos["age"].quantile(0.1))Media: 25.556898357297374
Mediana: 24.0
Desvío estándar: 6.393560847035814
Mínimo: 10
Máximo: 97
Percentil 10: 19.0
Expresiones
En Polars, es posible construir expresiones que representan transformaciones sobre los datos.
Para hacerlo, primero se indican las columnas con pl.col() y luego se las combina usando operadores aritméticos habituales o métodos provistos por Polars.
Por ejemplo, la siguiente expresión combina las columnas "height" y "weight" de modo tal que se calcule el índice de masa corporal.
pl.col("weight") / (pl.col("height") / 100) ** 2Como se puede ver, el bloque anterior todavía no realiza ningún cálculo sobre el conjunto de datos. Lo que hace es definir, de forma abstracta, una transformación que queremos aplicar.
Para evaluar esa expresión, en este caso para calcular el índice de masa corporal, necesitamos usarla dentro de un contexto de ejecución. Polars ofrece varios, pero por ahora nos interesa select. Este contexto permite aplicar expresiones sobre las columnas y producir nuevas columnas, ya sea como resúmenes, combinaciones de columnas existentes o incluso valores literales.
Veamos que, en principio, select permite seleccionar columnas:
datos.select("name", "age", "sex")| name | age | sex |
|---|---|---|
| str | i64 | str |
| "A Dijiang" | 24 | "M" |
| "A Lamusi" | 23 | "M" |
| "Gunnar Nielsen Aaby" | 24 | "M" |
| "Edgar Lindenau Aabye" | 34 | "M" |
| "Christine Jacoba Aaftink" | 21 | "F" |
| … | … | … |
| "Andrzej ya" | 29 | "M" |
| "Piotr ya" | 27 | "M" |
| "Piotr ya" | 27 | "M" |
| "Tomasz Ireneusz ya" | 30 | "M" |
| "Tomasz Ireneusz ya" | 34 | "M" |
Sin embargo, como acabamos de mencionar, también permite ejecutar expresiones.
datos.select("name", pl.col("age") * 12)| name | age |
|---|---|
| str | i64 |
| "A Dijiang" | 288 |
| "A Lamusi" | 276 |
| "Gunnar Nielsen Aaby" | 288 |
| "Edgar Lindenau Aabye" | 408 |
| "Christine Jacoba Aaftink" | 252 |
| … | … |
| "Andrzej ya" | 348 |
| "Piotr ya" | 324 |
| "Piotr ya" | 324 |
| "Tomasz Ireneusz ya" | 360 |
| "Tomasz Ireneusz ya" | 408 |
Si queremos guardar el resultado de la expresión en una nueva columna, tenemos que pasarla como parámetro:
datos.select(
"name",
"height",
"weight",
bmi=pl.col("weight") / (pl.col("height") / 100) ** 2
)| name | height | weight | bmi |
|---|---|---|---|
| str | i64 | f64 | f64 |
| "A Dijiang" | 180 | 80.0 | 24.691358 |
| "A Lamusi" | 170 | 60.0 | 20.761246 |
| "Gunnar Nielsen Aaby" | null | null | null |
| "Edgar Lindenau Aabye" | null | null | null |
| "Christine Jacoba Aaftink" | 185 | 82.0 | 23.959094 |
| … | … | … | … |
| "Andrzej ya" | 179 | 89.0 | 27.776911 |
| "Piotr ya" | 176 | 59.0 | 19.047004 |
| "Piotr ya" | 176 | 59.0 | 19.047004 |
| "Tomasz Ireneusz ya" | 185 | 96.0 | 28.049671 |
| "Tomasz Ireneusz ya" | 185 | 96.0 | 28.049671 |
O utilizar el método alias.
datos.select(
"name",
"height",
"weight",
(pl.col("weight") / (pl.col("height") / 100) ** 2).alias("bmi")
)| name | height | weight | bmi |
|---|---|---|---|
| str | i64 | f64 | f64 |
| "A Dijiang" | 180 | 80.0 | 24.691358 |
| "A Lamusi" | 170 | 60.0 | 20.761246 |
| "Gunnar Nielsen Aaby" | null | null | null |
| "Edgar Lindenau Aabye" | null | null | null |
| "Christine Jacoba Aaftink" | 185 | 82.0 | 23.959094 |
| … | … | … | … |
| "Andrzej ya" | 179 | 89.0 | 27.776911 |
| "Piotr ya" | 176 | 59.0 | 19.047004 |
| "Piotr ya" | 176 | 59.0 | 19.047004 |
| "Tomasz Ireneusz ya" | 185 | 96.0 | 28.049671 |
| "Tomasz Ireneusz ya" | 185 | 96.0 | 28.049671 |
Retemando el cálculo de resumenes, es posible utilizar un método de agregación, por ejempĺo:
datos.select(pl.col("age").mean())| age |
|---|
| f64 |
| 25.556898 |
datos.select(
pl.col("age").mean().alias("edad_promedio"),
pl.col("height").mean().alias("altura_promedio"),
)| edad_promedio | altura_promedio |
|---|---|
| f64 | f64 |
| 25.556898 | 175.33897 |
O bien:
datos.select(
pl.col("age").mean().alias("edad_media"),
pl.col("age").std().alias("edad_std"),
)| edad_media | edad_std |
|---|---|
| f64 | f64 |
| 25.556898 | 6.393561 |
Como iremos viendo a lo largo del curso, las expresiones de Polars son muy potentes y se pueden usar en varios contextos, no solo en select.
Resumenes agrupados
El siguiente paso es calcular resumenes para cada nivel de otra variable (o combinaciones de niveles de múltiples variables).
Por ejemplo:
- ¿Cuál es la edad promedio de los atletas en cada juego olímpico?
- ¿Cuántas medallas de oro ganó cada país en cada edición?
- ¿Qué país estuvo más represenado por mujeres?
Lo primero que tenemos que conocer para realizar esta tarea es método group_by, que es un contexto de ejecución de expresiones.
Este método agrupa el DataFrame según una o más columnas para que luego podamos aplicar agregaciones sobre cada grupo.
datos.group_by("year")<polars.dataframe.group_by.GroupBy at 0x7e8aac1be3c0>
La representación del objeto agrupado es bastante austera, pero una vez agrupados los datos podemos aplicar agregaciones con .agg().
datos.group_by("games").agg(pl.col("age").mean())| games | age |
|---|---|
| str | f64 |
| "1906 Summer" | 27.125253 |
| "1984 Summer" | 24.002811 |
| "1928 Winter" | 26.581301 |
| "1976 Winter" | 23.739459 |
| "1924 Summer" | 28.449373 |
| … | … |
| "1956 Summer" | 26.033673 |
| "1952 Summer" | 26.258851 |
| "1980 Winter" | 23.65616 |
| "1932 Summer" | 33.45154 |
| "1960 Winter" | 24.241877 |
Utilizando el método sort() se puede obtener una tabla ordenada según variables de interés, por ejemplo games.
datos.group_by("games").agg(pl.col("age").mean()).sort("games")| games | age |
|---|---|
| str | f64 |
| "1896 Summer" | 23.580645 |
| "1900 Summer" | 29.034031 |
| "1904 Summer" | 26.69815 |
| "1906 Summer" | 27.125253 |
| "1908 Summer" | 26.970228 |
| … | … |
| "2008 Summer" | 25.734118 |
| "2010 Winter" | 26.124262 |
| "2012 Summer" | 25.961378 |
| "2014 Winter" | 25.987324 |
| "2016 Summer" | 26.207919 |
Y con un poco de creatividad podemos obtener la edición donde se presentó el atleta mas longevo:
datos.group_by("games").agg(pl.col("age").max()).sort("age").tail(1)| games | age |
|---|---|
| str | i64 |
| "1928 Summer" | 97 |
Otro ejemplo:
(
datos
.group_by("games", "noc")
.agg(
(pl.col("sex") == "F").mean().alias("p_mujer"),
pl.len().alias("n")
)
.sort("p_mujer", "n")
.tail(10)
)| games | noc | p_mujer | n |
|---|---|---|---|
| str | str | f64 | u32 |
| "2014 Winter" | "TOG" | 1.0 | 3 |
| "1992 Winter" | "HON" | 1.0 | 3 |
| "2006 Winter" | "HKG" | 1.0 | 3 |
| "1988 Summer" | "MYA" | 1.0 | 3 |
| "2006 Winter" | "DEN" | 1.0 | 4 |
| "1932 Winter" | "GBR" | 1.0 | 4 |
| "2002 Winter" | "HKG" | 1.0 | 5 |
| "1960 Summer" | "ZIM" | 1.0 | 10 |
| "1972 Winter" | "PRK" | 1.0 | 10 |
| "1976 Summer" | "PER" | 1.0 | 14 |
Una forma de obtener la cantidad de observaciones por grupo es:
datos.group_by("games", "noc").agg(pl.len())| games | noc | len |
|---|---|---|
| str | str | u32 |
| "1988 Summer" | "ASA" | 6 |
| "1956 Summer" | "PHI" | 48 |
| "1960 Summer" | "HAI" | 1 |
| "2002 Winter" | "EST" | 51 |
| "2016 Summer" | "SEN" | 22 |
| … | … | … |
| "1988 Summer" | "ANG" | 33 |
| "1984 Summer" | "FRA" | 354 |
| "1996 Summer" | "BOL" | 9 |
| "2010 Winter" | "SLO" | 107 |
| "1900 Summer" | "GRE" | 4 |
Manipulación de columnas
Seleccionar y transformar
Podemos seleccionar una columna de un DataFrame pasando el nombre de la misma dentro de corchetes (de la misma forma que haríamos con un diccionario).
datos["noc"]| noc |
|---|
| str |
| "CHN" |
| "CHN" |
| "DEN" |
| "DEN" |
| "NED" |
| … |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
El resultado es una Series, no un DataFrame compuesto por una sola columna.
type(datos["noc"])polars.series.series.Series
Si queremos obtener un DataFrame, tenemos que usar el método select.
datos.select("noc")| noc |
|---|
| str |
| "CHN" |
| "CHN" |
| "DEN" |
| "DEN" |
| "NED" |
| … |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
type(datos.select("noc"))polars.dataframe.frame.DataFrame
datos.select(pl.col("noc").alias("pais"))| pais |
|---|
| str |
| "CHN" |
| "CHN" |
| "DEN" |
| "DEN" |
| "NED" |
| … |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
| "POL" |
datos.select(pl.col("noc").alias("pais"), pl.col("age").mean())| pais | age |
|---|---|
| str | f64 |
| "CHN" | 25.556898 |
| "CHN" | 25.556898 |
| "DEN" | 25.556898 |
| "DEN" | 25.556898 |
| "NED" | 25.556898 |
| … | … |
| "POL" | 25.556898 |
| "POL" | 25.556898 |
| "POL" | 25.556898 |
| "POL" | 25.556898 |
| "POL" | 25.556898 |
datos.select((pl.col("weight") - pl.col("weight").mean()) / pl.col("weight").std())| weight |
|---|
| f64 |
| 0.648006 |
| -0.745914 |
| null |
| null |
| 0.787398 |
| … |
| 1.275271 |
| -0.81561 |
| -0.81561 |
| 1.763143 |
| 1.763143 |
Si se desea modificar y/o agregar columnas a un DataFrame, conviene utilizar el contexto with_columns, que devuelve una tabla con todas las columnas del DataFrame original junto con las nuevas columnas generadas a partir de las expresiones.
datos.with_columns(
weight_z=(pl.col("weight") - pl.col("weight").mean()) / pl.col("weight").std()
)| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal | weight_z |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str | f64 |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null | 0.648006 |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null | -0.745914 |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null | 0.787398 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135569 | "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | 1976 | "Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null | 1.275271 |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null | -0.81561 |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null | -0.81561 |
| 135571 | "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | 1998 | "Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null | 1.763143 |
| 135571 | "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | 2002 | "Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null | 1.763143 |
Debido a esta diferencia entre select y with_columns, las expresiones usadas dentro dewith_columnsdeben producir series que tengan la misma longitud que las columnas originales delDataFrame. En cambio, en el contextoselect` alcanza con que las expresiones produzcan series de la misma longitud entre sí.
Descartar
Para descartar columnas tenemos el método .drop. Simplemente le pasamos el nombre de las columnas que queremos descartar.
datos.drop("id", "year", "season")| name | sex | age | height | weight | team | noc | games | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|
| str | str | i64 | i64 | f64 | str | str | str | str | str | str | str |
| "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| … | … | … | … | … | … | … | … | … | … | … | … |
| "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null |
| "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
| "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
Renombrar
datos.select("name", "sex", "age").rename({"name": "nombre", "sex": "sexo", "age": "edad"})| nombre | sexo | edad |
|---|---|---|
| str | str | i64 |
| "A Dijiang" | "M" | 24 |
| "A Lamusi" | "M" | 23 |
| "Gunnar Nielsen Aaby" | "M" | 24 |
| "Edgar Lindenau Aabye" | "M" | 34 |
| "Christine Jacoba Aaftink" | "F" | 21 |
| … | … | … |
| "Andrzej ya" | "M" | 29 |
| "Piotr ya" | "M" | 27 |
| "Piotr ya" | "M" | 27 |
| "Tomasz Ireneusz ya" | "M" | 30 |
| "Tomasz Ireneusz ya" | "M" | 34 |
Manipulación de filas
Ordenar
Para ordenar las observaciones en función de una o más variables usamos .sort(). Si pasamos una sola columna alcanza con una cadena; si queremos ordenar por varias, conviene pasar una lista.
El argumento descending permite invertir el orden. Como Polars no usa un índice persistente, ordenar las filas no requiere ningún paso extra para reenumerarlas.
datos.sort("year")| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1724 | "Aristidis Akratopoulos" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Tennis" | "Tennis Men's Singles" | null |
| 1724 | "Aristidis Akratopoulos" | "M" | null | null | null | "Greece-3" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Tennis" | "Tennis Men's Doubles" | null |
| 1725 | "Konstantinos "Kostas" Akratopo… | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Tennis" | "Tennis Men's Singles" | null |
| 1725 | "Konstantinos "Kostas" Akratopo… | "M" | null | null | null | "Greece-3" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Tennis" | "Tennis Men's Doubles" | null |
| 4113 | "Anastasios Andreou" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Athletics" | "Athletics Men's 110 metres Hur… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135528 | "Marc Zwiebler" | "M" | 32 | 181 | 75.0 | "Germany" | "GER" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Badminton" | "Badminton Men's Singles" | null |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 100 metres" | null |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 200 metres" | null |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 4 x 100 metr… | null |
| 135568 | "Olga Igorevna Zyuzkova" | "F" | 33 | 171 | 69.0 | "Belarus" | "BLR" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Basketball" | "Basketball Women's Basketball" | null |
datos.sort("year", descending=True)| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 22 | "Andreea Aanei" | "F" | 22 | 170 | 125.0 | "Romania" | "ROU" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Weightlifting" | "Weightlifting Women's Super-He… | null |
| 51 | "Nstor Abad Sanjun" | "M" | 23 | 167 | 64.0 | "Spain" | "ESP" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Gymnastics" | "Gymnastics Men's Individual Al… | null |
| 51 | "Nstor Abad Sanjun" | "M" | 23 | 167 | 64.0 | "Spain" | "ESP" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Gymnastics" | "Gymnastics Men's Floor Exercis… | null |
| 51 | "Nstor Abad Sanjun" | "M" | 23 | 167 | 64.0 | "Spain" | "ESP" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Gymnastics" | "Gymnastics Men's Parallel Bars" | null |
| 51 | "Nstor Abad Sanjun" | "M" | 23 | 167 | 64.0 | "Spain" | "ESP" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Gymnastics" | "Gymnastics Men's Horizontal Ba… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135285 | "Khristos Zoumis" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Athletics" | "Athletics Men's Triple Jump" | null |
| 135481 | "Jules Alexis "Louis" Zutter" | "M" | 30 | null | null | "Switzerland" | "SUI" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Gymnastics" | "Gymnastics Men's Horse Vault" | "Silver" |
| 135481 | "Jules Alexis "Louis" Zutter" | "M" | 30 | null | null | "Switzerland" | "SUI" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Gymnastics" | "Gymnastics Men's Parallel Bars" | "Silver" |
| 135481 | "Jules Alexis "Louis" Zutter" | "M" | 30 | null | null | "Switzerland" | "SUI" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Gymnastics" | "Gymnastics Men's Horizontal Ba… | null |
| 135481 | "Jules Alexis "Louis" Zutter" | "M" | 30 | null | null | "Switzerland" | "SUI" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Gymnastics" | "Gymnastics Men's Pommelled Hor… | "Gold" |
datos.sort(["season", "year"])| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 57440 | "Karakatsanis" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Shooting" | "Shooting Men's Free Rifle, Thr… | null |
| 122329 | "Georgios Tsitas" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Wrestling" | "Wrestling Men's Unlimited Clas… | "Silver" |
| 122303 | "Leonidas Tsiklitiras" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Gymnastics" | "Gymnastics Men's Horizontal Ba… | null |
| 38123 | "Carl Galle" | "M" | 23 | 154 | 45.0 | "Germany" | "GER" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Athletics" | "Athletics Men's 1,500 metres" | null |
| 122152 | "A. Tryfiatis-Trypiapis" | "M" | null | null | null | "Greece" | "GRE" | "1896 Summer" | 1896 | "Summer" | "Athina" | "Cycling" | "Cycling Men's 12-Hours Race" | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 23464 | "Charle Cournoyer" | "M" | 22 | 173 | 68.0 | "Canada" | "CAN" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Short Track Speed Skating" | "Short Track Speed Skating Men'… | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 31342 | "Tiril Kampenhaug Eckhoff" | "F" | 23 | 163 | 59.0 | "Norway" | "NOR" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Biathlon" | "Biathlon Mixed 2 x 6 kilometre… | "Gold" |
| 128915 | "Yuta Watase" | "M" | 31 | 178 | 63.0 | "Japan" | "JPN" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Normal Hill,… | null |
Filtrar
Es posible seleccionar filas por posición usando slices. Esto resulta útil para inspeccionar rápidamente una parte de la tabla.
datos[0:10]| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 27 | 185 | 82.0 | "Netherlands" | "NED" | "1994 Winter" | 1994 | "Winter" | "Lillehammer" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 27 | 185 | 82.0 | "Netherlands" | "NED" | "1994 Winter" | 1994 | "Winter" | "Lillehammer" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
datos[0:10:2]| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 27 | 185 | 82.0 | "Netherlands" | "NED" | "1994 Winter" | 1994 | "Winter" | "Lillehammer" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
datos[0:100:10, :5]| id | name | sex | age | height |
|---|---|---|---|---|
| i64 | str | str | i64 | i64 |
| 1 | "A Dijiang" | "M" | 24 | 180 |
| 6 | "Per Knut Aaland" | "M" | 31 | 188 |
| 7 | "John Aalberg" | "M" | 31 | 183 |
| 11 | "Jorma Ilmari Aalto" | "M" | 22 | 182 |
| 16 | "Juhamatti Tapio Aaltonen" | "M" | 28 | 184 |
| 17 | "Paavo Johannes Aaltonen" | "M" | 32 | 175 |
| 20 | "Kjetil Andr Aamodt" | "M" | 20 | 176 |
| 20 | "Kjetil Andr Aamodt" | "M" | 26 | 176 |
| 22 | "Andreea Aanei" | "F" | 22 | 170 |
| 28 | "Jan-Erik Aarberg" | "M" | 47 | 170 |
Cuando lo que queremos es conservar filas según una condición sobre sus valores, en cambio, conviene usar el método filter. Este método acepta condiciones en forma de expresiones booleanas, es decir, filter también es un contexto.
Por ejemplo, podemos quedarnos solo con los Juegos Olímpicos de invierno, excluirlos o combinar condiciones sobre season, year, sport o medal.
datos.filter(pl.col("season") == "Winter")| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 25 | 185 | 82.0 | "Netherlands" | "NED" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Speed Skating" | "Speed Skating Women's 1,000 me… | null |
| 5 | "Christine Jacoba Aaftink" | "F" | 27 | 185 | 82.0 | "Netherlands" | "NED" | "1994 Winter" | 1994 | "Winter" | "Lillehammer" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135569 | "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | 1976 | "Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | 1998 | "Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | 2002 | "Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
datos.filter(pl.col("season") != "Winter")| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 8 | "Cornelia "Cor" Aalten (-Strann… | "F" | 18 | 168 | null | "Netherlands" | "NED" | "1932 Summer" | 1932 | "Summer" | "Los Angeles" | "Athletics" | "Athletics Women's 100 metres" | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135565 | "Fernando scar Zylberberg" | "M" | 27 | 168 | 76.0 | "Argentina" | "ARG" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Hockey" | "Hockey Men's Hockey" | null |
| 135566 | "James Francis "Jim" Zylker" | "M" | 21 | 175 | 75.0 | "United States" | "USA" | "1972 Summer" | 1972 | "Summer" | "Munich" | "Football" | "Football Men's Football" | null |
| 135567 | "Aleksandr Viktorovich Zyuzin" | "M" | 24 | 183 | 72.0 | "Russia" | "RUS" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Rowing" | "Rowing Men's Lightweight Coxle… | null |
| 135567 | "Aleksandr Viktorovich Zyuzin" | "M" | 28 | 183 | 72.0 | "Russia" | "RUS" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Rowing" | "Rowing Men's Lightweight Coxle… | null |
| 135568 | "Olga Igorevna Zyuzkova" | "F" | 33 | 171 | 69.0 | "Belarus" | "BLR" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Basketball" | "Basketball Women's Basketball" | null |
datos.filter(pl.col("season") == "Summer", pl.col("year") >= 2000)| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 12 | "Jyri Tapani Aalto" | "M" | 31 | 172 | 70.0 | "Finland" | "FIN" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Badminton" | "Badminton Men's Singles" | null |
| 13 | "Minna Maarit Aalto" | "F" | 34 | 159 | 55.5 | "Finland" | "FIN" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Sailing" | "Sailing Women's Windsurfer" | null |
| 18 | "Timo Antero Aaltonen" | "M" | 31 | 189 | 130.0 | "Finland" | "FIN" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Athletics" | "Athletics Men's Shot Put" | null |
| 21 | "Ragnhild Margrethe Aamodt" | "F" | 27 | 163 | null | "Norway" | "NOR" | "2008 Summer" | 2008 | "Summer" | "Beijing" | "Handball" | "Handball Women's Handball" | "Gold" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135565 | "Fernando scar Zylberberg" | "M" | 23 | 168 | 76.0 | "Argentina" | "ARG" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Hockey" | "Hockey Men's Hockey" | null |
| 135565 | "Fernando scar Zylberberg" | "M" | 27 | 168 | 76.0 | "Argentina" | "ARG" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Hockey" | "Hockey Men's Hockey" | null |
| 135567 | "Aleksandr Viktorovich Zyuzin" | "M" | 24 | 183 | 72.0 | "Russia" | "RUS" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Rowing" | "Rowing Men's Lightweight Coxle… | null |
| 135567 | "Aleksandr Viktorovich Zyuzin" | "M" | 28 | 183 | 72.0 | "Russia" | "RUS" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Rowing" | "Rowing Men's Lightweight Coxle… | null |
| 135568 | "Olga Igorevna Zyuzkova" | "F" | 33 | 171 | 69.0 | "Belarus" | "BLR" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Basketball" | "Basketball Women's Basketball" | null |
datos.filter(pl.col("medal").is_in(["Gold", "Silver"]))| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Team All-Arou… | "Gold" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Horse Vault" | "Gold" |
| 17 | "Paavo Johannes Aaltonen" | "M" | 28 | 175 | 64.0 | "Finland" | "FIN" | "1948 Summer" | 1948 | "Summer" | "London" | "Gymnastics" | "Gymnastics Men's Pommelled Hor… | "Gold" |
| 20 | "Kjetil Andr Aamodt" | "M" | 20 | 176 | 85.0 | "Norway" | "NOR" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Alpine Skiing" | "Alpine Skiing Men's Super G" | "Gold" |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135544 | "Krzysztof Zwoliski" | "M" | 21 | 175 | 70.0 | "Poland" | "POL" | "1980 Summer" | 1980 | "Summer" | "Moskva" | "Athletics" | "Athletics Men's 4 x 100 metres… | "Silver" |
| 135545 | "Henk Jan Zwolle" | "M" | 31 | 197 | 93.0 | "Netherlands" | "NED" | "1996 Summer" | 1996 | "Summer" | "Atlanta" | "Rowing" | "Rowing Men's Coxed Eights" | "Gold" |
| 135553 | "Galina Ivanovna Zybina (-Fyodo… | "F" | 21 | 168 | 80.0 | "Soviet Union" | "URS" | "1952 Summer" | 1952 | "Summer" | "Helsinki" | "Athletics" | "Athletics Women's Shot Put" | "Gold" |
| 135553 | "Galina Ivanovna Zybina (-Fyodo… | "F" | 25 | 168 | 80.0 | "Soviet Union" | "URS" | "1956 Summer" | 1956 | "Summer" | "Melbourne" | "Athletics" | "Athletics Women's Shot Put" | "Silver" |
| 135563 | "Olesya Nikolayevna Zykina" | "F" | 23 | 171 | 64.0 | "Russia" | "RUS" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Athletics" | "Athletics Women's 4 x 400 metr… | "Silver" |
datos.filter(
pl.col("sport").is_in(["Athletics", "Swimming"]),
pl.col("year").is_between(2000, 2016)
)| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 18 | "Timo Antero Aaltonen" | "M" | 31 | 189 | 130.0 | "Finland" | "FIN" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Athletics" | "Athletics Men's Shot Put" | null |
| 34 | "Jamale (Djamel-) Aarrass (Ahra… | "M" | 30 | 187 | 76.0 | "France" | "FRA" | "2012 Summer" | 2012 | "Summer" | "London" | "Athletics" | "Athletics Men's 1,500 metres" | null |
| 36 | "Stefan Remco Aartsen" | "M" | 25 | 194 | 78.0 | "Netherlands" | "NED" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Swimming" | "Swimming Men's 100 metres Butt… | null |
| 36 | "Stefan Remco Aartsen" | "M" | 25 | 194 | 78.0 | "Netherlands" | "NED" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Swimming" | "Swimming Men's 200 metres Butt… | null |
| 49 | "Moonika Aava" | "F" | 24 | 168 | 65.0 | "Estonia" | "EST" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Athletics" | "Athletics Women's Javelin Thro… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 100 metres" | null |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 200 metres" | null |
| 135547 | "Viktoriya Viktorovna Zyabkina" | "F" | 23 | 174 | 62.0 | "Kazakhstan" | "KAZ" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Athletics" | "Athletics Women's 4 x 100 metr… | null |
| 135563 | "Olesya Nikolayevna Zykina" | "F" | 19 | 171 | 64.0 | "Russia" | "RUS" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Athletics" | "Athletics Women's 4 x 400 metr… | "Bronze" |
| 135563 | "Olesya Nikolayevna Zykina" | "F" | 23 | 171 | 64.0 | "Russia" | "RUS" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Athletics" | "Athletics Women's 4 x 400 metr… | "Silver" |
datos.filter((pl.col("year") < 1950) | (pl.col("year") > 2010))| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 8 | "Cornelia "Cor" Aalten (-Strann… | "F" | 18 | 168 | null | "Netherlands" | "NED" | "1932 Summer" | 1932 | "Summer" | "Los Angeles" | "Athletics" | "Athletics Women's 100 metres" | null |
| 8 | "Cornelia "Cor" Aalten (-Strann… | "F" | 18 | 168 | null | "Netherlands" | "NED" | "1932 Summer" | 1932 | "Summer" | "Los Angeles" | "Athletics" | "Athletics Women's 4 x 100 metr… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135558 | "ukasz Tomasz ygado" | "M" | 32 | 200 | 89.0 | "Poland" | "POL" | "2012 Summer" | 2012 | "Summer" | "London" | "Volleyball" | "Volleyball Men's Volleyball" | null |
| 135561 | "Frantiek Zyka" | "M" | 26 | null | null | "Czechoslovakia" | "TCH" | "1928 Summer" | 1928 | "Summer" | "Amsterdam" | "Athletics" | "Athletics Men's Marathon" | null |
| 135568 | "Olga Igorevna Zyuzkova" | "F" | 33 | 171 | 69.0 | "Belarus" | "BLR" | "2016 Summer" | 2016 | "Summer" | "Rio de Janeiro" | "Basketball" | "Basketball Women's Basketball" | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
Cuando usamos operadores como | conviene encerrar cada comparación entre paréntesis para que la expresión se evalúe como esperamos.
Descartar duplicados
En Polars podemos usar .unique() para quedarnos con combinaciones únicas de una o más columnas. Esto sirve tanto para explorar valores posibles como para descartar filas duplicadas según un subconjunto de variables.
Si usamos el parámetro subset, Polars decide si dos filas son duplicadas mirando solo esas columnas.
datos.select("season", "sport").unique()| season | sport |
|---|---|
| str | str |
| "Summer" | "Croquet" |
| "Summer" | "Gymnastics" |
| "Summer" | "Lacrosse" |
| "Summer" | "Diving" |
| "Winter" | "Ice Hockey" |
| … | … |
| "Summer" | "Roque" |
| "Summer" | "Wrestling" |
| "Summer" | "Badminton" |
| "Winter" | "Alpinism" |
| "Summer" | "Figure Skating" |
datos.unique(subset=["season", "sport"])| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 23 | "Fritz Aanes" | "M" | 22 | 187 | 89.0 | "Norway" | "NOR" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Wrestling" | "Wrestling Men's Light-Heavywei… | null |
| 43 | "Morten Gjerdrum Aasen" | "M" | 34 | 185 | 75.0 | "Norway" | "NOR" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Equestrianism" | "Equestrianism Mixed Jumping, I… | null |
| 154 | "Khadija Abbouda" | "F" | 40 | 165 | 57.0 | "Morocco" | "MAR" | "2008 Summer" | 2008 | "Summer" | "Beijing" | "Archery" | "Archery Women's Individual" | null |
| 82 | "Enrique Abaroa Martnez" | "M" | 26 | 183 | 75.0 | "Mexico" | "MEX" | "2000 Summer" | 2000 | "Summer" | "Sydney" | "Tennis" | "Tennis Men's Doubles" | null |
| 8 | "Cornelia "Cor" Aalten (-Strann… | "F" | 18 | 168 | null | "Netherlands" | "NED" | "1932 Summer" | 1932 | "Summer" | "Los Angeles" | "Athletics" | "Athletics Women's 100 metres" | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5716 | "John Jacob Astor" | "M" | 21 | null | null | "Great Britain" | "GBR" | "1908 Summer" | 1908 | "Summer" | "London" | "Racquets" | "Racquets Men's Singles" | "Bronze" |
| 387 | "Sonia Abejn Esteban" | "F" | 19 | 169 | 48.0 | "Spain" | "ESP" | "2004 Summer" | 2004 | "Summer" | "Athina" | "Rhythmic Gymnastics" | "Rhythmic Gymnastics Women's Gr… | null |
| 20 | "Kjetil Andr Aamodt" | "M" | 20 | 176 | 85.0 | "Norway" | "NOR" | "1992 Winter" | 1992 | "Winter" | "Albertville" | "Alpine Skiing" | "Alpine Skiing Men's Downhill" | null |
| 21 | "Ragnhild Margrethe Aamodt" | "F" | 27 | 163 | null | "Norway" | "NOR" | "2008 Summer" | 2008 | "Summer" | "Beijing" | "Handball" | "Handball Women's Handball" | "Gold" |
| 5822 | "George H. Atkinson" | "M" | null | null | null | "Wolseley-Siddeley-1" | "GBR" | "1908 Summer" | 1908 | "Summer" | "London" | "Motorboating" | "Motorboating Mixed A-Class (Op… | null |
Combinación de tablas
Para combinar tablas en Polars hay dos operaciones muy comunes: la concatenación y la fusión. La concatenación pega tablas por posición, mientras que la fusión relaciona filas a partir de una o más claves.
Concatenación
La concatenación combina tablas que ya tienen una estructura compatible. Con pl.concat() podemos apilarlas verticalmente o pegarlas horizontalmente.
En una concatenación vertical, las filas de una tabla se agregan debajo de las de otra. En una concatenación horizontal, en cambio, las columnas se agregan lado a lado y las filas se alinean por posición.
ventas_2023 = pl.DataFrame({
"id_venta": [1, 2, 3],
"producto": ["Cuaderno", "Lapicera", "Regla"],
"unidades": [5, 8, 3],
})
ventas_2024 = pl.DataFrame({
"id_venta": [4, 5],
"producto": ["Cuaderno", "Mochila"],
"unidades": [6, 2],
})
detalle_ventas = pl.DataFrame({
"id_venta": [1, 2, 3],
"producto": ["Cuaderno", "Lapicera", "Regla"],
})
precios = pl.DataFrame({
"precio_unitario": [1200, 800, 1500],
"descuento": [0.10, 0.00, 0.05],
})pl.concat([ventas_2023, ventas_2024], how="vertical")| id_venta | producto | unidades |
|---|---|---|
| i64 | str | i64 |
| 1 | "Cuaderno" | 5 |
| 2 | "Lapicera" | 8 |
| 3 | "Regla" | 3 |
| 4 | "Cuaderno" | 6 |
| 5 | "Mochila" | 2 |
pl.concat([detalle_ventas, precios], how="horizontal")| id_venta | producto | precio_unitario | descuento |
|---|---|---|---|
| i64 | str | i64 | f64 |
| 1 | "Cuaderno" | 1200 | 0.1 |
| 2 | "Lapicera" | 800 | 0.0 |
| 3 | "Regla" | 1500 | 0.05 |
Fusión
Cuando queremos combinar tablas a partir de una clave compartida, usamos .join(). Los tipos de fusión más comunes son inner, left, right y full.
inner: conserva solo las filas con coincidencia en ambas tablas.left: conserva todas las filas de la tabla izquierda y agrega coincidencias de la derecha.right: conserva todas las filas de la tabla derecha y agrega coincidencias de la izquierda.full: conserva todas las filas de ambas tablas, aunque no tengan coincidencia.
La documentación oficial de Polars muestra más variantes y detalles sobre joins: https://docs.pola.rs/user-guide/transformations/joins/
clientes = pl.DataFrame({
"id_cliente": [101, 102, 103, 105],
"nombre": ["Ana", "Bruno", "Carla", "Diego"],
})
segmentos = pl.DataFrame({
"id_cliente": [102, 103, 104],
"segmento": ["Mayorista", "Minorista", "Mayorista"],
})clientes.join(segmentos, on="id_cliente", how="inner")| id_cliente | nombre | segmento |
|---|---|---|
| i64 | str | str |
| 102 | "Bruno" | "Mayorista" |
| 103 | "Carla" | "Minorista" |
clientes.join(segmentos, on="id_cliente")| id_cliente | nombre | segmento |
|---|---|---|
| i64 | str | str |
| 102 | "Bruno" | "Mayorista" |
| 103 | "Carla" | "Minorista" |
El resultado es el mismo porque how es por defecto igual a "inner".
clientes.join(segmentos, on="id_cliente", how="left")| id_cliente | nombre | segmento |
|---|---|---|
| i64 | str | str |
| 101 | "Ana" | null |
| 102 | "Bruno" | "Mayorista" |
| 103 | "Carla" | "Minorista" |
| 105 | "Diego" | null |
clientes.join(segmentos, on="id_cliente", how="right")| nombre | id_cliente | segmento |
|---|---|---|
| str | i64 | str |
| "Bruno" | 102 | "Mayorista" |
| "Carla" | 103 | "Minorista" |
| null | 104 | "Mayorista" |
clientes.join(segmentos, on="id_cliente", how="full")| id_cliente | nombre | id_cliente_right | segmento |
|---|---|---|---|
| i64 | str | i64 | str |
| 101 | "Ana" | null | null |
| 102 | "Bruno" | 102 | "Mayorista" |
| 103 | "Carla" | 103 | "Minorista" |
| 105 | "Diego" | null | null |
| null | null | 104 | "Mayorista" |
También es posible unir usando más de una columna. En ese caso, le pasamos a on una lista con los nombres de las claves que deben coincidir simultáneamente.
ventas_sucursal = pl.DataFrame({
"sucursal": ["Centro", "Centro", "Norte", "Norte"],
"producto": ["Cuaderno", "Lapicera", "Cuaderno", "Mochila"],
"unidades": [20, 35, 18, 7],
})
precios_sucursal = pl.DataFrame({
"sucursal": ["Centro", "Centro", "Norte", "Sur"],
"producto": ["Cuaderno", "Lapicera", "Cuaderno", "Mochila"],
"precio_unitario": [1200, 800, 1250, 32000],
})
ventas_sucursal.join(precios_sucursal, on=["sucursal", "producto"], how="left")| sucursal | producto | unidades | precio_unitario |
|---|---|---|---|
| str | str | i64 | i64 |
| "Centro" | "Cuaderno" | 20 | 1200 |
| "Centro" | "Lapicera" | 35 | 800 |
| "Norte" | "Cuaderno" | 18 | 1250 |
| "Norte" | "Mochila" | 7 | null |
Escritura de de datos
Así como Polars ofrece varias funciones para leer datos, también incluye métodos para exportarlos. Uno de los más usados es .write_csv, aunque también existen .write_parquet y .write_excel (estos últimos pueden requerir dependencias adicionales).
El primer argumento de .write_csv es la ruta donde queremos guardar el archivo.
datos.head(10).write_csv("primeras_10_filas.csv")datos| id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | str | i64 | i64 | f64 | str | str | str | i64 | str | str | str | str | str |
| 1 | "A Dijiang" | "M" | 24 | 180 | 80.0 | "China" | "CHN" | "1992 Summer" | 1992 | "Summer" | "Barcelona" | "Basketball" | "Basketball Men's Basketball" | null |
| 2 | "A Lamusi" | "M" | 23 | 170 | 60.0 | "China" | "CHN" | "2012 Summer" | 2012 | "Summer" | "London" | "Judo" | "Judo Men's Extra-Lightweight" | null |
| 3 | "Gunnar Nielsen Aaby" | "M" | 24 | null | null | "Denmark" | "DEN" | "1920 Summer" | 1920 | "Summer" | "Antwerpen" | "Football" | "Football Men's Football" | null |
| 4 | "Edgar Lindenau Aabye" | "M" | 34 | null | null | "Denmark/Sweden" | "DEN" | "1900 Summer" | 1900 | "Summer" | "Paris" | "Tug-Of-War" | "Tug-Of-War Men's Tug-Of-War" | "Gold" |
| 5 | "Christine Jacoba Aaftink" | "F" | 21 | 185 | 82.0 | "Netherlands" | "NED" | "1988 Winter" | 1988 | "Winter" | "Calgary" | "Speed Skating" | "Speed Skating Women's 500 metr… | null |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 135569 | "Andrzej ya" | "M" | 29 | 179 | 89.0 | "Poland-1" | "POL" | "1976 Winter" | 1976 | "Winter" | "Innsbruck" | "Luge" | "Luge Mixed (Men)'s Doubles" | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135570 | "Piotr ya" | "M" | 27 | 176 | 59.0 | "Poland" | "POL" | "2014 Winter" | 2014 | "Winter" | "Sochi" | "Ski Jumping" | "Ski Jumping Men's Large Hill, … | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 30 | 185 | 96.0 | "Poland" | "POL" | "1998 Winter" | 1998 | "Winter" | "Nagano" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
| 135571 | "Tomasz Ireneusz ya" | "M" | 34 | 185 | 96.0 | "Poland" | "POL" | "2002 Winter" | 2002 | "Winter" | "Salt Lake City" | "Bobsleigh" | "Bobsleigh Men's Four" | null |
df_equipos = (
datos
.filter(pl.col("season") == "Summer")
.group_by("year")
.agg(pl.col("noc").n_unique())
.sort("year")
)
df_equipos| year | noc |
|---|---|
| i64 | u32 |
| 1896 | 12 |
| 1900 | 31 |
| 1904 | 15 |
| 1906 | 21 |
| 1908 | 22 |
| … | … |
| 2000 | 200 |
| 2004 | 201 |
| 2008 | 204 |
| 2012 | 205 |
| 2016 | 207 |
df_equipos.write_csv("equipos_por_año.csv")