data.table's Python port
The Python library landscape is fairly monolithic where in-memory tabular data manipulation is concerned - pandas
rules supreme. A brilliant piece of software obviously, but could it find itself facing some competition from the Python port of R’s data.table soon? It might!
data.table
is the package that saved R for the Big Data era, improving spectacularly on the performance of base R since 2006. The main reason to consider data.table
in 2019 though is not its superior speed and advanced algorithms - but the elegance of its compact syntax.
There are many good tutorials covering the practical usage of data.table
, but here is the gist of it: operations are defined within the square brackets of the table in the following order
Many find this syntax hard to parse visually at first, but it will quickly become a data analysis superpower for which your fingers and/or code completion gnomes will be thankful.
Here is how to do the most basic data manipulation operations in R using data.table
:
library(data.table)
# Reading a CSV
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
tbl <- fread(url)
# Filtering rows
tbl <- tbl[species != "setosa"]
# Selecting columns
tbl <- tbl[, .(species, sepal_length)]
# Adding a computed column (by reference)
tbl[, sepal_length_sq := sepal_length ** 2]
# Aggregating tables
agg_tbl <- tbl[, .(avg_sq_length = mean(sepal_length_sq)), by = "species"]
# Output the result
agg_tbl
## species avg_sq_length
## 1: versicolor 35.4972
## 2: virginica 43.7980
R as a language is of course highly suitable for creating expressive DSLs, whereas Python tries to be syntactically consistent and explicit (see this oldie). So how does the above code translate to Python?
These are some of the compromises made:
f
is imported to reference the current frame (table)- Aggregation functions such as
mean
are imported fromdatatable
- The
by
argument for grouping is imported as a function - Selecting all rows or columns uses the
numpy
convention:
- Column computations are defined as dictionaries
- Adding columns by reference is explicitly done with a
cbind()
method
In datatable
version 0.8.0, the above code then translates to:
import numpy as np
import datatable as dt
from datatable import f, by, mean
# Reading a CSV
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
tbl = dt.fread(url)
# Filtering rows
tbl = tbl[f.species != "setosa", :]
# Selecting columns
tbl = tbl[:, (f.species, f.sepal_length)]
# Adding a computed column (by reference)
tbl.cbind(tbl[:, {"sepal_length_sq" : np.square(f.sepal_length)}])
# Aggregating tables
agg_tbl = tbl[:, {"avg_sq_length" : mean(f.sepal_length_sq)}, by(f.species)]
# Output the result (and convert to pandas)
agg_tbl.to_pandas()
## species avg_sq_length
## 0 versicolor 35.4972
## 1 virginica 43.7980
datatable
for Python is still in alpha and not full-featured yet, but I was delighted to see its much-beloved syntax translate so well. And together with joins, row or column deletion and sorting, it already covers 95% of practical data manipulation needs!