data.table's Python port

The Python library landscape is fairly monolithic where in-memory tabular data manipulation is concerned - pandas rules supreme. A brilliant piece of software obviously, but could it find itself facing some competition from the Python port of R’s data.table soon? It might!

data.table is the package that saved R for the Big Data era, improving spectacularly on the performance of base R since 2006. The main reason to consider data.table in 2019 though is not its superior speed and advanced algorithms - but the elegance of its compact syntax.

There are many good tutorials covering the practical usage of data.table, but here is the gist of it: operations are defined within the square brackets of the table in the following order

tbl[row operations, column operations, group operations]

Many find this syntax hard to parse visually at first, but it will quickly become a data analysis superpower for which your fingers and/or code completion gnomes will be thankful.

Here is how to do the most basic data manipulation operations in R using data.table:

library(data.table)

# Reading a CSV
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
tbl <- fread(url)

# Filtering rows
tbl <- tbl[species != "setosa"]

# Selecting columns
tbl <- tbl[, .(species, sepal_length)]

# Adding a computed column (by reference)
tbl[, sepal_length_sq := sepal_length ** 2]

# Aggregating tables
agg_tbl <- tbl[, .(avg_sq_length = mean(sepal_length_sq)), by = "species"]

# Output the result
agg_tbl
##       species avg_sq_length
## 1: versicolor       35.4972
## 2:  virginica       43.7980

R as a language is of course highly suitable for creating expressive DSLs, whereas Python tries to be syntactically consistent and explicit (see this oldie). So how does the above code translate to Python?

These are some of the compromises made:

  • f is imported to reference the current frame (table)
  • Aggregation functions such as mean are imported from datatable
  • The by argument for grouping is imported as a function
  • Selecting all rows or columns uses the numpy convention :
  • Column computations are defined as dictionaries
  • Adding columns by reference is explicitly done with a cbind() method

In datatable version 0.8.0, the above code then translates to:

import numpy as np
import datatable as dt
from datatable import f, by, mean

# Reading a CSV
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
tbl = dt.fread(url)

# Filtering rows
tbl = tbl[f.species != "setosa", :]

# Selecting columns
tbl = tbl[:, (f.species, f.sepal_length)]

# Adding a computed column (by reference)
tbl.cbind(tbl[:, {"sepal_length_sq" : np.square(f.sepal_length)}])

# Aggregating tables
agg_tbl = tbl[:, {"avg_sq_length" : mean(f.sepal_length_sq)}, by(f.species)]

# Output the result (and convert to pandas)
agg_tbl.to_pandas()
##       species  avg_sq_length
## 0  versicolor        35.4972
## 1   virginica        43.7980

datatable for Python is still in alpha and not full-featured yet, but I was delighted to see its much-beloved syntax translate so well. And together with joins, row or column deletion and sorting, it already covers 95% of practical data manipulation needs!