Matrix cross-distances using Rcpp

I promised to keep these posts short and sweet, and have been failing miserably lately. So how about a quick one?

The problem

One of the things I find myself implementing again and again in R is the computation of the Euclidean cross-distances of all pairwise points between two coordinate matrices. That is, for every point (x,y,z,…) in matrix M1 we want to compute the distance to every point (x,y,z,…) in matrix M2.

The requirements are:

  • Good speed and memory usage
  • Take and return matrix objects
  • Support for two different matrices
  • Support for more than two coordinate dimensions
  • Minimal package dependencies

Unsatisfactory solutions

There are some solutions available both in base R and on CRAN, but none of them seem to satisfy the requirements.

  • stats::dist() is an obvious candidate, but only computes the distances within one matrix. And it returns a dist object instead of a matrix.
  • pdist::pdist() does support using two different matrices, but it is slow and introduces an extra dependency.
  • Built-in R matrix operations can be used, but are very slow when there are many points. pracma::distmat() uses this approach, and has similar performance issues.
  • spatstat::crossdist() is fast, but only works with two dimensions and constitutes a rather large dependency.

Rcpp to the rescue

The easiest way to get around performance issues is to code critical computations directly in C, rather than in R. The Rcpp package is tremendously helpful for this, and indeed you will see many performant packages use it extensively.

And it’s not nearly as scary as you might fear it is!

Take a look at this:

library(Rcpp)

cppFunction('NumericMatrix crossdist(NumericMatrix m1, NumericMatrix m2) {

  int nrow1 = m1.nrow();
  int nrow2 = m2.nrow();
  int ncol = m1.ncol();
 
  if (ncol != m2.ncol()) {
    throw std::runtime_error("Incompatible number of dimensions");
  }
 
  NumericMatrix out(nrow1, nrow2);
 
  for (int r1 = 0; r1 < nrow1; r1++) {
    for (int r2 = 0; r2 < nrow2; r2++) {
      double total = 0;
      for (int c12 = 0; c12 < ncol; c12++) {
        total += pow(m1(r1, c12) - m2(r2, c12), 2);
      }
      out(r1, r2) = sqrt(total);
    }
  }
 
  return out;

}')

This should be relatively easy to understand even for someone who has never coded in C++ - we iterate over all combinations of matrix rows, add up the squared differences between dimensions, and take the square root before moving on to the next row. Rcpp allows us to directly take in, manipulate and return R matrix objects, which otherwise would be a big hassle.

When we compare performance to the alternatives listed above, we get:

library(microbenchmark)

mat1 <- matrix(runif(2000), ncol = 2)
mat2 <- matrix(runif(2000), ncol = 2)

microbenchmark(rcpp = crossdist(mat1, mat2),
               dist = as.matrix(stats::dist(mat1)),
               pdist = as.matrix(pdist::pdist(mat1, mat2)),
               spatstat = spatstat::crossdist(mat1[, 1], mat1[, 2],
                                              mat2[, 1], mat2[, 2]),
               pracma = pracma::distmat(mat1, mat2))
## Unit: milliseconds
##      expr       min        lq     mean    median        uq       max neval
##      rcpp  3.924608  4.154542 10.22706  7.415639 11.834439  80.65002   100
##      dist 12.830634 15.061525 29.33596 22.293739 35.982537 112.22664   100
##     pdist 11.681389 13.729763 28.65005 16.766787 34.149946 106.61124   100
##  spatstat  3.353265  3.596391 14.51705  6.999774  9.762766 586.01026   100
##    pracma 34.795197 59.406904 77.25908 64.281204 78.253411 168.83436   100

Only spatstat::crossdist() is a competitor to this simple Rcpp implementation, but as argued it is limited to two dimensions and constitutes a dependency of considerable size. dist() would be faster if we didn’t convert the dist object to a matrix afterwards - but it is also not doing the same computation.

Of course Rcpp itself is a big dependency. However it is very likely to already be part of your installation if you use any non-base packages.