If it quacks like a package

The curse of modularity

Perhaps data scientists can be classified into different types according to their favourite procrastination activities. There is the one who likes perfecting already-flashy visuals. There is the one who needlessly applies the fanciest algorithms. The one who keeps trying new programming languages and loves demonstrating interop. And then there is the one always perfecting their analysis workflow.

On the latter count - guilty as charged! I have spent entirely too much time building one-off analysis code that is perfectly reproducible and consists fully of modular functions, while always concluding afterwards that it wasn’t really worth the effort. Most of those beautifully modular functions are never reused and most analysis projects never need to be reproduced, while the abstractions needed for their consistent modularity and reproducibility require significant mental effort, both at the time of creation and whenever there is an actual need to revisit and understand the code.

And to be perfectly honest, these lofty ideals are regularly dropped anyway as the deadline draws closer, leaving behind an analysis that feels like a dirty failure to me, while satisfying the client fully! So in time, I have gravitated towards a more pragmatic approach. Better than the chaos-inviting classical approach of sourcing interdependent .R scripts with library() calls at the top, but avoiding unnecessary overhead as well as the temptation of procrastination-through-modularisation.

Use a package

In short, each analysis for me is a package, but not really. It’s a quasi-package - a quackage! R packages offer significant advantages for managing project code:

  • Packages have their own namespace environment, separate from the global environment
  • Code spread out over multiple .R files can be loaded all at once
  • External dependencies can be declared in a structured manner, and imported into the package namespace selectively using import and importFrom statements in the NAMESPACE file
  • There is a wealth of tooling available for managing and using packages - such as usethis and devtools

At the same time, the full orthodoxy of R package development is overkill for an analysis project. You don’t need to roxygenise your every function parameter, you don’t need to write unit tests, and you do not need to distribute it or even build and install it into your local library.

… a minimal package!

Importantly, we need to recognise that data analysis is an interactive, creative process of understanding the data. Putting abstraction and formalism first only distracts from that. So what do we minimally need to be able to leverage the benefits of packages?

  • A DESCRIPTION file with at least a package name and a version, and preferably a specification of dependencies
  • An R/ directory for code that you have abstracted out into functions
  • A NAMESPACE file to import often-used external dependencies, or individual functions from them

And forget about the rest. Using devtools::load_all() will now allow you to load all imported dependencies and every function in your R/ directory at once, attached to the interactive global environment you’re working in. Meanwhile, code not in R/ will not be loaded - this can be your main script using those functions, or the code you are currently working on until you can wrap it into a function in R/.

If you need to use global variables accessed by your package functions, perhaps in some cases even as an alternative to the elegant drudgery of cleanly passing data between them, you can safely declare them inside the package namespace - this will then not interfere with whatever you’re doing in the global environment. And whenever you need to reset things or have updated your functions, just call devtools::load_all() again.

Finally

This is of course not an encouragement to start writing sloppy code. Do be clear in your intent, do write comments, do check for correctness, don’t throw out best practices altogether. But there is a balance to strike here. Don’t abstract prematurely, don’t straight-jacket yourself in a way that removes you from feeling the data interactively, don’t be tempted to merely stimulate yourself intellectually by building intricate code-LEGO that will cost you precious time now and bring you little benefit in the future. Data analysis can benefit from software engineering practices, but it is not software engineering.

My suggestion here is to use a deliberately minimal package structure as a solid middle ground. Quack! 🦆