Introduction #

A fundamental component of data science is sharing of results, outputs and findings along with the code that produced them. Consider the following scenarios:

  1. We have an R and/or Python script. It works on our laptop. Can we share it with someone else to run also?
  2. How about a Jupyter/Rmarkdown notebook? What about one that combines R and Python via rpy2 or reticulate?

The correct answer to the questions above will vary, but for complex scripts and projects, it will most likely be not easily.

Few users rely on plain R or Python alone. For example in R, the tidyverse set of packages is an entire ecosystem of tools for manipulating data, mlr and dependencies capture tools for machine learning. When installing each of these from scratch, hundreds of packages may make their way onto our system.

For Python, the situation is no different. Normally, people start with the basics — numpy, pandas, matplotlib, scikit-learn. Already those come with a substantial set of dependencies and versions, and they are only the start — we may want to make nicer plots (plotly, seaborn, ...), or do deep learning via our favourite modelling tools.

Technical reproducibility #

Being able to understand and reproduce the complexity of our compute environment leads to improved technical reproducibility of our work, all in addition to other good practices such as use of versioning, setting seeds, and not hard-coding stuff. Here are a few recommendations to keep complex projects reproducible and maintainable over time.

Minimize and simplify dependencies #

Many packages duplicate functionality. For example, do we really need both caret and MLR, packages for fitting exotic models that cover special cases? How about data.table and tidyverse? Plotly, seaborn and bokeh? Both tensorflow and pytorch? That library that wraps one-hot encoding in a really neat way?

Picking the minimal set of tools (and getting good at using it) goes a long way to not only keeping our work maintainable as it grows, but also producing cleaner and more consistent code. Also, removing things we tried out and then didn't use is a good habit to get into.

Understand the dependencies #

Find and document dependencies. Apart from knowing which packages were used, it is also good to know how things fit together, and what the limitations of dependency packages may be. For example, knowing that Keras in R also uses Python we may want to consider documenting Python dependencies also. Or, to get more technical, different versions of numerical libraries such as Intel MKL or OpenBLAS may interact differently with compute hardware of various ages (our hardware is a fundamental dependency also, and, although rare, it can have bugs, too).

Document at runtime #

In R, a good start is to include information from sessionInfo() and .libPaths() in our output. For Python/Jupyter there is a package named watermark that achieves similar things. In a Unix / Linux environment, incorporating environment variables (printenv in the shell) can be helpful also. Finally, it can be useful to get and print runtime information on the system we're on.

Import on top #

In both R and Python we can include / require packages in many places in our code, also dynamically during runtime. If possible, avoid this and include things consistently at the start. For example, in an Rmarkdown notebook, require all libraries in the first chunk. Same for Jupyter/IPython. Python also suggests an ordering for imports via PEP8, which is a good idea to follow.

Don't install in ${HOME} or over-customize #

Avoid installing to personal libraries. Use specific library locations, conda environments or virtualenvs / pipenvs for Python, and checkpoint or renv for R.

Also avoid manual config changes (user-level or system-level), such as changing .Renviron.site. Not everyone else will have these same changes in their home / system.

Fix the versions #

When installing in R and Python, we typically get the newest version. For longer-running projects, it is helpful to pin the versions to the ones we started with and only increment at specific points in time. Tools that help here are the CRAN time machine at MRAN and the checkpoint package, which allow us to install package versions as of a specific date. The renv package allows us to pin the set of versions we have installed right now, and restore them later. In Python, conda environments can be saved to a yaml file via conda env export (or in case of pip: pip freeze). For projects that run a really long time, it may be a good idea to keep the tarballs of packages around on a local mirror.

Docker and the cloud #

In the cloud, things can get a little easier. For example, many of the examples for Tensorflow and generally in the ML community use Google Colaboratory, installing their dependencies at the start of the notebook. Similarly, Code Ocean is a tool to share our data and analysis together with the environment they were run in. In many cases, the underlying infrastructure uses container orchestration and Docker to ensure flexibility and isolation: Each user can set up their compute environment like their own machine without interfering with others. However, being able to use these environments comes with one key caveat: The data we need needs to be able to live there, too.

Of course, using Docker directly is possible, too, and is a good way to check how our code might run on someone else's machine. However, this can take quite some effort. Building Docker images well and reproducibly is a technical science of its own, and archiving pre-built images together with all the security holes and bugs not yet discovered in them is also a risky strategy. Also, we cannot simply run Docker containers anywhere, systems must allow us to do so (and many systems where we're not admins/root actually don't).

Understand library paths #

When we do need to rely on many packages it is important to know where packages are installed and loaded from. Both R and Python retrieve packages from a set of locations, their respective library paths. These can be fairly complex.

The library path in Python comes from various sources:

  • Our Python's system library path (all base libraries deployed with our Python installation).
  • Potentially additional paths from our Conda environment, virtualenv, pipenv, etc.
  • Paths in our home according to PEP-370.
  • Python may consider the current directory as a package.

How this comes together is documented here. Short of reading all this, the simplest way to find where Python looks for packages is to run:

import sys; print(sys.path)

Also, sys.path is an array, we can append to it or change it within a script to point to new locations. To add a specific library directory before running a script, use the PYTHONPATH environment variable:

# common example: testing in a Python package, library source lives
# in src, cli interface in bin; this way we can test locally without
# doing an install.
PYTHONPATH=src python bin/test.py

R tells us its library paths through the .libPaths() function. These paths also come from various sources:

  • Environment variables, namely R_LIBS_USER, R_LIBS_SITE and the library subdirectory of the R_HOME directory.
  • Various config files, in particular, various kinds of .Rprofile / .Renviron files which may live in our home, with the R installation (e.g. .Rprofile.site) or also with our project (e.g. created when using renv or packrat). Modifying these files can make life very hard trying to figure out why things went wrong, so change with caution. Also, some IDEs like Rstudio actually load the .Rprofile in the folder where we open a project, so adding complex logic / lengthy package installations is not a good idea.

Know how R and Python work together #

Another level of complexity comes in when using R and Python together in the same script. This can be quite a powerful tool (for example, I much prefer plotting in ggplot, but working with environments like Spark in Python is still so much easier).

There are two common ways to do this:

  • reticulate to run Python from R. Reticulate does a good job in terms of configurability: we can point it to a version of Python/Conda. In the worst case we may provide a wrapper shell script to run & set up Python. Note that by default, reticulate will use the system-installed version of Python, so being explicit is important for being reproducible. Moreover, reticulate is used for running Tensorflow or Keras from R. So, even if we are only using R, setting up reticulate may be necessary!
  • rpy2 allows us to run R from Python, or from Jupyter/Ipython where we can enable R cells that can be started via the %%R magic like this:
    import rpy2
    %load_ext rpy2.ipython
    New cells prefixed with %%R can now be written in R, and exchange data with the Python session (see docs here). This works nicely, but it is important to know where rpy2 actually picks its R version from: This is set up from the location of R_HOME (or the location of which R) at the time when rpy2 is installed. This can lead to confusion for example when R is both installed at system level and in a conda environment.