Dynamic Super Computing in R

I have for a long time believed that it was a waste to spend a lot of money to have a powerful computer for home and office use because it’s more efficient to dynamically use cloud compute resources rather than rely on my local machine. However, I have also for a long time not actually done that and run all of my jobs on my local machine because I couldn’t actually figure out a good way to make use of a remote computing box that didn’t also throw a lot of wrenches into my workflow.

However, recently I’ve had the chance to work extensively with the future package for R and I believe that I can see the path to a bright future of true on-demand dynamic computing (at least in the R ecosystem).

In this post, I’ll start by talking about my workflow and what an ideal dynamic compute setup would look like, then I’ll talk about why some of the other solutions that exist haven’t worked well for me, then I’ll talk about how I’ve been using future for R to take a big step toward my ideal setup, and I’ll close by talking about some of the limitations and gotchas of using future in this way (which is a bit outside of what future was designed for).

My Development Environment & Dream Setup

My development environment is tmux + vim. I use the same setup for working in python or R or SQL or ruby on rails and I have found that switching to any other development environment (or, even worse, a different environment for each framework) dramatically slows down my development speed. I do not like using RStudio for software development (although it’s a great tool for doing analysis) and I similarly find that my heavily-customized vim setup still wins against Pycharm or DataGrip.

In general, I like being able to work with my local filesystem! If I download some data, I want to be able to link directly to it. If I make a change to a file in my local filesystem, I want to be able to reload the package and run my tests with the changes applied.

My ideal dynamic computing environment would support all of this “local” customization while allowing me to use CPUs and RAM in the cloud rather than just whatever I picked up at the Apple store. So that is, I want to be able to use my local clipboard and my local filesystem, but then ship data over the wire for processing on remote boxes. In an ideal world, I’d be able to make use of unlimited computing capacity in the cloud, while it still “feels” as if I’m just using my local box.

Other solutions / attempts

The first step I took towards making this happen was creating an EC2 image on AWS that was a linux-version of the setup that I have locally. I installed tmux, I installed vim, I ported my macOS configuration to the linux box, and I installed all of the python and R packages that I normally use. Then, whenever I needed to do a remote compute job, I’d just spin up an EC2 box from that image with whatever specs I’d need, ssh in, and get to work.

This idea worked much better in theory than in practice for a number of reasons:

  • Dealing with credentials is a nightmare. You don’t want to store any credentials on your image, so that means that you are going to face a lot of overhead anytime you want to connect to other AWS resources or do a git pull. I read a few ideas for good workarounds for how to do this, but I never found any of them actually easy.
  • Keeping the environment up to date / synced is a pain.
    •  Because I still do some work on my local machine, my AWS image would always be stale so I’d always have a pretty long startup time of getting up-to-date packages and environments set up on the new box — the additional 20 / 30 minutes of re-installing all of the packages I’m going to need is actually pretty costly!
    •  Additionally, once I update all of the packages, I have to remember to save the new updated image so that the next time I don’t have to do this again. Versioning these images and keeping track of what has been updated or hasn’t, especially if I’m using different boxes for different projects, is a real pain.
  • Things like access to the clipboard and the local filesystem end up being … really annoying. I find myself having to write a lot of scp ... commands traversing two different filesystems in a way that’s really frustrating. Similarly, differences in the way the clipboards are handled over ssh is another annoyance that just makes it not super nice to work in that environment.

Instead of putting my entire environment on the box in the cloud, I had considered just running the REPL in the cloud via ssh and then using my local environment as normal. That of course doesn’t work if you want to read local data into memory or work off of the local file system (e.g., source("~/myscript.R")).

I’ve mentioned above that I don’t like using Rstudio which rules out using the traditional Rstudio-via-browser approach. I’ve heard that VS Code has some functionality that can make this work but I haven’t yet given it a try.

So there are some options to get close to what I want to do, but none of them felt seamless enough to want to use them as frequently as I should be using them.

Using future for remote compute in R

And then I found future for R. This package adds support for parallel-processing and asynchronous programming for R. It does a lot but for my use-case it does two things that are crucial:

  • It connects a local R session to a remote computing cluster in the cloud
  • It handles shipping local in-memory objects over ssh to the remote cluster for remote processing as well as bringing back the results

If it doesn’t click for you immediately how monumental that is, I’ll just tell you that it’s a big deal! It allows you to run a local R session like you would normally, and then offload any compute-intensive jobs directly to the cloud without ever actually leaving your local session! You don’t have to save a csv of data and then scp it to your box and then ssh and run your script — everything just happens seamlessly from your local R session.

With future set up appropriately, that means that I can use my local editor and local configuration and then only use the remote box for the compute-intensive job where I’m running a thousand different regressions and compiling the results. In combination with the cloudyr aws.ec2 package I can actually spin the remote boxes up-and-down programmatically so I don’t pay for a minute more than I need to run that very specific compute-intensive job.

It’s really like magic and it opens up really incredible opportunities to build powerful tools on top of our local (and underpowered) development machines.

Issues and Next Steps

Unfortunately, this computing setup is not the paradigmatic use case for the future package and there are a number of issues that I’d love to see addressed (likely by copying some of the functionality from future into a separate package specifically optimized for managing these remote-compute jobs)1It is totally unfair to say that what’s listed here are “limitations” of the future package. Everything I’m doing here is an abuse of that package and I’m really just trying to highlight what the needs would be for a package that was actually designed to support this use-case..

Issues:

  • future relies on ssh, and out of the box a dropped connection will kill your remote R processes. If you’re trying to use this setup to manage long-running jobs on the remote machine, relying on ssh will result in a lot of headaches due to dropped connections.
    •  Using something like nohup can keep your remote processes from being canceled, but then you have to figure out a way to get your results back if the connection dropped
    •  It’d be great to have a connection system that’s more fault tolerant than ssh (maybe mosh?) so that intermittent network outages don’t kill the connection
  • future does not provide an easy way to monitor the remote process — while the remote process is running, intermediate logs are not sent back to the local process so you do not know if your job is making progress or if it has hung somewhere
  • In general, there’s no easy way to monitor or restart processes on the remote machines where things might have gone wrong or recover gracefully if one or more of the remote processes fails
  • You have to install whatever libraries are needed for the remote computation on the remote boxes before you ship the code to them (that includes the future library but also rstan or mgcv or whatever)
    •  It’s hard to see a way around this problem, but I’d love to think about ways to make this easier to manage for folks who are less technical

Conclusion

I am so excited by the opportunities presented by this style of computing and workflow. I would love to help push this effort forward, so if you’re a developer who is looking for some help on either improving this process for R or making something similar for python, I’d love to help however I can (while I’m sure that I’m not the person to actually be the tech-lead on the project).

If you’re an R user, hopefully you give this a shot — definitely let me know how it goes!

About the author

michael
By michael