Dynamic Super Computing in R

I have for a long time believed that it was a waste to spend a lot of money to have a powerful computer for home and office use because it’s more efficient to dynamically use cloud compute resources rather than rely on my local machine. However, I have also for a long time not actually done that and run all of my jobs on my local machine because I couldn’t actually figure out a good way to make use of a remote computing box that didn’t also throw a lot of wrenches into my workflow.

However, recently I’ve had the chance to work extensively with the future package for R and I believe that I can see the path to a bright future of true on-demand dynamic computing (at least in the R ecosystem).

In this post, I’ll start by talking about my workflow and what an ideal dynamic compute setup would look like, then I’ll talk about why some of the other solutions that exist haven’t worked well for me, then I’ll talk about how I’ve been using future for R to take a big step toward my ideal setup, and I’ll close by talking about some of the limitations and gotchas of using future in this way (which is a bit outside of what future was designed for).

My Development Environment & Dream Setup

My development environment is tmux + vim. I use the same setup for working in python or R or SQL or ruby on rails and I have found that switching to any other development environment (or, even worse, a different environment for each framework) dramatically slows down my development speed. I do not like using RStudio for software development (although it’s a great tool for doing analysis) and I similarly find that my heavily-customized vim setup still wins against Pycharm or DataGrip.

In general, I like being able to work with my local filesystem! If I download some data, I want to be able to link directly to it. If I make a change to a file in my local filesystem, I want to be able to reload the package and run my tests with the changes applied.

My ideal dynamic computing environment would support all of this “local” customization while allowing me to use CPUs and RAM in the cloud rather than just whatever I picked up at the Apple store. So that is, I want to be able to use my local clipboard and my local filesystem, but then ship data over the wire for processing on remote boxes. In an ideal world, I’d be able to make use of unlimited computing capacity in the cloud, while it still “feels” as if I’m just using my local box.

Using `future` for remote compute in R

And then I found future for R. This package adds support for parallel-processing and asynchronous programming for R. It does a lot but for my use-case it does two things that are crucial:

It connects a local R session to a remote computing cluster in the cloud
It handles shipping local in-memory objects over ssh to the remote cluster for remote processing as well as bringing back the results

If it doesn’t click for you immediately how monumental that is, I’ll just tell you that it’s a big deal! It allows you to run a local R session like you would normally, and then offload any compute-intensive jobs directly to the cloud without ever actually leaving your local session! You don’t have to save a csv of data and then scp it to your box and then ssh and run your script — everything just happens seamlessly from your local R session.

With future set up appropriately, that means that I can use my local editor and local configuration and then only use the remote box for the compute-intensive job where I’m running a thousand different regressions and compiling the results. In combination with the cloudyr aws.ec2 package I can actually spin the remote boxes up-and-down programmatically so I don’t pay for a minute more than I need to run that very specific compute-intensive job.

It’s really like magic and it opens up really incredible opportunities to build powerful tools on top of our local (and underpowered) development machines.

Issues and Next Steps

Unfortunately, this computing setup is not the paradigmatic use case for the future package and there are a number of issues that I’d love to see addressed (likely by copying some of the functionality from future into a separate package specifically optimized for managing these remote-compute jobs)¹It is totally unfair to say that what’s listed here are “limitations” of the future package. Everything I’m doing here is an abuse of that package and I’m really just trying to highlight what the needs would be for a package that was actually designed to support this use-case..

Issues:

future relies on ssh, and out of the box a dropped connection will kill your remote R processes. If you’re trying to use this setup to manage long-running jobs on the remote machine, relying on ssh will result in a lot of headaches due to dropped connections.
- Using something like nohup can keep your remote processes from being canceled, but then you have to figure out a way to get your results back if the connection dropped
- It’d be great to have a connection system that’s more fault tolerant than ssh (maybe mosh?) so that intermittent network outages don’t kill the connection
future does not provide an easy way to monitor the remote process — while the remote process is running, intermediate logs are not sent back to the local process so you do not know if your job is making progress or if it has hung somewhere
In general, there’s no easy way to monitor or restart processes on the remote machines where things might have gone wrong or recover gracefully if one or more of the remote processes fails
You have to install whatever libraries are needed for the remote computation on the remote boxes before you ship the code to them (that includes the future library but also rstan or mgcv or whatever)
- It’s hard to see a way around this problem, but I’d love to think about ways to make this easier to manage for folks who are less technical

Conclusion

I am so excited by the opportunities presented by this style of computing and workflow. I would love to help push this effort forward, so if you’re a developer who is looking for some help on either improving this process for R or making something similar for python, I’d love to help however I can (while I’m sure that I’m not the person to actually be the tech-lead on the project).

If you’re an R user, hopefully you give this a shot — definitely let me know how it goes!

Dynamic Super Computing in R

My Development Environment & Dream Setup

Other solutions / attempts

Using `future` for remote compute in R

Issues and Next Steps

Conclusion

About the author

michael

My Development Environment & Dream Setup

Other solutions / attempts

Using future for remote compute in R

Issues and Next Steps

Conclusion

About the author

michael

Read more

Using `future` for remote compute in R