We’re excited to include the following guest blog post by our friends at Domino.
Domino is a PaaS for data science — our goal is to give analysts and researchers easy access to powerful infrastructure, so they can focus on their analysis instead of worrying about custom tooling and environment setup.
Docker has been enormously helpful to us, saving us many man-months of engineering effort. Before describing that in more detail, we’ll provide some context about what we mean by providing “powerful infrastructure.” Essentially, Domino provides three things:
- Access to cloud computing resources. Many analysts can write R or Python code, but are not technical enough to setup their own EC2 machine. Domino lets you run your own R, Python or Matlab code on cloud hardware without any special setup. In this sense, Domino is like a Heroku for data science.
- Organization and version control. Data analysis is iterative. A typical workflow involves making changes to code and data, running the code to generate result files (charts, tables, etc), viewing the results, and iterating. Every time Domino runs your project, we keep a snapshot of all input files and result files, so you can track the progression of your work through time, and trace particular results back to the inputs that generated them. Our users work with data files in the 10s of gigabytes — so not true “big data” scale, but well past the point where a simple git solution would work.
- Collaboration: because Domino hosts all project files centrally, multiple users can easily collaborate, sharing files, initiating new runs, and sharing results. Domino handles synchronization, authorization, and notifications.
At its core, Domino executes arbitrary code on behalf of our users. To give each process its own machine would be prohibitively expensive, so user jobs share resources in a cluster. In our early beta versions, this sharing was naive — we just started each process on a host machine. That caused two large problems for us.
First, security: users’ code and data are valuable assets, and we couldn’t have a platform where some malicious script could, say, crawl the file system to find files from other users’ jobs.
Second, users’ scripts often require external dependencies (R packages, Python libraries) that need to be installed in some global location. Having all processes share a file system would lead
Docker enabled us to easily upgrade our cluster: instead of launching processes on the host, the machines now launch processes in Docker containers for each user run.
Once we built our container images, modifying our code to use it was straightforward because Docker’s API is quite clean, and because we had already done a decent job of encapsulating our logic for executing commands. Here’s the essence of the code changes we made (with some details elided):
This was the core part of our process-launching code before using Docker:
And here is it after:
Now that our users’ code executes in isolation, we’re eager to use Docker for more. Specifically, we have two upcoming use cases in mind:
- Power users may want to use their own, highly customized environments. Custom Docker images per user per project could give them that capability.
- We’d like to use Docker to make our own deployment options more flexible. For example, we’d like to create container images for the various components in our stack (database, message queue, dispatch server, executor slaves) so that we can easily deploy our product on-premise within a company, instead of hosting everything on our own servers.
We have been thoroughly impressed by the quality, power, and convenience of Docker, and we’re excited to be a part of its voyage.
Meet us at the next Docker Meetup
We will give a lightning talk on how and why we use Docker at Domino at the next San Francisco Docker Meetup at Twilio on Thursday. We will be happy to share our experience and answer your questions.