10x Faster Parallel Python Without Python Multiprocessing

Last updated 2 years ago by Robert Nishihara


While Python’s multiprocessing library has been used successfully for a wide range of applications, in this blog post, we show that it falls short for several important classes of applications including numerical data processing, stateful computation, and computation with expensive initialization. There are two main reasons:

  • Inefficient handling of numerical data.
  • Missing abstractions for stateful computation (i.e., an inability to share variables between separate “tasks”).

Ray is a fast, simple framework for building and running distributed applications that addresses these issues. For an introduction to some of the basic concepts, see this blog post. Ray leverages Apache Arrow for efficient data handling and provides task and actor abstractions for distributed computing.

This blog post benchmarks three workloads that aren’t easily expressed with Python multiprocessing and compares Ray, Python multiprocessing, and serial Python code. Note that it’s important to always compare to optimized single-threaded code.

In these benchmarks, Ray is 10–30x faster than serial Python, 5–25x faster than multiprocessing, and 5–15x faster than the faster of these two on a large machine.

The benchmarks were run on EC2 using the m5 instance types (m5.large for 1 physical core and m5.24xlarge for 48 physical cores). Code for running all of the benchmarks is available here. Abbreviated snippets are included in this post. The main differences are that the full benchmarks include 1) timing and printing code, 2) code for warming up the Ray object store, and 3) code for adapting the benchmark to smaller machines.

Benchmark 1: Numerical Data

Many machine learning, scientific computing, and data analysis workloads make heavy use of large arrays of data. For example, an array may represent a large image or dataset, and an application may wish to have multiple tasks analyze the image. Handling numerical data efficiently is critical.

Each pass through the for loop below takes 0.84s with Ray, 7.5s with Python multiprocessing, and 24s with serial Python (on 48 physical cores). This performance gap explains why it is possible to build libraries like Modin on top of Ray but not on top of other libraries.

The code looks as follows with Ray.

By calling ray.put(image), the large array is stored in shared memory and can be accessed by all of the worker processes without creating copies. This works not just with arrays but also with objects that contain arrays (like lists of arrays).

Read full Article