Parallel Python - Standard Library

2018-09-25

Parallel Python - Standard Library

If you need to process items in a list and each is independent from the others, you probably benefit from parallelism. To parallelize you split the work in multiple tasks and execute them at the same time. There are two ways of doing paralellism: threads and processes.

I always parallelized with Pools, so I discovered other interesting ways of using pools and other parallelization tools. However, we will start with Pools and move to other topics later.

Pools: simple processing of a list

Pools are very helpful, since they drive away details of the implementation, even if not using them is not that complex. Python has two pools, one at multiprocessing.Pool and the other at multiprocessing.pool.ThreadPool. The default one works with processes. The code below calculate the factorial of a number given by the function range.

import multiprocessing
from math import factorial


def main():
    print("Running a pool of processes...")
    with multiprocessing.Pool() as pool:
        results = pool.map(factorial, range(10))
    print(f"Processes pool results: {results}")

    print("Running a pool of threads...")
    with multiprocessing.pool.ThreadPool() as pool:
        results = pool.map(factorial, range(10))
    print(f"Threads pool results: {results}")


if __name__ == '__main__':
    main()
$ python basic.py
Running a pool of processes...
Processes pool results: [1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]
Running a pool of threads...
Threads pool results: [1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880]

Python distribute each element to the list to one worker of the pool, so they work in parallel calculating the factorial. When a worker in finished, it recieves another argument to process.

Timings

The following tries three different styles: serial, processes and threads using two functions: sleep and square. Here I use the timeit module to time each of the functions.

# timing.py
import multiprocessing
from timeit import timeit
from time import sleep
from math import pow
from functools import partial

square = partial(pow, 2)  # We 'pre-configure' the first argument of the function


def serial_sleep():
   list(map(sleep, range(5)))


def processes_sleep():
   with multiprocessing.Pool() as pool:
      pool.map(sleep, range(5))


def threads_sleep():
   with multiprocessing.pool.ThreadPool() as pool:
      pool.map(sleep, range(5))


def serial_square():
   list(map(square, range(1000)))


def processes_square():
   with multiprocessing.Pool() as pool:
      pool.map(square, range(1000))


def threads_square():
   with multiprocessing.pool.ThreadPool() as pool:
      pool.map(square, range(1000))


if __name__ == '__main__':
   for func, repeats in [("sleep", 3), ("square", 100)]:
      for style in ["serial", "processes", "threads"]:
         fn_name = f"{style}_{func}"
         elapsed = timeit(f"{fn_name}()", setup=f"from __main__ import {fn_name}", number=repeats)
         print(f"Time elapsed for {fn_name}: {round(elapsed / repeats, 4)}s)")
      print()
$ python run.py
Time elapsed for serial_sleep: 10.0096s)
Time elapsed for processes_sleep: 4.0407s)
Time elapsed for threads_sleep: 4.0176s)

Time elapsed for serial_square: 0.0005s)
Time elapsed for processes_square: 0.1289s)
Time elapsed for threads_square: 0.1058s)

This results are somewhat unexpected if you know nothing about processes and threads in Python. They come with certain drawbacks. For the sleep functions, the serial is slower because each iteration has to wait, while in the parallelized version, the slower sleep is the last iteration, 4 seconds. For the square functions, the parallel versions are slower. Why? Because spawning a pool of threads or processes takes time. So if you plan to do small calculations, or if you won't reuse your Pool of workers (as I do here), you can go with serial tasks. But if you plan to do slow computations, such as the function sleep tries to simulate, parallelizing is a good choice.

To reuse a pool, simply open it without context manager (without with statement):

import multiprocessing


def main():
    pool = multiprocessing.Pool()
    pool.map(task, arguments)
    pool.close()  # Do no accept more tasks
    pool.join()  # Wait tasks to complete


if __name__ == '__main__':
    main()

Threads or processes

There are two major differences between threads and processes: threads share objects (memory space) and are affected by Python's Global Interpreter Lock. The first can be shown easily.

# shared.py

import multiprocessing


def main():
    shared = []
    with multiprocessing.Pool() as pool:
        pool.map(shared.append, range(20))
    print(shared)

    shared = []
    with multiprocessing.pool.ThreadPool() as pool:
        pool.map(shared.append, range(20))
    print(shared)


if __name__ == '__main__':
    main()
$ python shared.py
[]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

The threaded version populated the shared list, so things that happens to one object can be read by the other threads 'out of the box'. Processes can share objects too, but you have to do it explicitly. The Global Interpreter Lock, is a restriction Python uses to enforce certain behavior and it blocks threads to run code at the same time. A thread should wait the Global Interpreter Lock to be freed in order to run code. If the task is computational consuming the GIL won't be released and the computation time will be pretty much like serial. If you want to read more about the GIL and why Python has it, check this post at Real Python about the GIL.

Summary

Parallelize becomes useful every time you have slow tasks. Creating threads and processes is time consuming, so you better have a bunch of tasks and reuse your worker pools. While you can parallelize with threads or processes, probably you can benefit the most by using processes, since their can run heavy computations in really parallel way. Also they do not share memory with other processes, so it is safer.

parallelization apis