Python Multiprocessing: Pool vs Process – Comparative Analysis

Python Multiprocessing: Pool vs Process – Comparative Analysis
Python Multiprocessing - Article Thumbnail
Priyanka Mane
Posted by on October 4, 2017 in Blog

Python Multiprocessing: Pool vs Process – Comparative Analysis

Introduction To Python Multiprocessing 

Multiprocessing is a great way to improve the performance. We came across Python Multiprocessing when we had the task of evaluating the millions of excel expressions using python code. In such scenario, evaluating the expressions serially becomes imprudent and time-consuming.

So, we decided to use Python Multiprocessing.

Generally, in multiprocessing you execute your task using process or thread. To get the better advantage of multiprocessing, we decided to use thread. But while doing research, we got to know that GIL Lock disables the multi-threading functionality in Python. On further digging we got to know that Python provides two classes for multiprocessing i.e. Process and Pool class. In the following sections, I have narrated the brief overview of our experience while using pool and process class.  And the performance comparison using both the classes. I have also detailed out the performance comparison, which will help to choose the appropriate method for your multiprocessing task.

 

Python Multiprocessing: The Pool and Process class

Though Pool and Process both executes the task parallelly, but their way executing task parallelly is different.

The pool distributes the tasks to the available processors using a FIFO scheduling. It works like a map reduce architecture. It maps the input to the different processors and collects the output from all the processors. After the execution of code, it returns the output in form of a list or array. It waits for all the tasks to finish and then returns the output. The processes in execution are stored in memory and other non-executing processes are stored out of memory.

Python Multiprocessing - Pool Process

The process class puts all the processes in memory and schedules execution using FIFO policy. When the process is suspended, it pre-empts and schedules new process for execution.

When to use Pool and Process

I think choosing an appropriate approach depends on the task in hand. Pool allows you to do multiple jobs per process, which may make it easier to parallelize your program. If  you have a million tasks to execute in parallel, you can create a Pool with number of processes as many as CPU cores and then pass the list of the million tasks to pool.map. The pool will distribute those tasks to the worker processes(typically same in number as available cores) and collects the return values in the form of list and pass it to the parent process. Launching separate million processes would be much less practical (it would probably break your OS).

Python Multiprocessing - Pool Code                Python Multiprocessing -Process Code

Pool                                                                  Process 

On the other hand, if you have a small number of tasks to execute in parallel, and you only need each task done once, it may be perfectly reasonable to use a separate multiprocessing.process for each task, rather than setting up a Pool.

We used both, Pool and Process class to evaluate excel expressions. Following are our observations about pool and process class:

  1. Task number

As we have seen, the Pool allocates only executing processes in memory and process allocates all the tasks in memory, so when the task number is small, we can use process class and when the task number is large, we can use pool. In case of large tasks, if we use process then memory problem might occur, causing system disturbance. In case of Pool, there is overhead in creating it. Hence with small task numbers, the performance is impacted when Pool is used.

  1. IO operations

The Pool distributes the processes among the available cores in FIFO manner. On each core, the allocated process executes serially. So, if there is a long IO operation, it waits till the IO operation is completed and does not schedule another process. This leads to the increase in execution time.  The Process class suspends the process executing IO operations and schedules another process. So, in case of long IO operation, it is advisable to use process class.

Python Multiprocessing: Performance Comparison

In our case, the performance using Pool class was as following:

1) Using pool- 6 secs

2) Without using pool- 10 secs

Process () works by launching an independent system process for every parallel process you want to run. When we used Process class, we observed machine disturbance as 1 million processes were created and loaded in memory.

To test further, we reduced the number of arguments in each expression and ran the code for 100 expressions.

The performance using Pool class is as follows:

1) Using pool- 4secs

2) Without using pool- 3 secs

Then, we increased the arguments to 250 and executed those expressions.

The performance using Pool class is as follows:

1) Using pool- 0.6secs

2) Without using pool- 3 secs

To summarize this, pool class works better when there are more processes and small IO wait. Process class works better when processes are small in number and IO operations are long. What was your experience with Python Multiprocessing? I would be more than happy to have a conversation around this. Get in touch with me here: priyanka.mane@ellicium.com