Up until now, I had been writing custom thread pools using threading.Thread, but I thought there must be something provided by Python already. So, I did some searching and found that multiprosessing.pool.Pool was exactly what I was looking for. Creating a process pool is super easy with it. What was I doing all this time?
Note: When I ran parallel processing with a process pool using uwsgi, it was slow. I haven't investigated the cause. When I switched to processing with threading.Thread, it became comfortable.
from multiprocessing.pool import Pool def test_procedure(): test_objects = HogeModel.getxxxxx with Pool(20) as p: for result in p.map(validate_hoge, test_objects): if result: print(result) print('ok') def validate_hoge(test_object): Slow processing... return processing result
It would look something like this.
The key point is that the first argument passed to p.map should be a function placed in the top-level of the file, not a class method. Also, if you need to share memory between processes, you'll need to take another step.
This time, since I am assuming network-bound processing, I honestly don't think there will be much difference whether using multiprocessing or threading for parallelization. Either way is fine. However, if you want to use the CPU efficiently, multiprocessing without GIL would be better.
Also, the first argument of Pool specifies the number of processes, and since this is assumed to be network-bound processing, I've set it to a large number. For CPU-bound processing, if you create a Pool without any arguments, it will automatically determine the number of processes based on the number of CPUs available, which is convenient.
There are convenient classes like Pool in multiprocessing, so if you are writing parallel processing for batch processing, you should use multiprocessing.
I don’t know the exact calling cost. If you need large shared memory and the process doesn't use much CPU, threading might be the way to go.
Comments