I’m running a Python script that loads images, preprocesses the images in some way, and saves the image back to a cache directory. There’s about 500k images. It was taking too long to run for our use case, so I added multithreading. We’re running on a GCP Compute Engine VM instance with the raw images located on a HDD, and the preprocessed images being saved to SSD.
When using 8 vCPU’s and 12 threads, it does around 30-40 images per second for the first 30 seconds or so. Awesome speed and totally works for our use case. Then after 30 seconds, it suddenly drops down to around 1 image per second, which is not acceptable. The slowdown happens essentially instantaneously and remains at a consistent 1 image per second over the course of hours.
It does a similar thing when I limited the max number of threads to 4 and even 1 thread. In the single worker case, it ran for about 3 minutes at 5 images per second before it slowed down to 1 image per second. So it seems to be some type of rate limiting / resource exhaustion / quota limit.
What could be causing this problem? And how can I resolve the problem?
Here’s a code sample showing usage (note that it uses tqdm for a progress bar):
import concurrent import io import pandas as pd import tqdm df = pd.DataFrame(...) with concurrent.futures.ThreadPoolExecutor(max_workers=12) as pool: df['results'] = list( tqdm(total=df.shape, iterable=pool.map(preprocess_row, df['raw_path'], df['cache_path'], chunksize=1) ) ) def preprocess_row(raw_path, cache_path): img = io.imread(raw_path) img = ... # Arbitrary preprocessing io.imsave(cache_path, img) result = ... # Boolean that tracks whether an error occurred return result
Source: Python Questions