Parallelization and profiling#

If you’re one of those people whose scripts always run in a second or less, you can probably skip this tutorial. But if you have time to make yourself a cup of tea while your code is running, you might want to read on. This tutorial covers how to run code in parallel, and how to check its performance to look for improvements.

Click here to open an interactive version of this notebook.

Parallelization#

Parallelization in Python#

Scary stories of Python’s “global interpreter lock” aside, parallelization is actually fairly simple in Python. However, it’s not particularly intuitive or flexible. We can do vanilla parallelization in Python via something like this:

[1]:

import multiprocessing as mp

# Define a function
def my_func(x):
    return x**2

# Run it in parallel
with mp.Pool() as pool:
    results = pool.map(my_func, [1,2,3])

print(results)

[1, 4, 9]

So far so good. But what if we have something more complicated? What if we want to run our function with a different keyword argument, for example? It starts getting kind of crazy:

[2]:

from functools import partial

# Define a (slightly) more complex function
def complex_func(x, arg1=2, arg2=4):
    return x**2 + (arg1 * arg2)

# Make a new function with a different default argument 😱
new_func = partial(complex_func, arg2=10)

# Run it in parallel
with mp.Pool() as pool:
    results = pool.map(new_func, [1,2,3])

print(results)

[21, 24, 29]

This works, but that sure was a lot of work just to set a single keyword argument!

Parallelization in Sciris#

With Sciris, you can do it all with one line:

[3]:

import sciris as sc

results = sc.parallelize(complex_func, [1,2,3], arg2=10)

print(results)

[21, 24, 29]

What’s happening here? sc.parallelize() lets you pass keyword arguments directly to the function you’re calling. You can also iterate over multiple arguments rather than just one:

[4]:

args = dict(x=[1,2,3], arg2=[10,20,30])

results = sc.parallelize(complex_func, iterkwargs=args)

print(results)

[21, 44, 69]

(Of course you can do this with vanilla Python too, but you’ll need to define a list of tuples, and you can only assign by position, not by keyword.)

Depending on what you might want to run, your inputs might be in one of several different forms. You can supply a list of values, a list of dicts, or a dict of lists. An example will probably help:

[5]:

def mult(x,y):
    return x*y

r1 = sc.parallelize(mult, iterarg=[(1,2),(2,3),(3,4)])
r2 = sc.parallelize(mult, iterkwargs={'x':[1,2,3], 'y':[2,3,4]})
r3 = sc.parallelize(mult, iterkwargs=[{'x':1, 'y':2}, {'x':2, 'y':3}, {'x':3, 'y':4}])
print(f'{r1 = }')
print(f'{r2 = }')
print(f'{r3 = }')

r1 = [2, 6, 12]
r2 = [2, 6, 12]
r3 = [2, 6, 12]

All of these are equivalent: choose whichever makes you happy.

Advanced usage#

There are lots and lots of options with parallelization, but we’ll only cover a couple here. For example, if you want to start 200 jobs on your laptop with 8 cores, you probably don’t want them to eat up all your CPU or memory and make your computer unusable. You can set maxcpu and maxmem limits to handle that:

[6]:

import numpy as np
import pylab as pl

# Define the function
def rand2d(i, x, y):
    np.random.seed()
    xy = [x+i*np.random.randn(100), y+i*np.random.randn(100)]
    return (i,xy)

# Run in parallel
xy = sc.parallelize(
    func     = rand2d,   # The function to parallelize
    iterarg  = range(5), # Values for first argument
    maxcpu   = 0.8,      # CPU limit (1 = no limit)
    maxmem   = 0.9,      # Memory limit (1 = no limit)
    interval = 0.2,      # How often to re-check the limits (in seconds)
    x = 3, y = 8,        # Keyword arguments for the function
)

# Plot
pl.figure()
colors = sc.gridcolors(len(xy))
for i,(x,y) in reversed(xy): # Reverse order to plot the most widely spaced dots first
    pl.scatter(x, y, c=[colors[i]], alpha=0.7, label=f'Scale={i}')
pl.legend();

CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 0 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 1 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 2 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 3 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 4 after 1 tries

../_images/tutorials_tut_parallel_15_1.png

So far, we’ve used sc.parallelize() as a function. But you can also use it as a class, which gives you more flexibility and control over which jobs are run, and will give you more information if any of them failed:

[7]:

def slow_func(i=1):
    sc.randsleep(seed=i)
    if i == 4:
        raise Exception("I don't like seed 4")
    return i**2

# Create the parallelizer object
P = sc.Parallel(
    func = slow_func,
    iterarg = range(10),
    parallelizer = 'multiprocess-async', # Run asynchronously
    die = False, # Keep going if a job crashes
)

# Actually run
P.run_async()

# Monitor progress
P.monitor()

# Get results
P.finalize()

# See how long things took
print(P.times)

Job 4/10 (2.3 s) ••••••••••••—————————————————— 40%

/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/multiprocess/pool.py:48: RuntimeWarning: sc.parallelize(): Task 4 failed, but die=False so continuing.
Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/sciris/sc_parallel.py", line 835, in _task
    result = func(*args, **kwargs) # Call the function!
             ^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipykernel_1915/2785706684.py", line 4, in slow_func
    raise Exception("I don't like seed 4")
Exception: I don't like seed 4

  return list(map(*args))

#0. 'started':  datetime.datetime(2024, 4, 1, 23, 20, 29, 452539)
#1. 'finished': datetime.datetime(2024, 4, 1, 23, 20, 36, 65270)
#2. 'elapsed':  6.612731
#3. 'jobs':     [1.278202772140503, 1.0245444774627686, 0.5267577171325684,
0.17212557792663574, 1.8923094272613525, 1.611030101776123, 1.077981948852539,
1.251112699508667, 0.654982328414917, 1.7413651943206787]

/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/sciris/sc_parallel.py:543: RuntimeWarning: Only 9 of 10 jobs succeeded; see exceptions attribute for details
  self.process_results()

You can see it raised some warnings. These are stored in the Parallel object so we can check back and see what happened:

[8]:

print(f'{P.success = }')
print(f'{P.exceptions = }')
print(f'{P.results = }')

P.success = [True, True, True, True, False, True, True, True, True, True]
P.exceptions = [None, None, None, None, Exception("I don't like seed 4"), None, None, None, None, None]
P.results = [0, 1, 4, 9, None, 25, 36, 49, 64, 81]

Hopefully, you will never need to run a function as poorly written as slow_func()!

Profiling#

Even parallelization can’t save you if your code is just really slow. Sciris provides a variety of tools to help with this.

Benchmarking#

First off, we can check if our computer is performing as we expect, or if we want to compare across computers:

[9]:

bm = sc.benchmark() # Check CPU performance, in units of MOPS (million operations per second)
ml = sc.memload() # Check total memory load
ram = sc.checkram() # Check RAM used by this Python instance

print('CPU performance: ', dict(bm))
print('System memory load', ml)
print('Python RAM usage', ram)

CPU performance:  {'python': 4.420045548484547, 'numpy': 138.59488716438034}
System memory load 0.201
Python RAM usage 153.78 MB

We can see that NumPy performance is much higher than Python – hundreds of MOPS† instead of single-digits. This makes sense, this is why we use it for array operations!

† The determination of a single “operation” is a little loose, so these “MOPS” can be used for relative purposes, but aren’t directly relatable to, say, published processor speeds.

Line profiling#

If you want to do a serious profiling of your code, take a look at Austin. But if you just want to get a quick sense of where things might be slow, you can use sc.profile(). Applying it to our lousy slow_func() from before:

[10]:

sc.profile(slow_func)

Profiling...
Timer unit: 1e-09 s

Total time: 1.02409 s
File: /tmp/ipykernel_1915/2785706684.py
Function: slow_func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def slow_func(i=1):
     2         1 1024088429.0    1e+09    100.0      sc.randsleep(seed=i)
     3         1       1747.0   1747.0      0.0      if i == 4:
     4                                                   raise Exception("I don't like seed 4")
     5         1       1411.0   1411.0      0.0      return i**2

Done.

[10]:

<line_profiler.line_profiler.LineProfiler at 0x7f0c9c5f5630>

We can see that 100% (well, 99.9997%) of the time was taken by the sleep function. This is not surprising, but seems correct!

For a slightly more realistic example:

[11]:

def func():
    n = 1000

    # Do some NumPy
    v1 = np.random.rand(n,n)
    v2 = np.random.rand(n,n)
    v3 = v1*v2

    # Do some Python
    means = []
    for i in range(n):
        means.append(sum(v3[i])/n)

sc.profile(func)

Profiling...
Timer unit: 1e-09 s

Total time: 0.110839 s
File: /tmp/ipykernel_1915/701805461.py
Function: func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def func():
     2         1        803.0    803.0      0.0      n = 1000
     3
     4                                               # Do some NumPy
     5         1   10601157.0    1e+07      9.6      v1 = np.random.rand(n,n)
     6         1    9473091.0    9e+06      8.5      v2 = np.random.rand(n,n)
     7         1    3943569.0    4e+06      3.6      v3 = v1*v2
     8
     9                                               # Do some Python
    10         1       1484.0   1484.0      0.0      means = []
    11      1001     278894.0    278.6      0.3      for i in range(n):
    12      1000   86539806.0  86539.8     78.1          means.append(sum(v3[i])/n)

Done.

[11]:

<line_profiler.line_profiler.LineProfiler at 0x7f0c9c5f43d0>

We can see (from the “% Time” column) that, again not surprisingly, the Python math operation is much slower than the NumPy operations.