Multi-processing Example

Scripting style

Start with working code that is clear, simple, and runs top to bottom. This is easy to develop and test incrementally.

import urllib.request

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]

for url in sites:
    with urllib.request.urlopen(url) as u:
        page = u.read()
        print(url, len(page))

Function style

import urllib.request

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
]

def sitesize(url):
    ''' Determine the size of a website '''
    with urllib.request.urlopen(url) as u:
        page = u.read()
        return url, len(page)

for result in map(sitesize, sites):
    print(result)

Note

A good development strategy is to use map to test the code in a single process and single thread mode before switching to multi-processing.

What is parallelizeable?

A key pattern of thinking is to divide the world into to “lawn mowing” versus “baby making” – identifying tasks that are significantly parallelizeable versus those that are intrinsically serial.

Amdahl’s Law (according to wikipedia):

Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 p) = 20). For this reason parallel computing is relevant only for a low number of processors and very parallelizable programs.

Detailed example:

def sitesize(url):
    ''' Determine the size of a website

    This is non-parallizeable:
    * UDP DNS request for the url
    * UDP DNS response
    * Acquire socket from the OS
    * TCP Connection:  SYN, ACK, SYN/ACK
    * Send HTTP Request for the root resource
    * Wait for the TCP response which is broken-up
      into packets.
    * Count the characters of the webpage

    This is a bit parallizeable:
    Do ten times in parallel (channel bonding):
        1) DNS lookup (UDP request and resp)
        1) Acquire the socket
        2) Send HTTP range requests
        3) The sections comes back in parallel
           across different pieces of fiber.
           http://stackoverflow.com/questions/8293687/sample-http-range-request-session
        4) Count the characters for a single
           block as received.
    Add up the 10 results!

    '''
    u = urllib.request.urlopen(url)
    page = u.read()
    return url, len(page)

Pools of processes

import urllib.request
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing.pool import Pool

sites = [
    'https://www.yahoo.com/',
    'http://www.cnn.com',
    'http://www.python.org',
    'http://www.jython.org',
    'http://www.pypy.org',
    'http://www.perl.org',
    'http://www.cisco.com',
    'http://www.facebook.com',
    'http://www.twitter.com',
    'http://www.macrumors.com/',
    'http://arstechnica.com/',
    'http://www.reuters.com/',
    'http://abcnews.go.com/',
    'http://www.cnbc.com/',
    'http://www.cnbc.com/',
]

def sitesize(url):
    ''' Determine the size of a website '''
    with urllib.request.urlopen(url) as u:
        page = u.read()
        return url, len(page)

pool = Pool(10)
for result in pool.imap_unordered(sitesize, sites):
    print(result)

Note

The imap_unordered is used to improve responsiveness.

Note

The use of imap_unordered is made possible by designing the function to return both its argument and its result as a tuple.

Hazards of thin channel communication (SQL Version)

Note

Don’t make too many trips back and forth

Note

Do significant work on each trip

Note

Don’t send or receive a lot of data

###########################################################################################
# Bringing too much back and not doing enough work while you are there

summary = collections.Counter()
for employee, dept, salary in c.execute('SELECT employee, dept, salary FROM Employee')
    summary[dept] += salary


###########################################################################################
# Too many trips back and forth

summary = dict()
for dept in c.execute('SELECT DISTINCT dept FROM Employee'):
    c.execute('SELECT SUM(salary) FROM Employee')
    summary[dept] = c.fetchone()[0]


###########################################################################################
# Right way is one trip with where a lot of work gets done and only a summary result in returned

summary = dict(execute('SELECT dept, SUM(salary) FROM Employee GROUPBY dept'))

Performance hazards for multi-processing

###########################################################################################
# Too many trips back and forth
# If the input iterable to map is very large, it suggests you're making too many trips

def sitesize(url, start):
    req = urllib.request.Request()
    req.add_header('Range:%d-%d' % (start, start+1000))
    u = urllib.request.urlopen(url, req)
    block = u.read()
    return url, len(block)


###########################################################################################
# Not doing enough work relative to the travel time
# Once you get to a process, be sure to do enough work to make the trip worthwhile

def sitesize(url, results):
    with urllib.request.urlopen(url) as u:
        while True:
            line = u.readline()
            results.put((url, len(line)))


###########################################################################################
# Taking too much with you or bringing too much back

def sitesize(url):
    u = urllib.request.urlopen(url)
    page = u.read()
    return url, page

Other Multi-processing notes

Note

Never run a multi-processing example from within an IDE that runs in the same process as the code you are developing. Otherwise, the forking step will fork the IDE itself as well as your code.

Note

When partitioning into subtasks, a common challenge is how to handle data at the boundaries of the partition.

Note

Setting the number of processes is a bit of an art. If the code is CPU bound, the number of cores times two is a reasonable starting point. If the code is IO bound, the number of cores can be much higher. Experimentation is the key.