Bug in multiprocessing + Pipes on macOS #101225

slbayer · 2023-01-21T18:56:50Z

Bug report

I believe I've found a bug in how the multiprocessing package passes the Connections that Pipe creates down to the child worker process, but only on macOS.

The following minimal example demonstrates the problem:

def _mp_job(nth, child):
    print("Nth is", nth)

if __name__ == "__main__":
    from multiprocessing import Pool, Pipe, set_start_method, log_to_stderr
    import logging, time

    set_start_method("spawn")
    logger = log_to_stderr()

    logger.setLevel(logging.DEBUG)

    with Pool(processes = 10) as mp_pool:

        jobs = []
        for i in range(20):
            parent, child = Pipe()
            # child = None
            r = mp_pool.apply_async(_mp_job, args = (i, child))
            jobs.append(r)

        while jobs:
            new_jobs = []
            for job in jobs:
                if not job.ready():
                    new_jobs.append(job)
            jobs = new_jobs
            print("%d jobs remaining" % len(jobs))
            time.sleep(1)

On Linux, this script prints Nth is 0, etc., 20 times and exits. On macOS, it does the same if the line child = None is not commented out. If that line is commented out - i.e., if the child Connection is passed in the args of apply_async() - not all the jobs are done, and the script will frequently (if not always) loop forever, reporting some number of jobs remaining.

The logging shows approximately what's happening: the output will have a number of lines of this form:

[DEBUG/SpawnPoolWorker-10] worker got EOFError or OSError -- exiting

and the number of log records of that type is exactly the number of jobs reported remaining. This debug message is reported by the worker() function in multiprocessing/pool.py, as it dequeues a task:

        try:
            task = get()
        except (EOFError, OSError): 
            util.debug('worker got EOFError or OSError -- exiting')
            break

When I insert a traceback printout before the debug statement, I find that it's reporting ConnectionRefusedError, presumably as it attempts to unpickle the Connection object in the worker:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 112, in worker
    task = get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/queues.py", line 354, in get
    return _ForkingPickler.loads(res)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 961, in rebuild_connection
    fd = df.detach()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 492, in Client
    c = SocketClient(address)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused

The error is caught and the worker exits, but it's already dequeued the task, so the task never gets done.

Note that this has to be due to the Connection object being passed; if I uncomment child = None, the code works fine. Note that it also has nothing to do with anything passed through the Pipe, since the code passes nothing through the pipe. It also has nothing to do with the connection objects being garbage collected because there's no reference to them in the parent process; if I save them in a global list, I get the same error.

I don't understand how this could possibly happen; the Pipe is created with socket.socketpair(), and I was under the impression that sockets created that way don't require any other initialization to communicate. I do know that it's a race condition; if I insert a short sleep after I create the Pipe, say, .1 second, the code works fine. I've also observed that this is much more likely to happen with large numbers of workers; if the number of workers is 2, I almost never observe the problem.

Your environment

Breaks:

CPython versions tested on: 3.7.9, 3.11
Operating system and architecture: macOS 12.6.2, Intel 6-core i7
CPython versions tested on: 3.8.2
Operating system and architecture: macOS 10.15.7, Intel quad core i7

Works:

CPython versions tested on: 3.8.10
Operating system and architecture: Linux

Linked PRs

gh-101225: Fix hang when passing Pipe instances to child in multiprocessing #113567

The text was updated successfully, but these errors were encountered:

ronaldoussoren · 2023-12-29T16:48:57Z

I've been doing some debugging, and the error for workers getting EOFError or OSError is a connection refused error:

[DEBUG/SpawnPoolWorker-4] worker got EOFError or OSError -- exiting: [Errno 61] Connection refused

Traceback (most recent call last):

  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/pool.py", line 114, in worker
    task = get()
           ~~~^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/queues.py", line 389, in get
    return _ForkingPickler.loads(res)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/connection.py", line 1175, in rebuild_connection
    fd = df.detach()
         ~~~~~~~~~^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/connection.py", line 518, in Client
    c = SocketClient(address)
        ~~~~~~~~~~~~^^^^^^^^^
  File "/Users/ronald/Projects/Forks/cpython/Lib/multiprocessing/connection.py", line 646, in SocketClient
    s.connect(address)
    ~~~~~~~~~^^^^^^^^^
ConnectionRefusedError: [Errno 61] Connection refused

This appears to be caused by macOS doing exactly what's being asked: the listen queue for the resource sharer is full, hence the ConnectionRefusedError. This is due to resource_sharer using the default backlog when creating a connection.Listener and that default is 1.

…tiprocessing This PR uses the same backlog value when creating a `.connection.Client` in `.resource_sharer` as is used in `.manager`. On macOS the default backlog (1) is small enough to cause the socket accept queue to fill up when starting a number of children.

ronaldoussoren · 2023-12-29T16:51:44Z

The attached PR fixes the issue for me, also when I raise the number of jobs to 50.

slbayer added the type-bug An unexpected behavior, bug, or error label Jan 21, 2023

ronaldoussoren added the topic-multiprocessing label Jan 22, 2023

arhadthedev added OS-mac stdlib Python modules in the Lib dir labels Feb 7, 2023

bedevere-app bot mentioned this issue Dec 29, 2023

gh-101225: Fix hang when passing Pipe instances to child in multiprocessing #113567

Open

ronaldoussoren mentioned this issue Dec 29, 2023

Possible race condition on multiprocessing.Manager().dict() on macOS #87934

Open

Dec	JAN	May
	11
2023	2024	2025

Bug in multiprocessing + Pipes on macOS #101225

Bug in multiprocessing + Pipes on macOS #101225

slbayer commented Jan 21, 2023 •

edited by bedevere-app bot

ronaldoussoren commented Dec 29, 2023

ronaldoussoren commented Dec 29, 2023

Bug in multiprocessing + Pipes on macOS #101225

Bug in multiprocessing + Pipes on macOS #101225

Comments

slbayer commented Jan 21, 2023 • edited by bedevere-app bot

Bug report

Your environment

Linked PRs

ronaldoussoren commented Dec 29, 2023

ronaldoussoren commented Dec 29, 2023

slbayer commented Jan 21, 2023 •

edited by bedevere-app bot