New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in multiprocessing + Pipes on macOS #101225
Comments
I've been doing some debugging, and the error for workers getting EOFError or OSError is a connection refused error: [DEBUG/SpawnPoolWorker-4] worker got EOFError or OSError -- exiting: [Errno 61] Connection refused
This appears to be caused by macOS doing exactly what's being asked: the listen queue for the resource sharer is full, hence the ConnectionRefusedError. This is due to resource_sharer using the default backlog when creating a connection.Listener and that default is 1. |
…tiprocessing This PR uses the same backlog value when creating a `.connection.Client` in `.resource_sharer` as is used in `.manager`. On macOS the default backlog (1) is small enough to cause the socket accept queue to fill up when starting a number of children.
The attached PR fixes the issue for me, also when I raise the number of jobs to 50. |
Bug report
I believe I've found a bug in how the
multiprocessing
package passes theConnection
s thatPipe
creates down to the child worker process, but only on macOS.The following minimal example demonstrates the problem:
On Linux, this script prints
Nth is 0
, etc., 20 times and exits. On macOS, it does the same if the linechild = None
is not commented out. If that line is commented out - i.e., if the childConnection
is passed in theargs
ofapply_async()
- not all the jobs are done, and the script will frequently (if not always) loop forever, reporting some number of jobs remaining.The logging shows approximately what's happening: the output will have a number of lines of this form:
and the number of log records of that type is exactly the number of jobs reported remaining. This debug message is reported by the
worker()
function inmultiprocessing/pool.py
, as it dequeues a task:When I insert a traceback printout before the debug statement, I find that it's reporting
ConnectionRefusedError
, presumably as it attempts to unpickle theConnection
object in the worker:The error is caught and the worker exits, but it's already dequeued the task, so the task never gets done.
Note that this has to be due to the
Connection
object being passed; if I uncommentchild = None
, the code works fine. Note that it also has nothing to do with anything passed through thePipe
, since the code passes nothing through the pipe. It also has nothing to do with the connection objects being garbage collected because there's no reference to them in the parent process; if I save them in a global list, I get the same error.I don't understand how this could possibly happen; the
Pipe
is created withsocket.socketpair()
, and I was under the impression that sockets created that way don't require any other initialization to communicate. I do know that it's a race condition; if I insert a short sleep after I create thePipe
, say, .1 second, the code works fine. I've also observed that this is much more likely to happen with large numbers of workers; if the number of workers is 2, I almost never observe the problem.Your environment
Breaks:
Works:
Linked PRs
The text was updated successfully, but these errors were encountered: