New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock at shutdown while recursively acquiring head lock #102126
Comments
I'm the downstream bug reporter, and I have verified on my M1 Pro Mac OS 12.6.2 laptop that:
I have not had time to find a more minimal reproduction. |
I've created a smaller reproducible example here: https://github.com/sergei-maertens/threading-deadlock The advantage is that it's possible to reproduce on a linux/docker context and there's no actual application code involved except for two libraries that happen to trigger this when combined (coverage + hypothesis). @CharString remarked that the hypothesis bug report requires two calls to the I looked into the coverage code a bit yesterday and they do some funky stuff to eventually pass it to |
It appears that Python 3.10.10 introduced a regression that can cause deadlocks on threading.local which manifests in CI at the moment for us. For "better safe than sorry" reasons the Docker image version is also pinned as we do NOT want a risk of deadlocks in production. For more information, see python/cpython#102126
I was able to repro this issue on my Mac using Docker and the instructions at https://github.com/sergei-maertens/threading-deadlock I also attempted to repro using the same
|
Hey, I've run into the same problem and went to #python on liberachat where I got some help. I have a django project that uses django-haystack, and the same phenomenon occurs. Because of this bug, my entire CI and production system started acting weird because of the hanging processes. Upon debugging the hanging process with gdb with the python extension enabled, I've got the following debug log:
Hope this can help somebody. |
@sergei-maertens I can reproduce the bug with your test on my CI, FreeBSD jail with python3.10.10, and not with python3.10.9. Python is clearly the culprit. Haven't tried on my Linux box, but probably it would result in the same phenomenon. |
Compiling python from the main branch gives a python that doesn't have this issue. I tried to revert to 762745a in there to see if the repo state there results in a |
@karolyi if you try compiling Python from a different commit, say the same commit that was released as |
@carljm the tag v3.10.10 refers a commit in the 3.10 branch, and the entire 3.10 branch is broken now in terms of this bug as of the referred merge. Reverting the referred (cherry-picked) commit in the 3.10 branch fixed the issue, so the culprit is indeed 762745a, however checking out to the main branch version at this commit results in a working 3.12. Hence, git bisect isn't helping here. Something is in there in the main branch that fixes this issue. Probably architectural changes that are way harder for me to figure out. |
The reason I ask is because on Ubuntu Linux, as mentioned above, I couldn't reproduce the issue even on 3.10 branch, when compiling Python myself. So are you saying that when compiling Python yourself, you are able to repro the issue from 3.10 branch, but not from main? I didn't see that clearly mentioned above. (When you say you reproed with 3.10.10 and not 3.10.9, it's not clear where those Pythons came from.) |
Interesting. I'm testing on my FreeBSD box now, since I don't use docker and I don't want to pollute my linux box (Manjaro) with a self compiled python. I might just try it with an actual Linux VM. Also the production server which is an ubuntu 20.04 (focal fossa) using http://ppa.launchpad.net/deadsnakes/ppa/ubuntu/, suffers from this issue on python 3.10.10 as well.
Those were the official tarballs from the python.org site, until I started compiling from the github git repo. |
Working on a fix... |
Marking release blocker, cc @pablogsal @ambv |
threading.local()
changes
I'm bisecting the main branch as we speak, cherry-picking 762745a on top of each bisect operation. There is a commit in main that fixes this issue, and I'm eager to find out which one it is. |
Whatever that commit does, the problem exists in main too, it must be that is just delaying calling the finalizer so it doesn't deadlocks. |
Are you able to reproduce the problem in main as well? For me on FreeBSD, the main branch doesn't hang. |
@karolyi Can you try applying the following patch on 3.10 and test it? It is #102222 backported to 3.10 head. diff --git a/Python/pystate.c b/Python/pystate.c
index df98eb11bb..c7a6af5da8 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -293,11 +293,19 @@ interpreter_clear(PyInterpreterState *interp, PyThreadState *tstate)
_PyErr_Clear(tstate);
}
+ // Clear the current/main thread state last.
HEAD_LOCK(runtime);
- for (PyThreadState *p = interp->tstate_head; p != NULL; p = p->next) {
+ PyThreadState *p = interp->tstate_head;
+ HEAD_UNLOCK(runtime);
+ while (p != NULL) {
+ // See https://github.com/python/cpython/issues/102126
+ // Must be called without HEAD_LOCK held as it can deadlock
+ // if any finalizer tries to acquire that lock.
PyThreadState_Clear(p);
+ HEAD_LOCK(runtime);
+ p = p->next;
+ HEAD_UNLOCK(runtime);
}
- HEAD_UNLOCK(runtime);
Py_CLEAR(interp->audit_hooks);
|
Yes, will do, hold on. In the meantime, you can look into c314198, it's the commit with which the main branch doesn't hang, reverting it causes the main branch to hang. |
This is a GC timing issue which is unpredictable by nature. Any unrelated can change the way objects are deallocated or when finalizers are called and possibly delay it. |
There is an error while compiling with your patch:
|
If you meant |
Yeah, sorry for the typo. Thanks for checking. |
I've also checked my use case with django-haystack and django, the process hanging disappeared there as well. This seems to be the right solution. |
Not sure if this the right place to ask, but can we get a hotfix version until 3.10.11 arrives? |
…states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
… states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
… states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
HypothesisWorks/hypothesis#3585 is reproducible on Python 3.10.10 but not 3.10.9, and so we suspect that #100922 may have introduced a (rare?) deadlock while fixing the data race in #100892.
The downstream bug report on Hypothesis includes a reliable (but not minimal reproducer) on OSX - though it's unclear whether this might be an artifact of different patch versions of CPython on the various machines which have checked so far.
Linked PRs
The text was updated successfully, but these errors were encountered: