Skip to content

[core] Cleanup SRT state after a fork() (issue #3177) #3179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

cl-ment
Copy link
Collaborator

@cl-ment cl-ment commented Jun 6, 2025

This change ensures proper cleanup and reinitialization of the libsrt internal state after a fork() system call. Since the child process does not inherit the parent’s threads, it must explicitly close all inherited SRT sockets and reinitialize the library by calling srt_startup() again.
Without this cleanup, the child process may hang when calling srt_cleanup(). With this pull request, the child process should now exit cleanly without hanging.

@cl-ment cl-ment added the [core] Area: Changes in SRT library core label Jun 6, 2025
@cl-ment cl-ment linked an issue Jun 6, 2025 that may be closed by this pull request
@cl-ment cl-ment added this to the v1.5.5 milestone Jun 6, 2025
@maxsharabayko
Copy link
Collaborator

API change: new API function srt_cleanupAtFork() can't go directly in the API of 1.5.5. Must be placed under ifdef and disabled by default (e.g. see ENABLE_AEAD_API_PREVIEW and ENABLE_MAXREXMITBW for options to be enabled in 1.6.0).
To follow existing naming convention must be named e.g. srt_cleanup_at_fork.

@lelegard
Copy link
Contributor

lelegard commented Jun 6, 2025

API change: new API function srt_cleanupAtFork() ...
To follow existing naming convention must be named e.g. srt_cleanup_at_fork.

Why should such a function appear in the public SRT API? Shouldn't this remain internal to libsrt, as an internal cleanup mechanism?

It user applications do not need to call it, it may create some confusion.

@cl-ment
Copy link
Collaborator Author

cl-ment commented Jun 6, 2025

Apologies, I had misplaced the function declaration. srt_cleanupAtFork is intended to be used internally and should not be exposed for external use.
@lelegard This update should resolve issue #3177 without requiring any changes to existing applications.

ls->second->m_QueuedSockets.erase(s->m_SocketID);
}
s->core().closeInternal();
operator delete(s);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slick. This frees the memory, bypassing the destructor (including unsubscribing from the multiplexer and cleaning out its threads). I'm not sure if socket objects can be safely deleted this way - at least I'd do it by first closing all sockets, then waiting until all multiplexers are gone (this should not involve any waiting if we state that no threads are running, but still multiplexer objects will still exist, even if they don't have the duplicated UDP socket, and therefore they must be also destroyed). Only then can all sockets be deleted, and then the groups, stating that sockets were deleted "decently", that is, they were removed as a group member first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This frees the memory, bypassing the destructor

Why did you write this? The delete C++ operator calls the destructor before freeing the memory. The following two forms are identical, as far as I understand:

operator delete(s);
delete s;

The one which frees memory without calling the destructor, in C++20, probably for desperate situations, is:

operator delete(s, std::destroying_delete);

Did I miss something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are wrong. https://en.cppreference.com/w/cpp/memory/new/operator_delete

The version from C++20 is irrelevant - it contains way more nuisances on its own.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally use delete without calling the destructor here. After a fork(), all threads managing the sockets no longer exist in the child process. Since the destructor attempts to destroy mutexes and condition variables that are left in an undefined state, I chose to simply free the memory without performing any further cleanup to avoid potential issues.

Copy link
Collaborator

@ethouris ethouris Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't refer to just the fact that you freed the memory. What I meant was that there were more things to possibly destroy, maybe not through managing the data or exitting threads, but there are also bound objects - such as multiplexers. As file descriptors also get duplicated (every example of fork() speak about that), then definitely any open UDP sockets inside multiplexers must be closed as well. Not sure how mutexes and condition variables are handled in that case.

It's not wrong that you have only reclaimed memory and bypassed destructors - the problem is that it might be not enough to reallly free all resources that sockets occupied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are wrong. https://en.cppreference.com/w/cpp/memory/new/operator_delete

You are right, my bad, delete and operator delete are not equivalent, I was confused by the C++20 trick.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That from C++20 works only if you overload the operator delete in particular class this specific way - this will make the delete p call bypass the destructor, exceptionally. I didn't even kow about this trick ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to what I have read about system objects in case of fork - at least on Linux, which has it defined - only one thread is duplicated, the one that calls fork(), and all mutexes and CVs are in the state derived from the parent. So, in order to do a complete cleanup of these things, all mutexes would have to be forcefully unlocked and CVs signaled, then all destroyed. Otherwise these objects constitute a leak. Theoretically most implementations of mutexes do not operate with dynamic allocation here, but none of the standards we use does give any guarantees on that. Of course, we don't risk any stuck threads here, as fork() only duplicates the current thread, but there is no guarantee that deallocation of memory containing a "sunken" mutex without performing any destruction action first is enough to free up all resources that the mutex was using.

@lelegard
Copy link
Contributor

lelegard commented Jun 11, 2025

Hi,

I have tried this patch on the case from #3177. The symptoms are radically different, but still not working properly.

For the record, sender command:

tsp -b 1,000,000 -I null -P regulate -O srt --caller localhost:3000

And receiver command, creating a process every 4 seconds, exhibiting the problem:

tsp -I srt --listener 3000 -P bitrate_monitor --periodic-command 4 --alarm-command echo -O drop

There is no longer any hung or zombie process. However, the SRT session is disconnected, probably due to SRT cleanup in the child process.

The precise scenario is:

  • Start receiver command.
  • Start sender command. Connection established.
  • After 4 seconds, in the receiver process, the child process is created and displays a message.
  • At this precise moment, the sender process stops because of a disconnection on the receiver side.
  • The receiver command continues a while, until the buffers are emptied. There is enough time to display a second command 4 seconds later.
  • The receiver command stops with the following message:
17:17:36.326613/srt!W:SRT.br: CRcvBuffer.readMessage(): nothing to read. Ignored isRcvDataReady() result?

Reproducer:

  • cloned repo https://github.com/cl-ment/srt.git
  • branch issue-3177
  • last commit f5aa892 ("Remove a typo")
  • built libsrt from the branch issue-3177, got a libsrt.so.1.5 binary
    • mkdir build
    • cd build
    • cmake -DENABLE_C_DEPS=ON -DENABLE_SHARED=ON -DENABLE_STATIC=OFF ..
    • make
  • defined LD_LIBRARY_PATH before running the receiver process
  • tested on Ubuntu 25.04, arm64 architecture

If you can't reproduce this behaviour on a test program, I recommend to test with the exact commands above. I can help setting up TSDuck if necessary.

@lelegard
Copy link
Contributor

lelegard commented Jun 11, 2025

Considering the discussions in the previous posts and the results which are observed, I wonder if these cleanup considerations are just over-engineering.

First, in a multi-threaded application, we never fork to continue in the same executable on the long run. This is a parallelism pattern which can be easily replaced with multi-threading. So, if we are in a multithreaded application, we usually fork in two cases only:

  1. Immediately exec().
  2. Do something quick and exit().

In my scenario, the intermediate process is case 2 and the grand-child process is case 1.

So, first consideration, we do not care about "leaks". There is no leak in threads because fork didn't recreate other threads. There is no memory leak either. Having N bytes of memory becoming unused once is just a theoretical leak, not a practical one. A practical leak is having N bytes left unused every X seconds, endlessly. In that case, and in that case only, you eventually run into a problem.

Concerning the sockets (in UDP sense, not SRT socket), the child process should close() the corresponding file descriptors, nothing more. If you try to "disconnect", whatever it means in terms of SRT protocol (about which I am quite ignorant), you may send a message meaning "socket closing" and the peer will probably exits, as I observe with the current fix. The SRT socket continues to live and operate in the parent process. Don't kill it by doing unnecessary things in the child process. In the child process, you just inherit an open file descriptor which is simply useless.

So, probably, a plausible scenario should be:

  • In the atfork handler, on child process side, close all file descriptors without doing anything else. Setup a global boolean which means "libsrt is in a forked state, unstable, useless, do nothing".
  • For any librst call (which should never occur anyway in the child process), when this boolean is set, do nothing, at best return an error status.
  • On srt_cleanup() (which may still be automatically called in destructors in the child process), when this boolean is set, do nothing, nothing at all.
  • Of course, there may be additional internal considerations about libsrt that I ignore.

@ethouris
Copy link
Collaborator

Hmm, this looks like then the only thing to do on fork() is to run over the list of multiplexers and close all UDP sockets in there. Nothing else matters, even memory allocated for the objects because all of this will be disposed anyway on exec().

@lelegard
Copy link
Contributor

Hmm, this looks like then the only thing to do on fork() is to run over the list of multiplexers and close all UDP sockets in there. Nothing else matters, even memory allocated for the objects because all of this will be disposed anyway on exec().

In terms of file descriptors, probably. Not even necessary if FD_CLOEXEC was set on them.

However, there is probably some hidden issue. In the original problem #3177, the intermediate process hangs after returning from srt_cleanup(). Using the debug traces, we can see that the sequence of operations is:

  • exit() called
  • exit handlers called one by one
  • the exit handler which calls the C++ destructors is called
  • the destructor of the singleton in charge of SRT cleanup calls srt_cleanup()
  • we safely returned from srt_cleanup()
  • exit() never completed because ps shows the executable in Ss state.

So, there is something after completion of srt_cleanup() which hangs the process on Linux (but not on macOS). Since the error message on macOS demonstrated that srt_cleanup() called phtread_join(), maybe this invalid phtread_join() is the origin of the later hang on Linux. That's why it could be worth trying with a simple atfork handler which does nothing more than closing all file descriptors. We all know that trying various things until it works is a very bad approach, so it's only a test and a more thorough analysis of the consequences of fork() on libsrt seems necessary anyway.

@ethouris
Copy link
Collaborator

ethouris commented Jun 13, 2025

Ah, yes, this is possible. In case of after-fork destruction there's no thread to join and maybe this isn't properly checked...

How about adding a condition m_GCThread.joinable() in stopGarbageCollector()?

@lelegard
Copy link
Contributor

Is it safe to assume that the thread state can be trusted after fork()? Meaning can we call joinable()?

Why not simply set a global boolean in an atfork handler saying "after fork, in child process"?

This boolean simply means "all libsrt data are unsafe, no thread is available, don't do anything on sockets, because it could interfere with parent process, which remains the sole owner of the SRT state, just close all file descriptors".

@lelegard
Copy link
Contributor

Hi,

With the latest commit, db8375d, the problem disappeared on Linux (not tested yet on macOS). There is no more hung or zombie process, the transmission continues on both sides, process are regularly created and complete without interference.

@cl-ment
Copy link
Collaborator Author

cl-ment commented Jun 18, 2025

I’m glad to hear that. Thank you for your feedback, @lelegard. I will double-check that the UDP sockets are properly closed, and if everything looks good, I will proceed with the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Libsrt interference with fork/exec
4 participants