Skip to content

Attempt to expand listener to multi-threaded #1313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jiankyu
Copy link
Contributor

@jiankyu jiankyu commented May 6, 2025

Attempt to resolve #1312. Changes summary can be found in commit message.
This is still working in progress, stability issue present. Bring in early to check if any failed CI job can shed a light.

This commit resolves issue pistacheio#1312.

Listener now supports multiple "acceptor" threads handling new client
socket connection. In http this has no benefit, as there is only one
`accept()` syscall involved. However, if the server is armed with ssl,
this significantly increases the server performance regarding tps.

The single server thread impedes the tps in two ways:

1. Regular tcp handshake is "non-blocking" to server, as long as server
sends the syn+ack frame, server side is considered established, the
`accept()` returns. Whereas with ssl, the server needs to send the its
cert and wait for client's response, this blocks the `SSL_accept()` call
until the handshake completes. If the client comes from a remote data
center with large latency, the latency contributes to the blocking of
listener thread, resulting very low tps.

2. The ssl handshake involves cryptographic arithmetic, which is CPU
bound. Even if client to server latency can be ignored, one listening
thread is only able to handle ~1000 tps on a Intel Xeron core @2.1GHz.

In this commit, we add two more options to Endpoint::Options that
specifies the number of "acceptor" threads and the thread name,
Listener::runThreaded() consumes them and initializes this number of
threads. To make every thread has a equal chance to receive epoll
events, the lock operation is removed before calling `poller.poll()`,
there are only two fds registered to Listener's poller, the listen_fd
and shutdownFd, there is no scenario regarding fd being removed while
being handled.
Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 75.90%. Comparing base (31ef837) to head (0fd80ce).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/server/endpoint.cc 40.00% 6 Missing ⚠️
src/server/listener.cc 91.42% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1313      +/-   ##
==========================================
- Coverage   76.29%   75.90%   -0.39%     
==========================================
  Files          64       63       -1     
  Lines       11115     9946    -1169     
==========================================
- Hits         8480     7550     -930     
+ Misses       2635     2396     -239     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jiankyu added 2 commits May 6, 2025 10:12
@dgreatwood
Copy link
Collaborator

Hi @jiankyu -

Couple of things.

1/ Windows compile error

The build sees error:
../src/server/listener.cc(589): error C2065: 'lLoggedSetThreadDescriptionFail': undeclared identifier

This is in the code to set the "thread description" in the Windows case. I think you copied this over from reactor.cc? Which is fine. But you would also need to copy over (i.e. add to listener.cc) the declaration:
static std::atomic_bool lLoggedSetThreadDescriptionFail = false;

2/ Test case

I take it from the prior conversation that you are seeing the SIGPIPE failure from your own application. It seems the SIGPIPE is not demonstrated by the existing Pistache unit tests? Presuming not, could you add a test that activates the problem?
Most likely you would add the test in https_server_test.cc, though you could also inspect https_client_test_net.cc for possibly relevant example tests.

Thanks in advance.

@kiplingw
Copy link
Member

kiplingw commented May 6, 2025

@jiankyu, I don't know if it's possible through a unit test, but it would be helpful if you are able to benchmark before and after your (finalized) PR when ready. That will help us evaluate whether the underlying resource bottleneck problem has been resolved.

@jiankyu
Copy link
Contributor Author

jiankyu commented May 7, 2025

Hi @jiankyu -

Couple of things.

1/ Windows compile error

The build sees error: ../src/server/listener.cc(589): error C2065: 'lLoggedSetThreadDescriptionFail': undeclared identifier

This is in the code to set the "thread description" in the Windows case. I think you copied this over from reactor.cc? Which is fine. But you would also need to copy over (i.e. add to listener.cc) the declaration: static std::atomic_bool lLoggedSetThreadDescriptionFail = false;

2/ Test case

I take it from the prior conversation that you are seeing the SIGPIPE failure from your own application. It seems the SIGPIPE is not demonstrated by the existing Pistache unit tests? Presuming not, could you add a test that activates the problem? Most likely you would add the test in https_server_test.cc, though you could also inspect https_client_test_net.cc for possibly relevant example tests.

Thanks in advance.

@dgreatwood

  1. Yes, I copied the code from reactor.cc, I will also copy the declaration to resolve the compile error.
  2. Sure, I will add the test case to reveal the problem and show the performance data I got.

@jiankyu
Copy link
Contributor Author

jiankyu commented May 7, 2025

@jiankyu, I don't know if it's possible through a unit test, but it would be helpful if you are able to benchmark before and after your (finalized) PR when ready. That will help us evaluate whether the underlying resource bottleneck problem has been resolved.

I actually did the benchmarking. Below is the test data. The numbers are generated by a multi-threaded client request generator.
Column 1 is the number of acceptWorker threads, column 2 is the average requests per second during the test, column 3-5 are the median, 90th, and 99th percentile request latency. Last column, efficiency, is defined as ops / num_threads / ops-with-1-thread, for example, with 2 threads, I got 1979.47 ops, its efficiency is 1979.47 / 2 / 1062.23 = 93.18%.

#threads        op/s       P50     P90     P99   efficiency
1            1062.23    37.282  38.688  42.565   100%
2            1979.47    20.087  21.730  23.066   93.18%
3            2850.90    13.996  15.079  16.234   89.46%
4            3820.15    10.361  11.429  13.018   89.90%
5            4397.98     9.035   9.824  10.720   82.81%
6            5085.45     7.764   8.548   9.441   79.79%
7            5874.28     6.742   7.517   8.337   79.00%
8            6431.21     6.110   6.860   7.689   75.68%
9            6939.28     5.640   6.495   7.375   72.59%
10           7561.68     5.156   5.962   6.860   71.19%
15           9806.96     3.935   4.874   5.794   61.55%
20          12185.99     3.119   3.995   5.292   57.36%
25          12980.32     2.984   3.786   5.209   48.87%
30          14252.14     2.698   3.617   4.996   44.72%
35          14826.37     2.481   3.588   5.674   39.88%
40          14826.68     2.477   3.635   5.714   34.90%
45*         15187.71     2.710   4.101   6.111   31.77%
50*         15304.74     3.016   4.558   6.615   28.81%
60*         15156.50     3.672   5.490   7.697   23.78%

As we can see, the maximum ops we can get is 15304.74 at 50 threads, further increasing the threads doesn't increase the ops.
Also, as we increase the threads, the efficiency drops, the performance gain doesn't go linearly with the threads. This makes me think of a different approach to resolve the problem - maybe we can offload the ssl handshake to the AsyncImpl::Worker thread, listener is still single threaded, but it only does accept(). This requires change in the Peer class, and listener has to provide a read-only inteface to expose its ssl context to the workers so they can create ssl object later. I didn't take this approach because it involves more change across different classes, I'm not quite confident about it, but theoretically this approach should have better performance.

@dgreatwood
Copy link
Collaborator

Cool. Out of interest, what kind of CPU was it, with how many parallel execution units?

Also, as per @kiplingw's prior comment, it would be awesome to put some form of this - e.g. 1, 2, 25, 50, 75 threads for a few seconds each - into a unit test if practical.

@jiankyu
Copy link
Contributor Author

jiankyu commented May 7, 2025

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  104
  On-line CPU(s) list:   0-103
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  26
    Socket(s):           2
    Stepping:            7
    CPU(s) scaling MHz:  49%
    CPU max MHz:         4000.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            4200.00

@jiankyu
Copy link
Contributor Author

jiankyu commented May 9, 2025

Unfortunately the current Http::Experimental::Client() can't be used to benchmark this. There are couple of problems with it.
The blocking issue is, it reuses connection and there is no way to opt out, I use this client to send out 10k requests, there are only 1 connection establishment captured.
Secondly, the implementation seems also buggy, it allows setting the "threads" used by underlying reactor, but only the first thread is doing things, when it gets 100% CPU, it can do around 2400 requests per second, while all the other threads are idle.

The Client performance issue can be resolved by multiple clients making requests simultaneously, but I got stuck at the connection reuse problem, I tried different fixes but none of them works so far.

The shutdownFd is used to tell the acceptor worker thread to exit,
when there is multiple threads, the current implementation is not
enough to ensure every worker has a chance to get the event, if it's
busy doing something, it will miss the notification.

Without changing the NotifyFd interface, the guaranteed way is to
do a loop notification and check the number of alive worker threads,
keep notifying them until all the threads are done.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limited Performance on HTTPS server
3 participants