Retry the writing of messages on transient network errors #668
Conversation
@efaif thanks for the fix. We're under the impression that the syscall errors would already be wrapped by the network layer, and Would you be able to share logs that show which errors you were getting in your program? |
@achille-roussel thanks for the response. Sure, below are the program logs containing the writer error logs (tagged by
You can notice that the writes that failed with a connection error have not been retried and that the |
@achille-roussel is there anything I can do to help progress the PR? |
Thanks for providing the extra context. I guess what I was wondering is whether we could simply test for No concerns with merging the change otherwise. |
@achille-roussel thanks for the response. After a more thorough debugging I found that the syscall connection errors are not covered by the When a // Treat ECONNRESET and ECONNABORTED as temporary errors when
// they come from calling accept. See issue 6163.
if e.Op == "accept" && isConnError(e.Err) {
return true
} This doesn't apply in our case where the if ne, ok := e.Err.(*os.SyscallError); ok {
t, ok := ne.Err.(temporary)
return ok && t.Temporary()
} Which ends up here: func (e Errno) Temporary() bool {
return e == EINTR || e == EMFILE || e == ENFILE || e.Timeout()
} This calls func (e Errno) Timeout() bool {
return e == EAGAIN || e == EWOULDBLOCK || e == ETIMEDOUT
} As you can notice neither |
We're also seeing "Unexpected EOF" write failures when a broker goes down, can we get this merged please? |
@achille-roussel hey, how can we expedite this? |
Changes are looking good, thanks for the contribution!
Hey, can you tag this release please so we can use it? :) |
After upgrading to 0.4 (v0.4.16) from 0.3 we noticed that if a broker goes down in our kafka cluster, the ongoing writes to its partitions are failing with network connection errors and without any retry attempts.
That causes these writes to fail although they would have succeeded if retried due to the other healthy brokers in the cluster taking the partitions leadership.
I think that although network connection errors are not considered temporary, they should be treated as transient errors in this scenario, as the writer is usually working against a cluster of brokers that tolerates server failures (up to the configured replication factor).
In this PR I’ve added another condition to the breaking retry loop check that validates that the received error is not a transient network error before exiting the loop.
I’ve reproduced the issue locally and validated that this change indeed fixes it.
I would love to hear your thoughts.
The text was updated successfully, but these errors were encountered: