Retry the writing of messages on transient network errors #668

efaif · 2021-05-19T08:59:06Z

After upgrading to 0.4 (v0.4.16) from 0.3 we noticed that if a broker goes down in our kafka cluster, the ongoing writes to its partitions are failing with network connection errors and without any retry attempts.
That causes these writes to fail although they would have succeeded if retried due to the other healthy brokers in the cluster taking the partitions leadership.

I think that although network connection errors are not considered temporary, they should be treated as transient errors in this scenario, as the writer is usually working against a cluster of brokers that tolerates server failures (up to the configured replication factor).

In this PR I’ve added another condition to the breaking retry loop check that validates that the received error is not a transient network error before exiting the loop.

I’ve reproduced the issue locally and validated that this change indeed fixes it.

I would love to hear your thoughts.

achille-roussel · 2021-05-21T17:41:10Z

@efaif thanks for the fix.

We're under the impression that the syscall errors would already be wrapped by the network layer, and isTemporary should match them (see https://golang.org/src/net/net.go?s=16228:16259#L515).

Would you be able to share logs that show which errors you were getting in your program?

efaif · 2021-05-21T21:23:55Z

@achille-roussel thanks for the response.

Sure, below are the program logs containing the writer error logs (tagged by producer.kafkaWriter) and the errors received by the writer’s Completion function (tagged by producer).
I’ve also logged the received net.OpError’s Op value.

...
...
...
2021-05-21T22:08:48.17       INFO    producer        Start producing...
2021-05-21T22:08:57.51       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date   {"topic": "test-topic"}
2021-05-21T22:08:57.55       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): [6] Not Leader For Partition: the client attempted to send messages to a replica that is not the leader for some partition, the client's metadata are likely out of date   {"topic": "test-topic"}
2021-05-21T22:08:57.59       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: unexpected EOF        {"topic": "test-topic"}
2021-05-21T22:08:57.59       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: unexpected EOF", "topic": "test-topic"}
2021-05-21T22:08:57.63       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: write tcp 192.168.0.133:54773->192.168.0.133:9092: write: broken pipe {"topic": "test-topic"}
2021-05-21T22:08:57.63       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: write tcp 192.168.0.133:54773->192.168.0.133:9092: write: broken pipe", "topic": "test-topic"}
2021-05-21T22:08:57.63       ERROR   producer        Got OpError.Op = 'write'  
2021-05-21T22:08:57.63       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: read tcp 192.168.0.133:54771->192.168.0.133:9092: read: connection reset by peer {"topic": "test-topic"}
2021-05-21T22:08:57.63       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: read tcp 192.168.0.133:54771->192.168.0.133:9092: read: connection reset by peer", "topic": "test-topic"}
2021-05-21T22:08:57.63       ERROR   producer        Got OpError.Op = 'read' 
2021-05-21T22:08:57.64       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused      {"topic": "test-topic"}
2021-05-21T22:08:57.64       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused", "topic": "test-topic"}
2021-05-21T22:08:57.64       ERROR   producer        Got OpError.Op = 'dial' 
2021-05-21T22:08:57.65       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused      {"topic": "test-topic"}
2021-05-21T22:08:57.65       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused", "topic": "test-topic"}
2021-05-21T22:08:57.65       ERROR   producer        Got OpError.Op = 'dial' 
2021-05-21T22:08:57.67       ERROR   producer.kafkaWriter    error writing messages to test-topic (partition 1): kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused      {"topic": "test-topic"}
2021-05-21T22:08:57.67       ERROR   producer        Failed to produce messages batch        {"error": "kafka.(*Client).Produce: dial tcp 192.168.0.133:9092: connect: connection refused", "topic": "test-topic"}
2021-05-21T22:08:57.67       ERROR   producer        Got OpError.Op = 'dial'

You can notice that the writes that failed with a connection error have not been retried and that the Completion function was immediately invoked.
It seems that the net.OpError’s Op is not “accept” for any of the connection reset errors and therefore they were not treated as temporary.

efaif · 2021-06-01T21:23:04Z

@achille-roussel is there anything I can do to help progress the PR?

achille-roussel · 2021-06-04T18:11:51Z

Thanks for providing the extra context.

I guess what I was wondering is whether we could simply test for io.ErrUnexpectedEOF, it seemed like the other syscall errors should be covered by the isTemporary check already.

No concerns with merging the change otherwise.

efaif · 2021-06-05T18:36:19Z

@achille-roussel thanks for the response.

After a more thorough debugging I found that the syscall connection errors are not covered by the isTemporary check and are not treated as temporary by the network layer:

When a net.OpError that wraps os.SyscallError (like ECONNREFUSED or ECONNRESET) is received, the isTemporary check calls net.OpError's Temporary.
net.OpError's Temporary treats a connection error as temporary only when its Op is "accept":
https://golang.org/src/net/net.go?s=16559:16575#L515

// Treat ECONNRESET and ECONNABORTED as temporary errors when
// they come from calling accept. See issue 6163.
if e.Op == "accept" && isConnError(e.Err) {
	return true
}

This doesn't apply in our case where the Op is "write", "read" or "dial" as can be seen in the logs posted in a previous comment above.
Therefore, net.OpError's Temporary calls os.SyscallError's Err's Temporary:
https://golang.org/src/net/net.go?s=16708:16721#L515

if ne, ok := e.Err.(*os.SyscallError); ok {
	t, ok := ne.Err.(temporary)
	return ok && t.Temporary()
}

Which ends up here:
https://golang.org/src/syscall/syscall_unix.go?s=3288:3390#L138

func (e Errno) Temporary() bool {
	return e == EINTR || e == EMFILE || e == ENFILE || e.Timeout()
}

This calls syscall.Errno's Timeout:
https://golang.org/src/syscall/syscall_unix.go?s=3388:3482#L142

func (e Errno) Timeout() bool {
	return e == EAGAIN || e == EWOULDBLOCK || e == ETIMEDOUT
}

As you can notice neither ECONNREFUSED nor ECONNRESET nor EPIPE are treated as temporary in this case.

dmarkhas · 2021-06-09T10:20:40Z

We're also seeing "Unexpected EOF" write failures when a broker goes down, can we get this merged please?

dmarkhas · 2021-06-20T19:44:59Z

@achille-roussel hey, how can we expedite this?

achille-roussel

Changes are looking good, thanks for the contribution!

dmarkhas · 2021-06-30T07:09:10Z

Hey, can you tag this release please so we can use it? :)

Retry the writing of messages on transient network errors

4780256

achille-roussel self-assigned this May 21, 2021

thisIsAnil mentioned this pull request Jun 16, 2021

Chirpstack stops sending data over kafka after disconnect. brocaar/chirpstack-application-server#604

Closed

2 tasks

achille-roussel approved these changes Jun 20, 2021

View changes

achille-roussel merged commit 7113876 into segmentio:master Jun 20, 2021
11 checks passed

Sep	OCT	Nov
	06
2020	2021	2022

segmentio / kafka-go Public

Retry the writing of messages on transient network errors #668

Retry the writing of messages on transient network errors #668

efaif commented May 19, 2021

achille-roussel commented May 21, 2021

efaif commented May 21, 2021

efaif commented Jun 1, 2021

achille-roussel commented Jun 4, 2021

efaif commented Jun 5, 2021

dmarkhas commented Jun 9, 2021

dmarkhas commented Jun 20, 2021

achille-roussel left a comment

dmarkhas commented Jun 30, 2021

segmentio / kafka-go Public

Retry the writing of messages on transient network errors #668

Retry the writing of messages on transient network errors #668

Conversation

efaif commented May 19, 2021

achille-roussel commented May 21, 2021

efaif commented May 21, 2021

efaif commented Jun 1, 2021

achille-roussel commented Jun 4, 2021

efaif commented Jun 5, 2021

dmarkhas commented Jun 9, 2021

dmarkhas commented Jun 20, 2021

achille-roussel left a comment

dmarkhas commented Jun 30, 2021