Skip to content

[Bug]: Many operations like load, release and flush failed after milvus recovered from many chaos test due to mixcoord panic when release segment #42568

Closed
@zhuwenxing

Description

@zhuwenxing

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.5-20250604-fdfb78b9-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2025/06/05 01:56:48.627 +00:00] [INFO] [task/executor.go:120] ["execute the action of task"] [taskID=1749086769340] [collectionID=458512562322749924] [replicaID=-1] [step=0] [source=segment_checker]
[2025/06/05 01:56:48.627 +00:00] [WARN] [task/executor.go:293] ["no shard leader for the segment to execute releasing"] [taskID=1749086769340] [collectionID=458512562322749924] [replicaID=-1] [segmentID=458512562322554259] [node=4] [source=segment_checker] [error="shard delegator not found: channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]"] [errorVerbose="shard delegator not found: channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/pkg/v2/util/merr.warpChannelErr\n  | \t/workspace/source/pkg/util/merr/utils.go:708\n  | github.com/milvus-io/milvus/pkg/v2/util/merr.WrapErrChannelNotFound\n  | \t/workspace/source/pkg/util/merr/utils.go:714\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:292\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:163\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:123\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1700\nWraps: (2) shard delegator not found\nWraps: (3) channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER

SIGNAL CATCH BY NON-GO SIGNAL HANDLER
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: (nil)
BACKTRACE:
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: (nil)
BACKTRACE:
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment
	/workspace/source/internal/querycoordv2/task/executor.go:297 pc=0x5180d8e


[2025/06/05 01:56:53.364 +00:00] [INFO] [task/executor.go:120] ["execute the action of task"] [taskID=1749086769343] [collectionID=458512562322749924] [replicaID=-1] [step=0] [source=segment_checker]
[2025/06/05 01:56:53.365 +00:00] [WARN] [task/executor.go:293] ["no shard leader for the segment to execute releasing"] [taskID=1749086769343] [collectionID=458512562322749924] [replicaID=-1] [segmentID=458512562322554768] [node=4] [source=segment_checker] [error="shard delegator not found: channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]"] [errorVerbose="shard delegator not found: channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/pkg/v2/util/merr.warpChannelErr\n  | \t/workspace/source/pkg/util/merr/utils.go:708\n  | github.com/milvus-io/milvus/pkg/v2/util/merr.WrapErrChannelNotFound\n  | \t/workspace/source/pkg/util/merr/utils.go:714\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:292\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:163\n  | github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1\n  | \t/workspace/source/internal/querycoordv2/task/executor.go:123\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1700\nWraps: (2) shard delegator not found\nWraps: (3) channel not found[channel=by-dev-rootcoord-dml_6_458512562322749924v1]\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: (nil)
BACKTRACE:
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment
	/workspace/source/internal/querycoordv2/task/executor.go:297 pc=0x5180d8e


panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5180d8e]

Expected Behavior

No response

Steps To Reproduce

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/19431/pipeline
log:

artifacts-mixcoord-pod-failure-19431-server-logs.tar.gz

Anything else?

when no pods get killed, this issue also reproduced, for example, etcd-follower chaos test

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/19440/pipeline
log:

artifacts-etcd-followers-pod-failure-19440-server-logs.tar.gz

[2025/06/05 03:28:37.382 +00:00] [INFO] [task/scheduler.go:407] ["task added"] [task="[id=1749094110901] [type=Move] [source=balance_checker] [reason=channel unbalanced] [collectionID=458514015146293816] [replicaID=458514526585225217] [resourceGroup=__default_resource_group] [priority=High] [actionsCount=2] [actions={[type=Grow][node=6][shard=by-dev-rootcoord-dml_13_458514015146293816v0]},{[type=Reduce][node=5][shard=by-dev-rootcoord-dml_13_458514015146293816v0]},] [channel=by-dev-rootcoord-dml_13_458514015146293816v0]"]
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment
	/workspace/source/internal/querycoordv2/task/executor.go:297 pc=0x5180d8e


panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5180d8e]

goroutine 3556 gp=0xc001a828c0 m=9 mp=0xc000189008 [running]:
panic({0x5a78b60?, 0x9163150?})
	/usr/local/go/src/runtime/panic.go:811 +0x168 fp=0xc001d07bd0 sp=0xc001d07b20 pc=0x1f5ea08
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:262
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:925 +0x359 fp=0xc001d07c30 sp=0xc001d07bd0 pc=0x1f61d79
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).releaseSegment(0xc00735c000, 0xc002b19820, 0x0)
	/workspace/source/internal/querycoordv2/task/executor.go:297 +0xaee fp=0xc001d07f58 sp=0xc001d07c30 pc=0x5180d8e
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeSegmentAction(0xc00735c000, 0xc002b19820, 0x0)
	/workspace/source/internal/querycoordv2/task/executor.go:163 +0x8f fp=0xc001d07f80 sp=0xc001d07f58 pc=0x517eeef
github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1()
	/workspace/source/internal/querycoordv2/task/executor.go:123 +0x105 fp=0xc001d07fe0 sp=0xc001d07f80 pc=0x517e9e5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc001d07fe8 sp=0xc001d07fe0 pc=0x1f68501
created by github.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute in goroutine 3555
	/workspace/source/internal/querycoordv2/task/executor.go:119 +0x517

cluster:4am
ns:chaos-tetsing
pods

 + kubectl get pods -o wide
 + grep etcd-followers-pod-failure-19440
 etcd-followers-pod-failure-19440-0                                1/1     Running             0               42m      10.104.24.111   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-1                                1/1     Running             2 (24m ago)     42m      10.104.19.71    4am-node28   <none>           <none>
 etcd-followers-pod-failure-19440-2                                1/1     Running             0               42m      10.104.23.196   4am-node27   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-datanode-894d99d76mdqvz   1/1     Running             2 (41m ago)     42m      10.104.33.154   4am-node36   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-datanode-894d99d76mptbc   1/1     Running             2 (41m ago)     42m      10.104.26.170   4am-node32   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-indexnode-684c4c86bc4lj   1/1     Running             2 (41m ago)     42m      10.104.17.17    4am-node23   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-indexnode-684c4c86tlhf7   1/1     Running             2 (41m ago)     42m      10.104.33.155   4am-node36   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-indexnode-684c4c86xqkq8   1/1     Running             2 (41m ago)     42m      10.104.32.202   4am-node39   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-mixcoord-595fbc5887fk56   1/1     Running             4 (8m43s ago)   42m      10.104.33.153   4am-node36   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-proxy-84fc884889-shpzw    1/1     Running             2 (41m ago)     42m      10.104.17.16    4am-node23   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-querynode-f87f584d2lwzn   1/1     Running             2 (41m ago)     42m      10.104.26.171   4am-node32   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-querynode-f87f584dmsgbs   1/1     Running             2 (41m ago)     42m      10.104.32.203   4am-node39   <none>           <none>
 etcd-followers-pod-failure-19440-milvus-querynode-f87f584dtqqkk   1/1     Running             2 (41m ago)     42m      10.104.33.156   4am-node36   <none>           <none>
 etcd-followers-pod-failure-19440-minio-0                          1/1     Running             0               42m      10.104.19.68    4am-node28   <none>           <none>
 etcd-followers-pod-failure-19440-minio-1                          1/1     Running             0               42m      10.104.24.113   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-minio-2                          1/1     Running             0               42m      10.104.23.197   4am-node27   <none>           <none>
 etcd-followers-pod-failure-19440-minio-3                          1/1     Running             0               42m      10.104.15.82    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-bookie-0                1/1     Running             0               42m      10.104.15.76    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-bookie-1                1/1     Running             0               42m      10.104.19.75    4am-node28   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-bookie-2                1/1     Running             0               42m      10.104.24.119   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-bookie-init-72vm2       0/1     Completed           0               42m      10.104.15.67    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-broker-0                1/1     Running             0               42m      10.104.15.71    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-broker-1                1/1     Running             0               42m      10.104.24.108   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-proxy-0                 1/1     Running             0               42m      10.104.15.68    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-proxy-1                 1/1     Running             0               42m      10.104.19.63    4am-node28   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-pulsar-init-xzn45       0/1     Completed           0               42m      10.104.24.106   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-recovery-0              1/1     Running             0               42m      10.104.9.49     4am-node14   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-zookeeper-0             1/1     Running             0               42m      10.104.15.73    4am-node20   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-zookeeper-1             1/1     Running             0               42m      10.104.24.112   4am-node29   <none>           <none>
 etcd-followers-pod-failure-19440-pulsarv3-zookeeper-2             1/1     Running             0               42m      10.104.19.72    4am-node28   <none>           <none>

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.severity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions