DRA kubelet: connection monitoring #132058

pohly · 2025-06-02T17:03:16Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

When a kubelet plugin shuts down without removing its registration socket (for example, because it was forcibly killed or crashed), then the kubelet does not clean up and delete ResourceSlices even if the plugin does not come back. With connection monitoring of the service socket, the kubelet recognizes when a plugin becomes unusable and then cleans up accordingly.

Which issue(s) this PR fixes:

Fixes #128696
Replaces #131073
Depends on #132096

Special notes for your reviewer:

Please review commit-by-commit. The initial commits refactor code to make the last commits simpler, so at least #132096 should get merged first.

Does this PR introduce a user-facing change?

DRA kubelet: the kubelet now also cleans up ResourceSlices in some additional failure scenarios (driver gets removed forcibly or crashes and does not restart).

Replaced the Manager interface with a simple struct. There was no need for the interface. Replaced the global instance of the DRA plugins store with a fresh instanced owned by the manager. This makes unit testing a bit easier (no need to restore state, would enable parallelizing long-running tests). Simplified the plugin.PluginsStore type to just "plugin.Store" because it stuttered. The Plugin type is kept because having one struct named after its package is one common exception from the "don't stutter" guideline. For the sake of clarify, the package gets imported as "draplugin" (= <parent dir>/<package>). Removed unused NewManager "node" parameter and replaced direct construction of cache and manager with calls to NewManager because that is how the manager should get constructed (less code, too). Fixed incorrect description of Manager: the plugin store is the entity which manages drivers, not the manager. The manager is focused on the DRA logic around ResourceClaims.

k8s-ci-robot · 2025-06-02T17:03:25Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-06-03T06:23:24Z

pkg/kubelet/cm/dra/plugin/plugins_store.go

-type pluginsStore struct {
-	sync.RWMutex
+// Store keeps track of how to reach plugins registered for DRA drivers.
+// Each plugin has a gRPC endpoint. There may be more than one plugin per driver.


This is my attempt to rationalize the difference between "kubelet plugin" and "DRA driver".

The DRA manager has traditionally favored "plugin" and "plugin name" when the rest of the system uses the terms "DRA driver" and "DRA driver name". IMHO this is unnecessary and the code would have been more readable when restricting the use of "plugin" to the registration mechanism. Even with some of my updates that separation is not complete and/or consistent.

Would it be worth doing more renames?

It gets worse: sometimes the plugin instance is also called "client", leading to "NewDRAPluginClient". Apropos, the "New" is also wrong there because it does not create an instance.

I'll add a commit with further renaming...

It's not just code readability: kubelet also logs "pluginName" as an attribute where kube-scheduler and other components use "driverName". This is visible to users.

pohly · 2025-06-03T11:45:34Z

/test pull-kubernetes-node-e2e-containerd-2-0-dra

bart0sh · 2025-06-05T07:00:45Z

test/e2e_node/dra_test.go

+			ginkgo.By("wait for ResourceSlice removal")
+			gomega.Eventually(ctx, listResources).Should(gomega.BeEmpty(), "ResourceSlices without plugin")
+			gomega.Consistently(ctx, listResources).WithTimeout(5*time.Second).Should(gomega.BeEmpty(), "ResourceSlices without plugin")
+		})


Please, add test cases for reconnection. I would be nice to test 2 scenarios: when reconnection occurs when whiping is not started and when the wiping is done.

The case for "reconnect before wiping" fits here: "ResourceSlices must not be removed if plugin restarts quickly enough"

The one for "reconnection after wiping" doesn't: we allow the slice to be deleted and have the restarted driver recreate it and see it remain, but that doesn't really tell us whether the kubelet has reconnected. I'm putting something next to "must be functional when plugin starts to listen on a service socket after registration": "Resource Kubelet Plugin must be functional after reconnect".

we allow the slice to be deleted and have the restarted driver recreate it and see it remain, but that doesn't really tell us whether the kubelet has reconnected.

I thought about making sure that the slice(s) deleted by the Kubelet before the driver publishes them again on reconnect by:

stopping the plugin

waiting for slices to be removed by kubelet

starting the plugin

may be waiting for the slices to be updated( not sure if plugin updates them on its start unconditionally)

testing that it serves requests successfully

That's what I added, without the "waiting for the slices to be updated" (creating slices on startup is covered elsewhere).

Tests look good to me, thanks!

One more thing - here you've mentioned that we need to test use case(s) with the same registration and service sockets. Would it still make sense to do?

Yes, worth adding. I had it on my backlog to go through that post before merging.

pkg/kubelet/cm/dra/plugin/plugins_store.go

bart0sh · 2025-06-05T07:10:54Z

pkg/kubelet/cm/dra/plugin/plugins_store.go

+	//
+	// TimedWorkerQueue uses namespace/name as key. We use
+	// the driver name as name with no namespace.
+	pendingWipes *timedworkers.TimedWorkerQueue


I'm still thinking that the added functionality is out of scope of store operations(storing plugins) and should be implemented by different structures. This approach complicates the code a lot and is harder to maintain and test.

The scope of the store has increased. It's no longer limited to just storing of entities created elsewhere. We can rename it to "plugin manager", but we already have too many "managers"...

If we split the responsibilities, we are back at having to using different mutexes and potential time-of-check-time-of-use races. All of that gets avoided by making a single entity responsible for everything related to plugins (tracking registered endpoints, tracking connectedness).

I'm not convinced. I don't see any harm in using mutexes for every structure if the structure and its interface is clear and easy to use. Let's see an example in my pr
where resource slices cleanup is handled by a separate cleanupHandler structure with 2 interface methods: cleanupResourceSlices and cancelPendingWipe (should probably be renamed in cancelCleanupResourceSlices. those are called from registration handler when plugin registered or unregistred and by plugin when connection status changes. The fact that those methods lock cleanupHandler doesn't create any issue as the fact that store is locked when plugins are added or removed to/from it. Those locks are not exposed outside and only used internally by cleanupHandler. Can you give an example when this approach can create time-of-check-time-of-use races?

In my opinion merging registration and connection monitoring into Store is a good big step towards a God object. It's already responsible for 3 things, what's next we're going to add to it? plugin health management? garbage collection of stalled sockets? updating node resources? The list can be continued easily. Testing of this object would also be much harder than a separate ones in my opinion.

I think equating this to a "God object" is a bit extreme.

From my perspective this (new) abstraction has a clear set of roles that all fall within the scope of a DRAPluginManager. And these roles all logically make sense for me to be together in a single object:

It handles the registration / deregistration of DRAPlugins (including wiping their resource slices if necessary), allowing the consumer of this component to not have to worry about these details

It allows you to get a handle to a specific registered plugin and call methods on it (e.g. NodePrepare/UnprepareResources) when appropriate.

I see no reason to force these two roles into separate abstractions unless we plan to reuse one or the other of them to interface with yet another component in some way. Do you have an additional component in mind that would make keeping these roles in separate abstractions the obvious better choice?

I didn't equate it to "Got object". I said it's a big step in that direction, which I still believe it is. Merging registration and connection management functionality into a storage object doesn't look like a good thing to do. Even if a result seemingly makes sense as you've mentioned above, it doesn't mean that descomposing registration, connection management or cleanup into a separate entities wouldn't make sense. In my opinion it would be easier to understand, maintain and test.

bart0sh · 2025-06-05T07:16:49Z

pkg/kubelet/pluginmanager/pluginwatcher/README.md

@@ -22,6 +22,58 @@ should end with a DNS domain that is unique for the plugin. Each time a plugin
 starts, it has to delete old sockets if they exist and listen anew under the
 same filename.

+## Monitoring Plugin Connection


As this is fully implemented in the dra code, would it make sense to move the documentation there? Monitoring service connection is out of scope of pluginwatcher, I believe.
Moving it out of pluginmanager code would also simplify the approval of this PR as it can be approved only by DRA approvers.

We also need SIG Node approval for the DRA code, moving the documentation won't simplify approval. Besides, @klueska can approve also this here.

I prefer to keep documentation about kubelet plugin implementation aspects in one place. Perhaps other types will also implement this.

klueska · 2025-06-06T10:34:44Z

pkg/kubelet/cm/dra/plugin/plugin_manager.go

+//
+// It returns an informative error message including the driver name
+// with an explanation why the driver is not usable.
+func (pm *PluginManager) GetDRAPlugin(driverName string) (*Plugin, error) {


Does it make sense to rename this type to DRAPluginManager, the returned type to DRAPlugin and then have this function just be called GetPlugin()?

Looks good to me. I'll update the renaming commit...

Done, force-pushed to both PRs.

The rest of the system logs information using "driverName" as key in structured logging. The kubelet should do the same. This also gets clarified in the code, together with using consistent a consistent name for a Plugin pointer: "plugin" instead of "client" or "instance". The New in NewDRAPluginClient made no sense because it's not constructing anything, and it returns a plugin, not a client -> GetDRAPlugin.

This moves wiping into the plugins store. The advantages are: - Only a single mutex is needed in the plugin store. - The code which decides about queuing and canceling wiping has access to all relevant information while that information is protected against concurrent modifications. - It prepares for making that code more complex ("connection monitoring"). In retrospect it is not clear whether the RegistrationHandler did the right thing in all cases when endpoints got registered and removed concurrently. The RegistrationHandler became a thin, stateless layer on top of the store. Because it's not really needed anymore, the required methods for the plugin manager now get provided directly by the store. The disadvantage is the slightly more complex initialization, but that was also a problem before which just hadn't been solved: wiping ran without the context of the manager. Now it does.

After merging with RegistrationHandler, the store abstraction is more than a dump map. It implements additional logic, so renaming it seems warranted. To avoid confusion with other "plugin managers", the DRA prefix is used. The plugin type gets updated accordingly. This is done in a separate commit for ease of review. The "store" field is kept because that really is just a dump lookup structure.

Instead of creating the gRPC connection on demand and forcing gRPC to connect, we establish it immediately and rely on gRPC to handle the underlying connection automatically like it usually does. It's not clear what benefit the one second connection timeout had. The way it is now, gRPC calls still fail when the underlying connection cannot be established. Having to have a separate context for establishing that connection just made the code more complex. The DRAPluginManager is the central component which manages plugins. Making it responsible for creating them reduces the number of places where a DRAPlugin struct needs to be initialized. Doing this in the DRAPluginManager instead of a stand-alone function simplifies the implementation of connection monitoring, because that will be something that is tied to the DRAPluginManager state.

This ensures that ResourceSlices get removed also when a plugin becomes unresponsive without removing the registration socket. Tests are from kubernetes#131073 by Ed with some modifications, the implementation is new.

Conceptually TimedWorkersQueue is similar to the current code: it spawns goroutines and cancels them. Using it makes the code a bit shorter, even though the TimedWorkersQueue API could be a bit nicer and more consistent (key string vs. WorkerArgs as parameters). Depending on the tainteviction package is a bit odd, which is the reason why TimedWorkersQueue wasn't already used earlier. But there don't seem to be other implementations of this common problem. https://pkg.go.dev/k8s.io/client-go/util/workqueue#TypedDelayingInterface doesn't work because queue entries cannot be removed. This doesn't really solve the problem of tracking goroutines for wiping because TimedWorkersQueue doesn't support that. But not tracking is arguably better than doing it wrong and this only affects unit tests, so it should be okay.

pohly · 2025-06-11T14:10:59Z

/test pull-kubernetes-e2e-kind-dra-canary pull-kubernetes-e2e-kind-dra-n-1-canary

k8s-ci-robot · 2025-06-11T14:11:02Z

@pohly: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cos-containerd-e2e-ubuntu-gce

/test pull-kubernetes-cmd

/test pull-kubernetes-cmd-canary

/test pull-kubernetes-cmd-go-canary

/test pull-kubernetes-conformance-kind-ga-only-parallel

/test pull-kubernetes-coverage-unit

/test pull-kubernetes-dependencies

/test pull-kubernetes-dependencies-go-canary

/test pull-kubernetes-e2e-gce

/test pull-kubernetes-e2e-gce-100-performance

/test pull-kubernetes-e2e-gce-cos

/test pull-kubernetes-e2e-gce-cos-canary

/test pull-kubernetes-e2e-gce-cos-no-stage

/test pull-kubernetes-e2e-gce-network-proxy-http-connect

/test pull-kubernetes-e2e-gce-pull-through-cache

/test pull-kubernetes-e2e-gce-scale-performance-manual

/test pull-kubernetes-e2e-kind

/test pull-kubernetes-e2e-kind-ipv6

/test pull-kubernetes-e2e-storage-kind-alpha-beta-features-slow

/test pull-kubernetes-integration

/test pull-kubernetes-integration-canary

/test pull-kubernetes-integration-go-canary

/test pull-kubernetes-kubemark-e2e-gce-scale

/test pull-kubernetes-node-e2e-containerd

/test pull-kubernetes-typecheck

/test pull-kubernetes-unit

/test pull-kubernetes-unit-go-canary

/test pull-kubernetes-update

/test pull-kubernetes-verify

/test pull-kubernetes-verify-go-canary

The following commands are available to trigger optional jobs:

/test check-dependency-stats

/test pull-crio-cgroupv1-node-e2e-eviction

/test pull-crio-cgroupv1-node-e2e-features

/test pull-crio-cgroupv1-node-e2e-hugepages

/test pull-crio-cgroupv1-node-e2e-resource-managers

/test pull-crio-cgroupv2-imagefs-separatedisktest

/test pull-crio-cgroupv2-node-e2e-eviction

/test pull-crio-cgroupv2-node-e2e-hugepages

/test pull-crio-cgroupv2-node-e2e-resource-managers

/test pull-crio-cgroupv2-splitfs-separate-disk

/test pull-e2e-gce-cloud-provider-disabled

/test pull-e2e-gci-gce-alpha-enabled-default

/test pull-kubernetes-apidiff

/test pull-kubernetes-apidiff-client-go

/test pull-kubernetes-conformance-image-test

/test pull-kubernetes-conformance-kind-ga-only

/test pull-kubernetes-conformance-kind-ipv6-parallel

/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial

/test pull-kubernetes-crio-node-memoryqos-cgrpv2

/test pull-kubernetes-cross

/test pull-kubernetes-e2e-autoscaling-hpa-cm

/test pull-kubernetes-e2e-autoscaling-hpa-cpu

/test pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta

/test pull-kubernetes-e2e-capz-azure-disk

/test pull-kubernetes-e2e-capz-azure-disk-vmss

/test pull-kubernetes-e2e-capz-azure-disk-windows

/test pull-kubernetes-e2e-capz-azure-file

/test pull-kubernetes-e2e-capz-azure-file-vmss

/test pull-kubernetes-e2e-capz-azure-file-windows

/test pull-kubernetes-e2e-capz-conformance

/test pull-kubernetes-e2e-capz-master-windows-nodelogquery

/test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa

/test pull-kubernetes-e2e-capz-windows-alpha-features

/test pull-kubernetes-e2e-capz-windows-master

/test pull-kubernetes-e2e-capz-windows-serial-slow

/test pull-kubernetes-e2e-capz-windows-serial-slow-hpa

/test pull-kubernetes-e2e-containerd-gce

/test pull-kubernetes-e2e-ec2

/test pull-kubernetes-e2e-ec2-arm64

/test pull-kubernetes-e2e-ec2-conformance

/test pull-kubernetes-e2e-ec2-conformance-arm64

/test pull-kubernetes-e2e-ec2-device-plugin-gpu

/test pull-kubernetes-e2e-gce-canary

/test pull-kubernetes-e2e-gce-correctness

/test pull-kubernetes-e2e-gce-cos-alpha-features

/test pull-kubernetes-e2e-gce-csi-serial

/test pull-kubernetes-e2e-gce-device-plugin-gpu

/test pull-kubernetes-e2e-gce-disruptive-canary

/test pull-kubernetes-e2e-gce-kubelet-credential-provider

/test pull-kubernetes-e2e-gce-network-policies

/test pull-kubernetes-e2e-gce-network-proxy-grpc

/test pull-kubernetes-e2e-gce-serial

/test pull-kubernetes-e2e-gce-serial-canary

/test pull-kubernetes-e2e-gce-storage-disruptive

/test pull-kubernetes-e2e-gce-storage-selinux

/test pull-kubernetes-e2e-gce-storage-slow

/test pull-kubernetes-e2e-gce-storage-snapshot

/test pull-kubernetes-e2e-gci-gce-autoscaling

/test pull-kubernetes-e2e-gci-gce-ingress

/test pull-kubernetes-e2e-gci-gce-ipvs

/test pull-kubernetes-e2e-gci-gce-kube-dns-nodecache

/test pull-kubernetes-e2e-gci-gce-nftables

/test pull-kubernetes-e2e-kind-alpha-beta-features

/test pull-kubernetes-e2e-kind-alpha-features

/test pull-kubernetes-e2e-kind-beta-features

/test pull-kubernetes-e2e-kind-canary

/test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer

/test pull-kubernetes-e2e-kind-dependencies

/test pull-kubernetes-e2e-kind-dual-canary

/test pull-kubernetes-e2e-kind-evented-pleg

/test pull-kubernetes-e2e-kind-ipv6-canary

/test pull-kubernetes-e2e-kind-ipvs

/test pull-kubernetes-e2e-kind-kms

/test pull-kubernetes-e2e-kind-multizone

/test pull-kubernetes-e2e-kind-nftables

/test pull-kubernetes-e2e-relaxed-environment-variable-validation

/test pull-kubernetes-e2e-storage-kind-disruptive

/test pull-kubernetes-e2e-storage-kind-volume-group-snapshots

/test pull-kubernetes-e2e-unit-dependencies

/test pull-kubernetes-integration-race

/test pull-kubernetes-kind-dra

/test pull-kubernetes-kind-dra-all

/test pull-kubernetes-kind-dra-all-canary

/test pull-kubernetes-kind-dra-canary

/test pull-kubernetes-kind-dra-n-1-canary

/test pull-kubernetes-kind-json-logging

/test pull-kubernetes-kind-text-logging

/test pull-kubernetes-kubemark-e2e-gce-big

/test pull-kubernetes-linter-hints

/test pull-kubernetes-local-e2e

/test pull-kubernetes-node-arm64-e2e-containerd-ec2

/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2

/test pull-kubernetes-node-arm64-ubuntu-serial-gce

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

/test pull-kubernetes-node-crio-cgrpv2-e2e

/test pull-kubernetes-node-crio-cgrpv2-e2e-canary

/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e

/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e

/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e

/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial

/test pull-kubernetes-node-crio-e2e

/test pull-kubernetes-node-e2e-alpha-ec2

/test pull-kubernetes-node-e2e-containerd-1-7-dra

/test pull-kubernetes-node-e2e-containerd-1-7-dra-canary

/test pull-kubernetes-node-e2e-containerd-2-0-dra

/test pull-kubernetes-node-e2e-containerd-2-0-dra-canary

/test pull-kubernetes-node-e2e-containerd-alpha-features

/test pull-kubernetes-node-e2e-containerd-ec2

/test pull-kubernetes-node-e2e-containerd-features

/test pull-kubernetes-node-e2e-containerd-features-kubetest2

/test pull-kubernetes-node-e2e-containerd-kubelet-psi

/test pull-kubernetes-node-e2e-containerd-kubetest2

/test pull-kubernetes-node-e2e-containerd-serial-ec2

/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks

/test pull-kubernetes-node-e2e-containerd-standalone-mode

/test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha

/test pull-kubernetes-node-e2e-cri-proxy-serial

/test pull-kubernetes-node-e2e-crio-cgrpv1-dra

/test pull-kubernetes-node-e2e-crio-cgrpv1-dra-canary

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra

/test pull-kubernetes-node-e2e-crio-cgrpv2-dra-canary

/test pull-kubernetes-node-e2e-resource-health-status

/test pull-kubernetes-node-kubelet-containerd-flaky

/test pull-kubernetes-node-kubelet-credential-provider

/test pull-kubernetes-node-kubelet-serial-containerd

/test pull-kubernetes-node-kubelet-serial-containerd-alpha-features

/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2

/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers

/test pull-kubernetes-node-kubelet-serial-cpu-manager

/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2

/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1

/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2

/test pull-kubernetes-node-kubelet-serial-hugepages

/test pull-kubernetes-node-kubelet-serial-memory-manager

/test pull-kubernetes-node-kubelet-serial-podresources

/test pull-kubernetes-node-kubelet-serial-topology-manager

/test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2

/test pull-kubernetes-node-swap-conformance-fedora-serial

/test pull-kubernetes-node-swap-conformance-ubuntu-serial

/test pull-kubernetes-node-swap-fedora

/test pull-kubernetes-node-swap-fedora-serial

/test pull-kubernetes-node-swap-ubuntu-serial

/test pull-kubernetes-scheduler-perf

/test pull-kubernetes-unit-windows-master

/test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

pull-kubernetes-cmd

pull-kubernetes-conformance-kind-ga-only-parallel

pull-kubernetes-dependencies

pull-kubernetes-e2e-ec2

pull-kubernetes-e2e-gce

pull-kubernetes-e2e-kind

pull-kubernetes-e2e-kind-ipv6

pull-kubernetes-integration

pull-kubernetes-kind-dra

pull-kubernetes-kind-dra-all

pull-kubernetes-linter-hints

pull-kubernetes-node-e2e-containerd

pull-kubernetes-node-e2e-crio-cgrpv1-dra

pull-kubernetes-typecheck

pull-kubernetes-unit

pull-kubernetes-unit-windows-master

pull-kubernetes-verify

In response to this:

/test pull-kubernetes-e2e-kind-dra-canary pull-kubernetes-e2e-kind-dra-n-1-canary

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-06-11T14:11:53Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-canary

pohly · 2025-06-11T15:41:35Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary pull-kubernetes-kind-dra-canary

pohly · 2025-06-11T17:34:22Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary pull-kubernetes-kind-dra-canary

pohly · 2025-06-12T05:32:13Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

pohly · 2025-06-12T10:20:47Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

pohly · 2025-06-12T14:34:57Z

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

k8s-ci-robot · 2025-06-12T15:12:04Z

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-linter-hints	`aee4073`	link	false	`/test pull-kubernetes-linter-hints`
pull-kubernetes-unit-windows-master	`aee4073`	link	false	`/test pull-kubernetes-unit-windows-master`
pull-kubernetes-kind-dra-n-1-canary	`aee4073`	link	false	`/test pull-kubernetes-kind-dra-n-1-canary`
pull-kubernetes-kind-dra-n-2-canary	`aee4073`	link	false	`/test pull-kubernetes-kind-dra-n-2-canary`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 2, 2025

k8s-ci-robot requested review from bobbypage and kannon92 June 2, 2025 17:03

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs and Dynamic Resource Allocation Jun 2, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 2, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 2, 2025

pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Jun 2, 2025

pohly commented Jun 3, 2025

View reviewed changes

pohly force-pushed the dra-kubelet-connection-monitoring branch 3 times, most recently from e37d351 to 6c65e86 Compare June 3, 2025 10:27

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025

pohly marked this pull request as draft June 3, 2025 12:15

pohly force-pushed the dra-kubelet-connection-monitoring branch from 6c65e86 to 23b2871 Compare June 3, 2025 16:28

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2025

k8s-ci-robot requested review from bart0sh and dims June 4, 2025 13:50

SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Jun 4, 2025

bart0sh reviewed Jun 5, 2025

View reviewed changes

pkg/kubelet/cm/dra/plugin/plugins_store.go Outdated Show resolved Hide resolved

bart0sh reviewed Jun 5, 2025

View reviewed changes

pohly force-pushed the dra-kubelet-connection-monitoring branch 3 times, most recently from 347795b to f964121 Compare June 5, 2025 19:12

klueska reviewed Jun 6, 2025

View reviewed changes

pohly force-pushed the dra-kubelet-connection-monitoring branch from f964121 to 6f220fa Compare June 6, 2025 13:14

pohly and others added 6 commits June 6, 2025 18:24

DRA kubelet: add connection monitoring

5ad24d1

This ensures that ResourceSlices get removed also when a plugin becomes unresponsive without removing the registration socket. Tests are from kubernetes#131073 by Ed with some modifications, the implementation is new.

pohly force-pushed the dra-kubelet-connection-monitoring branch from 6f220fa to aee4073 Compare June 6, 2025 16:26

pohly mentioned this pull request Jun 10, 2025

KEP 4680: Update README.md for DRA kubernetes/enhancements#5302

Open

DRA kubelet: connection monitoring #132058

Are you sure you want to change the base?

DRA kubelet: connection monitoring #132058

Conversation

pohly commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jun 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 11, 2025

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

pohly commented Jun 11, 2025

Uh oh!

pohly commented Jun 11, 2025

Uh oh!

pohly commented Jun 11, 2025

Uh oh!

pohly commented Jun 12, 2025

Uh oh!

pohly commented Jun 12, 2025

Uh oh!

pohly commented Jun 12, 2025

Uh oh!

k8s-ci-robot commented Jun 12, 2025

Uh oh!

Uh oh!

pohly commented Jun 2, 2025 •

edited

Loading