Skip to content

DRA kubelet: connection monitoring #132058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 2, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

When a kubelet plugin shuts down without removing its registration socket (for example, because it was forcibly killed or crashed), then the kubelet does not clean up and delete ResourceSlices even if the plugin does not come back. With connection monitoring of the service socket, the kubelet recognizes when a plugin becomes unusable and then cleans up accordingly.

Which issue(s) this PR fixes:

Fixes #128696
Replaces #131073
Depends on #132096

Special notes for your reviewer:

Please review commit-by-commit. The initial commits refactor code to make the last commits simpler, so at least #132096 should get merged first.

Does this PR introduce a user-facing change?

DRA kubelet: the kubelet now also cleans up ResourceSlices in some additional failure scenarios (driver gets removed forcibly or crashes and does not restart).

Replaced the Manager interface with a simple struct. There was no need for the
interface.

Replaced the global instance of the DRA plugins store with a fresh instanced
owned by the manager. This makes unit testing a bit easier (no need to restore
state, would enable parallelizing long-running tests).

Simplified the plugin.PluginsStore type to just "plugin.Store" because it
stuttered. The Plugin type is kept because having one struct named after its
package is one common exception from the "don't stutter" guideline.
For the sake of clarify, the package gets imported as "draplugin" (=
<parent dir>/<package>).

Removed unused NewManager "node" parameter and replaced direct construction of
cache and manager with calls to NewManager because that is how the manager
should get constructed (less code, too).

Fixed incorrect description of Manager: the plugin store is the entity which
manages drivers, not the manager. The manager is focused on the DRA logic
around ResourceClaims.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 2, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 2, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 2, 2025
@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Jun 2, 2025
type pluginsStore struct {
sync.RWMutex
// Store keeps track of how to reach plugins registered for DRA drivers.
// Each plugin has a gRPC endpoint. There may be more than one plugin per driver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to rationalize the difference between "kubelet plugin" and "DRA driver".

The DRA manager has traditionally favored "plugin" and "plugin name" when the rest of the system uses the terms "DRA driver" and "DRA driver name". IMHO this is unnecessary and the code would have been more readable when restricting the use of "plugin" to the registration mechanism. Even with some of my updates that separation is not complete and/or consistent.

Would it be worth doing more renames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets worse: sometimes the plugin instance is also called "client", leading to "NewDRAPluginClient". Apropos, the "New" is also wrong there because it does not create an instance.

I'll add a commit with further renaming...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just code readability: kubelet also logs "pluginName" as an attribute where kube-scheduler and other components use "driverName". This is visible to users.

@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch 3 times, most recently from e37d351 to 6c65e86 Compare June 3, 2025 10:27
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 3, 2025

/test pull-kubernetes-node-e2e-containerd-2-0-dra

@pohly pohly marked this pull request as draft June 3, 2025 12:15
@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch from 6c65e86 to 23b2871 Compare June 3, 2025 16:28
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and dims June 4, 2025 13:50
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Jun 4, 2025
ginkgo.By("wait for ResourceSlice removal")
gomega.Eventually(ctx, listResources).Should(gomega.BeEmpty(), "ResourceSlices without plugin")
gomega.Consistently(ctx, listResources).WithTimeout(5*time.Second).Should(gomega.BeEmpty(), "ResourceSlices without plugin")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add test cases for reconnection. I would be nice to test 2 scenarios: when reconnection occurs when whiping is not started and when the wiping is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case for "reconnect before wiping" fits here: "ResourceSlices must not be removed if plugin restarts quickly enough"

The one for "reconnection after wiping" doesn't: we allow the slice to be deleted and have the restarted driver recreate it and see it remain, but that doesn't really tell us whether the kubelet has reconnected. I'm putting something next to "must be functional when plugin starts to listen on a service socket after registration": "Resource Kubelet Plugin must be functional after reconnect".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we allow the slice to be deleted and have the restarted driver recreate it and see it remain, but that doesn't really tell us whether the kubelet has reconnected.

I thought about making sure that the slice(s) deleted by the Kubelet before the driver publishes them again on reconnect by:

  • stopping the plugin
  • waiting for slices to be removed by kubelet
  • starting the plugin
  • may be waiting for the slices to be updated( not sure if plugin updates them on its start unconditionally)
  • testing that it serves requests successfully

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I added, without the "waiting for the slices to be updated" (creating slices on startup is covered elsewhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good to me, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - here you've mentioned that we need to test use case(s) with the same registration and service sockets. Would it still make sense to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, worth adding. I had it on my backlog to go through that post before merging.

//
// TimedWorkerQueue uses namespace/name as key. We use
// the driver name as name with no namespace.
pendingWipes *timedworkers.TimedWorkerQueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking that the added functionality is out of scope of store operations(storing plugins) and should be implemented by different structures. This approach complicates the code a lot and is harder to maintain and test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of the store has increased. It's no longer limited to just storing of entities created elsewhere. We can rename it to "plugin manager", but we already have too many "managers"...

If we split the responsibilities, we are back at having to using different mutexes and potential time-of-check-time-of-use races. All of that gets avoided by making a single entity responsible for everything related to plugins (tracking registered endpoints, tracking connectedness).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced. I don't see any harm in using mutexes for every structure if the structure and its interface is clear and easy to use. Let's see an example in my pr
where resource slices cleanup is handled by a separate cleanupHandler structure with 2 interface methods: cleanupResourceSlices and cancelPendingWipe (should probably be renamed in cancelCleanupResourceSlices. those are called from registration handler when plugin registered or unregistred and by plugin when connection status changes. The fact that those methods lock cleanupHandler doesn't create any issue as the fact that store is locked when plugins are added or removed to/from it. Those locks are not exposed outside and only used internally by cleanupHandler. Can you give an example when this approach can create time-of-check-time-of-use races?

In my opinion merging registration and connection monitoring into Store is a good big step towards a God object. It's already responsible for 3 things, what's next we're going to add to it? plugin health management? garbage collection of stalled sockets? updating node resources? The list can be continued easily. Testing of this object would also be much harder than a separate ones in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think equating this to a "God object" is a bit extreme.

From my perspective this (new) abstraction has a clear set of roles that all fall within the scope of a DRAPluginManager. And these roles all logically make sense for me to be together in a single object:

  1. It handles the registration / deregistration of DRAPlugins (including wiping their resource slices if necessary), allowing the consumer of this component to not have to worry about these details
  2. It allows you to get a handle to a specific registered plugin and call methods on it (e.g. NodePrepare/UnprepareResources) when appropriate.

I see no reason to force these two roles into separate abstractions unless we plan to reuse one or the other of them to interface with yet another component in some way. Do you have an additional component in mind that would make keeping these roles in separate abstractions the obvious better choice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't equate it to "Got object". I said it's a big step in that direction, which I still believe it is. Merging registration and connection management functionality into a storage object doesn't look like a good thing to do. Even if a result seemingly makes sense as you've mentioned above, it doesn't mean that descomposing registration, connection management or cleanup into a separate entities wouldn't make sense. In my opinion it would be easier to understand, maintain and test.

@@ -22,6 +22,58 @@ should end with a DNS domain that is unique for the plugin. Each time a plugin
starts, it has to delete old sockets if they exist and listen anew under the
same filename.

## Monitoring Plugin Connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is fully implemented in the dra code, would it make sense to move the documentation there? Monitoring service connection is out of scope of pluginwatcher, I believe.
Moving it out of pluginmanager code would also simplify the approval of this PR as it can be approved only by DRA approvers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need SIG Node approval for the DRA code, moving the documentation won't simplify approval. Besides, @klueska can approve also this here.

I prefer to keep documentation about kubelet plugin implementation aspects in one place. Perhaps other types will also implement this.

@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch 3 times, most recently from 347795b to f964121 Compare June 5, 2025 19:12
//
// It returns an informative error message including the driver name
// with an explanation why the driver is not usable.
func (pm *PluginManager) GetDRAPlugin(driverName string) (*Plugin, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to rename this type to DRAPluginManager, the returned type to DRAPlugin and then have this function just be called GetPlugin()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll update the renaming commit...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, force-pushed to both PRs.

@pohly pohly force-pushed the dra-kubelet-connection-monitoring branch from f964121 to 6f220fa Compare June 6, 2025 13:14
pohly and others added 6 commits June 6, 2025 18:24
The rest of the system logs information using "driverName" as key in structured
logging. The kubelet should do the same.

This also gets clarified in the code, together with using consistent a
consistent name for a Plugin pointer: "plugin" instead of "client" or
"instance".

The New in NewDRAPluginClient made no sense because it's not constructing
anything, and it returns a plugin, not a client -> GetDRAPlugin.
This moves wiping into the plugins store. The advantages are:
- Only a single mutex is needed in the plugin store.
- The code which decides about queuing and canceling wiping
  has access to all relevant information while that information
  is protected against concurrent modifications.
- It prepares for making that code more complex ("connection
  monitoring").

In retrospect it is not clear whether the RegistrationHandler did the right
thing in all cases when endpoints got registered and removed concurrently.

The RegistrationHandler became a thin, stateless layer on top of the store.
Because it's not really needed anymore, the required methods for the plugin
manager now get provided directly by the store.

The disadvantage is the slightly more complex initialization, but that was also
a problem before which just hadn't been solved: wiping ran without the context
of the manager. Now it does.
After merging with RegistrationHandler, the store abstraction is more than a
dump map. It implements additional logic, so renaming it seems warranted.
To avoid confusion with other "plugin managers", the DRA prefix is used.
The plugin type gets updated accordingly.

This is done in a separate commit for ease of review. The "store" field is
kept because that really is just a dump lookup structure.
Instead of creating the gRPC connection on demand and forcing gRPC to connect,
we establish it immediately and rely on gRPC to handle the underlying
connection automatically like it usually does.

It's not clear what benefit the one second connection timeout had. The way it
is now, gRPC calls still fail when the underlying connection cannot be
established. Having to have a separate context for establishing that connection
just made the code more complex.

The DRAPluginManager is the central component which manages plugins. Making it responsible
for creating them reduces the number of places where a DRAPlugin struct needs to
be initialized. Doing this in the DRAPluginManager instead of a stand-alone function
simplifies the implementation of connection monitoring, because that will be
something that is tied to the DRAPluginManager state.
This ensures that ResourceSlices get removed also when a plugin becomes
unresponsive without removing the registration socket.

Tests are from kubernetes#131073 by Ed
with some modifications, the implementation is new.
Conceptually TimedWorkersQueue is similar to the current code: it spawns
goroutines and cancels them. Using it makes the code a bit shorter, even though
the TimedWorkersQueue API could be a bit nicer and more consistent (key string
vs. WorkerArgs as parameters).

Depending on the tainteviction package is a bit odd, which is the reason why
TimedWorkersQueue wasn't already used earlier. But there don't seem to be other
implementations of this common
problem. https://pkg.go.dev/k8s.io/client-go/util/workqueue#TypedDelayingInterface
doesn't work because queue entries cannot be removed.

This doesn't really solve the problem of tracking goroutines for wiping because
TimedWorkersQueue doesn't support that. But not tracking is arguably better
than doing it wrong and this only affects unit tests, so it should be okay.
@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2025

/test pull-kubernetes-e2e-kind-dra-canary pull-kubernetes-e2e-kind-dra-n-1-canary

@k8s-ci-robot
Copy link
Contributor

@pohly: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cos-containerd-e2e-ubuntu-gce
/test pull-kubernetes-cmd
/test pull-kubernetes-cmd-canary
/test pull-kubernetes-cmd-go-canary
/test pull-kubernetes-conformance-kind-ga-only-parallel
/test pull-kubernetes-coverage-unit
/test pull-kubernetes-dependencies
/test pull-kubernetes-dependencies-go-canary
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-e2e-gce-100-performance
/test pull-kubernetes-e2e-gce-cos
/test pull-kubernetes-e2e-gce-cos-canary
/test pull-kubernetes-e2e-gce-cos-no-stage
/test pull-kubernetes-e2e-gce-network-proxy-http-connect
/test pull-kubernetes-e2e-gce-pull-through-cache
/test pull-kubernetes-e2e-gce-scale-performance-manual
/test pull-kubernetes-e2e-kind
/test pull-kubernetes-e2e-kind-ipv6
/test pull-kubernetes-e2e-storage-kind-alpha-beta-features-slow
/test pull-kubernetes-integration
/test pull-kubernetes-integration-canary
/test pull-kubernetes-integration-go-canary
/test pull-kubernetes-kubemark-e2e-gce-scale
/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-typecheck
/test pull-kubernetes-unit
/test pull-kubernetes-unit-go-canary
/test pull-kubernetes-update
/test pull-kubernetes-verify
/test pull-kubernetes-verify-go-canary

The following commands are available to trigger optional jobs:

/test check-dependency-stats
/test pull-crio-cgroupv1-node-e2e-eviction
/test pull-crio-cgroupv1-node-e2e-features
/test pull-crio-cgroupv1-node-e2e-hugepages
/test pull-crio-cgroupv1-node-e2e-resource-managers
/test pull-crio-cgroupv2-imagefs-separatedisktest
/test pull-crio-cgroupv2-node-e2e-eviction
/test pull-crio-cgroupv2-node-e2e-hugepages
/test pull-crio-cgroupv2-node-e2e-resource-managers
/test pull-crio-cgroupv2-splitfs-separate-disk
/test pull-e2e-gce-cloud-provider-disabled
/test pull-e2e-gci-gce-alpha-enabled-default
/test pull-kubernetes-apidiff
/test pull-kubernetes-apidiff-client-go
/test pull-kubernetes-conformance-image-test
/test pull-kubernetes-conformance-kind-ga-only
/test pull-kubernetes-conformance-kind-ipv6-parallel
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-crio-node-memoryqos-cgrpv2
/test pull-kubernetes-cross
/test pull-kubernetes-e2e-autoscaling-hpa-cm
/test pull-kubernetes-e2e-autoscaling-hpa-cpu
/test pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta
/test pull-kubernetes-e2e-capz-azure-disk
/test pull-kubernetes-e2e-capz-azure-disk-vmss
/test pull-kubernetes-e2e-capz-azure-disk-windows
/test pull-kubernetes-e2e-capz-azure-file
/test pull-kubernetes-e2e-capz-azure-file-vmss
/test pull-kubernetes-e2e-capz-azure-file-windows
/test pull-kubernetes-e2e-capz-conformance
/test pull-kubernetes-e2e-capz-master-windows-nodelogquery
/test pull-kubernetes-e2e-capz-windows-alpha-feature-vpa
/test pull-kubernetes-e2e-capz-windows-alpha-features
/test pull-kubernetes-e2e-capz-windows-master
/test pull-kubernetes-e2e-capz-windows-serial-slow
/test pull-kubernetes-e2e-capz-windows-serial-slow-hpa
/test pull-kubernetes-e2e-containerd-gce
/test pull-kubernetes-e2e-ec2
/test pull-kubernetes-e2e-ec2-arm64
/test pull-kubernetes-e2e-ec2-conformance
/test pull-kubernetes-e2e-ec2-conformance-arm64
/test pull-kubernetes-e2e-ec2-device-plugin-gpu
/test pull-kubernetes-e2e-gce-canary
/test pull-kubernetes-e2e-gce-correctness
/test pull-kubernetes-e2e-gce-cos-alpha-features
/test pull-kubernetes-e2e-gce-csi-serial
/test pull-kubernetes-e2e-gce-device-plugin-gpu
/test pull-kubernetes-e2e-gce-disruptive-canary
/test pull-kubernetes-e2e-gce-kubelet-credential-provider
/test pull-kubernetes-e2e-gce-network-policies
/test pull-kubernetes-e2e-gce-network-proxy-grpc
/test pull-kubernetes-e2e-gce-serial
/test pull-kubernetes-e2e-gce-serial-canary
/test pull-kubernetes-e2e-gce-storage-disruptive
/test pull-kubernetes-e2e-gce-storage-selinux
/test pull-kubernetes-e2e-gce-storage-slow
/test pull-kubernetes-e2e-gce-storage-snapshot
/test pull-kubernetes-e2e-gci-gce-autoscaling
/test pull-kubernetes-e2e-gci-gce-ingress
/test pull-kubernetes-e2e-gci-gce-ipvs
/test pull-kubernetes-e2e-gci-gce-kube-dns-nodecache
/test pull-kubernetes-e2e-gci-gce-nftables
/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features
/test pull-kubernetes-e2e-kind-beta-features
/test pull-kubernetes-e2e-kind-canary
/test pull-kubernetes-e2e-kind-cloud-provider-loadbalancer
/test pull-kubernetes-e2e-kind-dependencies
/test pull-kubernetes-e2e-kind-dual-canary
/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-e2e-kind-ipv6-canary
/test pull-kubernetes-e2e-kind-ipvs
/test pull-kubernetes-e2e-kind-kms
/test pull-kubernetes-e2e-kind-multizone
/test pull-kubernetes-e2e-kind-nftables
/test pull-kubernetes-e2e-relaxed-environment-variable-validation
/test pull-kubernetes-e2e-storage-kind-disruptive
/test pull-kubernetes-e2e-storage-kind-volume-group-snapshots
/test pull-kubernetes-e2e-unit-dependencies
/test pull-kubernetes-integration-race
/test pull-kubernetes-kind-dra
/test pull-kubernetes-kind-dra-all
/test pull-kubernetes-kind-dra-all-canary
/test pull-kubernetes-kind-dra-canary
/test pull-kubernetes-kind-dra-n-1-canary
/test pull-kubernetes-kind-json-logging
/test pull-kubernetes-kind-text-logging
/test pull-kubernetes-kubemark-e2e-gce-big
/test pull-kubernetes-linter-hints
/test pull-kubernetes-local-e2e
/test pull-kubernetes-node-arm64-e2e-containerd-ec2
/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
/test pull-kubernetes-node-arm64-ubuntu-serial-gce
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e-canary
/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e
/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial
/test pull-kubernetes-node-crio-e2e
/test pull-kubernetes-node-e2e-alpha-ec2
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-containerd-1-7-dra-canary
/test pull-kubernetes-node-e2e-containerd-2-0-dra
/test pull-kubernetes-node-e2e-containerd-2-0-dra-canary
/test pull-kubernetes-node-e2e-containerd-alpha-features
/test pull-kubernetes-node-e2e-containerd-ec2
/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-node-e2e-containerd-features-kubetest2
/test pull-kubernetes-node-e2e-containerd-kubelet-psi
/test pull-kubernetes-node-e2e-containerd-kubetest2
/test pull-kubernetes-node-e2e-containerd-serial-ec2
/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks
/test pull-kubernetes-node-e2e-containerd-standalone-mode
/test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
/test pull-kubernetes-node-e2e-cri-proxy-serial
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra-canary
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra-canary
/test pull-kubernetes-node-e2e-resource-health-status
/test pull-kubernetes-node-kubelet-containerd-flaky
/test pull-kubernetes-node-kubelet-credential-provider
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-podresources
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
/test pull-kubernetes-node-swap-conformance-fedora-serial
/test pull-kubernetes-node-swap-conformance-ubuntu-serial
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial
/test pull-kubernetes-scheduler-perf
/test pull-kubernetes-unit-windows-master
/test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

pull-kubernetes-cmd
pull-kubernetes-conformance-kind-ga-only-parallel
pull-kubernetes-dependencies
pull-kubernetes-e2e-ec2
pull-kubernetes-e2e-gce
pull-kubernetes-e2e-kind
pull-kubernetes-e2e-kind-ipv6
pull-kubernetes-integration
pull-kubernetes-kind-dra
pull-kubernetes-kind-dra-all
pull-kubernetes-linter-hints
pull-kubernetes-node-e2e-containerd
pull-kubernetes-node-e2e-crio-cgrpv1-dra
pull-kubernetes-typecheck
pull-kubernetes-unit
pull-kubernetes-unit-windows-master
pull-kubernetes-verify

In response to this:

/test pull-kubernetes-e2e-kind-dra-canary pull-kubernetes-e2e-kind-dra-n-1-canary

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-canary

@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary pull-kubernetes-kind-dra-canary

1 similar comment
@pohly
Copy link
Contributor Author

pohly commented Jun 11, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary pull-kubernetes-kind-dra-canary

@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

2 similar comments
@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

@pohly
Copy link
Contributor Author

pohly commented Jun 12, 2025

/test pull-kubernetes-kind-dra-n-1-canary pull-kubernetes-kind-dra-n-2-canary

@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints aee4073 link false /test pull-kubernetes-linter-hints
pull-kubernetes-unit-windows-master aee4073 link false /test pull-kubernetes-unit-windows-master
pull-kubernetes-kind-dra-n-1-canary aee4073 link false /test pull-kubernetes-kind-dra-n-1-canary
pull-kubernetes-kind-dra-n-2-canary aee4073 link false /test pull-kubernetes-kind-dra-n-2-canary

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🏗 In progress
Status: Archive-it
Development

Successfully merging this pull request may close these issues.

DRA: detect stale DRA plugin sockets
4 participants