[WIP] Dra device health status #130606

Jpsassine · 2025-03-06T01:09:57Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements KEP-4680 by adding device health tracking for DRA plugins and integrating this status into the PodStatus. It allows Kubelet and users to observe the health of allocated DRA resources.

Optional gRPC Health Service (dra-health/v1alpha1):
- Defines a new NodeHealthService with a WatchResources stream RPC in staging/src/k8s.io/kubelet/pkg/apis/dra-health/v1alpha1/api.proto.
- DRA plugins can optionally implement this service to stream device health updates (pool_name, device_name, health, Last_updated) to Kubelet.
Kubelet Plugin Integration (cm/dra/plugin):
- RegistrationHandler (registration.go) updated to:
  - Attempt to start the WatchResources stream upon plugin registration via plugin.WatchResources.
  - Handle plugins that don't implement the NodeHealthService (or other stream startup errors) by logging the error and proceeding with registration without health monitoring.
  - Launch a handler goroutine (via StreamHandler interface implemented by dra.ManagerImpl) in the DRA Manager upon successful stream initiation.
  - Cancel the health stream context (HealthStreamCancel) during plugin deregistration (DeregisterPlugin) or replacement.
Health Cache (cm/dra/healthinfo.go, cm/dra/state/state.go):
- Implements healthInfoCache for persistent (dra_health_state file), thread-safe storage of device health (Healthy, Unhealthy, Unknown) and timestamps.
- Includes updateHealthInfo for full-state reconciliation based on plugin updates, handling timeouts (healthTimeout constant, e.g., 30s) by marking stale devices as "Unknown". Saves checkpoint on change.
- Includes getHealthInfo to retrieve current status (returns "Unknown" if stale/missing) and clearDriver for cleanup.
DRA Manager Integration (cm/dra/manager.go):
- ManagerImpl implements the plugin.StreamHandler interface.
- HandleWatchResourcesStream goroutine consumes updates from the plugin stream (NodeHealth.WatchResourcesClient), calls healthInfoCache.updateHealthInfo, finds affected Pod UIDs from claimInfoCache, and sends notifications via an internal update channel (non-blocking).
- Added defer healthInfoCache.clearDriver(pluginName) within HandleWatchResourcesStream to ensure cache cleanup for the driver upon goroutine exit (due to error, cancellation, or EOF).
- Adds Updates() method to return the update channel.
- Updates UpdateAllocatedResourcesStatus to read health from healthInfoCache and populate pod.Status.ContainerStatuses[].AllocatedResourcesStatus using the KEP-specified structure: v1.ResourceStatus (named by claim (Name field), containing a Resources slice where each element is a v1.ResourceHealth struct per device (ResourceID and Health fields)).
Container Manager Integration (cm/container_manager_linux.go):
- Updates() method merges update signals from Device Manager and DRA Manager (via draManager.Updates()).
- UpdateAllocatedResourcesStatus() method calls both Device Manager and DRA Manager update functions (DRA call guarded by feature gate DynamicResourceAllocation and nil check).
Testing:
- Adds unit tests for healthinfo.go (healthinfo_test.go).
- Adds unit tests for manager.go (manager_test.go) covering HandleWatchResourcesStream and UpdateAllocatedResourcesStatus. Fixes existing tests (TestPrepareResources, TestUnprepareResources) to align with updated signatures/logic.
- WIP E2E Test

Which issue(s) this PR fixes:

Fixes #126243

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4680-add-resource-health-to-pod-status/README.md

This Health Cache should account for lingering devices and ensure that they are properly makred as healthy or unhleahty or unknown. Changes to state.go were necessary to support this tracking of lingering machines via a last updated field. Initial file for unit testing created and first unit test passing.

linux-foundation-easycla · 2025-03-06T01:10:01Z

The committers listed above are authorized under a signed CLA.

✅ login: Jpsassine / name: JP (9fa282c, c81a4c0, 2671894, b358ac8, 8ffca04, bfb31e7, 7b142ca, 0fe0b4f, ce58745, 41bef3a, 1721aae, ff07c2d, 2399d37, 608460a, d02371a, 0d50b97, a8152a8, dbb37b3, 381da5f, 0b1dcab, 7c4674f, 5fb8a27, 936f530)

k8s-ci-robot · 2025-03-06T01:10:06Z

Hi @Jpsassine. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-03-06T01:10:06Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

googs1025 · 2025-03-06T02:58:26Z

/ok-to-test

bart0sh · 2025-03-10T09:00:24Z

@Jpsassine Thank you for your PR. Please sign the CLA to proceed further, thanks.

BTW, is there any KEP or another design document describing/discussing these changes? If so, please provide links in the PR description.

k8s-ci-robot · 2025-06-12T20:08:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jpsassine
Once this PR has been reviewed and has the lgtm label, please assign klueska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2025-06-12T20:24:05Z

/check-required-labels

k8s-ci-robot · 2025-06-13T21:27:19Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jpsassine added 2 commits March 5, 2025 02:01

Add full more unit tests for full coverage

bfb31e7

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 6, 2025

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 6, 2025

k8s-ci-robot requested review from haircommander and tzneal March 6, 2025 01:10

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Mar 6, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Mar 6, 2025

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 6, 2025

github-project-automation bot added this to Dynamic Resource Allocation Mar 6, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Mar 6, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Mar 6, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 6, 2025

bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Mar 10, 2025

k8s-ci-robot removed the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 11, 2025

github-project-automation bot added this to SIG Apps and SIG Auth Jun 12, 2025

k8s-ci-robot added sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Jun 12, 2025

github-project-automation bot added this to SIG CLI Jun 12, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Jun 12, 2025

k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jun 12, 2025

github-project-automation bot moved this to Needs Triage in SIG CLI Jun 12, 2025

Jpsassine force-pushed the dra_device_health_status branch from f76f9db to 608460a Compare June 12, 2025 20:08

enj moved this to Needs Triage in SIG Auth Jun 13, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Dra device health status #130606

[WIP] Dra device health status #130606

Jpsassine commented Mar 6, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Mar 6, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 6, 2025

Uh oh!

k8s-ci-robot commented Mar 6, 2025

Uh oh!

googs1025 commented Mar 6, 2025

Uh oh!

bart0sh commented Mar 10, 2025

Uh oh!

k8s-ci-robot commented Jun 12, 2025

Uh oh!

SergeyKanzhelev commented Jun 12, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

Uh oh!

[WIP] Dra device health status #130606

Are you sure you want to change the base?

[WIP] Dra device health status #130606

Conversation

Jpsassine commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 6, 2025

Uh oh!

k8s-ci-robot commented Mar 6, 2025

Uh oh!

googs1025 commented Mar 6, 2025

Uh oh!

bart0sh commented Mar 10, 2025

Uh oh!

k8s-ci-robot commented Jun 12, 2025

Uh oh!

SergeyKanzhelev commented Jun 12, 2025

Uh oh!

k8s-ci-robot commented Jun 13, 2025

Uh oh!

Uh oh!

Jpsassine commented Mar 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 6, 2025 •

edited

Loading