Skip to content

[WIP] Dra device health status #130606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

Jpsassine
Copy link

@Jpsassine Jpsassine commented Mar 6, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements KEP-4680 by adding device health tracking for DRA plugins and integrating this status into the PodStatus. It allows Kubelet and users to observe the health of allocated DRA resources.

  1. Optional gRPC Health Service (dra-health/v1alpha1):

    • Defines a new NodeHealthService with a WatchResources stream RPC in staging/src/k8s.io/kubelet/pkg/apis/dra-health/v1alpha1/api.proto.
    • DRA plugins can optionally implement this service to stream device health updates (pool_name, device_name, health, Last_updated) to Kubelet.
  2. Kubelet Plugin Integration (cm/dra/plugin):

    • RegistrationHandler (registration.go) updated to:
      • Attempt to start the WatchResources stream upon plugin registration via plugin.WatchResources.
      • Handle plugins that don't implement the NodeHealthService (or other stream startup errors) by logging the error and proceeding with registration without health monitoring.
      • Launch a handler goroutine (via StreamHandler interface implemented by dra.ManagerImpl) in the DRA Manager upon successful stream initiation.
      • Cancel the health stream context (HealthStreamCancel) during plugin deregistration (DeregisterPlugin) or replacement.
  3. Health Cache (cm/dra/healthinfo.go, cm/dra/state/state.go):

    • Implements healthInfoCache for persistent (dra_health_state file), thread-safe storage of device health (Healthy, Unhealthy, Unknown) and timestamps.
    • Includes updateHealthInfo for full-state reconciliation based on plugin updates, handling timeouts (healthTimeout constant, e.g., 30s) by marking stale devices as "Unknown". Saves checkpoint on change.
    • Includes getHealthInfo to retrieve current status (returns "Unknown" if stale/missing) and clearDriver for cleanup.
  4. DRA Manager Integration (cm/dra/manager.go):

    • ManagerImpl implements the plugin.StreamHandler interface.
    • HandleWatchResourcesStream goroutine consumes updates from the plugin stream (NodeHealth.WatchResourcesClient), calls healthInfoCache.updateHealthInfo, finds affected Pod UIDs from claimInfoCache, and sends notifications via an internal update channel (non-blocking).
    • Added defer healthInfoCache.clearDriver(pluginName) within HandleWatchResourcesStream to ensure cache cleanup for the driver upon goroutine exit (due to error, cancellation, or EOF).
    • Adds Updates() method to return the update channel.
    • Updates UpdateAllocatedResourcesStatus to read health from healthInfoCache and populate pod.Status.ContainerStatuses[].AllocatedResourcesStatus using the KEP-specified structure: v1.ResourceStatus (named by claim (Name field), containing a Resources slice where each element is a v1.ResourceHealth struct per device (ResourceID and Health fields)).
  5. Container Manager Integration (cm/container_manager_linux.go):

    • Updates() method merges update signals from Device Manager and DRA Manager (via draManager.Updates()).
    • UpdateAllocatedResourcesStatus() method calls both Device Manager and DRA Manager update functions (DRA call guarded by feature gate DynamicResourceAllocation and nil check).
  6. Testing:

    • Adds unit tests for healthinfo.go (healthinfo_test.go).
    • Adds unit tests for manager.go (manager_test.go) covering HandleWatchResourcesStream and UpdateAllocatedResourcesStatus. Fixes existing tests (TestPrepareResources, TestUnprepareResources) to align with updated signatures/logic.
    • WIP E2E Test

Which issue(s) this PR fixes:

Fixes #126243

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4680-add-resource-health-to-pod-status/README.md


This Health Cache should account for lingering devices and ensure that they are properly makred as healthy or unhleahty or unknown. Changes to state.go were necessary to support this tracking of lingering machines via a last updated field. Initial file for unit testing created and first unit test passing.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 6, 2025
Copy link

linux-foundation-easycla bot commented Mar 6, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 6, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Jpsassine. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 6, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Mar 6, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 6, 2025
@googs1025
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 6, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Mar 10, 2025

@Jpsassine Thank you for your PR. Please sign the CLA to proceed further, thanks.

BTW, is there any KEP or another design document describing/discussing these changes? If so, please provide links in the PR description.

@bart0sh bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Mar 10, 2025
@k8s-ci-robot k8s-ci-robot removed the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 11, 2025
@k8s-ci-robot k8s-ci-robot added area/code-generation area/dependency Issues or PRs related to dependency changes area/kube-proxy area/kubectl area/release-eng Issues or PRs related to the Release Engineering subproject kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Jun 12, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jun 12, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG CLI Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 12, 2025
@Jpsassine Jpsassine force-pushed the dra_device_health_status branch from f76f9db to 608460a Compare June 12, 2025 20:08
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jpsassine
Once this PR has been reviewed and has the lgtm label, please assign klueska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SergeyKanzhelev
Copy link
Member

/check-required-labels

@enj enj moved this to Needs Triage in SIG Auth Jun 13, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kube-proxy area/kubectl area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🏗 In progress
Status: Needs Triage
Status: Needs Triage
Status: Needs Triage
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

6 participants