-
Notifications
You must be signed in to change notification settings - Fork 40.7k
[WIP] Dra device health status #130606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] Dra device health status #130606
Conversation
This Health Cache should account for lingering devices and ensure that they are properly makred as healthy or unhleahty or unknown. Changes to state.go were necessary to support this tracking of lingering machines via a last updated field. Initial file for unit testing created and first unit test passing.
Hi @Jpsassine. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
@Jpsassine Thank you for your PR. Please sign the CLA to proceed further, thanks. BTW, is there any KEP or another design document describing/discussing these changes? If so, please provide links in the PR description. |
f76f9db
to
608460a
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Jpsassine The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/check-required-labels |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements KEP-4680 by adding device health tracking for DRA plugins and integrating this status into the PodStatus. It allows Kubelet and users to observe the health of allocated DRA resources.
Optional gRPC Health Service (
dra-health/v1alpha1
):NodeHealthService
with aWatchResources
stream RPC instaging/src/k8s.io/kubelet/pkg/apis/dra-health/v1alpha1/api.proto
.pool_name
,device_name
,health
,Last_updated
) to Kubelet.Kubelet Plugin Integration (
cm/dra/plugin
):RegistrationHandler
(registration.go
) updated to:WatchResources
stream upon plugin registration viaplugin.WatchResources
.NodeHealthService
(or other stream startup errors) by logging the error and proceeding with registration without health monitoring.StreamHandler
interface implemented bydra.ManagerImpl
) in the DRA Manager upon successful stream initiation.HealthStreamCancel
) during plugin deregistration (DeregisterPlugin
) or replacement.Health Cache (
cm/dra/healthinfo.go
,cm/dra/state/state.go
):healthInfoCache
for persistent (dra_health_state
file), thread-safe storage of device health (Healthy
,Unhealthy
,Unknown
) and timestamps.updateHealthInfo
for full-state reconciliation based on plugin updates, handling timeouts (healthTimeout
constant, e.g., 30s) by marking stale devices as "Unknown". Saves checkpoint on change.getHealthInfo
to retrieve current status (returns "Unknown" if stale/missing) andclearDriver
for cleanup.DRA Manager Integration (
cm/dra/manager.go
):ManagerImpl
implements theplugin.StreamHandler
interface.HandleWatchResourcesStream
goroutine consumes updates from the plugin stream (NodeHealth.WatchResourcesClient
), callshealthInfoCache.updateHealthInfo
, finds affected Pod UIDs fromclaimInfoCache
, and sends notifications via an internal update channel (non-blocking).defer healthInfoCache.clearDriver(pluginName)
withinHandleWatchResourcesStream
to ensure cache cleanup for the driver upon goroutine exit (due to error, cancellation, or EOF).Updates()
method to return the update channel.UpdateAllocatedResourcesStatus
to read health fromhealthInfoCache
and populatepod.Status.ContainerStatuses[].AllocatedResourcesStatus
using the KEP-specified structure:v1.ResourceStatus
(named by claim (Name
field), containing aResources
slice where each element is av1.ResourceHealth
struct per device (ResourceID
andHealth
fields)).Container Manager Integration (
cm/container_manager_linux.go
):Updates()
method merges update signals from Device Manager and DRA Manager (viadraManager.Updates()
).UpdateAllocatedResourcesStatus()
method calls both Device Manager and DRA Manager update functions (DRA call guarded by feature gateDynamicResourceAllocation
and nil check).Testing:
healthinfo.go
(healthinfo_test.go
).manager.go
(manager_test.go
) coveringHandleWatchResourcesStream
andUpdateAllocatedResourcesStatus
. Fixes existing tests (TestPrepareResources
,TestUnprepareResources
) to align with updated signatures/logic.Which issue(s) this PR fixes:
Fixes #126243
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: