Skip to content

[Controllers Health Check] Implement controllers custom health check function with informer readiness #132137

@haoranleo

Description

@haoranleo

What would you like to be added?

Implement HealthCheckable interface for controllers to include the informer cache readiness. So that each controller has real health check on whether it is running and ready to process requests.

Why is this needed?

Currently we use default ping health checker for controllers without custom health check functions defined (all controllers don't have it implemented now). As a result, it always returns healthy regardless of its running state.

check := controllerhealthz.NamedPingChecker(controllerName)
if ctrl != nil {
// check if the controller supports and requests a debugHandler
// and it needs the unsecuredMux to mount the handler onto.
if debuggable, ok := ctrl.(controller.Debuggable); ok && unsecuredMux != nil {
if debugHandler := debuggable.DebuggingHandler(); debugHandler != nil {
basePath := "/debug/controllers/" + controllerName
unsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler))
unsecuredMux.UnlistedHandlePrefix(basePath+"/", http.StripPrefix(basePath, debugHandler))
}
}
if healthCheckable, ok := ctrl.(controller.HealthCheckable); ok {
if realCheck := healthCheckable.HealthChecker(); realCheck != nil {
check = controllerhealthz.NamedHealthChecker(controllerName, realCheck)
}
}
}

Although controllers will wait for informer cache to be synced before starting processing any requests, we are lacking visibility into whether the controller is ready.

For example for job-controller it waits cache to be synced in jm.Run function

logger.Info("Starting job controller")
defer logger.Info("Shutting down job controller")
if !cache.WaitForNamedCacheSync("job", ctx.Done(), jm.podStoreSynced, jm.jobStoreSynced) {
return
}
for i := 0; i < workers; i++ {
go wait.UntilWithContext(ctx, jm.worker, time.Second)
go wait.UntilWithContext(ctx, jm.orphanWorker, time.Second)
}

If the cache fails to be synced the job-controller routine would exit here

go jobController.Run(ctx, int(controllerContext.ComponentConfig.JobController.ConcurrentJobSyncs))

But the registered ping check will still return healthy. The registered health checks are also exposed as kubernetes_healthcheck metrics emitted by KCM through /metrics/slis. Having custom health check for controllers will increase the user visibility into controller's runtime healthiness.

There was a similar issue created before but didn't get attention #128233, reopening new one with focus on improving the health check functions for each controller. The informer cache sync can be the first one (and generic one) included in all custom health check functions

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/instrumentationCategorizes an issue or PR as relevant to SIG Instrumentation.

Type

No type

Projects

Status

Needs Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions