Fix several goroutine leaks on controllers #131199

ntnn · 2025-04-08T07:12:37Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

As reported in kcp-dev/kcp#3350 several controllers are leaking goroutines as they are registering handlers with informers but are not cleaning those registrations up on shutdown.

I presume that this also occurs in Kubernetes on leader changes.
When the ResourceQuota controller starts and stops QuotaMonitors each stopped QuotaMonitor leaks goroutines as well.

Additionally if the informer returns an error and doesn't actually register handlers the controller will do nothing with almost no indication as to why as the erros are silently ignored.

Fix several goroutine leaks in controllers on controller shutdown

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

I'd be happy to split this PR into multiple PRs e.g. based on controller.

Does this PR introduce a user-facing change?

Some functions creating controllers are now returning the error from the handler registration. This was previously ignored.
IMHO returning the errors would be the correct choice as without the handlers the controllers would do nothing.
Alternatively the errors could be logged using the (util)runtime.HandleError function.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

…oller

linux-foundation-easycla · 2025-04-08T07:12:41Z

✅login: ntnn / (b7916fb)
✅login: ntnn / (b7916fb, b2a7f79)
✅login: ntnn / (b7916fb, b2a7f79, 5788185)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538, 627ec95)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538, 627ec95, 717cd6d)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538, 627ec95, 717cd6d, 795324e)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538, 627ec95, 717cd6d, 795324e, 5392364)
✅login: ntnn / (b7916fb, b2a7f79, 5788185, 1ead538, 627ec95, 717cd6d, 795324e, 5392364, 8bdc98a)

The committers listed above are authorized under a signed CLA.

k8s-ci-robot · 2025-04-08T07:12:46Z

Welcome @ntnn!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-04-08T07:12:47Z

Hi @ntnn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-04-08T07:12:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ntnn
Once this PR has been reviewed and has the lgtm label, please assign caseydavenport, deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2025-04-08T07:42:34Z

why is the context not cancelling the handlers?

kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go

Lines 884 to 909 in 195803c

    
           func (p *sharedProcessor) run(ctx context.Context) { 
        
           	func() { 
        
           		p.listenersLock.RLock() 
        
           		defer p.listenersLock.RUnlock() 
        
           		for listener := range p.listeners { 
        
           			p.wg.Start(listener.run) 
        
           			p.wg.Start(listener.pop) 
        
           		} 
        
           		p.listenersStarted = true 
        
           	}() 
        
           	<-ctx.Done() 
        
           	p.listenersLock.Lock() 
        
           	defer p.listenersLock.Unlock() 
        
           	for listener := range p.listeners { 
        
           		close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop 
        
           	} 
        
           	// Wipe out list of listeners since they are now closed 
        
           	// (processorListener cannot be re-used) 
        
           	p.listeners = nil 
        
           	// Reset to false since no listeners are running 
        
           	p.listenersStarted = false 
        
           	p.wg.Wait() // Wait for all .pop() and .run() to stop

ntnn · 2025-04-08T07:51:01Z

why is the context not cancelling the handlers?

kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go

Lines 884 to 909 in 195803c

func (p *sharedProcessor) run(ctx context.Context) {

func() {

p.listenersLock.RLock()

defer p.listenersLock.RUnlock()

for listener := range p.listeners {

p.wg.Start(listener.run)

p.wg.Start(listener.pop)

}

p.listenersStarted = true

}()

<-ctx.Done()

p.listenersLock.Lock()

defer p.listenersLock.Unlock()

for listener := range p.listeners {

close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop

}

// Wipe out list of listeners since they are now closed

// (processorListener cannot be re-used)

p.listeners = nil

// Reset to false since no listeners are running

p.listenersStarted = false

p.wg.Wait() // Wait for all .pop() and .run() to stop

Because the context is for the informer, not the controllers.

Meaning if the informer / informer factory is shutdown the handlers would also be shut down.

However if a controller is stopped - but not the informer - then the goroutines for the handlers keep running.

aojea · 2025-04-08T08:36:43Z

However if a controller is stopped - but not the informer - then the goroutines for the handlers keep running.

when does this happen in the controller manager? I don't remember the controllers to be stopped , is not all context cancelled at the same time?

ntnn · 2025-04-08T09:22:20Z

E.g. the QuotaMonitors leak goroutines whenever a monitors is stopped, as the handlers are registered but never cleaned up:

kubernetes/pkg/controller/resourcequota/resource_quota_monitor.go

Line 174 in ab3e83f

shared.Informer().AddEventHandlerWithResyncPeriod(handlers, qm.resyncPeriod())

QuotaMonitors are regularly synced monitors to remove are removed by closing the stopCh:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/resourcequota/resource_quota_monitor.go#L238

Trough the wait.Until.. here:

kubernetes/pkg/controller/resourcequota/resource_quota_controller.go

Line 480 in ab3e83f

if err := rq.resyncMonitors(ctx, newResources); err != nil {

That's calling .resyncMonitors and in turn .SyncMonitors here:

kubernetes/pkg/controller/resourcequota/resource_quota_controller.go

Lines 536 to 548 in ab3e83f

    
           // resyncMonitors starts or stops quota monitors as needed to ensure that all 
        
           // (and only) those resources present in the map are monitored. 
        
           func (rq *Controller) resyncMonitors(ctx context.Context, resources map[schema.GroupVersionResource]struct{}) error { 
        
           	if rq.quotaMonitor == nil { 
        
           		return nil 
        
           	} 
        
           	if err := rq.quotaMonitor.SyncMonitors(ctx, resources); err != nil { 
        
           		return err 
        
           	} 
        
           	rq.quotaMonitor.StartMonitors(ctx) 
        
           	return nil 
        
           }

This happens whenever quota-relevant grvs are removed:

kubernetes/pkg/controller/resourcequota/resource_quota_controller.go

Lines 550 to 566 in ab3e83f

    
           // GetQuotableResources returns all resources that the quota system should recognize. 
        
           // It requires a resource supports the following verbs: 'create','list','delete' 
        
           // This function may return both results and an error.  If that happens, it means that the discovery calls were only 
        
           // partially successful.  A decision about whether to proceed or not is left to the caller. 
        
           func GetQuotableResources(discoveryFunc NamespacedResourcesFunc) (map[schema.GroupVersionResource]struct{}, error) { 
        
           	possibleResources, discoveryErr := discoveryFunc() 
        
           	if discoveryErr != nil && len(possibleResources) == 0 { 
        
           		return nil, fmt.Errorf("failed to discover resources: %v", discoveryErr) 
        
           	} 
        
           	quotableResources := discovery.FilteredBy(discovery.SupportsAllVerbs{Verbs: []string{"create", "list", "watch", "delete"}}, possibleResources) 
        
           	quotableGroupVersionResources, err := discovery.GroupVersionResources(quotableResources) 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("failed to parse resources: %v", err) 
        
           	} 
        
           	// return the original discovery error (if any) in addition to the list 
        
           	return quotableGroupVersionResources, discoveryErr 
        
           }

That is one case that likely happens in production.

As for the controller manager I'd have to check - likely there it doesn't matter because the controllers and informers are shutdown with everything else when the node is restarting.

For KCP this happens because KCP starts and stops controllers dynamically for workspaces - the documentation does a better job at explaining workspaces then I'd do: https://docs.kcp.io/kcp/v0.27/concepts/workspaces/

Imho it doesn't matter that the informer cancels any handlers still registered on shutdown as controllers should cleanup any resources they require when shutting down. Both cases make sense (e.g. the handlers being stopped by the informer on shutdown as the informer quits as well as the controller stopping the handlers as the controller shuts down).

ntnn · 2025-04-08T11:04:25Z

As far as the controller manager goes - these leaks do affect it.

This is the function starting the controllers and informers:

kubernetes/cmd/kube-controller-manager/app/controllermanager.go

Lines 239 to 256 in ab3e83f

    
           run := func(ctx context.Context, controllerDescriptors map[string]*ControllerDescriptor) { 
        
           	controllerContext, err := CreateControllerContext(ctx, c, rootClientBuilder, clientBuilder) 
        
           	if err != nil { 
        
           		logger.Error(err, "Error building controller context") 
        
           		klog.FlushAndExit(klog.ExitFlushTimeout, 1) 
        
           	} 
        
           	if err := StartControllers(ctx, controllerContext, controllerDescriptors, unsecuredMux, healthzHandler); err != nil { 
        
           		logger.Error(err, "Error starting controllers") 
        
           		klog.FlushAndExit(klog.ExitFlushTimeout, 1) 
        
           	} 
        
           	controllerContext.InformerFactory.Start(stopCh) 
        
           	controllerContext.ObjectOrMetadataInformerFactory.Start(stopCh) 
        
           	close(controllerContext.InformersStarted) 
        
           	<-ctx.Done() 
        
           }

Notice that StartControllers uses the passed ctx - the informer factory however gets stopCh, which is defined here:

kubernetes/cmd/kube-controller-manager/app/controllermanager.go

Line 185 in ab3e83f

stopCh := ctx.Done()

This means that the controllers are also leaking goroutines in the controller manager every time the leader changes.

Edit: That should be fixed as well, because every time the leader changes the informer factories will leak their goroutines. I could add a commit to replace the channel with context cancellation unless there's arguments against it.

aojea · 2025-04-08T19:03:30Z

This means that the controllers are also leaking goroutines in the controller manager every time the leader changes.

when the leader changes the process exits

kubernetes/cmd/kube-controller-manager/app/controllermanager.go

Lines 339 to 341 in ab3e83f

    
           OnStoppedLeading: func() { 
        
           	logger.Error(nil, "leaderelection lost") 
        
           	klog.FlushAndExit(klog.ExitFlushTimeout, 1)

ntnn · 2025-04-09T07:52:09Z

when the leader changes the process exits

Ah, missed that; so it doesn't matter for the controller manager.

aojea · 2025-04-09T08:10:17Z

when the leader changes the process exits

Ah, missed that; so it doesn't matter for the controller manager.

I'm just trying to understand the changes and the motivation, we went through a lot to avoid goroutines leakes on controllers in the apiserver, see #108483 , so we should not leak goroutines.

Said this, we need to understand the benefit of the change and the ROI, since touching these critical pieces can have subtle bugs and the blast radius is very large ... I do not see any risk on goroutine leaks if the context termination leads to stop the process

ntnn · 2025-04-09T08:20:09Z

I more than understand the hesitation on a change of this impact, but I'd rather put the "best practice" solution forward for discussion - this could definitely happen for other projects and also kube itself if the lifecycle of controllers is ever changed.

ntnn · 2025-04-09T08:58:58Z

I had another idea to keep going with the context cancelletion - the shared informer could get a method like this:

AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error

Either with or without returning a handler - though if this handler is context-cancelled through the context provided by the controller the registration is superfluous.

A very simple implementation would just launch a goroutine that blocks until the context is cancelled and removes the handler:

func (s *sharedIndexInformer)  AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error {
	handler, err := s.AddEventHandlerWithOptions(handler, options)
	if err != nil {
		return err
	}
	go func() {
		<-ctx.Done()
		utilruntime.HandleError(s.RemoveEventHandler(handler))
	}()
	return nil
}

And errors could still be handed off to utilruntime.HandleError to not change the function signatures of the controllers.

Jefftree · 2025-04-09T18:57:22Z

Said this, we need to understand the benefit of the change and the ROI, since touching these critical pieces can have subtle bugs and the blast radius is very large ... I do not see any risk on goroutine leaks if the context termination leads to stop the process

In the future, I would like to explore an option to transition leaders without stopping the process so fixing goroutine leaks would be important. Also have similar questions as @aojea though, I was also under the impression that shutting down an informer factory would clean up these handlers. Is there no way of making this fix without having to rewrite every controller?

ntnn · 2025-04-09T19:04:11Z

Stopping the informer does clean up those handlers - and cleaning the handlers up can only be done by either updating the controllers or by shutting down the informer.

At least at the moment for kube it doesn't make a difference; even if you would go into leader migration without stopping the former leaders process you would likely shutdown the informer (e.g. by changing the stopCh here to also use ctx.Done()).

However if the informer should be kept running on non-leaders the handlers would leak on a leader migration.

Tbh I think the option with AddContextEventHandler.. would be the cleanest solution.

Jefftree · 2025-04-09T19:25:26Z

Ah thanks for the explanation. So this only happens for very specific controllers that can exit while keeping the informerFactory running. None of the KCM controllers should fall into that bucket then?

However if the informer should be kept running on non-leaders the handlers would leak on a leader migration.

Tbh I think the option with AddContextEventHandler.. would be the cleanest solution.

I recently wrote an apiserver controller that falls into the exact condition you described, must be able to stop and start without touching the informerFactory (https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/leaderelection/leaderelection_controller.go#L80-L89) and removing the handler does feel a bit hacky. AddContextEventHandler (or something similar) is a pattern we could benefit from.

ntnn · 2025-04-09T19:32:33Z

None of the KCM controllers should fall into that bucket then?

Correct, everything started by the controller manager is not directly affected because the controller manager shuts down both the informers and controllers at the same time.

removing the handler does feel a bit hacky.

Yeah, that's what I felt as well also with the .ShutDown methods in this PR. Your solution is effectively the same as I did in this PR.

AddContextEventHandler (or something similar) is a pattern we could benefit from.

I'll prep a PR tomorrow

aojea · 2025-04-09T20:39:53Z

I'll prep a PR tomorrow

please don't make api incompatible changes on those informers methods, it is very painful for downstream projects to rewrite all the controllers just to adapt for a new field

cici37 · 2025-04-15T20:18:39Z

/triage accepted

micahhausler · 2025-06-09T16:26:51Z

Its been ~two months since the last activity here, is this still being worked on?

ntnn · 2025-06-09T16:32:52Z

Sortof - for controllers to unregister handlers upon shutdown they either need to retain the handler ref and unregister (like in this PR) or AddContextEventHandler is added to support context-aware handlers (as in the linked PR, after which I'd update this PR).
@aojea had also mentioned that he wanted a more holistic solution for controllers with lifecycles - which would also be great.

I'm basically just waiting for reviews and for PRs to be merged or closed.

aojea · 2025-06-09T22:33:37Z

@aojea had also mentioned that he wanted a more holistic solution for controllers with lifecycles - which would also be great.

my bad sorry, this slipped , regarding the changes in the controller manager those are not required as they do not leak goroutines #131199 (comment)

Regarding the holistic solution, I think there are lately similar request so I was wondering it will be better to build a better solution around all this problem or just this context cancellation is enough ... but you need to involve API machinery leads for that , @jpbetz and @deads2k (Joe is out at least a few weeks IIRC)

ntnn added 9 commits April 8, 2025 08:56

Unregister handlers in Namespace controller

b7916fb

Unregister handlers in bootstrap Signer

b2a7f79

Unregister handlers in bootstrap TokenCleaner

5788185

Unregister handlers in the GraphBuilder of the GarbageCollector contr…

1ead538

…oller

Unregister handlers in the HorizontalController

627ec95

Unregister handlers in the ResourceQuota controller and monitors

717cd6d

Unregister handlers in the ServiceAccount controller

795324e

Unregister handlers in the RepairIPAddress controller

5392364

Unregister handlers in the EndpointSliceConfig

8bdc98a

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 8, 2025

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 8, 2025

k8s-ci-robot requested review from aojea and caesarxuchao April 8, 2025 07:12

github-project-automation bot added this to SIG Auth and SIG Apps Apr 8, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Apr 8, 2025

ntnn mentioned this pull request Apr 8, 2025

Fix workqueue not returning immediately on shutdown without drain #131200

Open

ntnn mentioned this pull request Apr 8, 2025

Replace os.Exit with return #131201

Open

ntnn mentioned this pull request Apr 9, 2025

Add SharedInformer.AddContextEventHandler and AddContextEventHandlerW… #131225

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2025

aramase moved this from Needs Triage to Pre-triage follow-up in SIG Auth May 12, 2025

Fix several goroutine leaks on controllers #131199

Are you sure you want to change the base?

Fix several goroutine leaks on controllers #131199

Uh oh!

Conversation

ntnn commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 8, 2025

Uh oh!

k8s-ci-robot commented Apr 8, 2025

Uh oh!

k8s-ci-robot commented Apr 8, 2025

Uh oh!

aojea commented Apr 8, 2025

Uh oh!

ntnn commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Apr 8, 2025

Uh oh!

ntnn commented Apr 8, 2025

Uh oh!

ntnn commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Apr 8, 2025

Uh oh!

ntnn commented Apr 9, 2025

Uh oh!

aojea commented Apr 9, 2025

Uh oh!

ntnn commented Apr 9, 2025

Uh oh!

ntnn commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jefftree commented Apr 9, 2025

Uh oh!

ntnn commented Apr 9, 2025

Uh oh!

Jefftree commented Apr 9, 2025

Uh oh!

ntnn commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Apr 9, 2025

Uh oh!

cici37 commented Apr 15, 2025

Uh oh!

micahhausler commented Jun 9, 2025

Uh oh!

ntnn commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ntnn commented Apr 8, 2025 •

edited

Loading

linux-foundation-easycla bot commented Apr 8, 2025 •

edited

Loading

ntnn commented Apr 8, 2025 •

edited

Loading

ntnn commented Apr 8, 2025 •

edited

Loading

ntnn commented Apr 9, 2025 •

edited

Loading

ntnn commented Apr 9, 2025 •

edited

Loading

ntnn commented Jun 9, 2025 •

edited

Loading

aojea commented Jun 9, 2025 •

edited

Loading