Skip to content

Fix several goroutine leaks on controllers #131199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ntnn
Copy link

@ntnn ntnn commented Apr 8, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

As reported in kcp-dev/kcp#3350 several controllers are leaking goroutines as they are registering handlers with informers but are not cleaning those registrations up on shutdown.

I presume that this also occurs in Kubernetes on leader changes.
When the ResourceQuota controller starts and stops QuotaMonitors each stopped QuotaMonitor leaks goroutines as well.

Additionally if the informer returns an error and doesn't actually register handlers the controller will do nothing with almost no indication as to why as the erros are silently ignored.

Fix several goroutine leaks in controllers on controller shutdown

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

I'd be happy to split this PR into multiple PRs e.g. based on controller.

Does this PR introduce a user-facing change?

Some functions creating controllers are now returning the error from the handler registration. This was previously ignored.
IMHO returning the errors would be the correct choice as without the handlers the controllers would do nothing.
Alternatively the errors could be logged using the (util)runtime.HandleError function.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2025
Copy link

linux-foundation-easycla bot commented Apr 8, 2025

CLA Signed


The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @ntnn!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ntnn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ntnn
Once this PR has been reviewed and has the lgtm label, please assign caseydavenport, deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kube-proxy sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 8, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Apr 8, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 8, 2025
@aojea
Copy link
Member

aojea commented Apr 8, 2025

why is the context not cancelling the handlers?

func (p *sharedProcessor) run(ctx context.Context) {
func() {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for listener := range p.listeners {
p.wg.Start(listener.run)
p.wg.Start(listener.pop)
}
p.listenersStarted = true
}()
<-ctx.Done()
p.listenersLock.Lock()
defer p.listenersLock.Unlock()
for listener := range p.listeners {
close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
}
// Wipe out list of listeners since they are now closed
// (processorListener cannot be re-used)
p.listeners = nil
// Reset to false since no listeners are running
p.listenersStarted = false
p.wg.Wait() // Wait for all .pop() and .run() to stop

@ntnn
Copy link
Author

ntnn commented Apr 8, 2025

why is the context not cancelling the handlers?

func (p *sharedProcessor) run(ctx context.Context) {
func() {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
for listener := range p.listeners {
p.wg.Start(listener.run)
p.wg.Start(listener.pop)
}
p.listenersStarted = true
}()
<-ctx.Done()
p.listenersLock.Lock()
defer p.listenersLock.Unlock()
for listener := range p.listeners {
close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
}
// Wipe out list of listeners since they are now closed
// (processorListener cannot be re-used)
p.listeners = nil
// Reset to false since no listeners are running
p.listenersStarted = false
p.wg.Wait() // Wait for all .pop() and .run() to stop

Because the context is for the informer, not the controllers.

Meaning if the informer / informer factory is shutdown the handlers would also be shut down.

However if a controller is stopped - but not the informer - then the goroutines for the handlers keep running.

@aojea
Copy link
Member

aojea commented Apr 8, 2025

However if a controller is stopped - but not the informer - then the goroutines for the handlers keep running.

when does this happen in the controller manager? I don't remember the controllers to be stopped , is not all context cancelled at the same time?

@ntnn
Copy link
Author

ntnn commented Apr 8, 2025

E.g. the QuotaMonitors leak goroutines whenever a monitors is stopped, as the handlers are registered but never cleaned up:

shared.Informer().AddEventHandlerWithResyncPeriod(handlers, qm.resyncPeriod())

QuotaMonitors are regularly synced monitors to remove are removed by closing the stopCh:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/resourcequota/resource_quota_monitor.go#L238

Trough the wait.Until.. here:

if err := rq.resyncMonitors(ctx, newResources); err != nil {

That's calling .resyncMonitors and in turn .SyncMonitors here:

// resyncMonitors starts or stops quota monitors as needed to ensure that all
// (and only) those resources present in the map are monitored.
func (rq *Controller) resyncMonitors(ctx context.Context, resources map[schema.GroupVersionResource]struct{}) error {
if rq.quotaMonitor == nil {
return nil
}
if err := rq.quotaMonitor.SyncMonitors(ctx, resources); err != nil {
return err
}
rq.quotaMonitor.StartMonitors(ctx)
return nil
}

This happens whenever quota-relevant grvs are removed:

// GetQuotableResources returns all resources that the quota system should recognize.
// It requires a resource supports the following verbs: 'create','list','delete'
// This function may return both results and an error. If that happens, it means that the discovery calls were only
// partially successful. A decision about whether to proceed or not is left to the caller.
func GetQuotableResources(discoveryFunc NamespacedResourcesFunc) (map[schema.GroupVersionResource]struct{}, error) {
possibleResources, discoveryErr := discoveryFunc()
if discoveryErr != nil && len(possibleResources) == 0 {
return nil, fmt.Errorf("failed to discover resources: %v", discoveryErr)
}
quotableResources := discovery.FilteredBy(discovery.SupportsAllVerbs{Verbs: []string{"create", "list", "watch", "delete"}}, possibleResources)
quotableGroupVersionResources, err := discovery.GroupVersionResources(quotableResources)
if err != nil {
return nil, fmt.Errorf("failed to parse resources: %v", err)
}
// return the original discovery error (if any) in addition to the list
return quotableGroupVersionResources, discoveryErr
}

That is one case that likely happens in production.

As for the controller manager I'd have to check - likely there it doesn't matter because the controllers and informers are shutdown with everything else when the node is restarting.

For KCP this happens because KCP starts and stops controllers dynamically for workspaces - the documentation does a better job at explaining workspaces then I'd do: https://docs.kcp.io/kcp/v0.27/concepts/workspaces/

Imho it doesn't matter that the informer cancels any handlers still registered on shutdown as controllers should cleanup any resources they require when shutting down. Both cases make sense (e.g. the handlers being stopped by the informer on shutdown as the informer quits as well as the controller stopping the handlers as the controller shuts down).

@ntnn
Copy link
Author

ntnn commented Apr 8, 2025

As far as the controller manager goes - these leaks do affect it.

This is the function starting the controllers and informers:

run := func(ctx context.Context, controllerDescriptors map[string]*ControllerDescriptor) {
controllerContext, err := CreateControllerContext(ctx, c, rootClientBuilder, clientBuilder)
if err != nil {
logger.Error(err, "Error building controller context")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
}
if err := StartControllers(ctx, controllerContext, controllerDescriptors, unsecuredMux, healthzHandler); err != nil {
logger.Error(err, "Error starting controllers")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
}
controllerContext.InformerFactory.Start(stopCh)
controllerContext.ObjectOrMetadataInformerFactory.Start(stopCh)
close(controllerContext.InformersStarted)
<-ctx.Done()
}

Notice that StartControllers uses the passed ctx - the informer factory however gets stopCh, which is defined here:

This means that the controllers are also leaking goroutines in the controller manager every time the leader changes.

Edit: That should be fixed as well, because every time the leader changes the informer factories will leak their goroutines. I could add a commit to replace the channel with context cancellation unless there's arguments against it.

@aojea
Copy link
Member

aojea commented Apr 8, 2025

This means that the controllers are also leaking goroutines in the controller manager every time the leader changes.

when the leader changes the process exits

OnStoppedLeading: func() {
logger.Error(nil, "leaderelection lost")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)

@ntnn
Copy link
Author

ntnn commented Apr 9, 2025

when the leader changes the process exits

Ah, missed that; so it doesn't matter for the controller manager.

@aojea
Copy link
Member

aojea commented Apr 9, 2025

when the leader changes the process exits

Ah, missed that; so it doesn't matter for the controller manager.

I'm just trying to understand the changes and the motivation, we went through a lot to avoid goroutines leakes on controllers in the apiserver, see #108483 , so we should not leak goroutines.

Said this, we need to understand the benefit of the change and the ROI, since touching these critical pieces can have subtle bugs and the blast radius is very large ... I do not see any risk on goroutine leaks if the context termination leads to stop the process

@ntnn
Copy link
Author

ntnn commented Apr 9, 2025

I more than understand the hesitation on a change of this impact, but I'd rather put the "best practice" solution forward for discussion - this could definitely happen for other projects and also kube itself if the lifecycle of controllers is ever changed.

@ntnn
Copy link
Author

ntnn commented Apr 9, 2025

I had another idea to keep going with the context cancelletion - the shared informer could get a method like this:

AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error

Either with or without returning a handler - though if this handler is context-cancelled through the context provided by the controller the registration is superfluous.

A very simple implementation would just launch a goroutine that blocks until the context is cancelled and removes the handler:

func (s *sharedIndexInformer)  AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error {
	handler, err := s.AddEventHandlerWithOptions(handler, options)
	if err != nil {
		return err
	}
	go func() {
		<-ctx.Done()
		utilruntime.HandleError(s.RemoveEventHandler(handler))
	}()
	return nil
}

And errors could still be handed off to utilruntime.HandleError to not change the function signatures of the controllers.

@Jefftree
Copy link
Member

Jefftree commented Apr 9, 2025

Said this, we need to understand the benefit of the change and the ROI, since touching these critical pieces can have subtle bugs and the blast radius is very large ... I do not see any risk on goroutine leaks if the context termination leads to stop the process

In the future, I would like to explore an option to transition leaders without stopping the process so fixing goroutine leaks would be important. Also have similar questions as @aojea though, I was also under the impression that shutting down an informer factory would clean up these handlers. Is there no way of making this fix without having to rewrite every controller?

@ntnn
Copy link
Author

ntnn commented Apr 9, 2025

Stopping the informer does clean up those handlers - and cleaning the handlers up can only be done by either updating the controllers or by shutting down the informer.

At least at the moment for kube it doesn't make a difference; even if you would go into leader migration without stopping the former leaders process you would likely shutdown the informer (e.g. by changing the stopCh here to also use ctx.Done()).

However if the informer should be kept running on non-leaders the handlers would leak on a leader migration.

Tbh I think the option with AddContextEventHandler.. would be the cleanest solution.

@Jefftree
Copy link
Member

Jefftree commented Apr 9, 2025

Ah thanks for the explanation. So this only happens for very specific controllers that can exit while keeping the informerFactory running. None of the KCM controllers should fall into that bucket then?

However if the informer should be kept running on non-leaders the handlers would leak on a leader migration.

Tbh I think the option with AddContextEventHandler.. would be the cleanest solution.

I recently wrote an apiserver controller that falls into the exact condition you described, must be able to stop and start without touching the informerFactory (https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/leaderelection/leaderelection_controller.go#L80-L89) and removing the handler does feel a bit hacky. AddContextEventHandler (or something similar) is a pattern we could benefit from.

@ntnn
Copy link
Author

ntnn commented Apr 9, 2025

None of the KCM controllers should fall into that bucket then?

Correct, everything started by the controller manager is not directly affected because the controller manager shuts down both the informers and controllers at the same time.

removing the handler does feel a bit hacky.

Yeah, that's what I felt as well also with the .ShutDown methods in this PR. Your solution is effectively the same as I did in this PR.

AddContextEventHandler (or something similar) is a pattern we could benefit from.

I'll prep a PR tomorrow

@aojea
Copy link
Member

aojea commented Apr 9, 2025

I'll prep a PR tomorrow

please don't make api incompatible changes on those informers methods, it is very painful for downstream projects to rewrite all the controllers just to adapt for a new field

@cici37
Copy link
Contributor

cici37 commented Apr 15, 2025

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 15, 2025
@aramase aramase moved this from Needs Triage to Pre-triage follow-up in SIG Auth May 12, 2025
@micahhausler
Copy link
Member

Its been ~two months since the last activity here, is this still being worked on?

@ntnn
Copy link
Author

ntnn commented Jun 9, 2025

Sortof - for controllers to unregister handlers upon shutdown they either need to retain the handler ref and unregister (like in this PR) or AddContextEventHandler is added to support context-aware handlers (as in the linked PR, after which I'd update this PR).
@aojea had also mentioned that he wanted a more holistic solution for controllers with lifecycles - which would also be great.

I'm basically just waiting for reviews and for PRs to be merged or closed.

@aojea
Copy link
Member

aojea commented Jun 9, 2025

@aojea had also mentioned that he wanted a more holistic solution for controllers with lifecycles - which would also be great.

my bad sorry, this slipped , regarding the changes in the controller manager those are not required as they do not leak goroutines #131199 (comment)

Regarding the holistic solution, I think there are lately similar request so I was wondering it will be better to build a better solution around all this problem or just this context cancellation is enough ... but you need to involve API machinery leads for that , @jpbetz and @deads2k (Joe is out at least a few weeks IIRC)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
Status: Pre-triage follow-up
Development

Successfully merging this pull request may close these issues.

6 participants