-
Notifications
You must be signed in to change notification settings - Fork 40.7k
Fix several goroutine leaks on controllers #131199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The committers listed above are authorized under a signed CLA. |
Welcome @ntnn! |
Hi @ntnn. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ntnn The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
why is the context not cancelling the handlers? kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go Lines 884 to 909 in 195803c
|
Because the context is for the informer, not the controllers. Meaning if the informer / informer factory is shutdown the handlers would also be shut down. However if a controller is stopped - but not the informer - then the goroutines for the handlers keep running. |
when does this happen in the controller manager? I don't remember the controllers to be stopped , is not all context cancelled at the same time? |
E.g. the QuotaMonitors leak goroutines whenever a monitors is stopped, as the handlers are registered but never cleaned up:
QuotaMonitors are regularly synced monitors to remove are removed by closing the Trough the
That's calling kubernetes/pkg/controller/resourcequota/resource_quota_controller.go Lines 536 to 548 in ab3e83f
This happens whenever quota-relevant grvs are removed: kubernetes/pkg/controller/resourcequota/resource_quota_controller.go Lines 550 to 566 in ab3e83f
That is one case that likely happens in production. As for the controller manager I'd have to check - likely there it doesn't matter because the controllers and informers are shutdown with everything else when the node is restarting. For KCP this happens because KCP starts and stops controllers dynamically for workspaces - the documentation does a better job at explaining workspaces then I'd do: https://docs.kcp.io/kcp/v0.27/concepts/workspaces/ Imho it doesn't matter that the informer cancels any handlers still registered on shutdown as controllers should cleanup any resources they require when shutting down. Both cases make sense (e.g. the handlers being stopped by the informer on shutdown as the informer quits as well as the controller stopping the handlers as the controller shuts down). |
As far as the controller manager goes - these leaks do affect it. This is the function starting the controllers and informers: kubernetes/cmd/kube-controller-manager/app/controllermanager.go Lines 239 to 256 in ab3e83f
Notice that
This means that the controllers are also leaking goroutines in the controller manager every time the leader changes. Edit: That should be fixed as well, because every time the leader changes the informer factories will leak their goroutines. I could add a commit to replace the channel with context cancellation unless there's arguments against it. |
when the leader changes the process exits kubernetes/cmd/kube-controller-manager/app/controllermanager.go Lines 339 to 341 in ab3e83f
|
Ah, missed that; so it doesn't matter for the controller manager. |
I'm just trying to understand the changes and the motivation, we went through a lot to avoid goroutines leakes on controllers in the apiserver, see #108483 , so we should not leak goroutines. Said this, we need to understand the benefit of the change and the ROI, since touching these critical pieces can have subtle bugs and the blast radius is very large ... I do not see any risk on goroutine leaks if the context termination leads to stop the process |
I more than understand the hesitation on a change of this impact, but I'd rather put the "best practice" solution forward for discussion - this could definitely happen for other projects and also kube itself if the lifecycle of controllers is ever changed. |
I had another idea to keep going with the context cancelletion - the shared informer could get a method like this: AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error Either with or without returning a handler - though if this handler is context-cancelled through the context provided by the controller the registration is superfluous. A very simple implementation would just launch a goroutine that blocks until the context is cancelled and removes the handler: func (s *sharedIndexInformer) AddContextEventHandlerWithOptions(ctx context.Context, handler ResourceEventHandler, options HandlerOptions) error {
handler, err := s.AddEventHandlerWithOptions(handler, options)
if err != nil {
return err
}
go func() {
<-ctx.Done()
utilruntime.HandleError(s.RemoveEventHandler(handler))
}()
return nil
} And errors could still be handed off to |
In the future, I would like to explore an option to transition leaders without stopping the process so fixing goroutine leaks would be important. Also have similar questions as @aojea though, I was also under the impression that shutting down an informer factory would clean up these handlers. Is there no way of making this fix without having to rewrite every controller? |
Stopping the informer does clean up those handlers - and cleaning the handlers up can only be done by either updating the controllers or by shutting down the informer. At least at the moment for kube it doesn't make a difference; even if you would go into leader migration without stopping the former leaders process you would likely shutdown the informer (e.g. by changing the However if the informer should be kept running on non-leaders the handlers would leak on a leader migration. Tbh I think the option with |
Ah thanks for the explanation. So this only happens for very specific controllers that can exit while keeping the informerFactory running. None of the KCM controllers should fall into that bucket then?
I recently wrote an apiserver controller that falls into the exact condition you described, must be able to stop and start without touching the informerFactory (https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/leaderelection/leaderelection_controller.go#L80-L89) and removing the handler does feel a bit hacky. |
Correct, everything started by the controller manager is not directly affected because the controller manager shuts down both the informers and controllers at the same time.
Yeah, that's what I felt as well also with the
I'll prep a PR tomorrow |
please don't make api incompatible changes on those informers methods, it is very painful for downstream projects to rewrite all the controllers just to adapt for a new field |
/triage accepted |
Its been ~two months since the last activity here, is this still being worked on? |
Sortof - for controllers to unregister handlers upon shutdown they either need to retain the handler ref and unregister (like in this PR) or I'm basically just waiting for reviews and for PRs to be merged or closed. |
my bad sorry, this slipped , regarding the changes in the controller manager those are not required as they do not leak goroutines #131199 (comment) Regarding the holistic solution, I think there are lately similar request so I was wondering it will be better to build a better solution around all this problem or just this context cancellation is enough ... but you need to involve API machinery leads for that , @jpbetz and @deads2k (Joe is out at least a few weeks IIRC) |
What type of PR is this?
/kind bug
What this PR does / why we need it:
As reported in kcp-dev/kcp#3350 several controllers are leaking goroutines as they are registering handlers with informers but are not cleaning those registrations up on shutdown.
I presume that this also occurs in Kubernetes on leader changes.
When the
ResourceQuota
controller starts and stopsQuotaMonitors
each stoppedQuotaMonitor
leaks goroutines as well.Additionally if the informer returns an error and doesn't actually register handlers the controller will do nothing with almost no indication as to why as the erros are silently ignored.
Which issue(s) this PR fixes:
None
Special notes for your reviewer:
I'd be happy to split this PR into multiple PRs e.g. based on controller.
Does this PR introduce a user-facing change?
Some functions creating controllers are now returning the error from the handler registration. This was previously ignored.
IMHO returning the errors would be the correct choice as without the handlers the controllers would do nothing.
Alternatively the errors could be logged using the
(util)runtime.HandleError
function.Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None