-
Notifications
You must be signed in to change notification settings - Fork 40.7k
podresources: list: use active pods #132028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
podresources: list: use active pods #132028
Conversation
/sig node |
5b9084a
to
7715e6d
Compare
/triage accepted we have enough evidence this is a real bug, and this breaks https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md - the scheduler plugin implemented workarounds before but the memory part is not workaroundable |
7715e6d
to
095f354
Compare
the PR is reviewable. I'm finishing the e2e tests. |
LGTM label has been added. Git tree hash: 0b6612b4133c881b304e571145bccf9010d51bb4
|
/hold for another round of testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ffromani, LGTM from me, fwiw
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ffromani, marquiz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5f70307
to
f647e63
Compare
uhm:
|
The podresources API List implementation uses the internal data of the resource managers as source of truth. Looking at the implementation here: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60 we take care of syncing the device allocation data before querying the device manager to return its pod->devices assignment. This is needed because otherwise the device manager (and all the other resource managers) would do the cleanup asynchronously, so the `List` call will return incorrect data. But we don't do this syncing neither for CPUs or for memory, so when we report these we will get stale data as the issue kubernetes#132020 demonstrates. For CPU manager, we however have the reconcile loop which cleans the stale data periodically. Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed (see: kubernetes#119423 (comment)). But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time very high (>= 5 minutes), then the issue still reproduces against current master branch (https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983). Taking a step back, we can see multiple problems: 1. not syncing the resource managers internal data before to query for pod assignment (no removeStaleState calls) but most importantly 2. the List call iterate overs all the pod known to the kubelet. But the resource managers do NOT hold resources for non-running pod, so it is better, actually it's correct to iterate only over the active pods. This will also avoid issue 1 above. Furthermore, the resource managers all iterate over the active pods anyway: `List` is using all the pods known about: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215 But all the resource managers are using the list of active pods: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198 So this change will also make the `List` view consistent with the resource managers view, which is also a promise of the API currently broken. We also need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset wrt the resource managers with the related desync causes way more harm than good. And arguably, it's better to fix this issue in just one place instead of having the `List` use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change. We need to further acknowledge the `List` endpoint used the full pod list since its inception. So, we will add a Feature Gate to disable this fix and restore the old behavior. We plan to keep this Feature Gate for quite a long time (at least 4 more releases) considering how stable this change was. Should a consumer of the API being broken by this change, we have the option to restore the old behavior and to craft a more elaborate fix. The old `v1alpha1` endpoint will be not modified intentionally. Signed-off-by: Francesco Romani <[email protected]>
Since the KEP 4885 (https://github.com/kubernetes/enhancements/blob/master/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md) memory manager is supported also on windows. Plus, we want to add podresources e2e tests which configure the memory manager. Both these facts suggest it's useful to build the e2e memory manager tests on all OSes, not just on linux; However, since we are not sure we are ready to run these tests everywhere, we tag them LinuxOnly to keep preserve most of the old behavior. Signed-off-by: Francesco Romani <[email protected]>
f647e63
to
30aefb4
Compare
/retest-required |
add more e2e tests to cover the interaction with core resource managers (cpu, memory) and to ensure proper reporting. Signed-off-by: Francesco Romani <[email protected]>
30aefb4
to
a90b42c
Compare
/hold cancel all tests look good now |
/test pull-kubernetes-node-kubelet-serial-podresources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: d59f5ac68482a511817b1408edb0eb760b117781
|
/assign @derekwaynecarr |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The podresources API List implementation uses the internal data of the resource managers as source of truth. But it iterates over the full list of pods known by the kubelet. Differently, all the resource managers do their internal sync using the list of active pods, fetched each reconciliation step using a helper function.
This causes the
List
endpoint to return incorrect data, because it has a view of the world different from what the resource managers actually have.Which issue(s) this PR fixes:
Fixes #132020
Fixes #119423
Fixes #131999
Special notes for your reviewer:
We need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset than the resource manager causes way more harm than good because of the desync.
And arguably, it's better to fix this issue in just one place instead of having the
List
use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change.We need to further acknowledge the
List
endpoint used the full pod list since its inception. So, we will add a way in thev1
endpoint to preserve the old behavior to zero the chance to break users of the API depending on the old behavior.The old
v1alpha1
endpoint will be not modified intentionally.Does this PR introduce a user-facing change?