Skip to content

podresources: list: use active pods #132028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented May 30, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

The podresources API List implementation uses the internal data of the resource managers as source of truth. But it iterates over the full list of pods known by the kubelet. Differently, all the resource managers do their internal sync using the list of active pods, fetched each reconciliation step using a helper function.
This causes the List endpoint to return incorrect data, because it has a view of the world different from what the resource managers actually have.

Which issue(s) this PR fixes:

Fixes #132020
Fixes #119423
Fixes #131999

Special notes for your reviewer:

We need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset than the resource manager causes way more harm than good because of the desync.
And arguably, it's better to fix this issue in just one place instead of having the List use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change.

We need to further acknowledge the List endpoint used the full pod list since its inception. So, we will add a way in the v1 endpoint to preserve the old behavior to zero the chance to break users of the API depending on the old behavior.

The old v1alpha1 endpoint will be not modified intentionally.

Does this PR introduce a user-facing change?

Change the node-local podresources API endpoint to only consider of active pods. Because this fix changes a long-established behavior, users observing a regressions can use the KubeletPodResourcesListUseActivePods feature gate (default on) to restore the old behavior. Please file an issue if you encounter problems and have to use the Feature Gate.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 30, 2025
@ffromani
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2025
@ffromani ffromani force-pushed the podresources-list-active-pods branch from 5b9084a to 7715e6d Compare May 30, 2025 08:06
@ffromani
Copy link
Contributor Author

/triage accepted
/priority important-soon

we have enough evidence this is a real bug, and this breaks https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md - the scheduler plugin implemented workarounds before but the memory part is not workaroundable

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 30, 2025
@ffromani ffromani force-pushed the podresources-list-active-pods branch from 7715e6d to 095f354 Compare May 30, 2025 15:57
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2025
@ffromani
Copy link
Contributor Author

the PR is reviewable. I'm finishing the e2e tests.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0b6612b4133c881b304e571145bccf9010d51bb4

@ffromani ffromani moved this from Needs Reviewer to Needs Approver in SIG Node: code and documentation PRs Jun 9, 2025
@ffromani
Copy link
Contributor Author

ffromani commented Jun 9, 2025

/hold

for another round of testing

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ffromani, LGTM from me, fwiw

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ffromani, marquiz
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ffromani ffromani force-pushed the podresources-list-active-pods branch from 5f70307 to f647e63 Compare June 10, 2025 13:27
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2025
@k8s-ci-robot k8s-ci-robot requested a review from harche June 10, 2025 13:27
@ffromani
Copy link
Contributor Author

uhm:

panic: feature "KubeletPodResourcesListUseActivePods" introduced as deprecated must provide a 1.0 entry indicating initial state
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.Must(...)
	k8s.io/apimachinery/pkg/util/runtime/runtime.go:293
k8s.io/kubernetes/pkg/features.init.0()
	k8s.io/kubernetes/pkg/features/kube_features.go:1885 +0x138
!!! [0610 13:33:30] Call tree:
!!! [0610 13:33:30]  1: hack/make-rules/test-cmd.sh:196 run_kube_apiserver(...)
junit report dir: /logs/artifacts
+++ [0610 13:33:30] Clean up complete 

ffromani added 2 commits June 10, 2025 15:41
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue kubernetes#132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed
(see: kubernetes#119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).

Taking a step back, we can see multiple problems:
1. not syncing the resource managers internal data before to query for
   pod assignment (no removeStaleState calls) but most importantly
2. the List call iterate overs all the pod known to the kubelet. But the
   resource managers do NOT hold resources for non-running pod, so it is
   better, actually it's correct to iterate only over the active pods.
   This will also avoid issue 1 above.

Furthermore, the resource managers all iterate over the active pods
anyway:
`List` is using all the pods known about:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215

But all the resource managers are using the list of active pods:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198

So this change will also make the `List` view consistent with the
resource managers view, which is also a promise of the API currently
broken.

We also need to acknowledge the the warning in the docstring of GetActivePods.
Arguably, having the endpoint using a different podset wrt the resource managers with the
related desync causes way more harm than good.
And arguably, it's better to fix this issue in just one place instead of
having the `List` use a different pod set for unclear reason.
For these reasons, while important, I don't think the warning per se
invalidated this change.

We need to further acknowledge the `List` endpoint used the full pod
list since its inception. So, we will add a Feature Gate to disable this
fix and restore the old behavior. We plan to keep this Feature Gate for
quite a long time (at least 4 more releases) considering how stable this
change was. Should a consumer of the API being broken by this change,
we have the option to restore the old behavior and to craft a more
elaborate fix.

The old `v1alpha1` endpoint will be not modified intentionally.

Signed-off-by: Francesco Romani <[email protected]>
Since the KEP 4885
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md)
memory manager is supported also on windows.

Plus, we want to add podresources e2e tests which configure
the memory manager. Both these facts suggest it's useful to build
the e2e memory manager tests on all OSes, not just on linux;

However, since we are not sure we are ready to run these tests
everywhere, we tag them LinuxOnly to keep preserve most of the
old behavior.

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the podresources-list-active-pods branch from f647e63 to 30aefb4 Compare June 10, 2025 13:41
@ffromani
Copy link
Contributor Author

/retest-required

add more e2e tests to cover the interaction with
core resource managers (cpu, memory) and to ensure
proper reporting.

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the podresources-list-active-pods branch from 30aefb4 to a90b42c Compare June 11, 2025 11:14
@ffromani
Copy link
Contributor Author

/hold cancel

all tests look good now

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025
@ffromani
Copy link
Contributor Author

/test pull-kubernetes-node-kubelet-serial-podresources

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d59f5ac68482a511817b1408edb0eb760b117781

@ffromani
Copy link
Contributor Author

/assign @derekwaynecarr
/assign @SergeyKanzhelev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Archive-it