podresources: list: use active pods #132028

ffromani · 2025-05-30T07:53:50Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The podresources API List implementation uses the internal data of the resource managers as source of truth. But it iterates over the full list of pods known by the kubelet. Differently, all the resource managers do their internal sync using the list of active pods, fetched each reconciliation step using a helper function.
This causes the List endpoint to return incorrect data, because it has a view of the world different from what the resource managers actually have.

Which issue(s) this PR fixes:

Fixes #132020
Fixes #119423
Fixes #131999

Special notes for your reviewer:

We need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset than the resource manager causes way more harm than good because of the desync.
And arguably, it's better to fix this issue in just one place instead of having the List use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change.

We need to further acknowledge the List endpoint used the full pod list since its inception. So, we will add a way in the v1 endpoint to preserve the old behavior to zero the chance to break users of the API depending on the old behavior.

The old v1alpha1 endpoint will be not modified intentionally.

Does this PR introduce a user-facing change?

Change the node-local podresources API endpoint to only consider of active pods. Because this fix changes a long-established behavior, users observing a regressions can use the KubeletPodResourcesListUseActivePods feature gate (default on) to restore the old behavior. Please file an issue if you encounter problems and have to use the Feature Gate.

ffromani · 2025-05-30T07:54:03Z

/sig node

ffromani · 2025-05-30T08:17:31Z

/triage accepted
/priority important-soon

we have enough evidence this is a real bug, and this breaks https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md - the scheduler plugin implemented workarounds before but the memory part is not workaroundable

ffromani · 2025-05-30T15:58:00Z

the PR is reviewable. I'm finishing the e2e tests.

k8s-ci-robot · 2025-06-09T16:53:06Z

LGTM label has been added.

Git tree hash: 0b6612b4133c881b304e571145bccf9010d51bb4

ffromani · 2025-06-09T20:53:37Z

/hold

for another round of testing

marquiz

Thanks @ffromani, LGTM from me, fwiw

k8s-ci-robot · 2025-06-10T06:43:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ffromani, marquiz
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/features/OWNERS
pkg/kubelet/OWNERS
test/compatibility_lifecycle/reference/OWNERS
~~test/e2e_node/OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/features/kube_features.go

ffromani · 2025-06-10T13:39:46Z

uhm:

panic: feature "KubeletPodResourcesListUseActivePods" introduced as deprecated must provide a 1.0 entry indicating initial state
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.Must(...)
	k8s.io/apimachinery/pkg/util/runtime/runtime.go:293
k8s.io/kubernetes/pkg/features.init.0()
	k8s.io/kubernetes/pkg/features/kube_features.go:1885 +0x138
!!! [0610 13:33:30] Call tree:
!!! [0610 13:33:30]  1: hack/make-rules/test-cmd.sh:196 run_kube_apiserver(...)
junit report dir: /logs/artifacts
+++ [0610 13:33:30] Clean up complete

The podresources API List implementation uses the internal data of the resource managers as source of truth. Looking at the implementation here: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60 we take care of syncing the device allocation data before querying the device manager to return its pod->devices assignment. This is needed because otherwise the device manager (and all the other resource managers) would do the cleanup asynchronously, so the `List` call will return incorrect data. But we don't do this syncing neither for CPUs or for memory, so when we report these we will get stale data as the issue kubernetes#132020 demonstrates. For CPU manager, we however have the reconcile loop which cleans the stale data periodically. Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed (see: kubernetes#119423 (comment)). But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time very high (>= 5 minutes), then the issue still reproduces against current master branch (https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983). Taking a step back, we can see multiple problems: 1. not syncing the resource managers internal data before to query for pod assignment (no removeStaleState calls) but most importantly 2. the List call iterate overs all the pod known to the kubelet. But the resource managers do NOT hold resources for non-running pod, so it is better, actually it's correct to iterate only over the active pods. This will also avoid issue 1 above. Furthermore, the resource managers all iterate over the active pods anyway: `List` is using all the pods known about: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215 But all the resource managers are using the list of active pods: 1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in 2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198 So this change will also make the `List` view consistent with the resource managers view, which is also a promise of the API currently broken. We also need to acknowledge the the warning in the docstring of GetActivePods. Arguably, having the endpoint using a different podset wrt the resource managers with the related desync causes way more harm than good. And arguably, it's better to fix this issue in just one place instead of having the `List` use a different pod set for unclear reason. For these reasons, while important, I don't think the warning per se invalidated this change. We need to further acknowledge the `List` endpoint used the full pod list since its inception. So, we will add a Feature Gate to disable this fix and restore the old behavior. We plan to keep this Feature Gate for quite a long time (at least 4 more releases) considering how stable this change was. Should a consumer of the API being broken by this change, we have the option to restore the old behavior and to craft a more elaborate fix. The old `v1alpha1` endpoint will be not modified intentionally. Signed-off-by: Francesco Romani <[email protected]>

Since the KEP 4885 (https://github.com/kubernetes/enhancements/blob/master/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md) memory manager is supported also on windows. Plus, we want to add podresources e2e tests which configure the memory manager. Both these facts suggest it's useful to build the e2e memory manager tests on all OSes, not just on linux; However, since we are not sure we are ready to run these tests everywhere, we tag them LinuxOnly to keep preserve most of the old behavior. Signed-off-by: Francesco Romani <[email protected]>

ffromani · 2025-06-10T14:51:53Z

/retest-required

add more e2e tests to cover the interaction with core resource managers (cpu, memory) and to ensure proper reporting. Signed-off-by: Francesco Romani <[email protected]>

ffromani · 2025-06-11T11:15:20Z

/hold cancel

all tests look good now

ffromani · 2025-06-11T11:15:36Z

/test pull-kubernetes-node-kubelet-serial-podresources

marquiz

/lgtm

k8s-ci-robot · 2025-06-11T14:18:52Z

LGTM label has been added.

Git tree hash: d59f5ac68482a511817b1408edb0eb760b117781

ffromani · 2025-06-13T13:44:53Z

/assign @derekwaynecarr
/assign @SergeyKanzhelev

k8s-ci-robot requested review from derekwaynecarr and haircommander May 30, 2025 07:54

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 30, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs May 30, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs May 30, 2025

k8s-ci-robot added the area/kubelet label May 30, 2025

ffromani force-pushed the podresources-list-active-pods branch from 5b9084a to 7715e6d Compare May 30, 2025 08:06

ffromani force-pushed the podresources-list-active-pods branch from 7715e6d to 095f354 Compare May 30, 2025 15:57

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2025

github-project-automation bot added this to SIG Node CI/Test Board May 30, 2025

github-project-automation bot moved this to Triage in SIG Node CI/Test Board May 30, 2025

k8s-ci-robot assigned haircommander Jun 9, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2025

ffromani moved this from Needs Reviewer to Needs Approver in SIG Node: code and documentation PRs Jun 9, 2025

marquiz approved these changes Jun 10, 2025

View reviewed changes

aojea reviewed Jun 10, 2025

View reviewed changes

pkg/features/kube_features.go Outdated Show resolved Hide resolved

ffromani force-pushed the podresources-list-active-pods branch from 5f70307 to f647e63 Compare June 10, 2025 13:27

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2025

k8s-ci-robot requested a review from harche June 10, 2025 13:27

ffromani added 2 commits June 10, 2025 15:41

ffromani force-pushed the podresources-list-active-pods branch from f647e63 to 30aefb4 Compare June 10, 2025 13:41

node: e2e: podresources: add more e2e tests

a90b42c

add more e2e tests to cover the interaction with core resource managers (cpu, memory) and to ensure proper reporting. Signed-off-by: Francesco Romani <[email protected]>

ffromani force-pushed the podresources-list-active-pods branch from 30aefb4 to a90b42c Compare June 11, 2025 11:14

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025

marquiz reviewed Jun 11, 2025

View reviewed changes

k8s-ci-robot assigned marquiz Jun 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2025

This was referenced Jun 12, 2025

kep-3695-beta update kubernetes/enhancements#5346

Open

[kubelet] Add support to expose init container resource usage in PodResources API #126765

Open

k8s-ci-robot assigned derekwaynecarr and SergeyKanzhelev Jun 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

podresources: list: use active pods #132028

podresources: list: use active pods #132028

ffromani commented May 30, 2025 •

edited

Loading

Uh oh!

ffromani commented May 30, 2025

Uh oh!

ffromani commented May 30, 2025

Uh oh!

ffromani commented May 30, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

ffromani commented Jun 9, 2025

Uh oh!

marquiz left a comment

Uh oh!

k8s-ci-robot commented Jun 10, 2025

Uh oh!

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

marquiz left a comment

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

ffromani commented Jun 13, 2025

Uh oh!

Uh oh!

podresources: list: use active pods #132028

Are you sure you want to change the base?

podresources: list: use active pods #132028

Conversation

ffromani commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

ffromani commented May 30, 2025

Uh oh!

ffromani commented May 30, 2025

Uh oh!

ffromani commented May 30, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

ffromani commented Jun 9, 2025

Uh oh!

marquiz left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 10, 2025

Uh oh!

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

ffromani commented Jun 10, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

marquiz left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

ffromani commented Jun 13, 2025

Uh oh!

Uh oh!

ffromani commented May 30, 2025 •

edited

Loading