Skip to content

[WIP][FG:InPlacePodVerticalScaling] Fix Static CPU management policy alongside InPlacePodVerticalScaling #129719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

esotsal
Copy link
Contributor

@esotsal esotsal commented Jan 20, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

It is needed to resize allocated CPU of a Guaranteed QoS Class with integer CPU requests Pod without restart with Static CPU management policy alongside InPlacePodVerticalScaling.

Current PR retains "promised" upon start of container CPUs

Which issue(s) this PR fixes:

It is mentioned the issue in https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/#known-issues last bullet.
I wasn't able to find a K8s issue, posting this PR following "I found it, I fix it" ethos after discussion in slack.

Update: Attempt to fix #127262

Special notes for your reviewer:

[Update] Latest 14th April 2025 Demo screencast

Screencast.from.2025-04-14.10-02-45.webm

This PR replace #123319 , needed due to company transfer.

This PR includes also a merge from esotsal#9 , to test proposals and review in one common PR. Thanks @Chunxia202410 for the contributions.

  • A proposal for CPU allocated strategy when Pod scale up and down
  • ~ Add mustKeepCPUs Interface~ ( Update 14th April 2025 , replaced with a local checkpoint solution as per sig-node meeting decisions )

Does this PR introduce a user-facing change?

Fixed an issue where static CPU management policy would not work alongside in-place
vertical scaling of a Pod.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 20, 2025
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 20, 2025
@esotsal
Copy link
Contributor Author

esotsal commented Jan 20, 2025

@Chunxia202410, @hshiina , @ffromani , @AnishShah , @tallclair , @SergeyKanzhelev , @vinaykul , moved old PR here due to company transfer, now i can continue working on this. Will prioritize updating the commit with the recent changes done in InPlacePodVerticalScaling and the proposals shared by Chunxia202410.

Copy link

linux-foundation-easycla bot commented Feb 5, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 5, 2025
@esotsal
Copy link
Contributor Author

esotsal commented Feb 5, 2025

Thanks @Chunxia202410 for the contributions, merged your two PRs to check tests . I have some questions about the second PR regarding strategy , thought best to discuss here, will ask later this week, need to do some more tests.

Thanks for the API PR seems is one of the options discussed.

Will try to update tests tomorrow or by end of this week, to make sure test is covered and passed before the Beta in v1.33 to have a point of reference.

We will need to raise the final solution or solutions in Sig-node , after v1.33 is finished and this PR is refactored with the new InPlacePodVerticalScaling KEP ongoing changes .

Thanks again for your PRs

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 5, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2025
@esotsal esotsal force-pushed the policy_static branch 2 times, most recently from b65f294 to fc5ddd1 Compare May 31, 2025 07:36
@esotsal
Copy link
Contributor Author

esotsal commented May 31, 2025

/retest

@esotsal esotsal force-pushed the policy_static branch 4 times, most recently from 0153b60 to 09b2d95 Compare June 1, 2025 09:02
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2025
@esotsal
Copy link
Contributor Author

esotsal commented Jun 4, 2025

/retest

@esotsal esotsal force-pushed the policy_static branch 4 times, most recently from df313e9 to c4aaa63 Compare June 8, 2025 10:46
esotsal and others added 8 commits June 11, 2025 09:40
Use new topology.Allocation struct (a CPU set plus
alignment metadata) instead of CPU set, due to rebase.

Remove duplicate unecessary SetDefaultCPUSet call as per
review comment.
- Revert introduction of API env mustKeepCPUs
- Replace mustKeepCPUs with local checkpoint "promised"
- Introduce "promised" in CPUManagerCheckpointV3 format
- Add logic, refactor with Beta candidate
- Fix lint issues
- Fail if mustKeepCPUs are not subset of resulted CPUs
- Fail if reusableCPUsForResize, mustKeepCPUs are not a subset
  of aligned CPUs
- Fail if mustKeepCPUs are not a subset of reusable CPUs
- TODO improve align resize tests, go through testing, corner cases
       refactor using cpumanager_test.go
- TODO improve CPUManagerCheckpointV3 tests
- TODO address code review/feedback to try different approach to allocate
       stepwise instead of once off when resizing
- TODO check init-containers
- TODO check migration from v2 to v3 CPU Manager checkpoint
- TODO check kubectl failure when prohibited can this be done earlier?
- TODO update CPU Manager tests to use refactpred cpu_manager_test
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 11, 2025

@esotsal: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-capz-windows-master 9c411d0 link false /test pull-kubernetes-e2e-capz-windows-master
pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers 9c411d0 link false /test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
pull-kubernetes-node-kubelet-serial-crio-cgroupv2 9c411d0 link false /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
pull-kubernetes-node-kubelet-serial-podresources 9c411d0 link false /test pull-kubernetes-node-kubelet-serial-podresources
pull-kubernetes-node-kubelet-serial-containerd-alpha-features 9c411d0 link false /test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
pull-kubernetes-node-kubelet-serial-containerd 9c411d0 link false /test pull-kubernetes-node-kubelet-serial-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor

I'm struggling to find the time to review this important PR. I'll keep pushing, but feel free to pull in more reviewers.

@esotsal
Copy link
Contributor Author

esotsal commented Jun 11, 2025

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
Status: Archive-it
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FG:InPlacePodVerticalScaling] Static CPU management policy alongside InPlacePodVerticalScaling
7 participants