Skip to content

[WIP]Release non-restartable InitContainer CPUs after terminated #131764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Chunxia202410
Copy link

@Chunxia202410 Chunxia202410 commented May 14, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

In some cases, the app container can not reuse the CPU of InitContainer, which lead to CPU leaking.
There are some solutions dissicussed in #112228 (comment)

This PR implement the solution2, release CPUs after non-restart InitContainer CPUs terminated.

Which issue(s) this PR fixes:

Fixes #112228

Special notes for your reviewer:

Pod restart case test failed, when Pod restart, the non-restartable InitContainer will restart, and will not be reallocate CPUs again.

If add the process to reallocate CPU for non-restartable InitContainer, it will be more complicated because multiple situations may need to be considered, such as whether to reallocate only the CPU or reallocate both the CPU, memory and the device. If only the CPU is reallocated, the reallocation may fail due to the absence of prefer numa nodes...

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @Chunxia202410. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 14, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and klueska May 14, 2025 11:44
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 14, 2025
@Chunxia202410 Chunxia202410 changed the title Release non-restart InitContainer CPUs after terminated Release non-restartable InitContainer CPUs after terminated May 15, 2025
@Chunxia202410
Copy link
Author

/sig node

@Chunxia202410 Chunxia202410 force-pushed the cpu_manager_initC branch 3 times, most recently from 215fe84 to 050fce7 Compare May 15, 2025 08:53
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Chunxia202410
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Chunxia202410 Chunxia202410 changed the title Release non-restartable InitContainer CPUs after terminated [WIP]Release non-restartable InitContainer CPUs after terminated May 15, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2025
@Chunxia202410
Copy link
Author

When I test the Pod restart case, the non-restartable InitContainer will restart, and will not reallocate CPUs again. if no exclusive CPUs, it will use default CPUs. So this solution may not cover this case.

@Chunxia202410 Chunxia202410 force-pushed the cpu_manager_initC branch 2 times, most recently from ff19d69 to c9a865f Compare May 26, 2025 10:19
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2025
@Chunxia202410 Chunxia202410 force-pushed the cpu_manager_initC branch 3 times, most recently from 841ed62 to 3389167 Compare June 10, 2025 05:59
@bart0sh bart0sh moved this from Triage to Work in progress in SIG Node: code and documentation PRs Jun 10, 2025
@Chunxia202410 Chunxia202410 force-pushed the cpu_manager_initC branch 2 times, most recently from 4f233ff to d445769 Compare June 12, 2025 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

Static CPU Manager can fail with UnexpectedAdmissionError with init-containers requesting integer CPUs
2 participants