Skip to content

[FG:InPlacePodVerticalScaling] Add Pod resize complete event #130387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shiya0705
Copy link

@shiya0705 shiya0705 commented Feb 24, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
Add events when pod resize complete

Which issue(s) this PR fixes:
Fixes #127172

Special notes for your reviewer:
N/A

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
N/A

Does this PR introduce a user-facing change?
Event is a user-facing change

Added detailed event for in-place pod vertical scaling completed, improving cluster management and debugging

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 24, 2025
Copy link

linux-foundation-easycla bot commented Feb 24, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 24, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @shiya0705. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 24, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 24, 2025
@googs1025
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2025
@@ -2872,6 +2872,40 @@ func (kl *Kubelet) canResizePod(pod *v1.Pod) (bool, v1.PodResizeStatus, string)
return true, v1.PodResizeStatusInProgress, ""
}


func resizeMsgGenerate(allocatedPod *v1.Pod, podStatus *kubecontainer.PodStatus) string {
var msg string = "}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just use "}" ? This might make the return value a bit weird. 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed code, var msg string = "}" is compared with msg = fmt.Sprintf("Pod resize complete, {") + msg

Copy link
Member

@googs1025 googs1025 Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is not a correct way to initialize. Can we use like [] string to get msg and Join operation at the end instead of this string concatenation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments, I change the string definition.

@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch from 8159cfe to 1435a12 Compare February 24, 2025 08:34
@shiya0705 shiya0705 closed this Feb 24, 2025
@shiya0705 shiya0705 deleted the Pod_Resize_complete_event branch February 24, 2025 08:54
@shiya0705 shiya0705 changed the title first modified version Pod resize complete event Feb 24, 2025
@shiya0705 shiya0705 changed the title Pod resize complete event Add Pod resize complete event Feb 24, 2025
@shiya0705
Copy link
Author

/retest

@shiya0705 shiya0705 restored the Pod_Resize_complete_event branch February 24, 2025 09:18
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch from a0049a9 to c13720a Compare May 3, 2025 08:40
@natasha41575
Copy link
Contributor

/retest

@@ -2967,8 +3013,19 @@ func (kl *Kubelet) handlePodResourcesResize(pod *v1.Pod, podStatus *kubecontaine
// If a resize is in progress, make sure the cache has the correct state in case the Kubelet restarted.
kl.statusManager.SetPodResizeInProgressCondition(pod.UID, "", "", false)
} else {
PodResizeConditionsLenBefore := len(kl.statusManager.GetPodResizeConditions(pod.UID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than tracking this here, let's just make ClearPodResizeInProgressCondition return whether a condition was cleared. Then the logic just becomes:

if kl.statusManager.ClearPodResizeInProgressCondition(pod.UID) {

Copy link
Contributor

@natasha41575 natasha41575 May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair do you still want the event emitted when pending resize is reverted? If so, just checking that the pod ResizeInProgress condition is cleared is not enough (since pending resizes are tracked in the other condition). To handle that I think we'd need to move this line to the beginning of the function and unconditionally check if the entire resize status has been cleared at the end of the defer block (which is what I was suggesting in #130387 (comment))

(referencing your comment #127172 (comment))

Copy link
Author

@shiya0705 shiya0705 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair @natasha41575

Thanks for your comments.

  1. I agree with natasha's comment, just checking pod ResizeInProgress condition is cleared is not enough, but this has nothing to do with "event emitted when pending resize is reverted". If only check if kl.statusManager.ClearPodResizeInProgressCondition(pod.UID),after resize completed,resize complete event will be emitted continually every time when SyncPod is called. Because func handlePodResourcesResize defer block will check isPodResizeInProgress even after pod resize is completed, this is not what we want. We just want to emit complete event once resize is finished.

  2. I think current code has satisfied this requirement, emit complete event when pending resize is reverted.
    Pending resize is reverted, first both InProgress and Pending condition are cleared,then set InProgress condition,after resize complete,clear InProgress condition( emit complete event). The detailed is as below.

  • When pending
    ->SetPodResizePendingCondition
  • pending reverted
    -> ClearPodResizePendingCondition/ClearPodResizeInProgressCondition(func handlePodResourcesResize)
    -> SetPodResizeInProgressCondition (func (kl *Kubelet) SyncPod)
    -> ClearPodResizeInProgressCondition(pod.UID) (defer func() in func handlePodResourcesResize )

Please correct me if there is any error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair
Could you please give your comment?

@@ -2953,6 +2954,51 @@ func disallowResizeForSwappableContainers(runtime kubecontainer.Runtime, desired
return false, ""
}

// PodResourceSummary save container and pod-level resources for resizing
type PodResourceSummary struct {
// TODO: Add pod-level resources here once resizing pod-level resources is supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ndixita

// PodResourceSummary save container and pod-level resources for resizing
type PodResourceSummary struct {
// TODO: Add pod-level resources here once resizing pod-level resources is supported
InitContainers map[string]v1.ResourceRequirements `json:"InitContainers,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on the type for this. The map[string]v1.ResourceRequriements is a more compact json representation, but it doesn't preserve container order. The alternative would be something like

InitContainers []ContainerResourceSummary
Containers []ContainerResourceSummary

// ...

type ContainerResourceSummary struct {
  Name string
  Resources v1.ResourceRequirements
}

I'm leaning toward the latter, since it would basically present as a filtered view of the pod spec, rather than a completely new type, if that makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your consideration, applied it to the new code


containerStatus := podStatus.FindContainerStatusByName(allocatedContainer.Name)
if containerStatus == nil || containerStatus.State != kubecontainer.ContainerStateRunning {
// If the container isn't running, it doesn't need to be saved to completed event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. The resize might not have been "actuated", but a waiting container can still be resized - it will use the newly allocated values when it starts up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I delete this condition judgment in the new code, so that even the containerStatus is nil, but container resource in the completed message is displayed.

// If pod resize completed, print pod resize complete event
if (PodResizeConditionsLenBefore != 0) && (len(kl.statusManager.GetPodResizeConditions(pod.UID)) == 0) {
// PodResizecompleteAnnotations including pod all resizeable container resrouce, it is printed in event annotaions
PodResizecompleteAnnotations := kl.getPodResizeInfo(allocatedPod, podStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why put this in annotations rather than the event message? If it's part of the message, we don't need to worry about the timestamp.

Alternatively, if it's too much information to dump in the message, maybe we could put the allocated generation in the message instead of the timestamp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it to annotations to refer this comment, but annotations do not update if event message content is same, so added a time stamp to distinguish different times of resize completed event.
#127172 (comment)

But add a timestamp in the event message is indeed strange.
In the new code, I deleted timestamp and moved detailed information from annotations to the message.

}

// getPodResizeInfo to cache containers and pod resources CPU and memory resources once resizing is completed
func (kl *Kubelet) getPodResizeInfo(allocatedPod *v1.Pod, podStatus *kubecontainer.PodStatus) map[string]string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this. I think it should have its own dedicated test.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2025
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch 2 times, most recently from ea9c2c0 to 64d90ff Compare May 8, 2025 02:11
@@ -2967,8 +3015,18 @@ func (kl *Kubelet) handlePodResourcesResize(pod *v1.Pod, podStatus *kubecontaine
// If a resize is in progress, make sure the cache has the correct state in case the Kubelet restarted.
kl.statusManager.SetPodResizeInProgressCondition(pod.UID, "", "", false)
} else {
// podResizeConditionsOriginalLen is used to save original resize condition
podResizeConditionsOriginalLen := len(kl.statusManager.GetPodResizeConditions(pod.UID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this here just so it doesn't get lost but make sure you get an answer from @tallclair on #130387 (comment) on the desired behavior

@@ -2967,8 +3015,18 @@ func (kl *Kubelet) handlePodResourcesResize(pod *v1.Pod, podStatus *kubecontaine
// If a resize is in progress, make sure the cache has the correct state in case the Kubelet restarted.
kl.statusManager.SetPodResizeInProgressCondition(pod.UID, "", "", false)
} else {
// podResizeConditionsOriginalLen is used to save original resize condition
podResizeConditionsOriginalLen := len(kl.statusManager.GetPodResizeConditions(pod.UID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are holding off on merging #131157 for the time being, meaning I wouldn't expect the code that backfills the status manager conditions from there to be merged any time soon. If we want to move forward here, can you copy the condition backfilling into this PR? Basically this section

(This is a follow-up to my other comment: #130387 (comment))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch from edfadd9 to 9eb657a Compare May 13, 2025 11:29
Comment on lines 2699 to 2706
for _, c := range pod.Status.Conditions {
switch c.Type {
case v1.PodResizePending:
kl.statusManager.SetPodResizePendingCondition(pod.UID, c.Reason, c.Message)
case v1.PodResizeInProgress:
kl.statusManager.SetPodResizeInProgressCondition(pod.UID, c.Reason, c.Message, true)
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natasha41575, for this comment #130387 (comment)
I merged backfills status manager conditions code to here, is my understanding correct?

@Rajalakshmi-Girish Rajalakshmi-Girish moved this to Pending inclusion in [sig-release] Bug Triage May 23, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2025
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch 2 times, most recently from ac792a4 to 4087930 Compare June 6, 2025 06:05
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 6, 2025
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch from 4087930 to 5451237 Compare June 6, 2025 06:24
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 6, 2025
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch 2 times, most recently from ae0472a to 79887c0 Compare June 6, 2025 08:41
@shiya0705 shiya0705 force-pushed the Pod_Resize_complete_event branch from b46eeb7 to 780529e Compare June 6, 2025 10:30
@k8s-ci-robot
Copy link
Contributor

@shiya0705: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-e2e-containerd 780529e link true /test pull-kubernetes-node-e2e-containerd
pull-kubernetes-conformance-kind-ga-only-parallel 780529e link true /test pull-kubernetes-conformance-kind-ga-only-parallel
pull-kubernetes-e2e-kind 780529e link true /test pull-kubernetes-e2e-kind
pull-kubernetes-e2e-gce 780529e link true /test pull-kubernetes-e2e-gce
pull-kubernetes-typecheck 780529e link true /test pull-kubernetes-typecheck
pull-kubernetes-e2e-kind-ipv6 780529e link true /test pull-kubernetes-e2e-kind-ipv6
pull-kubernetes-unit 780529e link true /test pull-kubernetes-unit
pull-kubernetes-linter-hints 780529e link false /test pull-kubernetes-linter-hints
pull-kubernetes-unit-windows-master 780529e link false /test pull-kubernetes-unit-windows-master
pull-kubernetes-verify 780529e link true /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@natasha41575
Copy link
Contributor

@shiya0705 I was originally thinking it would be best to get this in before #131612, so that you could avoid the merge conflict. But after reflection, I realized that if we do that, all the work you've doing here would more or less be wasted, because #131612 completely changes the logic for resize allocation and determining if a resize is done. It would also mean that we'd have to deal with the merge conflict in #131612, but #131612 is a stronger priority for 1.34. I believe the refactor in #131612 should also greatly simplify the logic required for emitting the resize event, which I think will make it easier for both you and reviewers.

If you are willing to wait for #131612 and rebase on it, I can prioritize review of this one after.

@shiya0705
Copy link
Author

@shiya0705 I was originally thinking it would be best to get this in before #131612, so that you could avoid the merge conflict. But after reflection, I realized that if we do that, all the work you've doing here would more or less be wasted, because #131612 completely changes the logic for resize allocation and determining if a resize is done. It would also mean that we'd have to deal with the merge conflict in #131612, but #131612 is a stronger priority for 1.34. I believe the refactor in #131612 should also greatly simplify the logic required for emitting the resize event, which I think will make it easier for both you and reviewers.

If you are willing to wait for #131612 and rebase on it, I can prioritize review of this one after.

@natasha41575 I get your idea, it make sense to make this event after resize logic become easier and clearer. I will keep monitoring #131612 and rebase code after it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Archived in project
Status: In Progress
Status: Archive-it
Status: Pending inclusion
Development

Successfully merging this pull request may close these issues.

[FG:InPlacePodVerticalScaling] Emit a events when resize status changes