Skip to content

chore(kubelet): add container name for sending Pod Hook fail event #130471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

googs1025
Copy link
Member

@googs1025 googs1025 commented Feb 27, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

We have multiple containers with hooks, one succeeds and the other fails, but we cannot tell from the events which container executed the hook error, which will affect the troubleshooting
We should at least record the container name. I see error info is not added because here https://github.com/kubernetes/kubernetes/pull/86139/files#r665010260

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-prestarthook-error
spec:
  containers:
  - name: container-with-hooks1
    image: nginx
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "echo 'PreStop Hook Executing...' && nonexistentcommand 2>&1"]
      postStart:
        exec:
          command: ["/bin/sh", "-c", "echo 'PostStart Hook Executing...' && nonexistentcommand 2>&1"]
  - name: container-with-hooks2
    image: nginx
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "echo 'PreStop Hook Executing..."]
      postStart:
        exec:
          command: ["/bin/sh", "-c", "echo 'PostStart Hook Executing..."]
Name:             test-pod-prestarthook-error
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Thu, 27 Feb 2025 20:03:53 +0800
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               10.244.18.115
IPs:
  IP:  10.244.18.115
Containers:
  container-with-hooks1:
    Container ID:   docker://2253ef036bfbcb8db798c3e9e176d3aa2cd7a29101e196b304d0ea70f4f0d31e
    Image:          nginx
    Image ID:       docker-pullable://nginx@sha256:9d6b58feebd2dbd3c56ab5853333d627cc6e281011cfd6050fa4bcf2072c9496
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PostStartHookError
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 Feb 2025 20:03:58 +0800
      Finished:     Thu, 27 Feb 2025 20:03:59 +0800
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q82bj (ro)
  container-with-hooks2:
    Container ID:   docker://bededf7f0bf6908f51ed30b52766b6137b932c3afb3a8db5f5f4282017231d38
    Image:          nginx
    Image ID:       docker-pullable://nginx@sha256:9d6b58feebd2dbd3c56ab5853333d627cc6e281011cfd6050fa4bcf2072c9496
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PostStartHookError
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 Feb 2025 20:04:03 +0800
      Finished:     Thu, 27 Feb 2025 20:04:03 +0800
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q82bj (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-q82bj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age               From               Message
  ----     ------               ----              ----               -------
  Normal   Scheduled            20s               default-scheduler  Successfully assigned default/test-pod-prestarthook-error to minikube
  Normal   Pulled               15s               kubelet            Successfully pulled image "nginx" in 4.991s (4.991s including waiting). Image size: 197285467 bytes.
  Normal   Pulled               10s               kubelet            Successfully pulled image "nginx" in 4.128s (4.128s including waiting). Image size: 197285467 bytes.
  Normal   Pulling              9s (x2 over 20s)  kubelet            Pulling image "nginx"
  Normal   Pulling              5s (x2 over 14s)  kubelet            Pulling image "nginx"
  Warning  FailedPostStartHook  5s (x2 over 14s)  kubelet            PostStartHook failed
  Normal   Killing              5s (x2 over 14s)  kubelet            FailedPostStartHook
  Warning  FailedPreStopHook    5s (x2 over 14s)  kubelet            PreStopHook failed
  Normal   Started              5s (x2 over 14s)  kubelet            Started container container-with-hooks1
  Normal   Created              5s (x2 over 15s)  kubelet            Created container: container-with-hooks1
  Normal   Pulled               5s                kubelet            Successfully pulled image "nginx" in 4.317s (4.317s including waiting). Image size: 197285467 bytes.
  Normal   Created              0s (x2 over 10s)  kubelet            Created container: container-with-hooks2
  Normal   Started              0s (x2 over 10s)  kubelet            Started container container-with-hooks2
  Warning  FailedPostStartHook  0s (x2 over 10s)  kubelet            PostStartHook failed
  Normal   Killing              0s (x2 over 10s)  kubelet            FailedPostStartHook
  Warning  FailedPreStopHook    0s (x2 over 10s)  kubelet            PreStopHook failed
  Normal   Pulled               0s                kubelet            Successfully pulled image "nginx" in 4.162s (4.162s including waiting). Image size: 197285467 bytes.
  • add more info for Pod Hook fail event

Which issue(s) this PR fixes:

Fixes # some thinking in: #130438

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Enhance Pod lifecycle hook failure events with container names

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: googs1025
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bart0sh
Copy link
Contributor

bart0sh commented Feb 27, 2025

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2025
@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node: code and documentation PRs Feb 27, 2025
// do not record the message in the event so that secrets won't leak from the server.
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, "PostStartHook failed")
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, eventMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: recordContainerEvent can format messages:

Suggested change
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, eventMessage)
m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, "PostStartHook failed for container %s", container.Name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// do not record the message in the event so that secrets won't leak from the server.
m.recordContainerEvent(pod, containerSpec, containerID.ID, v1.EventTypeWarning, events.FailedPreStopHook, "PreStopHook failed")
m.recordContainerEvent(pod, containerSpec, containerID.ID, v1.EventTypeWarning, events.FailedPreStopHook, eventMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same here.

@googs1025 googs1025 force-pushed the feature/add_event_info_hook branch from 0ed5048 to ea7fc2d Compare February 27, 2025 13:57
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 27, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Feb 27, 2025

@googs1025 can you look at other calls of recordContainerEvent and change them as well if they'd benefit of adding container name to the message.

@googs1025
Copy link
Member Author

@googs1025 can you look at other calls of recordContainerEvent and change them as well if they'd benefit of adding container name to the message.

Thanks for the feedback. I took a quick look and other events seem to return detailed message info. I understand that the message should contain info that can determine which container it is. 🤔

@HirazawaUi
Copy link
Contributor

HirazawaUi commented Feb 28, 2025

This is similar of #129300, my position: The container information is already present in the Event through the involvedObject.fieldPath field, but kubectl doesn't display it by default.
ref:

func GenerateContainerRef(pod *v1.Pod, container *v1.Container) (*v1.ObjectReference, error) {
fieldPath, err := fieldPath(pod, container)
if err != nil {
// TODO: figure out intelligent way to refer to containers that we implicitly
// start (like the pod infra container). This is not a good way, ugh.
fieldPath = ImplicitContainerPrefix + container.Name
}
ref, err := ref.GetPartialReference(legacyscheme.Scheme, pod, fieldPath)
if err != nil {
return nil, err
}
return ref, nil
}

If we want to show container names in events by default, maybe we should implement this in kubectl's output formatting rather than elsewhere.

/cc @soltysh @ardaguclu Would like to get some input from kubectl maintainers.

@googs1025
Copy link
Member Author

Why do we print the container name in other events?

 Normal   Started              39m (x3 over 39m)      kubelet            Started container container-with-hooks1
  Normal   Created              39m (x3 over 39m)      kubelet            Created container: container-with-hooks1

We mainly have user(in our group) feedback that when postHook is executed, it is not sure which container failed to execute.

@HirazawaUi
Copy link
Contributor

I'm not sure if my understanding is correct. The sig-node maintainers might have more accurate and comprehensive insights.

If you want to retrieve the involvedObject of an event, you can use kubectl get events -o wide

@googs1025 googs1025 requested a review from bart0sh March 3, 2025 05:09
@googs1025
Copy link
Member Author

/test pull-kubernetes-e2e-kind-ipv6

@googs1025
Copy link
Member Author

@SergeyKanzhelev @ffromani can you please take a look?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025
@SergeyKanzhelev
Copy link
Member

I'm not sure if my understanding is correct. The sig-node maintainers might have more accurate and comprehensive insights.

If you want to retrieve the involvedObject of an event, you can use kubectl get events -o wide

Can you clarify, is this a formatting issue or consistency with other events? @googs1025 please clarify the intent here. I do not know the history so if you have done digging already, it will help if you can summarize the state we are in.

@googs1025
Copy link
Member Author

Can you clarify, is this a formatting issue or consistency with other events? @googs1025 please clarify the intent here. I do not know the history so if you have done digging already, it will help if you can summarize the state we are in.

@SergeyKanzhelev thanks for this.

This is not just a formatting issue, but rather a matter of completeness in event reporting.

When multiple containers in a Pod define lifecycle hooks (e.g., postStart, preStop), and one of them fails during execution, the current events only show:

Warning  FailedPostStartHook  ...  PostStartHook failed

There's no indication of which container triggered the failure. This makes troubleshooting significantly harder.

We've observed that other container-related events (like Started, Failed, Killing) do include the container name for clarity:

Normal   Started     ...  Started container my-container
Warning  Failed      ...  Failed to start container "my-container": ...

However, lifecycle hook failures are missing this context, leading to inconsistency across similar types of events.

The purpose of this PR is to address that gap by including the container name in lifecycle hook failure events, improving both observability and debug.

@ffromani
Copy link
Contributor

/remove-lifecycle stale
/triage accepted
/priority backlog

it seems a nice QoL change to me

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 13, 2025
@HirazawaUi
Copy link
Contributor

HirazawaUi commented Jun 13, 2025

To mitigate any potential errors caused by my earlier assumptions, I conducted some research:

In PR #73892, we previously adjusted messages for some container events by adding container names to messages for events like Created, Started, and Stopping. However, these changes did not include container lifecycle hook events.

In PR #86139, we refactored the behavior of lifecycle hooks. The current message format for container lifecycle hook events was established in this PR and has remained unchanged since. Before this PR, messages for container lifecycle hook events would log the hook failed meesage. After this PR, we only record messages like: "PostStartHook failed", etc.

Furthermore, during PR #86139, we did not discuss whether to add container names to the messages for container lifecycle hook events, this aspect appears to have been overlooked.

Additionally, prior to #73892, similar efforts had been made by others. However, @yujuhong rejected the proposal using reasoning similar to mine. At that time, kubectl describe pods would display the container name in the Subject field of events. Now, kubectl describe no longer displays this field.

Ref: #66855 (comment)

Nevertheless, since #73892 has already been merged, I believe maintaining consistency in container event messages is what we should do.

@googs1025
Copy link
Member Author

To mitigate any potential errors caused by my earlier assumptions, I conducted some research:

In PR #73892, we previously adjusted messages for some container events by adding container names to messages for events like Created, Started, and Stopping. However, these changes did not include container lifecycle hook events.

In PR #86139, we refactored the behavior of lifecycle hooks. The current message format for container lifecycle hook events was established in this PR and has remained unchanged since. Before this PR, messages for container lifecycle hook events would log the hook failed meesage. After this PR, we only record messages like: "PostStartHook failed", etc.

Furthermore, during PR #86139, we did not discuss whether to add container names to the messages for container lifecycle hook events, this aspect appears to have been overlooked.

Additionally, prior to #73892, similar efforts had been made by others. However, @yujuhong rejected the proposal using reasoning similar to mine. At that time, kubectl describe pods would display the container name in the Subject field of events. Now, kubectl describe no longer displays this field.

Ref: #66855 (comment)

Nevertheless, since #73892 has already been merged, I believe maintaining consistency in container event messages is what we should do.

@HirazawaUi Thank you so much for taking the time to search for the context. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

7 participants