Description
What happened?
According to
kubernetes/pkg/kubelet/metrics/metrics.go
Lines 240 to 257 in 832be95
I tested this by creating lots of pods with a 10 second sleep in their init containers, then looked at the metric on the relevant kubelet, and I see this kind of thing:
kubelet_pod_start_sli_duration_seconds_bucket{le="0.5"} 0
kubelet_pod_start_sli_duration_seconds_bucket{le="1"} 0
kubelet_pod_start_sli_duration_seconds_bucket{le="2"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="3"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="4"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="5"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="6"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="8"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="10"} 1
kubelet_pod_start_sli_duration_seconds_bucket{le="20"} 43
What did you expect to happen?
I would expect to see the le="10", le="8", le="6", etc. counters increment in this case, since init container time should not be being included.
How can we reproduce it (as minimally and precisely as possible)?
I used a deployment yaml like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 50
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
nodeName: $YOUR_NODE_NAME
initContainers:
- name: init
image: busybox
command: ["sh", "-c", "sleep 10"]
containers:
- name: test
image: k8s.gcr.io/pause:3.9
Then SSH onto $YOUR_NODE_NAME and with an appropriate bearer token, and do:
curl -sk -H "Authorization: Bearer $BEARER_TOKEN" https://localhost:10250/metrics | grep kubelet_pod_start_sli_duration
(https://yuki-nakamura.com/2023/10/15/get-kubelets-metrics-manually/ is a helpful resource for how to create a ServiceAccount + ClusterRoleBinding to get a bearer token to use against Kubelet metrics. It's a little out of date though, you'll also need to follow something like https://stackoverflow.com/questions/73164466/how-to-create-a-secret-for-service-account-using-kubernetes-version-1-24 to create a Secret to lift the bearer token).
Anything else we need to know?
By code inspection, I cannot find anywhere any indication of code that deducts init container time.
I believe the key part of the code is here:
kubernetes/pkg/kubelet/util/pod_startup_latency_tracker.go
Lines 99 to 123 in 832be95
I read this as calculating t(all containers started) - t(pod created) - (t(last pull finished) - t(first pull started))
I'm also suspicious because I didn't see anything that excludes pods with stateful volumes, although I might have just missed it.
Similarly, I suspect that this metric implementation is currently including time between pod creation and successful scheduling; if this is so, I'm dubious that it well matches the intention of the SLO documentation "schedulable" term.
Kubernetes version
$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.7
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux aks-nodepool1-58959923-vmss000000 5.15.0-1087-azure #96-Ubuntu SMP Fri Mar 28 20:31:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status