Closed
Description
What happened?
A status of Pod allocated some device remains as terminating when the DRA Driver is stopped.
I don't know this is intentional or a bug.
What did you expect to happen?
A Pod is completely terminated.
How can we reproduce it (as minimally and precisely as possible)?
We can reproduce it using dra-example-driver.
Summary
- Install the
DRA Driver
(dra-example-driver) and create aDeviceClass
. - Create a
ResourceClaimTemplate
. - Deploy a Pod allocated some device via the
ResourceClaimTemplate
. - Stop the DRA Driver.
- Delete the Pod.
- The Pod remains as terminating.
Procedure
Install the DRA Driver
and create a DeviceClass
by following dra-example-driver demo.
Create a ResourceClaimTemplate
.
resource-claim-template-0.yaml
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.example.com
$ kubectl apply -f resource-claim-template-0.yaml
Deploy a Pod allocated some device via the ResourceClaimTemplate
.
sample-pod-0.yaml
apiVersion: v1
kind: Pod
metadata:
name: sample-pod-0
labels:
app: sample-pod-0
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-gpu
$ kubectl apply -f sample-pod-0.yaml
The current status of cluster is as follows.
$ kubectl get deviceclasses,resourceclaimtemplates,resourceclaim,pods
NAME AGE
deviceclass.resource.k8s.io/gpu.example.com 54m
NAME AGE
resourceclaimtemplate.resource.k8s.io/single-gpu 85s
NAME STATE AGE
resourceclaim.resource.k8s.io/sample-pod-0-gpu-hxsn7 allocated,reserved 77s
NAME READY STATUS RESTARTS AGE
pod/sample-pod-0 1/1 Running 0 77s
Stop the DRA Driver.
In this case, we can uninstall the dra-example-driver
via helm.
$ helm -n dra-example-driver uninstall dra-example-driver
Delete the Pod and the status remains as terminating.
$ kubectl delete po sample-pod-0
pod "sample-pod-0" deleted
(stucking...)
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
sample-pod-0 0/1 Terminating 0 17m
Anything else we need to know?
The kubelet log shows the following error.
# journalctl -xu kubelet
...
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: I1226 05:35:41.323475 231 kubelet.go:2490] "SyncLoop DELETE" source="api" pods=["default/sample-pod-0"]
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: I1226 05:35:41.323611 231 kuberuntime_container.go:809] "Killing container with a grace period" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc" containerName="ctr0" containerID="containerd://f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05" gracePeriod=30
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: E1226 05:35:41.415637 231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032843 231 generic.go:358] "Generic (PLEG): container finished" podID="5598e74f-08ff-40ab-aba3-fa811874f9dc" containerID="f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05" exitCode=0
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032885 231 kubelet.go:2506] "SyncLoop (PLEG): event for pod" pod="default/sample-pod-0" event={"ID":"5598e74f-08ff-40ab-aba3-fa811874f9dc","Type":"ContainerDied","Data":"f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05"}
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032901 231 kubelet.go:2506] "SyncLoop (PLEG): event for pod" pod="default/sample-pod-0" event={"ID":"5598e74f-08ff-40ab-aba3-fa811874f9dc","Type":"ContainerDied","Data":"220eab5b9bc2277f4ac7777dabb82a92bee9d3d17bfa42f92d204cb9d85c936d"}
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032907 231 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="220eab5b9bc2277f4ac7777dabb82a92bee9d3d17bfa42f92d204cb9d85c936d"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032934 231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: E1226 05:35:42.044794 231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:35:53 kind-v1.32.0-worker kubelet[231]: I1226 05:35:53.469847 231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:35:53 kind-v1.32.0-worker kubelet[231]: E1226 05:35:53.482748 231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:36:08 kind-v1.32.0-worker kubelet[231]: I1226 05:36:08.468746 231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:36:08 kind-v1.32.0-worker kubelet[231]: E1226 05:36:08.480012 231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Kubernetes version
$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
Cloud provider
none
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
$ kind version
kind v0.26.0 go1.23.4 linux/amd64
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
📋 Backlog
Status
Done