Skip to content

DRA: Pod termination is stuck when DRA Driver is stopped #129402

Closed
@mochizuki875

Description

@mochizuki875

What happened?

A status of Pod allocated some device remains as terminating when the DRA Driver is stopped.
I don't know this is intentional or a bug.

What did you expect to happen?

A Pod is completely terminated.

How can we reproduce it (as minimally and precisely as possible)?

We can reproduce it using dra-example-driver.

Summary

  1. Install the DRA Driver(dra-example-driver) and create a DeviceClass.
  2. Create a ResourceClaimTemplate.
  3. Deploy a Pod allocated some device via the ResourceClaimTemplate.
  4. Stop the DRA Driver.
  5. Delete the Pod.
  6. The Pod remains as terminating.

Procedure

Install the DRA Driver and create a DeviceClass by following dra-example-driver demo.

Create a ResourceClaimTemplate.

resource-claim-template-0.yaml

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.example.com
$ kubectl apply -f resource-claim-template-0.yaml

Deploy a Pod allocated some device via the ResourceClaimTemplate.

sample-pod-0.yaml

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod-0
  labels:
    app: sample-pod-0
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    resourceClaimTemplateName: single-gpu
$ kubectl apply -f sample-pod-0.yaml

The current status of cluster is as follows.

$ kubectl get deviceclasses,resourceclaimtemplates,resourceclaim,pods
NAME                                          AGE
deviceclass.resource.k8s.io/gpu.example.com   54m

NAME                                               AGE
resourceclaimtemplate.resource.k8s.io/single-gpu   85s

NAME                                                   STATE                AGE
resourceclaim.resource.k8s.io/sample-pod-0-gpu-hxsn7   allocated,reserved   77s

NAME               READY   STATUS    RESTARTS   AGE
pod/sample-pod-0   1/1     Running   0          77s

Stop the DRA Driver.
In this case, we can uninstall the dra-example-driver via helm.

$ helm -n dra-example-driver uninstall dra-example-driver

Delete the Pod and the status remains as terminating.

$ kubectl delete po sample-pod-0
pod "sample-pod-0" deleted
(stucking...)

$ kubectl get pod
NAME           READY   STATUS        RESTARTS   AGE
sample-pod-0   0/1     Terminating   0          17m

Anything else we need to know?

The kubelet log shows the following error.

# journalctl -xu kubelet
...
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: I1226 05:35:41.323475     231 kubelet.go:2490] "SyncLoop DELETE" source="api" pods=["default/sample-pod-0"]
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: I1226 05:35:41.323611     231 kuberuntime_container.go:809] "Killing container with a grace period" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc" containerName="ctr0" containerID="containerd://f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05" gracePeriod=30
Dec 26 05:35:41 kind-v1.32.0-worker kubelet[231]: E1226 05:35:41.415637     231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032843     231 generic.go:358] "Generic (PLEG): container finished" podID="5598e74f-08ff-40ab-aba3-fa811874f9dc" containerID="f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05" exitCode=0
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032885     231 kubelet.go:2506] "SyncLoop (PLEG): event for pod" pod="default/sample-pod-0" event={"ID":"5598e74f-08ff-40ab-aba3-fa811874f9dc","Type":"ContainerDied","Data":"f42f59f5684ff9be1a0ee57d70231456530e40e87e652ddf0b72c7639a544a05"}
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032901     231 kubelet.go:2506] "SyncLoop (PLEG): event for pod" pod="default/sample-pod-0" event={"ID":"5598e74f-08ff-40ab-aba3-fa811874f9dc","Type":"ContainerDied","Data":"220eab5b9bc2277f4ac7777dabb82a92bee9d3d17bfa42f92d204cb9d85c936d"}
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032907     231 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="220eab5b9bc2277f4ac7777dabb82a92bee9d3d17bfa42f92d204cb9d85c936d"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: I1226 05:35:42.032934     231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:35:42 kind-v1.32.0-worker kubelet[231]: E1226 05:35:42.044794     231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:35:53 kind-v1.32.0-worker kubelet[231]: I1226 05:35:53.469847     231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:35:53 kind-v1.32.0-worker kubelet[231]: E1226 05:35:53.482748     231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"
Dec 26 05:36:08 kind-v1.32.0-worker kubelet[231]: I1226 05:36:08.468746     231 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="default/sample-pod-0"
Dec 26 05:36:08 kind-v1.32.0-worker kubelet[231]: E1226 05:36:08.480012     231 pod_workers.go:1301] "Error syncing pod, skipping" err="get gRPC client for DRA driver gpu.example.com: plugin name gpu.example.com not found in the list of registered DRA plugins" pod="default/sample-pod-0" podUID="5598e74f-08ff-40ab-aba3-fa811874f9dc"

Kubernetes version

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

Cloud provider

none

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

$ kind version
kind v0.26.0 go1.23.4 linux/amd64

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/backlogHigher priority than priority/awaiting-more-evidence.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

    Type

    No type

    Projects

    Status

    📋 Backlog

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions