Skip to content

Kubelet cpumanager inaccurately calculates the narrowest matching NUMANodeAffinity #132187

Closed
@jimkingstone

Description

@jimkingstone

What happened?

On a machine with 128 cores and 8 NUMA nodes (AMD + NVIDIA 4090D), the kubelet is configured with the TopologyManager set to restricted policy and has a portion of the CPUs reserved. The reserved and allocatable CPU resources are as follows:

NUMA Node Total CPU IDs
(16)
Allocatable CPU IDs
(15)
Reserved CPU IDs
(1)
NUMA 0 0-7,64-71 1-7,64-71 0
NUMA 1 8-15,72-79 9-15,72-79 8
NUMA 2 16-23,80-87 17-23,80-87 16
NUMA 3 24-31,88-95 25-31,88-95 24
NUMA 4 32-39,96-103 33-39,96-103 32
NUMA 5 40-47,104-111 41-47,104-111 40
NUMA 6 48-55,112-119 49-55,112-119 48
NUMA 7 56-63,120-127 57-63,120-127 56

When creating a pod that requests 112 CPUs and 8 GPUs, the CPU Manager does not take the reserved CPUs into account when calculating the narrowest matching NUMANodeAffinity, resulting in a 7 NUMA node bitmask (01111111).

However, when generating hints, the CPU Manager uses the actual allocatable CPUs and generates a hint with a bitmask of 11111111. Since the narrowest matching NUMANodeAffinity is 7, the hint with the full 8-node bitmask (11111111) is marked as preferred=false.

As a result, under the restricted policy, the pod with this configuration cannot be created.

Related code:
https://github.com/kubernetes/kubernetes/blob/v1.25.12/pkg/kubelet/cm/cpumanager/policy_static.go#L536-L574

What did you expect to happen?

In fact, due to the presence of reserved CPUs, a hint with bitmask = 01111111 will never be generated. Therefore, calculating the narrowest matching NUMANodeAffinity as 7 is meaningless. Reserved CPUs should be taken into account when computing the narrowest matching NUMANodeAffinity, and 8 would be the correct value.

How can we reproduce it (as minimally and precisely as possible)?

  • Kubelet with topology manager configurations. The main configurations:
apiVersion: kubelet.config.k8s.io/v1beta1
……
kubeReserved:
  cpu: 200m
  ephemeral-storage: 1Gi
  memory: 300Mi
systemReserved:
  cpu: 200m
  ephemeral-storage: 1Gi
  memory: 1Gi
reservedSystemCPUs: "0,8,16,24,32,40,48,56"
cpuManagerPolicy: static
memoryManagerPolicy: Static
topologyManagerPolicy: restricted
topologyManagerScope: pod
featureGates:
  CPUManagerPolicyAlphaOptions: true
cpuManagerPolicyOptions:
  distribute-cpus-across-numa: "true"
reservedMemory:
- numaNode: 0
  limits:
    memory: "178Mi"
- numaNode: 1
  limits:
    memory: "178Mi"
- numaNode: 2
  limits:
    memory: "178Mi"
- numaNode: 3
  limits:
    memory: "178Mi"
- numaNode: 4
  limits:
    memory: "178Mi"
- numaNode: 5
  limits:
    memory: "178Mi"
- numaNode: 6
  limits:
    memory: "178Mi"
- numaNode: 7
  limits:
    memory: "178Mi"
  • A machine with 128 cores and 8 NUMA nodes, such as an AMD system with an NVIDIA 4090D.

  • Set 56Gi of hugepages-2Mi on each NUMA node.

#!/bin/bash

for node_path in /sys/devices/system/node/node*/hugepages/hugepages-2048kB; do
    echo 28672 > "$node_path/nr_hugepages"
done
  • Create a pod with the following resource configuration:
    resources:
      limits:
        cpu: "112"
        hugepages-2Mi: 448Gi
        memory: "3212837233"
        nvidia.com/AD102_GEFORCE_RTX_4090_D: "8"
      requests:
        cpu: "112"
        hugepages-2Mi: 448Gi
        memory: "3212837233"
        nvidia.com/AD102_GEFORCE_RTX_4090_D: "8"

Anything else we need to know?

No response

Kubernetes version

$ kubelet --version
Kubernetes v1.25.12

Cloud provider

OS version

$ cat /etc/redhat-release
Rocky Linux release 9.2 (Blue Onyx)

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions