Description
What happened?
On a machine with 128 cores and 8 NUMA nodes (AMD + NVIDIA 4090D), the kubelet is configured with the TopologyManager
set to restricted
policy and has a portion of the CPUs reserved. The reserved and allocatable CPU resources are as follows:
NUMA Node | Total CPU IDs (16) |
Allocatable CPU IDs (15) |
Reserved CPU IDs (1) |
---|---|---|---|
NUMA 0 | 0-7,64-71 | 1-7,64-71 | 0 |
NUMA 1 | 8-15,72-79 | 9-15,72-79 | 8 |
NUMA 2 | 16-23,80-87 | 17-23,80-87 | 16 |
NUMA 3 | 24-31,88-95 | 25-31,88-95 | 24 |
NUMA 4 | 32-39,96-103 | 33-39,96-103 | 32 |
NUMA 5 | 40-47,104-111 | 41-47,104-111 | 40 |
NUMA 6 | 48-55,112-119 | 49-55,112-119 | 48 |
NUMA 7 | 56-63,120-127 | 57-63,120-127 | 56 |
When creating a pod that requests 112 CPUs and 8 GPUs, the CPU Manager does not take the reserved CPUs into account when calculating the narrowest matching NUMANodeAffinity
, resulting in a 7 NUMA node bitmask (01111111
).
However, when generating hints, the CPU Manager uses the actual allocatable CPUs and generates a hint with a bitmask of 11111111
. Since the narrowest matching NUMANodeAffinity
is 7, the hint with the full 8-node bitmask (11111111
) is marked as preferred=false
.
As a result, under the restricted
policy, the pod with this configuration cannot be created.
Related code:
https://github.com/kubernetes/kubernetes/blob/v1.25.12/pkg/kubelet/cm/cpumanager/policy_static.go#L536-L574
What did you expect to happen?
In fact, due to the presence of reserved CPUs, a hint with bitmask = 01111111
will never be generated. Therefore, calculating the narrowest matching NUMANodeAffinity
as 7 is meaningless. Reserved CPUs should be taken into account when computing the narrowest matching NUMANodeAffinity
, and 8 would be the correct value.
How can we reproduce it (as minimally and precisely as possible)?
- Kubelet with topology manager configurations. The main configurations:
apiVersion: kubelet.config.k8s.io/v1beta1
……
kubeReserved:
cpu: 200m
ephemeral-storage: 1Gi
memory: 300Mi
systemReserved:
cpu: 200m
ephemeral-storage: 1Gi
memory: 1Gi
reservedSystemCPUs: "0,8,16,24,32,40,48,56"
cpuManagerPolicy: static
memoryManagerPolicy: Static
topologyManagerPolicy: restricted
topologyManagerScope: pod
featureGates:
CPUManagerPolicyAlphaOptions: true
cpuManagerPolicyOptions:
distribute-cpus-across-numa: "true"
reservedMemory:
- numaNode: 0
limits:
memory: "178Mi"
- numaNode: 1
limits:
memory: "178Mi"
- numaNode: 2
limits:
memory: "178Mi"
- numaNode: 3
limits:
memory: "178Mi"
- numaNode: 4
limits:
memory: "178Mi"
- numaNode: 5
limits:
memory: "178Mi"
- numaNode: 6
limits:
memory: "178Mi"
- numaNode: 7
limits:
memory: "178Mi"
-
A machine with 128 cores and 8 NUMA nodes, such as an AMD system with an NVIDIA 4090D.
-
Set
56Gi
ofhugepages-2Mi
on each NUMA node.
#!/bin/bash
for node_path in /sys/devices/system/node/node*/hugepages/hugepages-2048kB; do
echo 28672 > "$node_path/nr_hugepages"
done
- Create a pod with the following resource configuration:
resources:
limits:
cpu: "112"
hugepages-2Mi: 448Gi
memory: "3212837233"
nvidia.com/AD102_GEFORCE_RTX_4090_D: "8"
requests:
cpu: "112"
hugepages-2Mi: 448Gi
memory: "3212837233"
nvidia.com/AD102_GEFORCE_RTX_4090_D: "8"
Anything else we need to know?
No response
Kubernetes version
$ kubelet --version
Kubernetes v1.25.12
Cloud provider
OS version
$ cat /etc/redhat-release
Rocky Linux release 9.2 (Blue Onyx)
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status