Description
Context
Kubernetes was originally implemented with v1 in mind.
In cgroup v1, the CPU shares were defined very simply by assigning the container's CPU requests in a millicpu form.
As an example, for a container requesting 1 CPU (which equals to 1024m
cpu): cpu.shares = 1024
.
After a while, when there was a need to support and move the focus to cgroup v2, a dedicated KEP-2254 was submitted.
Cgroup v1 and v2 have very different ranges of values for CPU shares and weight.
Cgroup v1 uses a range of [2^1 - 2^18] == [2 - 262144]
for CPU shares.
Cgroup v2 uses a range of [10^0 - 10^4] == [1 - 10000]
for CPU weight.
As part of this KEP, it was agreed to use the following formula to perform the conversion from cgroup v1's cpu.shares to cgroup v2's CPU weight, as can be seen here:
cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) // convert from [2-262144] to [1-10000]
How does it look like
Let's start with an example to understand how the cgroup configuration looks like on both environments.
I'll use the following dummy pod and run it on v1 and v2 setups:
apiVersion: v1
kind: Pod
metadata:
name: dummy-sleeping-pod
spec:
containers:
- name: sleep-container
image: busybox
command: ["sleep", "infinity"]
resources:
requests:
cpu: 1
On cgroup v1 the underlying configuration is pretty intuitive:
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
1024
On v2, the configuration looks like the following:
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
39
And indeed, according to the formula above, cpu.weight = (1+((1024-2)*9999)/262142) ~= 39.9
.
If I would change the pod to consume only 100m
CPU, the configuration will look like the following:
on v1:
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
102
on v2:
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
4
The problem
The above formula focuses on converting the values from one range to another, keeping the values in the same percentile in this range. As an example, if a certain value of cpu shares is 20% of the range, it will stay 20% of the range when it's converted to cgroup v2.
However, this imposes several problems.
A non-Kubernetes workload has a much higher priority in v2
The default CPU shares for cgroup v1 is 1024
.
This means that when kubernetes workloads would compete with non-kubernetes workloads (system daemons, drivers, kubelet itself, etc), a container requesting 1 CPU has the same CPU priority as a "regular" process. Asking for less than 1 CPU will grand lower priority, and vice-versa.
However, in cgroup v2, the default CPU weight is 100
.
This means that (as can be seen above) a container asking for 1 CPU now has less than 40% of the default CPU weight.
The implication is that Kubernetes workloads have much less CPU priority against non-Kubernetes workloads under v2.
A too-small granularity
As can be seen above, a container that requests for 100m
CPU only has a CPU weight of 4
, while on v1 it would have 102
CPU shares.
This value is not granular enough.
This is relevant for use-cases in which sub-cgroups need to be configured inside a container to further distribute resources inside the container.
As an example, there could be a container running a few CPU intensive processes and one managerial process that does not need to consume a lot of CPU, but needs to be very responsive. In such a case, sub-cgroups can be created inside the container, leaving 90% of the weight to the CPU-bound processes and 10% to the other process.
Proposed solution
When thinking of this, a simple solution comes into mind. Define: cpu.weight = cpu_request / 10
.
This makes sense to me since asking for 1 CPU == 1024m CPU
will result in getting a CPU weight of 102
.
With this simple solution we solve the two problems above.
Firstly, a container asking for 1 CPU now gets a value that is very close to cgroup's default.
In addition, the granularity would improve, leaving more headroom for further configuration.
Note: this is just an initial thought. I'd be more than happy to discuss alternative solutions.
Other details
/sig node
$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.3
# On Linux:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="39 (Container Image)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-05-14
VARIANT="Container Image"
VARIANT_ID=container
$ uname -a
# paste output here
Linux 69097037f6cf 6.11.9-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Nov 17 18:52:19 UTC 2024 x86_64 GNU/Linux