Conversion of cgroup v1 CPU shares to v2 CPU weight causes workloads to have low CPU priority

## Context

Kubernetes was originally implemented with v1 in mind.
In cgroup v1, the CPU shares were defined very simply by assigning the container's CPU requests in a millicpu form.

As an example, for a container requesting 1 CPU (which equals to `1024m` cpu): `cpu.shares = 1024`.

After a while, when there was a need to support and move the focus to cgroup v2, a [dedicated KEP-2254](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2) was submitted.

Cgroup v1 and v2 have very different ranges of values for CPU shares and weight.
Cgroup v1 uses a range of `[2^1 - 2^18]  == [2 - 262144]` for CPU shares.
Cgroup v2 uses a range of `[10^0 - 10^4] == [1 - 10000]` for CPU weight.

As part of this KEP, it was agreed to use the following formula to perform the conversion from cgroup v1's cpu.shares to cgroup v2's CPU weight, as can be seen [here](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2254-cgroup-v2#phase-1-convert-from-cgroups-v1-settings-to-v2):
`cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142)     // convert from [2-262144] to [1-10000]`

## How does it look like
Let's start with an example to understand how the cgroup configuration looks like on both environments.
I'll use the following dummy pod and run it on v1 and v2 setups:
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: dummy-sleeping-pod
spec:
  containers:
  - name: sleep-container
    image: busybox
    command: ["sleep", "infinity"]
    resources:
      requests:
        cpu: 1
```

On cgroup v1 the underlying configuration is pretty intuitive:
```shell
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
1024
```

On v2, the configuration looks like the following:
```shell
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
39
```

And indeed, according to the formula above, `cpu.weight = (1+((1024-2)*9999)/262142) ~= 39.9`.

If I would change the pod to consume only `100m` CPU, the configuration will look like the following:
on v1:
```shell
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
102
```

on v2:
```shell
> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
4
```

## The problem

The above formula focuses on converting the values from one range to another, keeping the values in the same percentile in this range. As an example, if a certain value of cpu shares is 20% of the range, it will stay 20% of the range when it's converted to cgroup v2.

However, this imposes several problems.

### A non-Kubernetes workload has a much higher priority in v2
The default CPU shares for cgroup v1 is `1024`.
This means that when kubernetes workloads would compete with non-kubernetes workloads (system daemons, drivers, kubelet itself, etc), a container requesting 1 CPU has the same CPU priority as a "regular" process. Asking for less than 1 CPU will grand lower priority, and vice-versa.

However, in cgroup v2, the default CPU weight is `100`.
This means that (as can be seen above) a container asking for 1 CPU now has less than 40% of the default CPU weight.

The implication is that Kubernetes workloads have much less CPU priority against non-Kubernetes workloads under v2.

### A too-small granularity
As can be seen above, a container that requests for `100m` CPU only has a CPU weight of `4`, while on v1 it would have `102` CPU shares.

This value is not granular enough.
This is relevant for use-cases in which sub-cgroups need to be configured inside a container to further distribute resources inside the container.

As an example, there could be a container running a few CPU intensive processes and one managerial process that does not need to consume a lot of CPU, but needs to be very responsive. In such a case, sub-cgroups can be created inside the container, leaving 90% of the weight to the CPU-bound processes and 10% to the other process.

## Proposed solution
When thinking of this, a simple solution comes into mind. Define: `cpu.weight = cpu_request / 10`.

This makes sense to me since asking for `1 CPU == 1024m CPU` will result in getting a CPU weight of `102`.

With this simple solution we solve the two problems above.
Firstly, a container asking for 1 CPU now gets a value that is very close to cgroup's default.
In addition, the granularity would improve, leaving more headroom for further configuration.

Note: this is just an initial thought. I'd be more than happy to discuss alternative solutions.

## Other details
/sig node

<details>

```console
$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.3
```

```console
# On Linux:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="39 (Container Image)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-05-14
VARIANT="Container Image"
VARIANT_ID=container

$ uname -a
# paste output here
Linux 69097037f6cf 6.11.9-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Nov 17 18:52:19 UTC 2024 x86_64 GNU/Linux
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conversion of cgroup v1 CPU shares to v2 CPU weight causes workloads to have low CPU priority #131216

Context

How does it look like

The problem

A non-Kubernetes workload has a much higher priority in v2

A too-small granularity

Proposed solution

Other details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Conversion of cgroup v1 CPU shares to v2 CPU weight causes workloads to have low CPU priority #131216

Description

Context

How does it look like

The problem

A non-Kubernetes workload has a much higher priority in v2

A too-small granularity

Proposed solution

Other details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions