Skip to content

Conversion of cgroup v1 CPU shares to v2 CPU weight causes workloads to have low CPU priority #131216

Open
@iholder101

Description

@iholder101

Context

Kubernetes was originally implemented with v1 in mind.
In cgroup v1, the CPU shares were defined very simply by assigning the container's CPU requests in a millicpu form.

As an example, for a container requesting 1 CPU (which equals to 1024m cpu): cpu.shares = 1024.

After a while, when there was a need to support and move the focus to cgroup v2, a dedicated KEP-2254 was submitted.

Cgroup v1 and v2 have very different ranges of values for CPU shares and weight.
Cgroup v1 uses a range of [2^1 - 2^18] == [2 - 262144] for CPU shares.
Cgroup v2 uses a range of [10^0 - 10^4] == [1 - 10000] for CPU weight.

As part of this KEP, it was agreed to use the following formula to perform the conversion from cgroup v1's cpu.shares to cgroup v2's CPU weight, as can be seen here:
cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142) // convert from [2-262144] to [1-10000]

How does it look like

Let's start with an example to understand how the cgroup configuration looks like on both environments.
I'll use the following dummy pod and run it on v1 and v2 setups:

apiVersion: v1
kind: Pod
metadata:
  name: dummy-sleeping-pod
spec:
  containers:
  - name: sleep-container
    image: busybox
    command: ["sleep", "infinity"]
    resources:
      requests:
        cpu: 1

On cgroup v1 the underlying configuration is pretty intuitive:

> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
1024

On v2, the configuration looks like the following:

> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
39

And indeed, according to the formula above, cpu.weight = (1+((1024-2)*9999)/262142) ~= 39.9.

If I would change the pod to consume only 100m CPU, the configuration will look like the following:
on v1:

> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu/cpu.shares"
102

on v2:

> k exec -it dummy-sleeping-pod -- sh -c "cat /sys/fs/cgroup/cpu.weight"
4

The problem

The above formula focuses on converting the values from one range to another, keeping the values in the same percentile in this range. As an example, if a certain value of cpu shares is 20% of the range, it will stay 20% of the range when it's converted to cgroup v2.

However, this imposes several problems.

A non-Kubernetes workload has a much higher priority in v2

The default CPU shares for cgroup v1 is 1024.
This means that when kubernetes workloads would compete with non-kubernetes workloads (system daemons, drivers, kubelet itself, etc), a container requesting 1 CPU has the same CPU priority as a "regular" process. Asking for less than 1 CPU will grand lower priority, and vice-versa.

However, in cgroup v2, the default CPU weight is 100.
This means that (as can be seen above) a container asking for 1 CPU now has less than 40% of the default CPU weight.

The implication is that Kubernetes workloads have much less CPU priority against non-Kubernetes workloads under v2.

A too-small granularity

As can be seen above, a container that requests for 100m CPU only has a CPU weight of 4, while on v1 it would have 102 CPU shares.

This value is not granular enough.
This is relevant for use-cases in which sub-cgroups need to be configured inside a container to further distribute resources inside the container.

As an example, there could be a container running a few CPU intensive processes and one managerial process that does not need to consume a lot of CPU, but needs to be very responsive. In such a case, sub-cgroups can be created inside the container, leaving 90% of the weight to the CPU-bound processes and 10% to the other process.

Proposed solution

When thinking of this, a simple solution comes into mind. Define: cpu.weight = cpu_request / 10.

This makes sense to me since asking for 1 CPU == 1024m CPU will result in getting a CPU weight of 102.

With this simple solution we solve the two problems above.
Firstly, a container asking for 1 CPU now gets a value that is very close to cgroup's default.
In addition, the granularity would improve, leaving more headroom for further configuration.

Note: this is just an initial thought. I'd be more than happy to discuss alternative solutions.

Other details

/sig node

$ kubectl version
Client Version: v1.32.3
Kustomize Version: v5.5.0
Server Version: v1.32.3
# On Linux:
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="39 (Container Image)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora Linux 39 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f39/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-05-14
VARIANT="Container Image"
VARIANT_ID=container

$ uname -a
# paste output here
Linux 69097037f6cf 6.11.9-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Nov 17 18:52:19 UTC 2024 x86_64 GNU/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions