Description
What would you like to be added?
The API Priority and Fairness (APF) mechanism effectively protects the Kubernetes API server from CPU-bound overloads. However, it currently falls short in preventing memory exhaustion, particularly from large LIST responses. The size of LIST responses can vary drastically, ranging from a few kilobytes to hundreds of megabytes. APF's current cost estimation for requests does not adequately account for the memory footprint of these responses. This oversight can lead to API server memory exhaustion (OOMs), reduced stability, and inefficient resource utilization.
Proposal
To mitigate memory overloads and prevent OOMs, I propose to incorporate the response size as a primary signal into the APF LIST work estimator. This will allow APF to better reflect the true resource cost of these requests and allocate "seats" proportionally.
Specifically, I propose the following changes to the LIST work estimator:
- Replace the existing object count estimator with a size-based model:
- Requests served from etcd: Assign 1 seat per 100 KB of estimated response size, without an upper limit.
- Requests served from cache: Assign 1 seat per 100 KB of estimated response size, if response is streamed then limit seats to 10 seats, if there is no upper limit.
- Increase the global maximum seats from the current 10 to 100.
- Account for borrowing when counting max seats.
Experiments
The aforementioned proposal is informed by experiments in which the following behavioral observations were made regarding memory usage when serving LIST requests:
- Number of objects have no direct effect on memory usage, but rather it increases the CPU cost of allocations and garbage collection. This can lead to increasing memory assuming same QPS, but it should not be part of request cost estimation as allocated memory is proportional to CPU.
- From etcd: Memory consumption is directly proportional to the response size and is uncapped. This is primarily due to loading the entire etcd response into memory.
- From cache: Memory usage is primarily driven by response encoding, and thanks to response being streamed , it is capped.
- If response is not streamed then memory usage grows proportionally even when serving from cache.
Setup
Experiments were conducted on a Kind cluster with an API server configured with 16 cores and a maximum of 800 concurrent requests. This value was selected because at 800 concurrent requests, the global-default priority level reaches its maxSeats limit of 10. At that value global-default requests are able to borrow up to 500 seats. The cluster utilized the latest Kubernetes build from the master branch, incorporating a patch that allowed the explicit setting of request seats to values of 1, 10, or 100 (bypassing the maxSeats configuration for experimental control).
Metod
For each combination of input parameters (response size, object count, and etcd/cache serving), the following procedure was executed:
- Initialize the number of allocated seats to 1.
- Determine the maximum Queries Per Second (QPS) that the API server could sustainably serve without saturation.
- At the determined maximum QPS, measure the API server's peak memory usage for different enforced seat values: 1, 10, and 100.
- Increase the QPS by 33% to simulate an overload condition and repeat the memory measurement.
CPU saturation
Throughout these tests, it was consistently observed that a QPS threshold existed beyond which CPU scaling ceased. The API server was unable to process additional queries despite available CPU resources. It was critical to ensure that CPU usage remained capped below the available CPU capacity to prevent starvation of the garbage collector (GC). Starving the GC invariably led to significant memory leaks. The current APF mechanism was not designed to address this specific situation.
Results
Requests served from etcd
Response size | Object count | QPS | CPU | Memory [MB] seat 1 | Memory [MB] seat 10 | Memory [MB] seat 100 |
---|---|---|---|---|---|---|
100KB | 10 | 1800 | 654% | 412.4 | 471.9 | |
100KB | 10 | 2400 | 692% | 573.2 | 699.5 | |
1MB | 1000 | 200 | 770% | 1076.4 | 748.7 | |
1MB | 1000 | 265 | 932% | 1337.5 | 791.5 | |
1MB | 100 | 250 | 669% | 811.9 | 611.8 | |
1MB | 100 | 333 | 679% | 1316.7 | 743.6 | |
1MB | 10 | 260 | 617% | 902.7 | 623.0 | |
1MB | 10 | 345 | 641% | 1341.1 | 729.3 | |
10MB | 10000 | 15 | 583% | 726.9 | 727.5 | 643.8 |
10MB | 10000 | 20 | 682% | 3690.5 | 1957.6 | 691.8 |
10MB | 1000 | 15 | 393% | 602.2 | 578.3 | 603.0 |
10MB | 1000 | 20 | 426% | 2622.6 | 1849.0 | 611.5 |
10MB | 100 | 15 | 380% | 532.1 | 520.2 | 583.3 |
10MB | 100 | 20 | 436% | 2124.9 | 1902.9 | 586.9 |
10MB | 10 | 15 | 372% | 579.1 | 542.8 | 551.6 |
10MB | 10 | 20 | 398% | 2742.6 | 1526.6 | 583.6 |
Requests served from cache
Response size | Object count | QPS | CPU | Memory [MB] seat 1 | Memory [MB] seat 10 |
---|---|---|---|---|---|
100KB | 10 | 6300 | 1058% | 612.2 | |
100KB | 10 | 8400 | 1249% | 565.2 | |
1MB | 1000 | 480 | 1117% | 1786.1 | 637.5 |
1MB | 1000 | 640 | 1178% | 2308.1 | 840.8 |
1MB | 100 | 540 | 850% | 1483.5 | 559.4 |
1MB | 100 | 720 | 901% | 1713.0 | 745.7 |
10MB | 1000 | 80 | 917% | 2236.4 | 622.0 |
10MB | 1000 | 105 | 930% | 2253.9 | 659.3 |
10MB | 100 | 85 | 909% | 1771.1 | 582.2 |
10MB | 100 | 115 | 911% | 1794.1 | 612.8 |
10MB | 10 | 80 | 844% | 2692.3 | 680.1 |
10MB | 10 | 105 | 886% | 2757.0 | 710.8 |
If value was not provided, means that seats value was too large and resulted in most request being dropped without improvements to memory usage.
Why is this needed?
Prevent OOMs