Tune APF list work estimator for large responses

### What would you like to be added?

The API Priority and Fairness (APF) mechanism effectively protects the Kubernetes API server from CPU-bound overloads. However, it currently falls short in preventing memory exhaustion, particularly from large LIST responses. The size of LIST responses can vary drastically, ranging from a few kilobytes to hundreds of megabytes. APF's current cost estimation for requests does not adequately account for the memory footprint of these responses. This oversight can lead to API server memory exhaustion (OOMs), reduced stability, and inefficient resource utilization.

## Proposal

To mitigate memory overloads and prevent OOMs, I propose to incorporate the response size as a primary signal into the APF LIST work estimator. This will allow APF to better reflect the true resource cost of these requests and allocate "seats" proportionally.

Specifically, I propose the following changes to the LIST work estimator:
* Replace the existing object count estimator with a size-based model:
  * Requests served from etcd: Assign 1 seat per 100 KB of estimated response size, without an upper limit. 
  * Requests served from cache: Assign 1 seat per 100 KB of estimated response size, if response is streamed then limit seats to 10 seats, if there is no upper limit.
* Increase the global maximum seats from the current 10 to 100.
* Account for borrowing when counting max seats.

## Experiments

The aforementioned proposal is informed by experiments in which the following behavioral observations were made regarding memory usage when serving LIST requests:
* Number of objects have no direct effect on memory usage, but rather it increases the CPU cost of allocations and garbage collection. This can lead to increasing memory assuming same QPS, but it should not be part of request cost estimation as allocated memory is proportional to CPU.
* From etcd: Memory consumption is directly proportional to the response size and is uncapped. This is primarily due to loading the entire etcd response into memory.
* From cache: Memory usage is primarily driven by response encoding, and thanks to response being streamed , it is capped.
* If response is not streamed then memory usage grows proportionally even when serving from cache.

### Setup

Experiments were conducted on a Kind cluster with an API server configured with 16 cores and a maximum of 800 concurrent requests. This value was selected because at 800 concurrent requests, the global-default priority level reaches its maxSeats limit of 10. At that value global-default requests are able to borrow up to 500 seats. The cluster utilized the latest Kubernetes build from the master branch, incorporating a patch that allowed the explicit setting of request seats to values of 1, 10, or 100 (bypassing the maxSeats configuration for experimental control).

### Metod

For each combination of input parameters (response size, object count, and etcd/cache serving), the following procedure was executed:

1. Initialize the number of allocated seats to 1.
2. Determine the maximum Queries Per Second (QPS) that the API server could sustainably serve without saturation.
3. At the determined maximum QPS, measure the API server's peak memory usage for different enforced seat values: 1, 10, and 100.
4. Increase the QPS by 33% to simulate an overload condition and repeat the memory measurement.

### CPU saturation

Throughout these tests, it was consistently observed that a QPS threshold existed beyond which CPU scaling ceased. The API server was unable to process additional queries despite available CPU resources. It was critical to ensure that CPU usage remained capped below the available CPU capacity to prevent starvation of the garbage collector (GC). Starving the GC invariably led to significant memory leaks. The current APF mechanism was not designed to address this specific situation.

### Results

Requests served from etcd

Response size | Object count | QPS | CPU | Memory [MB] seat 1 | Memory [MB] seat 10 | Memory [MB] seat 100
-- | -- | -- | -- | -- | -- | --
100KB | 10 | 1800 | 654% | 412.4 | 471.9 | &nbsp;
100KB | 10 | 2400 | 692% | 573.2 | 699.5 | &nbsp;
1MB | 1000 | 200 | 770% | 1076.4 | 748.7 | &nbsp;
1MB | 1000 | 265 | 932% | 1337.5 | 791.5 | &nbsp;
1MB | 100 | 250 | 669% | 811.9 | 611.8 | &nbsp;
1MB | 100 | 333 | 679% | 1316.7 | 743.6 | &nbsp;
1MB | 10 | 260 | 617% | 902.7 | 623.0 | &nbsp;
1MB | 10 | 345 | 641% | 1341.1 | 729.3 | &nbsp;
10MB | 10000 | 15 | 583% | 726.9 | 727.5 | 643.8
10MB | 10000 | 20 | 682% | 3690.5 | 1957.6 | 691.8
10MB | 1000 | 15 | 393% | 602.2 | 578.3 | 603.0
10MB | 1000 | 20 | 426% | 2622.6 | 1849.0 | 611.5
10MB | 100 | 15 | 380% | 532.1 | 520.2 | 583.3
10MB | 100 | 20 | 436% | 2124.9 | 1902.9 | 586.9
10MB | 10 | 15 | 372% | 579.1 | 542.8 | 551.6
10MB | 10 | 20 | 398% | 2742.6 | 1526.6 | 583.6

Requests served from cache

Response size | Object count | QPS | CPU | Memory [MB] seat 1 | Memory [MB] seat 10
-- | -- | -- | -- | -- | --
100KB | 10 | 6300 | 1058% | 612.2 | &nbsp;
100KB | 10 | 8400 | 1249% | 565.2 | &nbsp;
1MB | 1000 | 480 | 1117% | 1786.1 | 637.5
1MB | 1000 | 640 | 1178% | 2308.1 | 840.8
1MB | 100 | 540 | 850% | 1483.5 | 559.4
1MB | 100 | 720 | 901% | 1713.0 | 745.7
10MB | 1000 | 80 | 917% | 2236.4 | 622.0
10MB | 1000 | 105 | 930% | 2253.9 | 659.3
10MB | 100 | 85 | 909% | 1771.1 | 582.2
10MB | 100 | 115 | 911% | 1794.1 | 612.8
10MB | 10 | 80 | 844% | 2692.3 | 680.1
10MB | 10 | 105 | 886% | 2757.0 | 710.8

If value was not provided, means that seats value was too large and resulted in most request being dropped without improvements to memory usage.

/cc @wojtek-t @deads2k 

### Why is this needed?

Prevent OOMs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tune APF list work estimator for large responses #132233

What would you like to be added?

Proposal

Experiments

Setup

Metod

CPU saturation

Results

Why is this needed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Response size	Object count	QPS	CPU	Memory [MB] seat 1	Memory [MB] seat 10	Memory [MB] seat 100
100KB	10	1800	654%	412.4	471.9
100KB	10	2400	692%	573.2	699.5
1MB	1000	200	770%	1076.4	748.7
1MB	1000	265	932%	1337.5	791.5
1MB	100	250	669%	811.9	611.8
1MB	100	333	679%	1316.7	743.6
1MB	10	260	617%	902.7	623.0
1MB	10	345	641%	1341.1	729.3
10MB	10000	15	583%	726.9	727.5	643.8
10MB	10000	20	682%	3690.5	1957.6	691.8
10MB	1000	15	393%	602.2	578.3	603.0
10MB	1000	20	426%	2622.6	1849.0	611.5
10MB	100	15	380%	532.1	520.2	583.3
10MB	100	20	436%	2124.9	1902.9	586.9
10MB	10	15	372%	579.1	542.8	551.6
10MB	10	20	398%	2742.6	1526.6	583.6

Response size	Object count	QPS	CPU	Memory [MB] seat 1	Memory [MB] seat 10
100KB	10	6300	1058%	612.2
100KB	10	8400	1249%	565.2
1MB	1000	480	1117%	1786.1	637.5
1MB	1000	640	1178%	2308.1	840.8
1MB	100	540	850%	1483.5	559.4
1MB	100	720	901%	1713.0	745.7
10MB	1000	80	917%	2236.4	622.0
10MB	1000	105	930%	2253.9	659.3
10MB	100	85	909%	1771.1	582.2
10MB	100	115	911%	1794.1	612.8
10MB	10	80	844%	2692.3	680.1
10MB	10	105	886%	2757.0	710.8

Tune APF list work estimator for large responses #132233

Description

What would you like to be added?

Proposal

Experiments

Setup

Metod

CPU saturation

Results

Why is this needed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions