Skip to content

add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods #130411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

We need to make sure the ReplicaSet controller does not get into a hot feedback loop with the kubelet. As shown in the bug, this can result in the ReplicaSet controller generating 1800+ pods per minute. This can have a negative impact on etcd/apiserver and cluster stability.

Which issue(s) this PR fixes:

Partially solves #72593

Special notes for your reviewer:

There is a proposed ReschedulePodsOnKubeletRejection feature, but there is not enough bandwidth to implement it. This PR will not solve the issue fully, but will at least improve the pod generation speed. From a hot loop to a cold loop. So over time we will still get to 12500 failed pods before they are garbage collected. The ReschedulePodsOnKubeletRejection should solve the cold loop into an immediate feedback between a kubelet/scheduler and move the pod to another node without failing the pods in the first place.

Does this PR introduce a user-facing change?

add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods to make sure we do not get into a hot feedback loop

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2025
@atiratree
Copy link
Member Author

atiratree commented Feb 25, 2025

added tests

@atiratree atiratree force-pushed the replicaset-failed-pods-backoff branch 2 times, most recently from 243e30c to 24e50ae Compare February 26, 2025 12:16
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 26, 2025
@atiratree atiratree changed the title [WIP] add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods Feb 26, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@atiratree atiratree force-pushed the replicaset-failed-pods-backoff branch from 24e50ae to e5791fc Compare February 26, 2025 19:36
//
// Returns the time-weighted failed pod count
// and the next time a pod should be promoted from one bucket to the next.
func calculateTimeWeightedFailedPodsCount(clock clock.PassiveClock, replicasToCreate int, failedPods []*v1.Pod) (int, *int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to look at how other pod-creating controllers handle backoff in failure cases when the kubelet rejects a pod. For example:

Daemonset controller:

// This is a critical place where DS is often fighting with kubelet that rejects pods.
// We need to avoid hot looping and backoff.
backoffKey := failedPodsBackoffKey(ds, node.Name)
now := dsc.failedPodsBackoff.Clock.Now()
inBackoff := dsc.failedPodsBackoff.IsInBackOffSinceUpdate(backoffKey, now)
if inBackoff {
delay := dsc.failedPodsBackoff.Get(backoffKey)
logger.V(4).Info("Deleting failed pod on node has been limited by backoff",
"pod", klog.KObj(pod), "node", klog.KObj(node), "currentDelay", delay)
dsc.enqueueDaemonSetAfter(ds, delay)
continue
}
dsc.failedPodsBackoff.Next(backoffKey, now)
msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name)
logger.V(2).Info("Found failed daemon pod on node, will try to kill it", "pod", klog.KObj(pod), "node", klog.KObj(node))
// Emit an event so that it's discoverable to users.
dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg)
podsToDelete = append(podsToDelete, pod.Name)

Job controller (I'm not sure if this handles the case where kubelet rejects pods):
https://github.com/kubernetes/kubernetes/blob/efac8fdea24d37aa86b1fd72f526e3f0acaf9e13/pkg/controller/job/job_controller.go#L901C31-L912

What do other pod-creating controllers (like statefulset) do? Should we make a distinct special behavior for replicaset controller, or try to do something common?

Copy link
Member Author

@atiratree atiratree Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatefulSet

StatefulSet has the same problem as ReplicaSet. It goes into a hotloop here

if isFailed(replicas[i]) || isSucceeded(replicas[i]) {
if replicas[i].DeletionTimestamp == nil {
if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
return true, err
}
}
// New pod should be generated on the next sync after the current pod is removed from etcd.
return true, nil
}

One advantage is, that it does not pollute the cluster with failed pods, so the problem is not as noticeable. But it is a strain on resources...

I think we could easily solve this with a backoff, same as we do in the DaemonSet.

Job

I think we do not have to solve this problem for a Job. The Job will usually just finish faster after going over a certain number of failures (BackoffLimit, and other fields...). It would still be useful to have a ReschedulePodsOnKubeletRejection feature here in the future though.

ReplicaSet

The difference from StatefulSet/DaemonSet is that we do not have a pod identity that we could use to base the backoff on. We only see that either a portion or all of the created replicas are failing. If we base the backoff on any failing pod, we would have scaling speed issues. So we need a custom backoff which I tried to explain in the code comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job
It would still be useful to have a ReschedulePodsOnKubeletRejection feature here in the future though.

Actually, for Job we already have a mechanism to throttle pod re-creation in case of failed pods - we count the number of failed pods since the last successful pod, and backoff exponentially with base 10s (so the delays are 10s, 20s, 40), see docs and code. So, it is not clear to me what would be the benefit of ReschedulePodsOnKubeletRejection in Job.

btw, I think 10s is too much of a base delay, but that's another discussion :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it is not clear to me what would be the benefit of ReschedulePodsOnKubeletRejection in Job.

The backoff is only applicable for jobs with .spec.backoffLimitPerIndex, right? (updated my comment above). But even in that case ReschedulePodsOnKubeletRejection would still help as it would not fail the pod at all and just reschedule it to a better node.

Copy link
Contributor

@mimowo mimowo Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff is only applicable for jobs with .spec.backoffLimitPerIndex, right?

Actually. the backoff delay applies both for classical "backoffLimit" and "backoffLimitPerIndex". The only difference is that for "backoffLimitPerIndex" the scope of counting the number of failed pods since the last successful is within the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require some testing, but from my local tests creating 200 replicas RS takes up to 10s:

I0317 19:12:43.159542       1 replica_set.go:779] ">>> slowStartBatch is starting..." logger="replicaset-controller" count=200
I0317 19:12:43.159579       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=1 remaining=200
I0317 19:12:43.194089       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=2 remaining=199
I0317 19:12:43.197996       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=4 remaining=197
I0317 19:12:43.202078       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=8 remaining=193
I0317 19:12:43.206846       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=16 remaining=185
I0317 19:12:43.414170       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=32 remaining=169
I0317 19:12:44.460584       1 request.go:752] "Waited before sending request" logger="replicaset-controller" delay="1.046108269s" reason="client-side throttling, not priority and fairness" verb="POST" URL="https://172.18.0.2:6443/api/v1/namespaces/default/pods"
I0317 19:12:45.063424       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=64 remaining=137
I0317 19:12:48.314100       1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=73 remaining=73
I0317 19:12:52.015248       1 replica_set.go:803] ">>> slowStartBatch ..." logger="replicaset-controller" succeeded=200

logs added for clarity in slowStartBatch method, which is responsible for creating pods in batches (starting with 1, and every successful create allows us to double the batch size, up to 500 limit in a single batch).

The discussion during today's sig-apps focus on expanding that method with the ability to periodically (every n-th iteration, not to overwhelm the creation with those checks) check the state of created pods, and eventually do not allow the doubling to happen (slowStartBatch will not double if we encounter an error). Currently only api errors when creating pods are the trigger for stopping the increased creations. We propose to expand that with pod status checks, which should allow us to also slow down the failure rate, if something like that happens.

This will not resolve the issue, but our goal is to slow down the problem so that other tools (monitoring, primarily) could notice an increase in the pod creation and react to the problem. Rather than allowing an avalanche of bad pods trashing the cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have experimented with an alternative solution and detect the failures in the slowStartBatch function. Please see the code in #131050, but have found couple of problesms.

We do not have a good controll over the backoff. When we try to create pods in some capacity and observe them fail, the new pods can trigger RS resync via the update handler before the RS backoff kicks in. And this new sync can create new pods again. Basically we can get into a loop as well. The bigger the ReplicaSet and the kubelet/scheduler delay, the bigger the chance of running into it.

The reasons for the loop are:

  • It can potentially take a long time before we get from Pending -> Failed.
  • We observe updates from pods that land on the good nodes.
  • When we get to the pod GC, each pod deletion can trigger the RS sync as well.

Also for small replica sets, it is hard to preserve the backoff since we always reset it to 0 because we are unable to detect the failures in a single sync.

It seems the best solution is to first do a lookback and decide how many pods should have failed per unit of time before we start creating new pods as proposed in the the current PR. Please let me know if you have a better idea on how to do the lookback.

I could also make the lookback part of the slowStartBatch, but then we run into the problem of polluting the log with

rsc.expectations.ExpectCreations(logger, rsKey, diff)
logger.V(2).Info("Too few replicas", "replicaSet", klog.KObj(rs), "need", *(rs.Spec.Replicas), "creating", diff)

since we can create no pods in empty syncs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the best solution is to first do a lookback and decide how many pods should have failed per unit of time before we start creating new pods as proposed in the the current PR. Please let me know if you have a better idea on how to do the lookback.

I could also make the lookback part of the slowStartBatch, but then we run into the problem of polluting the log with

The lookback being part of the slowStartBatch, to be exact here seems like a reasonable approach. Especially that recently we've added a pod indexer which will ensure that the pod listing and filtering the failed shouldn't be too heavy.

So your proposed solution from #131050 (assuming you'll use rsc.getRSPods(rs)) is the right path forward. Also, to eliminate looping through the pod list too many times, I'd suggest to do both the failed and starttime filtering in a single go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, make sense. I will make the lookback part of the slowStartBatch.

  • it will be more responsive
  • we need to do the first check even before creating the first pod
  • I will try to use the backoff mechanism from this PR (as mentioned the one in #131050 does not work)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to re-use the default re-queuing mechanism, which already supports rate limiting, rather than introducing entirely new mechanism. See my other comment where I'm laying out what it should look like, and drop this entire piece of code.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025
@atiratree atiratree force-pushed the replicaset-failed-pods-backoff branch from e5791fc to 94d43ca Compare March 25, 2025 11:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 25, 2025
@atiratree
Copy link
Member Author

/test pull-kubernetes-e2e-gce

@atiratree
Copy link
Member Author

/test pull-kubernetes-e2e-gce

@atiratree atiratree force-pushed the replicaset-failed-pods-backoff branch from 94d43ca to 25c7bb2 Compare May 15, 2025 12:43
@atiratree
Copy link
Member Author

rebased

@atiratree
Copy link
Member Author

flake #131758

/test pull-kubernetes-e2e-kind-alpha-beta-features

atiratree added 2 commits May 29, 2025 10:50
and improve code flow in syncReplicaSet
ReplicaSet pods to make sure we do not get into a hot feedback loop
@atiratree atiratree force-pushed the replicaset-failed-pods-backoff branch from 25c7bb2 to 49e85ad Compare May 30, 2025 12:17
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atiratree
Copy link
Member Author

As discussed, the lookback is now part of the slowStartBatch.

I also opened a pre-refactoring PR: #132031

rsFailedPods = append(rsFailedPods, rsPod)
// If we see any failed replicas in the last 20 minutes, decrease the max burst.
if maxPodsToCreate > rsc.burstReplicas/2 && rsPod.Status.StartTime != nil && now.Sub(rsPod.Status.StartTime.Time) < LimitBurstReplicasIfFailedPodsDuration {
maxPodsToCreate = rsc.burstReplicas / 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to limit the absolute number of pods to create in a single burst if we observe failing replicas. We could limit this even further. By a half sounds as a good compromise.

// Subtract pods from already executed batches and weighted failed pods
// If a pod fails quickly it can be counted twice (i.e. have double the weight)
// in a single slowStartBatch execution -> fail this sync faster.
maxPodsToCreate -= podsCreated + weightedFailedPodCount
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • failed pods will limit the creation of new pods (e.g. weight 1)
  • pods that failed in the current slowStartBatch execution have double the weight

@k8s-ci-robot
Copy link
Contributor

@atiratree: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-kind-alpha-beta-features 49e85ad link false /test pull-kubernetes-e2e-kind-alpha-beta-features
pull-kubernetes-e2e-gce 49e85ad link true /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

)

const (
// Realistic value of the burstReplica field for the replica set manager based off
// performance requirements for kubernetes 1.0.
BurstReplicas = 500

LimitBurstReplicasIfFailedPodsDuration = 20 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you comment what this value stands for. Also, maybe something like FailedPodsWindow or something more descriptive will be a better name.

// We can further limit the number of pods to create in batchPrecondition.
successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func(podsCreated int) (int, error) {
maxPodsToCreate := diff
if utilfeature.DefaultFeatureGate.Enabled(features.ReplicaSetFailedPodsBackoff) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a feature gate? If you want to go with a feature gate you'll need to open a KEP and go through the entire graduation and release process. Because every FG is automatically a user-facing change, in this case it's the --feature-gate value that is to be configured by the cluster admin.

Rarely, we decide to go with feature gates for bug fixes, and this particular case is no different. I've thought about it for a bit yesterday, and I think we can safely assume this is fixing a bug, so the surface is small, and with sufficient tests ensuring that both the error path isn't worse than before, and the "happy path" is the same I think we can just proceed with just controller changes.

Unless I missed something, did I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing and carefully reviewing this code and I have to admit that this is unnecessarily complicating the replica set controller, weaving the backoff calculations deeply inside the controller (see the below piece around calculateTimeWeightedFailedPodsCount) and the usual mechanism which we have inside the queue, which in case of failures puts a particular RS in a queue rate limited, see

err := rsc.syncHandler(ctx, key)
if err == nil {
rsc.queue.Forget(key)
return true
}
utilruntime.HandleError(fmt.Errorf("sync %q failed with %v", key, err))
rsc.queue.AddRateLimited(key)

I would like to be able to re-use just that mechanism to slow down the creations, rather than introduce yet another one. To achieve that, I'm proposing:

  1. Modify slowStartBatch function similarly to what you're doing here, introducing the batchPrecondition check. It will look like this batchPrecondition func(successes int) error. It accepts a number of successes and returns an error. This function implementation will be very simpler version of what you coded in this PR:
    • Count failed pods in the last unit of time, the hardcoded 20 min might be too long, so I'm open to suggestions. I was personally thinking about 5 min for starters, or we can try to get that information from the queue, and modify the window based on NumRequeue, for example.
    • If the number of failed pods is significantly exceeding the successes (I'm open to any particular implementation here, as long as the precondition function will be idempotent), return error stating that we're failing to create.
  2. Inside slowStartBatch I'd invoke the function and break the loop every time the precondition returns an error. This will cause manageReplicas to finish with an error which will trigger the default backoff to kick in, the one I pointed above.

// It returns the number of successful calls to the function.
func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {
func slowStartBatch(count int, initialBatchSize int, batchPrecondition func(successes int) (int, error), fn func() error) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract the precondition function as a type:

type batchPreconditionFunc  func(successes int) (int, error)

and make sure to properly document the inputs and outputs, this will make the code easier to follow.

}
}
if nextSyncInSeconds != nil {
rsc.queue.AddAfter(key, time.Duration(*nextSyncInSeconds)*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop these changes entirely, and rely on the fact that manageReplicas should return error and fall back to the default rate limited re-queuing.

if maxBatchSize < batchSize {
batchSize = maxBatchSize
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, and what I wrote above I think this entire piece should look like this:

if batchPrecondition != nil {
	if err := batchPrecondition(successes); err != nil {
		return successes, err
	}
}

//
// Returns the time-weighted failed pod count
// and the next time a pod should be promoted from one bucket to the next.
func calculateTimeWeightedFailedPodsCount(clock clock.PassiveClock, replicasToCreate int, failedPods []*v1.Pod) (int, *int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to re-use the default re-queuing mechanism, which already supports rate limiting, rather than introducing entirely new mechanism. See my other comment where I'm laying out what it should look like, and drop this entire piece of code.

@soltysh
Copy link
Contributor

soltysh commented Jun 4, 2025

/assign

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

5 participants