-
Notifications
You must be signed in to change notification settings - Fork 40.7k
add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods #130411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods #130411
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
added tests |
243e30c
to
24e50ae
Compare
24e50ae
to
e5791fc
Compare
// | ||
// Returns the time-weighted failed pod count | ||
// and the next time a pod should be promoted from one bucket to the next. | ||
func calculateTimeWeightedFailedPodsCount(clock clock.PassiveClock, replicasToCreate int, failedPods []*v1.Pod) (int, *int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to look at how other pod-creating controllers handle backoff in failure cases when the kubelet rejects a pod. For example:
Daemonset controller:
kubernetes/pkg/controller/daemon/daemon_controller.go
Lines 805 to 825 in efac8fd
// This is a critical place where DS is often fighting with kubelet that rejects pods. | |
// We need to avoid hot looping and backoff. | |
backoffKey := failedPodsBackoffKey(ds, node.Name) | |
now := dsc.failedPodsBackoff.Clock.Now() | |
inBackoff := dsc.failedPodsBackoff.IsInBackOffSinceUpdate(backoffKey, now) | |
if inBackoff { | |
delay := dsc.failedPodsBackoff.Get(backoffKey) | |
logger.V(4).Info("Deleting failed pod on node has been limited by backoff", | |
"pod", klog.KObj(pod), "node", klog.KObj(node), "currentDelay", delay) | |
dsc.enqueueDaemonSetAfter(ds, delay) | |
continue | |
} | |
dsc.failedPodsBackoff.Next(backoffKey, now) | |
msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name) | |
logger.V(2).Info("Found failed daemon pod on node, will try to kill it", "pod", klog.KObj(pod), "node", klog.KObj(node)) | |
// Emit an event so that it's discoverable to users. | |
dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg) | |
podsToDelete = append(podsToDelete, pod.Name) |
Job controller (I'm not sure if this handles the case where kubelet rejects pods):
https://github.com/kubernetes/kubernetes/blob/efac8fdea24d37aa86b1fd72f526e3f0acaf9e13/pkg/controller/job/job_controller.go#L901C31-L912
What do other pod-creating controllers (like statefulset) do? Should we make a distinct special behavior for replicaset controller, or try to do something common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatefulSet
StatefulSet has the same problem as ReplicaSet. It goes into a hotloop here
kubernetes/pkg/controller/statefulset/stateful_set_control.go
Lines 378 to 386 in efac8fd
if isFailed(replicas[i]) || isSucceeded(replicas[i]) { | |
if replicas[i].DeletionTimestamp == nil { | |
if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil { | |
return true, err | |
} | |
} | |
// New pod should be generated on the next sync after the current pod is removed from etcd. | |
return true, nil | |
} |
One advantage is, that it does not pollute the cluster with failed pods, so the problem is not as noticeable. But it is a strain on resources...
I think we could easily solve this with a backoff, same as we do in the DaemonSet.
Job
I think we do not have to solve this problem for a Job. The Job will usually just finish faster after going over a certain number of failures (BackoffLimit, and other fields...). It would still be useful to have a ReschedulePodsOnKubeletRejection
feature here in the future though.
ReplicaSet
The difference from StatefulSet/DaemonSet is that we do not have a pod identity that we could use to base the backoff on. We only see that either a portion or all of the created replicas are failing. If we base the backoff on any failing pod, we would have scaling speed issues. So we need a custom backoff which I tried to explain in the code comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job
It would still be useful to have a ReschedulePodsOnKubeletRejection feature here in the future though.
Actually, for Job we already have a mechanism to throttle pod re-creation in case of failed pods - we count the number of failed pods since the last successful pod, and backoff exponentially with base 10s (so the delays are 10s, 20s, 40), see docs and code. So, it is not clear to me what would be the benefit of ReschedulePodsOnKubeletRejection
in Job.
btw, I think 10s is too much of a base delay, but that's another discussion :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it is not clear to me what would be the benefit of ReschedulePodsOnKubeletRejection in Job.
The backoff is only applicable for jobs with .spec.backoffLimitPerIndex
, right? (updated my comment above). But even in that case ReschedulePodsOnKubeletRejection
would still help as it would not fail the pod at all and just reschedule it to a better node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backoff is only applicable for jobs with .spec.backoffLimitPerIndex, right?
Actually. the backoff delay applies both for classical "backoffLimit" and "backoffLimitPerIndex". The only difference is that for "backoffLimitPerIndex" the scope of counting the number of failed pods since the last successful is within the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require some testing, but from my local tests creating 200 replicas RS takes up to 10s:
I0317 19:12:43.159542 1 replica_set.go:779] ">>> slowStartBatch is starting..." logger="replicaset-controller" count=200
I0317 19:12:43.159579 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=1 remaining=200
I0317 19:12:43.194089 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=2 remaining=199
I0317 19:12:43.197996 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=4 remaining=197
I0317 19:12:43.202078 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=8 remaining=193
I0317 19:12:43.206846 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=16 remaining=185
I0317 19:12:43.414170 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=32 remaining=169
I0317 19:12:44.460584 1 request.go:752] "Waited before sending request" logger="replicaset-controller" delay="1.046108269s" reason="client-side throttling, not priority and fairness" verb="POST" URL="https://172.18.0.2:6443/api/v1/namespaces/default/pods"
I0317 19:12:45.063424 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=64 remaining=137
I0317 19:12:48.314100 1 replica_set.go:783] ">>> slowStartBatch ..." logger="replicaset-controller" batchSize=73 remaining=73
I0317 19:12:52.015248 1 replica_set.go:803] ">>> slowStartBatch ..." logger="replicaset-controller" succeeded=200
logs added for clarity in slowStartBatch
method, which is responsible for creating pods in batches (starting with 1, and every successful create allows us to double the batch size, up to 500 limit in a single batch).
The discussion during today's sig-apps focus on expanding that method with the ability to periodically (every n-th iteration, not to overwhelm the creation with those checks) check the state of created pods, and eventually do not allow the doubling to happen (slowStartBatch
will not double if we encounter an error). Currently only api errors when creating pods are the trigger for stopping the increased creations. We propose to expand that with pod status checks, which should allow us to also slow down the failure rate, if something like that happens.
This will not resolve the issue, but our goal is to slow down the problem so that other tools (monitoring, primarily) could notice an increase in the pod creation and react to the problem. Rather than allowing an avalanche of bad pods trashing the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have experimented with an alternative solution and detect the failures in the slowStartBatch function. Please see the code in #131050, but have found couple of problesms.
We do not have a good controll over the backoff. When we try to create pods in some capacity and observe them fail, the new pods can trigger RS resync via the update handler before the RS backoff kicks in. And this new sync can create new pods again. Basically we can get into a loop as well. The bigger the ReplicaSet and the kubelet/scheduler delay, the bigger the chance of running into it.
The reasons for the loop are:
- It can potentially take a long time before we get from Pending -> Failed.
- We observe updates from pods that land on the good nodes.
- When we get to the pod GC, each pod deletion can trigger the RS sync as well.
Also for small replica sets, it is hard to preserve the backoff since we always reset it to 0 because we are unable to detect the failures in a single sync.
It seems the best solution is to first do a lookback and decide how many pods should have failed per unit of time before we start creating new pods as proposed in the the current PR. Please let me know if you have a better idea on how to do the lookback.
I could also make the lookback part of the slowStartBatch, but then we run into the problem of polluting the log with
kubernetes/pkg/controller/replicaset/replica_set.go
Lines 589 to 590 in ee41b03
rsc.expectations.ExpectCreations(logger, rsKey, diff) | |
logger.V(2).Info("Too few replicas", "replicaSet", klog.KObj(rs), "need", *(rs.Spec.Replicas), "creating", diff) |
since we can create no pods in empty syncs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the best solution is to first do a lookback and decide how many pods should have failed per unit of time before we start creating new pods as proposed in the the current PR. Please let me know if you have a better idea on how to do the lookback.
I could also make the lookback part of the slowStartBatch, but then we run into the problem of polluting the log with
The lookback being part of the slowStartBatch
, to be exact here seems like a reasonable approach. Especially that recently we've added a pod indexer which will ensure that the pod listing and filtering the failed shouldn't be too heavy.
So your proposed solution from #131050 (assuming you'll use rsc.getRSPods(rs)
) is the right path forward. Also, to eliminate looping through the pod list too many times, I'd suggest to do both the failed and starttime filtering in a single go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, make sense. I will make the lookback part of the slowStartBatch.
- it will be more responsive
- we need to do the first check even before creating the first pod
- I will try to use the backoff mechanism from this PR (as mentioned the one in #131050 does not work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to re-use the default re-queuing mechanism, which already supports rate limiting, rather than introducing entirely new mechanism. See my other comment where I'm laying out what it should look like, and drop this entire piece of code.
e5791fc
to
94d43ca
Compare
/test pull-kubernetes-e2e-gce |
/test pull-kubernetes-e2e-gce |
94d43ca
to
25c7bb2
Compare
rebased |
flake #131758 /test pull-kubernetes-e2e-kind-alpha-beta-features |
and improve code flow in syncReplicaSet
ReplicaSet pods to make sure we do not get into a hot feedback loop
25c7bb2
to
49e85ad
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As discussed, the lookback is now part of the slowStartBatch. I also opened a pre-refactoring PR: #132031 |
rsFailedPods = append(rsFailedPods, rsPod) | ||
// If we see any failed replicas in the last 20 minutes, decrease the max burst. | ||
if maxPodsToCreate > rsc.burstReplicas/2 && rsPod.Status.StartTime != nil && now.Sub(rsPod.Status.StartTime.Time) < LimitBurstReplicasIfFailedPodsDuration { | ||
maxPodsToCreate = rsc.burstReplicas / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to limit the absolute number of pods to create in a single burst if we observe failing replicas. We could limit this even further. By a half sounds as a good compromise.
// Subtract pods from already executed batches and weighted failed pods | ||
// If a pod fails quickly it can be counted twice (i.e. have double the weight) | ||
// in a single slowStartBatch execution -> fail this sync faster. | ||
maxPodsToCreate -= podsCreated + weightedFailedPodCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- failed pods will limit the creation of new pods (e.g. weight 1)
- pods that failed in the current slowStartBatch execution have double the weight
@atiratree: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
) | ||
|
||
const ( | ||
// Realistic value of the burstReplica field for the replica set manager based off | ||
// performance requirements for kubernetes 1.0. | ||
BurstReplicas = 500 | ||
|
||
LimitBurstReplicasIfFailedPodsDuration = 20 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you comment what this value stands for. Also, maybe something like FailedPodsWindow
or something more descriptive will be a better name.
// We can further limit the number of pods to create in batchPrecondition. | ||
successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func(podsCreated int) (int, error) { | ||
maxPodsToCreate := diff | ||
if utilfeature.DefaultFeatureGate.Enabled(features.ReplicaSetFailedPodsBackoff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a feature gate? If you want to go with a feature gate you'll need to open a KEP and go through the entire graduation and release process. Because every FG is automatically a user-facing change, in this case it's the --feature-gate
value that is to be configured by the cluster admin.
Rarely, we decide to go with feature gates for bug fixes, and this particular case is no different. I've thought about it for a bit yesterday, and I think we can safely assume this is fixing a bug, so the surface is small, and with sufficient tests ensuring that both the error path isn't worse than before, and the "happy path" is the same I think we can just proceed with just controller changes.
Unless I missed something, did I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more testing and carefully reviewing this code and I have to admit that this is unnecessarily complicating the replica set controller, weaving the backoff calculations deeply inside the controller (see the below piece around calculateTimeWeightedFailedPodsCount
) and the usual mechanism which we have inside the queue, which in case of failures puts a particular RS in a queue rate limited, see
kubernetes/pkg/controller/replicaset/replica_set.go
Lines 559 to 566 in caa9324
err := rsc.syncHandler(ctx, key) | |
if err == nil { | |
rsc.queue.Forget(key) | |
return true | |
} | |
utilruntime.HandleError(fmt.Errorf("sync %q failed with %v", key, err)) | |
rsc.queue.AddRateLimited(key) |
I would like to be able to re-use just that mechanism to slow down the creations, rather than introduce yet another one. To achieve that, I'm proposing:
- Modify
slowStartBatch
function similarly to what you're doing here, introducing thebatchPrecondition
check. It will look like thisbatchPrecondition func(successes int) error
. It accepts a number of successes and returns an error. This function implementation will be very simpler version of what you coded in this PR:- Count failed pods in the last unit of time, the hardcoded 20 min might be too long, so I'm open to suggestions. I was personally thinking about 5 min for starters, or we can try to get that information from the queue, and modify the window based on
NumRequeue
, for example. - If the number of failed pods is significantly exceeding the successes (I'm open to any particular implementation here, as long as the precondition function will be idempotent), return error stating that we're failing to create.
- Count failed pods in the last unit of time, the hardcoded 20 min might be too long, so I'm open to suggestions. I was personally thinking about 5 min for starters, or we can try to get that information from the queue, and modify the window based on
- Inside
slowStartBatch
I'd invoke the function and break the loop every time the precondition returns an error. This will causemanageReplicas
to finish with an error which will trigger the default backoff to kick in, the one I pointed above.
// It returns the number of successful calls to the function. | ||
func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) { | ||
func slowStartBatch(count int, initialBatchSize int, batchPrecondition func(successes int) (int, error), fn func() error) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract the precondition function as a type:
type batchPreconditionFunc func(successes int) (int, error)
and make sure to properly document the inputs and outputs, this will make the code easier to follow.
} | ||
} | ||
if nextSyncInSeconds != nil { | ||
rsc.queue.AddAfter(key, time.Duration(*nextSyncInSeconds)*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop these changes entirely, and rely on the fact that manageReplicas
should return error and fall back to the default rate limited re-queuing.
if maxBatchSize < batchSize { | ||
batchSize = maxBatchSize | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, and what I wrote above I think this entire piece should look like this:
if batchPrecondition != nil {
if err := batchPrecondition(successes); err != nil {
return successes, err
}
}
// | ||
// Returns the time-weighted failed pod count | ||
// and the next time a pod should be promoted from one bucket to the next. | ||
func calculateTimeWeightedFailedPodsCount(clock clock.PassiveClock, replicasToCreate int, failedPods []*v1.Pod) (int, *int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to re-use the default re-queuing mechanism, which already supports rate limiting, rather than introducing entirely new mechanism. See my other comment where I'm laying out what it should look like, and drop this entire piece of code.
/assign |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
We need to make sure the ReplicaSet controller does not get into a hot feedback loop with the kubelet. As shown in the bug, this can result in the ReplicaSet controller generating 1800+ pods per minute. This can have a negative impact on etcd/apiserver and cluster stability.
Which issue(s) this PR fixes:
Partially solves #72593
Special notes for your reviewer:
There is a proposed ReschedulePodsOnKubeletRejection feature, but there is not enough bandwidth to implement it. This PR will not solve the issue fully, but will at least improve the pod generation speed. From a hot loop to a cold loop. So over time we will still get to 12500 failed pods before they are garbage collected. The ReschedulePodsOnKubeletRejection should solve the cold loop into an immediate feedback between a kubelet/scheduler and move the pod to another node without failing the pods in the first place.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: