Skip to content

Horizontal Pod Autoscaler reporting FailedRescale error #132002

Closed
@sekar-saravanan

Description

@sekar-saravanan

What happened?

We are currently running several applications that are configured with metric-based scaling. These applications are expected to scale frequently in response to varying traffic patterns.

Occasionally, we observe the following error during scaling events (though it is not frequent), and there is no noticeable impact on the actual scaling process:

reason: FailedRescale, New size: <size>; reason: All metrics <below>/<above> target; error: Operation cannot be fulfilled on deployments.apps: the object has been modified; please apply your changes to the latest version and try again.

Our understanding is that this occurs due to a race condition where another controller or a CI/CD process is simultaneously patching or updating the Deployment object at the same time the Horizontal Pod Autoscaler (HPA) attempts to scale it.

In some cases, we observed the issue occurring even when there were no apparent changes to the Deployment object itself. Only a possible update to the resourceVersion, which we were unable to confirm the reason for the same

To be cautious, we have set up an alert to notify us if the HPA fails to scale a deployment for any reason. However, encountering this specific error results in unnecessary noise and false positives in our alerting system.

What did you expect to happen?

We expect the Horizontal Pod Autoscaler (HPA) to perform retries internally and suppress the reporting of this error, as its occurrence generates unnecessary noise and may cause undue concern regarding system scaling.

How can we reproduce it (as minimally and precisely as possible)?

To reproduce the issue:

  • Configure a Deployment and Horizontal Pod Autoscaler (HPA) with CPU or memory-based scaling.
  • Apply load to the Deployment pods to trigger scaling by the HPA.
  • Simultaneously, perform multiple consecutive patch operations on the Deployment.

This sequence will result in the HPA reporting the same error in its events.

Anything else we need to know?

NA

Kubernetes version

$ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.7-eks-bcf3d70

Cloud provider

AWS - EKS

OS version

# On MAC:

OS: 15.0.1

$ uname -a
Darwin NOVI-QH6F676G2Y 24.0.0 Darwin Kernel Version 24.0.0: Tue Sep 24 23:39:07 PDT 2024; root:xnu-11215.1.12~1/RELEASE_ARM64_T6000 arm64


</details>


### Install tools

NA

### Container runtime (CRI) and version (if applicable)

NA

### Related plugins (CNI, CSI, ...) and versions (if applicable)

NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/autoscalingCategorizes an issue or PR as relevant to SIG Autoscaling.

    Type

    No type

    Projects

    Status

    Closed

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions