Skip to content

Stateful Indexed Job #115066

Open
Open
@ahg-g

Description

@ahg-g

What would you like to be added?

An automated way of creating PVCs for the individual pods of Indexed Job. This is similar to PersistentVolumeClaimTemplate in StatefulSets.

Why is this needed?

Many long running training workloads use checkpointing to recover from failures. In some cases, the individual workers create node-local checkpoints. To ensure that on failure the replacement pod restarts fast, it should be scheduled on the same node as the failed pod. At the same time, if the node itself fails and gone, the replacement pod is free to schedule on any available node.

/wg batch
/sig storage
/sig apps

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/storageCategorizes an issue or PR as relevant to SIG Storage.wg/batchCategorizes an issue or PR as relevant to WG Batch.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions