Open
Description
What would you like to be added?
An automated way of creating PVCs for the individual pods of Indexed Job. This is similar to PersistentVolumeClaimTemplate in StatefulSets.
Why is this needed?
Many long running training workloads use checkpointing to recover from failures. In some cases, the individual workers create node-local checkpoints. To ensure that on failure the replacement pod restarts fast, it should be scheduled on the same node as the failed pod. At the same time, if the node itself fails and gone, the replacement pod is free to schedule on any available node.
/wg batch
/sig storage
/sig apps
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Needs Triage