fix(core): Prevent worker from recovering finished executions #16094
+48
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Crash recovery has plenty of known issues but I found a fundamental one.
Consider this scenario:
error
statusIn this scenario, we end up with inconsistent event logs:
This is because of our lifecycle hooks setup:
getLifecycleHooksForScalingMain
which includeshookFunctionsWorkflowEvents
getLifecycleHooksForScalingWorker
which excludeshookFunctionsWorkflowEvents
getLifecycleHooksForSubExecutions
which includeshookFunctionsWorkflowEvents
This event log inconsistency impacts crash recovery, which assumes all events for an execution are in the event log for a given instance. This means crash recovery wrongly identifies execution 1 as unfinished (instead of finished with an error) leading it to wrongly amend details for a finished execution (instead of skipping it), which will also lead to another crash recovery on the next restart, as the details are incomplete.
In short, this means every worker tries to recover its own finished failed executions (instead of only unfinished crashed ones), which the worker does not even have the data for, and so this cycle repeats on restart. The root solution would be to straighten out our hooks setup and have event logs for workers remain with workers, but lifecycle hooks are so central and so legacy that this would be a big dedicated effort.
Hence for now, this fix filters out finished executions downstream and defers crash recovery logging until after we know which were the actually recovered execution IDs.
Related Linear tickets, Github issues, and Community forum posts
n/a
Review / Merge checklist
release/backport
(if the PR is an urgent fix that needs to be backported)