Skip to content

fix(core): Prevent worker from recovering finished executions #16094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ivov
Copy link
Contributor

@ivov ivov commented Jun 6, 2025

Summary

Crash recovery has plenty of known issues but I found a fundamental one.

Consider this scenario:

  • Main enqueues a job for execution 1 whose workflow has an error-handling workflow
  • Worker A picks up job, runs execution 1, which throws, finishing with error status
  • Worker A enqueues job for execution 2, for the error-handling workflow
  • Worker B picks up job, runs execution 2 successfully

In this scenario, we end up with inconsistent event logs:

n8nEventLog.log (main)
- n8n.workflow.started for execution 1
- n8n.workflow.failed for execution 1

n8nEventLog-worker.log (worker A)
- n8n.node.{started|finished} events for execution 1

n8nEventLog-worker.log (worker B)
- n8n.workflow.started for execution 2
- n8n.node.{started|finished} events for execution 2
- n8n.workflow.success for execution 2

This is because of our lifecycle hooks setup:

  • Main uses getLifecycleHooksForScalingMain which includes hookFunctionsWorkflowEvents
  • Worker uses getLifecycleHooksForScalingWorker which excludes hookFunctionsWorkflowEvents
  • Subexecutions use getLifecycleHooksForSubExecutions which includes hookFunctionsWorkflowEvents

This event log inconsistency impacts crash recovery, which assumes all events for an execution are in the event log for a given instance. This means crash recovery wrongly identifies execution 1 as unfinished (instead of finished with an error) leading it to wrongly amend details for a finished execution (instead of skipping it), which will also lead to another crash recovery on the next restart, as the details are incomplete.

In short, this means every worker tries to recover its own finished failed executions (instead of only unfinished crashed ones), which the worker does not even have the data for, and so this cycle repeats on restart. The root solution would be to straighten out our hooks setup and have event logs for workers remain with workers, but lifecycle hooks are so central and so legacy that this would be a big dedicated effort.

Hence for now, this fix filters out finished executions downstream and defers crash recovery logging until after we know which were the actually recovered execution IDs.

Related Linear tickets, Github issues, and Community forum posts

n/a

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic found 1 issue across 3 files. Review it in cubic.dev

React with 👍 or 👎 to teach cubic. Tag @cubic-dev-ai to give specific feedback.

Copy link

codecov bot commented Jun 6, 2025

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/eventbus/message-event-bus/message-event-bus.ts 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ivov ivov requested a review from despairblue June 6, 2025 12:54
@n8n-assistant n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team labels Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant