Skip to content

Ingestion: Add recovery mechanism for ingestion due to issues such as server crash #57

@tekrajchhetri

Description

@tekrajchhetri

Problem

When the server crashes, any in-progress jobs are left in a Running state even though the job is no longer running. As a result, some jobs remain stuck indefinitely (see examples in the screenshot: statuses 3/20 and 3/16).
Image

Expected Behavior

Jobs should not stay in a perpetual Running state when execution has stopped.

Proposed Solution

Add a recovery / termination mechanism (similar to a max-job timeout or watchdog) that:

  • Detects stale or orphaned running jobs after a crash or restart
  • Automatically marks them as Failed or Re-queued
  • Prevents jobs from remaining stuck in the Running state indefinitely

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions