Skip to content

Implement dead-letter queue and remediation workflow for failed transitions #140

@HunteRoi

Description

@HunteRoi

Summary

Implement a dead-letter queue and staff remediation workflow for transition writes that fail after bounded retries or otherwise reach an unrecoverable state.

Problem

The specification already requires:

  • strict optimistic locking with retry
  • bounded retry behavior
  • operator visibility when retries are exhausted

That means failed transitions cannot simply disappear into logs. They need a durable failure path and an operator-facing remediation workflow.

Without this, transition failures become hard to diagnose and impossible to safely recover in production.

Why This Matters

This protects:

  • operational safety
  • recoverability of lifecycle failures
  • visibility into race-condition fallout
  • trust in the membership state machine

It also completes the failure-handling contract implied by optimistic locking.

Required Behavior

  1. Exhausted transition retries create a dead-letter record.
  2. Dead-letter records are durable and queryable.
  3. Staff can inspect failure details.
  4. Staff can resolve, retry, or otherwise mark the record handled.
  5. Resolution actions are audited.

Acceptance Criteria

  1. Retry exhaustion creates a dead-letter entry instead of silently failing.
  2. Dead-letter entry contains enough context to diagnose the problem.
  3. Staff can view unresolved entries.
  4. Staff can perform remediation action on an entry.
  5. Remediation history is recorded.

Dead-Letter Record Should Include

At minimum:

  • aggregate identity (guildId, userId, or equivalent)
  • operation or transition name
  • attempted source state
  • attempted target state
  • reason for failure
  • retry count
  • timestamp
  • transition source (bot, api, manual, etc.)
  • correlation/idempotency key if available

Suggested Implementation Targets

  • new queue or dead-letter persistence subsystem
  • remediation panel or admin workflow to be created
  • prisma/schema.prisma
  • src/services/PostgresDatabaseService.ts
  • future API/admin tooling

Suggested Technical Direction

Model dead-letter handling as a first-class operational workflow.

Separate concerns:

  1. dead-letter creation
  2. dead-letter inspection
  3. dead-letter remediation
  4. remediation audit

Do not rely only on unstructured logs for retry exhaustion.

Validation

  • unit test: retry exhaustion creates dead-letter record
  • integration test: failed transition appears in unresolved dead-letter list
  • integration test: remediation action changes record status correctly
  • audit test: remediation actions are recorded
  • regression test: successful transition path does not create dead-letter entries

Traceability

  • Spec: docs/specs/issue-93-specification.md
  • Matrix rule: TRC-025
  • Related docs:
    • docs/specs/traceability-matrix.md
    • docs/specs/issue-drafts.md
    • docs/specs/architecture.md
    • docs/specs/state-machine.md

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions