Summary
Implement a dead-letter queue and staff remediation workflow for transition writes that fail after bounded retries or otherwise reach an unrecoverable state.
Problem
The specification already requires:
- strict optimistic locking with retry
- bounded retry behavior
- operator visibility when retries are exhausted
That means failed transitions cannot simply disappear into logs. They need a durable failure path and an operator-facing remediation workflow.
Without this, transition failures become hard to diagnose and impossible to safely recover in production.
Why This Matters
This protects:
- operational safety
- recoverability of lifecycle failures
- visibility into race-condition fallout
- trust in the membership state machine
It also completes the failure-handling contract implied by optimistic locking.
Required Behavior
- Exhausted transition retries create a dead-letter record.
- Dead-letter records are durable and queryable.
- Staff can inspect failure details.
- Staff can resolve, retry, or otherwise mark the record handled.
- Resolution actions are audited.
Acceptance Criteria
- Retry exhaustion creates a dead-letter entry instead of silently failing.
- Dead-letter entry contains enough context to diagnose the problem.
- Staff can view unresolved entries.
- Staff can perform remediation action on an entry.
- Remediation history is recorded.
Dead-Letter Record Should Include
At minimum:
- aggregate identity (
guildId, userId, or equivalent)
- operation or transition name
- attempted source state
- attempted target state
- reason for failure
- retry count
- timestamp
- transition source (
bot, api, manual, etc.)
- correlation/idempotency key if available
Suggested Implementation Targets
- new queue or dead-letter persistence subsystem
- remediation panel or admin workflow to be created
- prisma/schema.prisma
- src/services/PostgresDatabaseService.ts
- future API/admin tooling
Suggested Technical Direction
Model dead-letter handling as a first-class operational workflow.
Separate concerns:
- dead-letter creation
- dead-letter inspection
- dead-letter remediation
- remediation audit
Do not rely only on unstructured logs for retry exhaustion.
Validation
- unit test: retry exhaustion creates dead-letter record
- integration test: failed transition appears in unresolved dead-letter list
- integration test: remediation action changes record status correctly
- audit test: remediation actions are recorded
- regression test: successful transition path does not create dead-letter entries
Traceability
- Spec: docs/specs/issue-93-specification.md
- Matrix rule: TRC-025
- Related docs:
- docs/specs/traceability-matrix.md
- docs/specs/issue-drafts.md
- docs/specs/architecture.md
- docs/specs/state-machine.md
Related Issues
Summary
Implement a dead-letter queue and staff remediation workflow for transition writes that fail after bounded retries or otherwise reach an unrecoverable state.
Problem
The specification already requires:
That means failed transitions cannot simply disappear into logs. They need a durable failure path and an operator-facing remediation workflow.
Without this, transition failures become hard to diagnose and impossible to safely recover in production.
Why This Matters
This protects:
It also completes the failure-handling contract implied by optimistic locking.
Required Behavior
Acceptance Criteria
Dead-Letter Record Should Include
At minimum:
guildId,userId, or equivalent)bot,api,manual, etc.)Suggested Implementation Targets
Suggested Technical Direction
Model dead-letter handling as a first-class operational workflow.
Separate concerns:
Do not rely only on unstructured logs for retry exhaustion.
Validation
Traceability
Related Issues