Implement p95 under 500ms SLO instrumentation and alerting

## Summary

Implement instrumentation, monitoring, and alerting so DataDrop can measure and enforce the p95 under 500ms performance target for core actions.

## Problem

The specification defines a concrete SLO:

- p95 latency under 500ms

Right now this target is only a requirement on paper. Without instrumentation and alerting, the team cannot verify whether the bot and future backend API actually meet the target.

## Why This Matters

This protects:

- operational reliability
- regression detection
- confidence in multi-tenant scaling
- visibility into performance problems affecting verification, config, and lifecycle actions

It also makes the performance requirement actionable instead of aspirational.

## Required Behavior

1. Core actions emit latency measurements.
2. p95 latency can be computed from emitted measurements.
3. Alerting exists for sustained SLO breach.
4. Metrics are available for both bot and backend API entrypoints.
5. Measurements are scoped enough to identify which operation is slow.

## Scope of Measurement

At minimum, instrument:

- verification flow actions
- state transition writes
- config mutation actions
- yearly reset planning/execution actions
- web API auth and protected operation latency when the API exists

## Acceptance Criteria

1. Latency metrics are emitted for core bot operations.
2. Latency metrics are emitted for core API operations when API endpoints exist.
3. Dashboard or equivalent view exposes p95 latency.
4. Alerting is triggered on sustained SLO breach.
5. Measurements can be segmented by action type.

## Suggested Implementation Targets

- bot runtime instrumentation
- future API middleware instrumentation
- monitoring configuration to be created
- metrics/reporting integration to be created

## Suggested Technical Direction

Use structured timing around command handling, event-driven transitions, and API requests.

Metrics should capture at minimum:

- operation name
- duration
- success/failure outcome
- tenant/guild context where safe and appropriate
- timestamp

Prefer metrics and alerting that can evolve with both the bot runtime and the web/API stack.

## Validation

- unit test: timing wrapper records operation duration
- integration test: core operation emits metric
- integration test: p95 aggregation path works with emitted data
- alert test: sustained artificial breach triggers alert path
- regression test: instrumentation overhead remains acceptable

## Traceability

- Spec: docs/specs/issue-93-specification.md
- Matrix rule: TRC-024
- Related docs:
  - docs/specs/traceability-matrix.md
  - docs/specs/issue-drafts.md
  - docs/specs/mvp-scope.md

## Related Issues

- #93
- #132
- #138

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement p95 under 500ms SLO instrumentation and alerting #139

Summary

Problem

Why This Matters

Required Behavior

Scope of Measurement

Acceptance Criteria

Suggested Implementation Targets

Suggested Technical Direction

Validation

Traceability

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Implement p95 under 500ms SLO instrumentation and alerting #139

Description

Summary

Problem

Why This Matters

Required Behavior

Scope of Measurement

Acceptance Criteria

Suggested Implementation Targets

Suggested Technical Direction

Validation

Traceability

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions