Skip to content

Implement p95 under 500ms SLO instrumentation and alerting #139

@HunteRoi

Description

@HunteRoi

Summary

Implement instrumentation, monitoring, and alerting so DataDrop can measure and enforce the p95 under 500ms performance target for core actions.

Problem

The specification defines a concrete SLO:

  • p95 latency under 500ms

Right now this target is only a requirement on paper. Without instrumentation and alerting, the team cannot verify whether the bot and future backend API actually meet the target.

Why This Matters

This protects:

  • operational reliability
  • regression detection
  • confidence in multi-tenant scaling
  • visibility into performance problems affecting verification, config, and lifecycle actions

It also makes the performance requirement actionable instead of aspirational.

Required Behavior

  1. Core actions emit latency measurements.
  2. p95 latency can be computed from emitted measurements.
  3. Alerting exists for sustained SLO breach.
  4. Metrics are available for both bot and backend API entrypoints.
  5. Measurements are scoped enough to identify which operation is slow.

Scope of Measurement

At minimum, instrument:

  • verification flow actions
  • state transition writes
  • config mutation actions
  • yearly reset planning/execution actions
  • web API auth and protected operation latency when the API exists

Acceptance Criteria

  1. Latency metrics are emitted for core bot operations.
  2. Latency metrics are emitted for core API operations when API endpoints exist.
  3. Dashboard or equivalent view exposes p95 latency.
  4. Alerting is triggered on sustained SLO breach.
  5. Measurements can be segmented by action type.

Suggested Implementation Targets

  • bot runtime instrumentation
  • future API middleware instrumentation
  • monitoring configuration to be created
  • metrics/reporting integration to be created

Suggested Technical Direction

Use structured timing around command handling, event-driven transitions, and API requests.

Metrics should capture at minimum:

  • operation name
  • duration
  • success/failure outcome
  • tenant/guild context where safe and appropriate
  • timestamp

Prefer metrics and alerting that can evolve with both the bot runtime and the web/API stack.

Validation

  • unit test: timing wrapper records operation duration
  • integration test: core operation emits metric
  • integration test: p95 aggregation path works with emitted data
  • alert test: sustained artificial breach triggers alert path
  • regression test: instrumentation overhead remains acceptable

Traceability

  • Spec: docs/specs/issue-93-specification.md
  • Matrix rule: TRC-024
  • Related docs:
    • docs/specs/traceability-matrix.md
    • docs/specs/issue-drafts.md
    • docs/specs/mvp-scope.md

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions