Skip to content

docs: KMS context injection quality spec (signed off)#37

Open
ryaker wants to merge 1 commit into
mainfrom
docs/kms-quality-spec
Open

docs: KMS context injection quality spec (signed off)#37
ryaker wants to merge 1 commit into
mainfrom
docs/kms-quality-spec

Conversation

@ryaker

@ryaker ryaker commented Apr 13, 2026

Copy link
Copy Markdown
Owner

Summary

Contract for fixing KMS context injection quality. Currently unmeasured and surfacing roughly 22% on-topic items (anecdotal — leg 0 produces the real number). This spec pins down what "working right" means quantitatively, how we'll measure it, how measurement flows back into ranking, and how we'll roll it out with kill criteria per phase.

Deliverable is the spec itself — no code, no schema, no hooks. The implementation plan for Leg 0 is a separate PR that will reference this spec as its acceptance contract.

Complements the corrective tools in #36 — that PR adds the surface for correcting wrong entries; this spec defines the feedback loop that decides which entries to correct and when.

What the spec covers

§ Section What it contains
1 Context & current state Polyglot architecture, corrective-tools cross-ref, ~22% on-topic baseline flagged as anecdotal
2 Four quant metrics on-topic / usage / contradiction / discovery — each with formula, baseline, target, rolling window
3 Measurement infrastructure Async Stop-hook scorer → kms_quality_log collection in existing Mongo DB, Haiku 4.5 judge with Sonnet 4.6 auto-fallback
4 Feedback loop into ranking effective_confidence = stored × (1 + α·usage − β·contradiction), α=0.5, β=2.0
5 Closed corrective loop Contradictions auto-stage kms_supersede suggestions that surface in the next session's injected bundle
6 Phased rollout Leg 0 → 1 (Mem0) → 2 (Neo4j) → 3 (Mongo + closed loop) with numeric entry/exit/kill criteria and ≥1 week measurement windows
7 Failure modes & revert Seven failure modes, detection + mitigation each, single-flag revert (KMS_QUALITY_RANKING_ENABLED=false)
8 Decisions (signed off) All four open questions resolved with rationale: Haiku judge, full-text excerpt, same-DB new-collection, α/β = 0.5/2.0

Why these four metrics

They're orthogonal: on-topic rate measures retrieval quality, usage rate measures whether on-topic results were actually load-bearing, contradiction rate flags wrong stored facts, discovery rate measures whether the corrective tools are reaching daily flow. You can't game one without moving another.

Why the β > α asymmetry

β=2.0 vs α=0.5 means a single contradiction punishes ranking ~4× harder than a single use rewards it. Embodies the principle "wrong information is more costly than missing information" — wrong facts should fade fast; useful facts should promote cautiously. §7 has telemetry (false_demotion_rate) that detects if this is too aggressive, and the mitigation is explicit: raise α, drop β.

What happens after sign-off

  1. Leg 0 implementation PR opens, references this spec as its contract
  2. Leg 0 ships: Stop-hook scorer + kms_quality_log + rollup script, no ranking changes
  3. Judge validation runs against 50 hand-labeled turns; if Haiku <85% agreement, auto-upgrade to Sonnet
  4. 7 days of clean measurement data → Leg 1 (Mem0 leg of feedback loop) proceeds
  5. Each subsequent leg gates on measured improvement from the previous one. If a leg regresses, one-flag revert and rethink.

Test plan

  • All 8 sections present and self-contained
  • Every metric in §2 has formula + baseline + target + window (no vibes language)
  • §3 reuses existing infrastructure (Mongo connection, existing Stop-hook slot) — no new infra invented
  • §6 has explicit numeric kill criteria per phase, single-flag revert
  • §7 has both detection and mitigation for every failure mode
  • §8 decisions are resolved with rationale, not left as open questions
  • Reviewer: confirm metric definitions are what you want to optimize for
  • Reviewer: confirm the Leg 0 → Leg 1 → Leg 2 → Leg 3 sequencing matches your ordering preference
  • Reviewer: confirm the single-flag revert mechanism is sufficient (no other data written beyond kms_quality_log)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added comprehensive KMS context injection quality specification that defines measurable metrics, quality assessment processes, feedback mechanisms, and phased rollout plan for continuous improvement.

Contract for how we fix KMS context injection — currently unmeasured and
surfacing ~22% on-topic items (anecdotal). Before any code lands, this
spec pins down what "working right" means quantitatively, how we measure
it, how measurement flows back into ranking, and how we roll it out with
kill criteria per phase.

Eight sections:
- §1 Context & current state
- §2 Four orthogonal metrics (on-topic, usage, contradiction, discovery)
  with formulas, baselines, targets, and rolling windows
- §3 Measurement infrastructure: async Stop-hook scorer, new kms_quality_log
  collection in the existing KMS MongoDB database, Haiku 4.5 judge with
  Sonnet 4.6 auto-fallback if judge agreement <85% on 50 hand-labeled turns
- §4 Feedback loop: effective_confidence = stored × (1 + α·usage − β·contradiction)
  with α=0.5, β=2.0 (contradictions punish 4x harder than uses reward)
- §5 Closed corrective loop: contradictions auto-stage kms_supersede
  suggestions that surface in the next session's injected bundle
- §6 Four-phase rollout (Leg 0 measurement → Leg 1 Mem0 → Leg 2 Neo4j →
  Leg 3 MongoDB + closed loop), each with numeric entry/exit/kill criteria
  and a one-week measurement window minimum
- §7 Seven failure modes with detection + mitigation, single-flag revert
  (KMS_QUALITY_RANKING_ENABLED=false)
- §8 Four decisions signed off with rationale

Deliverable is the spec itself — no code, no schema, no hooks. The
implementation plan for Leg 0 is a separate PR that references this spec
as its acceptance contract.

Complements the corrective tools in PR #36 — that PR added the surface for
correcting wrong entries; this spec defines the feedback loop that decides
*which* entries to correct and *when*.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codeant-ai

codeant-ai Bot commented Apr 13, 2026

Copy link
Copy Markdown

User does not have a PR Review subscription.

Go to Team management and add this email to the PR Review subscription.

@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

A new documentation file specifying a complete quality measurement and improvement framework for KMS context injection, including quantitative metrics, measurement pipeline architecture, feedback loops for ranking corrections, and a phased rollout plan with explicit exit criteria and revert mechanism.

Changes

Cohort / File(s) Summary
KMS Quality Specification
docs/CONTEXT_INJECTION_QUALITY_SPEC.md
Introduces comprehensive quality spec with four metrics (on-topic rate, usage rate, contradiction rate, discovery rate), asynchronous scoring pipeline using Anthropic judge, MongoDB storage strategy with TTL, effective_confidence ranking feedback loop, correction workflow with kms_supersede suggestions, and phased rollout plan (4 legs) with kill switches and numeric thresholds.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐰 A quality spec hops into view,
With metrics so shiny and new,
Four measures of grace,
A judge keeps the pace,
While corrections bloom red, white, and blue!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: a new KMS context injection quality specification document with sign-off decisions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/kms-quality-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive quality specification for KMS context injection, defining quantitative metrics, measurement infrastructure, and a self-healing feedback loop for ranking stored facts. The documentation is well-structured and includes a phased rollout plan with clear kill criteria. The review feedback identifies three areas for improvement: clarifying the Discovery rate formula to account for cross-session events, adding a safety clamp to the effective confidence formula to prevent negative values, and specifying the smoothing factor for the exponential moving averages used in scoring.

| **On-topic injection rate** | `(# injected items judged relevant to user prompt) / (# injected items)` | Stop-hook scorer (LLM judge over `{prompt, injected_item}`) | ~22% (anecdotal) | ≥80% | rolling 7d |
| **Usage rate** | `(# injected items the next assistant turn referenced or acted on) / (# injected items)` | Stop-hook scorer (LLM judge over `{injected_item, assistant_response}`) | unknown | ≥50% | rolling 7d |
| **Contradiction rate** | `(# injected items the assistant explicitly contradicted) / (# injected items)` | Stop-hook scorer; auto-stages a `kms_supersede` suggestion | unknown | ≤5% | rolling 7d |
| **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Discovery rate formula (# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted) is slightly ambiguous because §5 explains that suggestions are surfaced in the next session. This implies the "warranted" event and the "happened" event occur in different turns. The definition should clarify how these events are linked across sessions to ensure the metric is accurately measurable.

Suggested change
| **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |
| **Discovery rate** | (# warranted corrections resolved via tool) / (# warranted corrections identified) | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |

The structural fix. Today `UnifiedSearchTool.rankResults` (`src/tools/UnifiedSearchTool.ts:430-444`) sorts purely on `result.confidence`. The spec defines an **effective confidence** that combines stored confidence with usage signal:

```
effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The effective_confidence formula can produce negative values if the contradiction_score is high (e.g., with the default β=2.0, any contradiction score > 0.5 results in a negative multiplier). Since the ranking logic likely expects non-negative scores, the specification should explicitly include clamping the result to a minimum of zero.

Suggested change
effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)
effective_confidence = max(0, stored_confidence × (1 + α·usage_score − β·contradiction_score))

Comment on lines +158 to +159
- `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`.
- `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The specification uses the term "rolling EMA" for usage_score and contradiction_score but does not define the smoothing factor (often denoted as α or γ). Defining this parameter is necessary to ensure consistent implementation of the feedback loop's responsiveness across different components.

Suggested change
- `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`.
- `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`.
- `usage_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(uses) / (injections)` from `kms_quality_rollup`.
- `contradiction_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(contradictions) / (injections)`.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md`:
- Around line 33-35: Three fenced code blocks in the document lack language
identifiers: the block that contains the literal
"docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block starting with
"┌─────────────────────────────────────────────────────────────┐" and the block
containing the formula starting with "effective_confidence = stored_confidence ×
(1 + α·usage_score − β·contradiction_score)". Add appropriate fenced-code
language tags (e.g., ```text for plain text/ascii art and ```math or ```text for
the formula depending on your linting preference) to each fence so they comply
with MD040; ensure you replace the opening ``` with ```text (or another suitable
language) for those three specific blocks.
- Line 138: Clarify the ambiguous "content excerpt (full text, first 500 chars)"
in the injected_items[] schema by deciding and stating explicitly whether the
field stores the full document or only the excerpt; update the wording for the
"content excerpt" entry to either "content (full text)" or "content_excerpt
(first 500 characters)" and, if keeping an excerpt, add the exact truncation
rule and character limit and whether it uses Unicode codepoints or bytes; ensure
the change is applied to the injected_items[] description so implementers of the
scorer/log schema know the precise field name and length behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4352a016-0c3e-4f66-8fa8-0089508f0f95

📥 Commits

Reviewing files that changed from the base of the PR and between eff278f and e1e963e.

📒 Files selected for processing (1)
  • docs/CONTEXT_INJECTION_QUALITY_SPEC.md

Comment on lines +33 to +35
```
docs/CONTEXT_INJECTION_QUALITY_SPEC.md
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks (markdownlint MD040).

The fences starting at Line 33, Line 45, and Line 154 should specify a language for lint compliance.

Markdownlint-compliant diff
-```
+```text
 docs/CONTEXT_INJECTION_QUALITY_SPEC.md

- +text
┌─────────────────────────────────────────────────────────────┐
│ This plan │
...
- +

- +text
effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)

Also applies to: 45-102, 154-156

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 33-33: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` around lines 33 - 35, Three fenced
code blocks in the document lack language identifiers: the block that contains
the literal "docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block
starting with "┌─────────────────────────────────────────────────────────────┐"
and the block containing the formula starting with "effective_confidence =
stored_confidence × (1 + α·usage_score − β·contradiction_score)". Add
appropriate fenced-code language tags (e.g., ```text for plain text/ascii art
and ```math or ```text for the formula depending on your linting preference) to
each fence so they comply with MD040; ensure you replace the opening ``` with
```text (or another suitable language) for those three specific blocks.

- **What it captures per turn**:
- `session_id`, `turn_id`, `timestamp`
- `user_prompt` (truncated)
- `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the excerpt definition to avoid implementation drift.

Line 138 mixes “full text” with “first 500 chars,” which can be interpreted two ways. Please make it unambiguous (store full content vs store excerpt only) so the scorer/log schema is implemented consistently.

Suggested wording tweak
-  - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection
+  - `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection
- `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` at line 138, Clarify the ambiguous
"content excerpt (full text, first 500 chars)" in the injected_items[] schema by
deciding and stating explicitly whether the field stores the full document or
only the excerpt; update the wording for the "content excerpt" entry to either
"content (full text)" or "content_excerpt (first 500 characters)" and, if
keeping an excerpt, add the exact truncation rule and character limit and
whether it uses Unicode codepoints or bytes; ensure the change is applied to the
injected_items[] description so implementers of the scorer/log schema know the
precise field name and length behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant