docs: KMS context injection quality spec (signed off) by ryaker · Pull Request #37 · ryaker/KMSmcp

ryaker · 2026-04-13T02:15:41Z

Summary

Contract for fixing KMS context injection quality. Currently unmeasured and surfacing roughly 22% on-topic items (anecdotal — leg 0 produces the real number). This spec pins down what "working right" means quantitatively, how we'll measure it, how measurement flows back into ranking, and how we'll roll it out with kill criteria per phase.

Deliverable is the spec itself — no code, no schema, no hooks. The implementation plan for Leg 0 is a separate PR that will reference this spec as its acceptance contract.

Complements the corrective tools in #36 — that PR adds the surface for correcting wrong entries; this spec defines the feedback loop that decides which entries to correct and when.

What the spec covers

§	Section	What it contains
1	Context & current state	Polyglot architecture, corrective-tools cross-ref, ~22% on-topic baseline flagged as anecdotal
2	Four quant metrics	on-topic / usage / contradiction / discovery — each with formula, baseline, target, rolling window
3	Measurement infrastructure	Async Stop-hook scorer → `kms_quality_log` collection in existing Mongo DB, Haiku 4.5 judge with Sonnet 4.6 auto-fallback
4	Feedback loop into ranking	`effective_confidence = stored × (1 + α·usage − β·contradiction)`, α=0.5, β=2.0
5	Closed corrective loop	Contradictions auto-stage `kms_supersede` suggestions that surface in the next session's injected bundle
6	Phased rollout	Leg 0 → 1 (Mem0) → 2 (Neo4j) → 3 (Mongo + closed loop) with numeric entry/exit/kill criteria and ≥1 week measurement windows
7	Failure modes & revert	Seven failure modes, detection + mitigation each, single-flag revert (`KMS_QUALITY_RANKING_ENABLED=false`)
8	Decisions (signed off)	All four open questions resolved with rationale: Haiku judge, full-text excerpt, same-DB new-collection, α/β = 0.5/2.0

Why these four metrics

They're orthogonal: on-topic rate measures retrieval quality, usage rate measures whether on-topic results were actually load-bearing, contradiction rate flags wrong stored facts, discovery rate measures whether the corrective tools are reaching daily flow. You can't game one without moving another.

Why the β > α asymmetry

β=2.0 vs α=0.5 means a single contradiction punishes ranking ~4× harder than a single use rewards it. Embodies the principle "wrong information is more costly than missing information" — wrong facts should fade fast; useful facts should promote cautiously. §7 has telemetry (false_demotion_rate) that detects if this is too aggressive, and the mitigation is explicit: raise α, drop β.

What happens after sign-off

Leg 0 implementation PR opens, references this spec as its contract
Leg 0 ships: Stop-hook scorer + kms_quality_log + rollup script, no ranking changes
Judge validation runs against 50 hand-labeled turns; if Haiku <85% agreement, auto-upgrade to Sonnet
7 days of clean measurement data → Leg 1 (Mem0 leg of feedback loop) proceeds
Each subsequent leg gates on measured improvement from the previous one. If a leg regresses, one-flag revert and rethink.

Test plan

All 8 sections present and self-contained
Every metric in §2 has formula + baseline + target + window (no vibes language)
§3 reuses existing infrastructure (Mongo connection, existing Stop-hook slot) — no new infra invented
§6 has explicit numeric kill criteria per phase, single-flag revert
§7 has both detection and mitigation for every failure mode
§8 decisions are resolved with rationale, not left as open questions
Reviewer: confirm metric definitions are what you want to optimize for
Reviewer: confirm the Leg 0 → Leg 1 → Leg 2 → Leg 3 sequencing matches your ordering preference
Reviewer: confirm the single-flag revert mechanism is sufficient (no other data written beyond kms_quality_log)

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added comprehensive KMS context injection quality specification that defines measurable metrics, quality assessment processes, feedback mechanisms, and phased rollout plan for continuous improvement.

Contract for how we fix KMS context injection — currently unmeasured and surfacing ~22% on-topic items (anecdotal). Before any code lands, this spec pins down what "working right" means quantitatively, how we measure it, how measurement flows back into ranking, and how we roll it out with kill criteria per phase. Eight sections: - §1 Context & current state - §2 Four orthogonal metrics (on-topic, usage, contradiction, discovery) with formulas, baselines, targets, and rolling windows - §3 Measurement infrastructure: async Stop-hook scorer, new kms_quality_log collection in the existing KMS MongoDB database, Haiku 4.5 judge with Sonnet 4.6 auto-fallback if judge agreement <85% on 50 hand-labeled turns - §4 Feedback loop: effective_confidence = stored × (1 + α·usage − β·contradiction) with α=0.5, β=2.0 (contradictions punish 4x harder than uses reward) - §5 Closed corrective loop: contradictions auto-stage kms_supersede suggestions that surface in the next session's injected bundle - §6 Four-phase rollout (Leg 0 measurement → Leg 1 Mem0 → Leg 2 Neo4j → Leg 3 MongoDB + closed loop), each with numeric entry/exit/kill criteria and a one-week measurement window minimum - §7 Seven failure modes with detection + mitigation, single-flag revert (KMS_QUALITY_RANKING_ENABLED=false) - §8 Four decisions signed off with rationale Deliverable is the spec itself — no code, no schema, no hooks. The implementation plan for Leg 0 is a separate PR that references this spec as its acceptance contract. Complements the corrective tools in PR #36 — that PR added the surface for correcting wrong entries; this spec defines the feedback loop that decides *which* entries to correct and *when*. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codeant-ai · 2026-04-13T02:15:50Z

User does not have a PR Review subscription.

Go to Team management and add this email to the PR Review subscription.

coderabbitai · 2026-04-13T02:16:02Z

📝 Walkthrough

Walkthrough

A new documentation file specifying a complete quality measurement and improvement framework for KMS context injection, including quantitative metrics, measurement pipeline architecture, feedback loops for ranking corrections, and a phased rollout plan with explicit exit criteria and revert mechanism.

Changes

Cohort / File(s)	Summary
KMS Quality Specification `docs/CONTEXT_INJECTION_QUALITY_SPEC.md`	Introduces comprehensive quality spec with four metrics (on-topic rate, usage rate, contradiction rate, discovery rate), asynchronous scoring pipeline using Anthropic judge, MongoDB storage strategy with TTL, `effective_confidence` ranking feedback loop, correction workflow with `kms_supersede` suggestions, and phased rollout plan (4 legs) with kill switches and numeric thresholds.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐰 A quality spec hops into view,
With metrics so shiny and new,
Four measures of grace,
A judge keeps the pace,
While corrections bloom red, white, and blue! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: a new KMS context injection quality specification document with sign-off decisions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/kms-quality-spec

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a comprehensive quality specification for KMS context injection, defining quantitative metrics, measurement infrastructure, and a self-healing feedback loop for ranking stored facts. The documentation is well-structured and includes a phased rollout plan with clear kill criteria. The review feedback identifies three areas for improvement: clarifying the Discovery rate formula to account for cross-session events, adding a safety clamp to the effective confidence formula to prevent negative values, and specifying the smoothing factor for the exponential moving averages used in scoring.

gemini-code-assist · 2026-04-13T02:16:53Z

+| **On-topic injection rate** | `(# injected items judged relevant to user prompt) / (# injected items)` | Stop-hook scorer (LLM judge over `{prompt, injected_item}`) | ~22% (anecdotal) | ≥80% | rolling 7d |
+| **Usage rate** | `(# injected items the next assistant turn referenced or acted on) / (# injected items)` | Stop-hook scorer (LLM judge over `{injected_item, assistant_response}`) | unknown | ≥50% | rolling 7d |
+| **Contradiction rate** | `(# injected items the assistant explicitly contradicted) / (# injected items)` | Stop-hook scorer; auto-stages a `kms_supersede` suggestion | unknown | ≤5% | rolling 7d |
+| **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |


The Discovery rate formula (# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted) is slightly ambiguous because §5 explains that suggestions are surfaced in the next session. This implies the "warranted" event and the "happened" event occur in different turns. The definition should clarify how these events are linked across sessions to ensure the metric is accurately measurable.

Suggested change

| **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |

| **Discovery rate** | (# warranted corrections resolved via tool) / (# warranted corrections identified) | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d |

gemini-code-assist · 2026-04-13T02:16:53Z

+The structural fix. Today `UnifiedSearchTool.rankResults` (`src/tools/UnifiedSearchTool.ts:430-444`) sorts purely on `result.confidence`. The spec defines an **effective confidence** that combines stored confidence with usage signal:
+
+```
+effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)


The effective_confidence formula can produce negative values if the contradiction_score is high (e.g., with the default β=2.0, any contradiction score > 0.5 results in a negative multiplier). Since the ranking logic likely expects non-negative scores, the specification should explicitly include clamping the result to a minimum of zero.

Suggested change

effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)

effective_confidence = max(0, stored_confidence × (1 + α·usage_score − β·contradiction_score))

gemini-code-assist · 2026-04-13T02:16:54Z

+- `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`.
+- `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`.


The specification uses the term "rolling EMA" for usage_score and contradiction_score but does not define the smoothing factor (often denoted as α or γ). Defining this parameter is necessary to ensure consistent implementation of the feedback loop's responsiveness across different components.

Suggested change

- `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`.

- `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`.

- `usage_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(uses) / (injections)` from `kms_quality_rollup`.

- `contradiction_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(contradictions) / (injections)`.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md`:
- Around line 33-35: Three fenced code blocks in the document lack language
identifiers: the block that contains the literal
"docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block starting with
"┌─────────────────────────────────────────────────────────────┐" and the block
containing the formula starting with "effective_confidence = stored_confidence ×
(1 + α·usage_score − β·contradiction_score)". Add appropriate fenced-code
language tags (e.g., ```text for plain text/ascii art and ```math or ```text for
the formula depending on your linting preference) to each fence so they comply
with MD040; ensure you replace the opening ``` with ```text (or another suitable
language) for those three specific blocks.
- Line 138: Clarify the ambiguous "content excerpt (full text, first 500 chars)"
in the injected_items[] schema by deciding and stating explicitly whether the
field stores the full document or only the excerpt; update the wording for the
"content excerpt" entry to either "content (full text)" or "content_excerpt
(first 500 characters)" and, if keeping an excerpt, add the exact truncation
rule and character limit and whether it uses Unicode codepoints or bytes; ensure
the change is applied to the injected_items[] description so implementers of the
scorer/log schema know the precise field name and length behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4352a016-0c3e-4f66-8fa8-0089508f0f95

📥 Commits

Reviewing files that changed from the base of the PR and between eff278f and e1e963e.

📒 Files selected for processing (1)

docs/CONTEXT_INJECTION_QUALITY_SPEC.md

coderabbitai · 2026-04-13T02:18:04Z

+```
+docs/CONTEXT_INJECTION_QUALITY_SPEC.md
+```


⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks (markdownlint MD040).

The fences starting at Line 33, Line 45, and Line 154 should specify a language for lint compliance.

Markdownlint-compliant diff

-``` +```text docs/CONTEXT_INJECTION_QUALITY_SPEC.md

- +text
┌─────────────────────────────────────────────────────────────┐
│ This plan │
...
- +

- +text
effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)

Also applies to: 45-102, 154-156

🧰 Tools

🪛 markdownlint-cli2 (0.22.0)

[warning] 33-33: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` around lines 33 - 35, Three fenced code blocks in the document lack language identifiers: the block that contains the literal "docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block starting with "┌─────────────────────────────────────────────────────────────┐" and the block containing the formula starting with "effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)". Add appropriate fenced-code language tags (e.g., ```text for plain text/ascii art and ```math or ```text for the formula depending on your linting preference) to each fence so they comply with MD040; ensure you replace the opening ``` with ```text (or another suitable language) for those three specific blocks.

coderabbitai · 2026-04-13T02:18:04Z

+- **What it captures per turn**:
+  - `session_id`, `turn_id`, `timestamp`
+  - `user_prompt` (truncated)
+  - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection


⚠️ Potential issue | 🟡 Minor

Clarify the excerpt definition to avoid implementation drift.

Line 138 mixes “full text” with “first 500 chars,” which can be interpreted two ways. Please make it unambiguous (store full content vs store excerpt only) so the scorer/log schema is implemented consistently.

Suggested wording tweak

- - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection + - `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection

- `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` at line 138, Clarify the ambiguous "content excerpt (full text, first 500 chars)" in the injected_items[] schema by deciding and stating explicitly whether the field stores the full document or only the excerpt; update the wording for the "content excerpt" entry to either "content (full text)" or "content_excerpt (first 500 characters)" and, if keeping an excerpt, add the exact truncation rule and character limit and whether it uses Unicode codepoints or bytes; ensure the change is applied to the injected_items[] description so implementers of the scorer/log schema know the precise field name and length behavior.

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

ryaker mentioned this pull request Apr 13, 2026

fix: return JSON stubs for OAuth discovery probes (Re-authenticate error) #38

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: KMS context injection quality spec (signed off)#37

docs: KMS context injection quality spec (signed off)#37
ryaker wants to merge 1 commit into
mainfrom
docs/kms-quality-spec

ryaker commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codeant-ai Bot commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| Discovery rate \| `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` \| Stop-hook scorer (judges from contradiction signal) \| ~0% \| ≥70% \| rolling 7d \|
	\| Discovery rate \| (# warranted corrections resolved via tool) / (# warranted corrections identified) \| Stop-hook scorer (judges from contradiction signal) \| ~0% \| ≥70% \| rolling 7d \|

	effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)
	effective_confidence = max(0, stored_confidence × (1 + α·usage_score − β·contradiction_score))

		- `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`.
		- `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`.

	- `injected_items[]` — id, source backend, content excerpt (full text, first 500 chars), confidence at time of injection
	- `injected_items[]` — id, source backend, content_excerpt (first 500 chars only), confidence at time of injection

Conversation

ryaker commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the spec covers

Why these four metrics

Why the β > α asymmetry

What happens after sign-off

Test plan

Summary by CodeRabbit

Uh oh!

codeant-ai Bot commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryaker commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading