docs: KMS context injection quality spec (signed off)#37
Conversation
Contract for how we fix KMS context injection — currently unmeasured and surfacing ~22% on-topic items (anecdotal). Before any code lands, this spec pins down what "working right" means quantitatively, how we measure it, how measurement flows back into ranking, and how we roll it out with kill criteria per phase. Eight sections: - §1 Context & current state - §2 Four orthogonal metrics (on-topic, usage, contradiction, discovery) with formulas, baselines, targets, and rolling windows - §3 Measurement infrastructure: async Stop-hook scorer, new kms_quality_log collection in the existing KMS MongoDB database, Haiku 4.5 judge with Sonnet 4.6 auto-fallback if judge agreement <85% on 50 hand-labeled turns - §4 Feedback loop: effective_confidence = stored × (1 + α·usage − β·contradiction) with α=0.5, β=2.0 (contradictions punish 4x harder than uses reward) - §5 Closed corrective loop: contradictions auto-stage kms_supersede suggestions that surface in the next session's injected bundle - §6 Four-phase rollout (Leg 0 measurement → Leg 1 Mem0 → Leg 2 Neo4j → Leg 3 MongoDB + closed loop), each with numeric entry/exit/kill criteria and a one-week measurement window minimum - §7 Seven failure modes with detection + mitigation, single-flag revert (KMS_QUALITY_RANKING_ENABLED=false) - §8 Four decisions signed off with rationale Deliverable is the spec itself — no code, no schema, no hooks. The implementation plan for Leg 0 is a separate PR that references this spec as its acceptance contract. Complements the corrective tools in PR #36 — that PR added the surface for correcting wrong entries; this spec defines the feedback loop that decides *which* entries to correct and *when*. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
User does not have a PR Review subscription. Go to Team management and add this email to the PR Review subscription. |
📝 WalkthroughWalkthroughA new documentation file specifying a complete quality measurement and improvement framework for KMS context injection, including quantitative metrics, measurement pipeline architecture, feedback loops for ranking corrections, and a phased rollout plan with explicit exit criteria and revert mechanism. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive quality specification for KMS context injection, defining quantitative metrics, measurement infrastructure, and a self-healing feedback loop for ranking stored facts. The documentation is well-structured and includes a phased rollout plan with clear kill criteria. The review feedback identifies three areas for improvement: clarifying the Discovery rate formula to account for cross-session events, adding a safety clamp to the effective confidence formula to prevent negative values, and specifying the smoothing factor for the exponential moving averages used in scoring.
| | **On-topic injection rate** | `(# injected items judged relevant to user prompt) / (# injected items)` | Stop-hook scorer (LLM judge over `{prompt, injected_item}`) | ~22% (anecdotal) | ≥80% | rolling 7d | | ||
| | **Usage rate** | `(# injected items the next assistant turn referenced or acted on) / (# injected items)` | Stop-hook scorer (LLM judge over `{injected_item, assistant_response}`) | unknown | ≥50% | rolling 7d | | ||
| | **Contradiction rate** | `(# injected items the assistant explicitly contradicted) / (# injected items)` | Stop-hook scorer; auto-stages a `kms_supersede` suggestion | unknown | ≤5% | rolling 7d | | ||
| | **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d | |
There was a problem hiding this comment.
The Discovery rate formula (# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted) is slightly ambiguous because §5 explains that suggestions are surfaced in the next session. This implies the "warranted" event and the "happened" event occur in different turns. The definition should clarify how these events are linked across sessions to ensure the metric is accurately measurable.
| | **Discovery rate** | `(# turns where supersede/flag was warranted AND happened) / (# turns where it was warranted)` | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d | | |
| | **Discovery rate** | (# warranted corrections resolved via tool) / (# warranted corrections identified) | Stop-hook scorer (judges from contradiction signal) | ~0% | ≥70% | rolling 7d | |
| The structural fix. Today `UnifiedSearchTool.rankResults` (`src/tools/UnifiedSearchTool.ts:430-444`) sorts purely on `result.confidence`. The spec defines an **effective confidence** that combines stored confidence with usage signal: | ||
|
|
||
| ``` | ||
| effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score) |
There was a problem hiding this comment.
The effective_confidence formula can produce negative values if the contradiction_score is high (e.g., with the default β=2.0, any contradiction score > 0.5 results in a negative multiplier). Since the ranking logic likely expects non-negative scores, the specification should explicitly include clamping the result to a minimum of zero.
| effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score) | |
| effective_confidence = max(0, stored_confidence × (1 + α·usage_score − β·contradiction_score)) |
| - `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`. | ||
| - `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`. |
There was a problem hiding this comment.
The specification uses the term "rolling EMA" for usage_score and contradiction_score but does not define the smoothing factor (often denoted as α or γ). Defining this parameter is necessary to ensure consistent implementation of the feedback loop's responsiveness across different components.
| - `usage_score`: rolling EMA over the item's `(uses) / (injections)` from `kms_quality_rollup`. | |
| - `contradiction_score`: rolling EMA over the item's `(contradictions) / (injections)`. | |
| - `usage_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(uses) / (injections)` from `kms_quality_rollup`. | |
| - `contradiction_score`: rolling EMA (smoothing factor γ=0.3) over the item's `(contradictions) / (injections)`. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md`:
- Around line 33-35: Three fenced code blocks in the document lack language
identifiers: the block that contains the literal
"docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block starting with
"┌─────────────────────────────────────────────────────────────┐" and the block
containing the formula starting with "effective_confidence = stored_confidence ×
(1 + α·usage_score − β·contradiction_score)". Add appropriate fenced-code
language tags (e.g., ```text for plain text/ascii art and ```math or ```text for
the formula depending on your linting preference) to each fence so they comply
with MD040; ensure you replace the opening ``` with ```text (or another suitable
language) for those three specific blocks.
- Line 138: Clarify the ambiguous "content excerpt (full text, first 500 chars)"
in the injected_items[] schema by deciding and stating explicitly whether the
field stores the full document or only the excerpt; update the wording for the
"content excerpt" entry to either "content (full text)" or "content_excerpt
(first 500 characters)" and, if keeping an excerpt, add the exact truncation
rule and character limit and whether it uses Unicode codepoints or bytes; ensure
the change is applied to the injected_items[] description so implementers of the
scorer/log schema know the precise field name and length behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4352a016-0c3e-4f66-8fa8-0089508f0f95
📒 Files selected for processing (1)
docs/CONTEXT_INJECTION_QUALITY_SPEC.md
| ``` | ||
| docs/CONTEXT_INJECTION_QUALITY_SPEC.md | ||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks (markdownlint MD040).
The fences starting at Line 33, Line 45, and Line 154 should specify a language for lint compliance.
Markdownlint-compliant diff
-```
+```text
docs/CONTEXT_INJECTION_QUALITY_SPEC.md- +text
┌─────────────────────────────────────────────────────────────┐
│ This plan │
...
- +
- +text
effective_confidence = stored_confidence × (1 + α·usage_score − β·contradiction_score)
Also applies to: 45-102, 154-156
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 33-33: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` around lines 33 - 35, Three fenced
code blocks in the document lack language identifiers: the block that contains
the literal "docs/CONTEXT_INJECTION_QUALITY_SPEC.md", the ASCII-art block
starting with "┌─────────────────────────────────────────────────────────────┐"
and the block containing the formula starting with "effective_confidence =
stored_confidence × (1 + α·usage_score − β·contradiction_score)". Add
appropriate fenced-code language tags (e.g., ```text for plain text/ascii art
and ```math or ```text for the formula depending on your linting preference) to
each fence so they comply with MD040; ensure you replace the opening ``` with
```text (or another suitable language) for those three specific blocks.
| - **What it captures per turn**: | ||
| - `session_id`, `turn_id`, `timestamp` | ||
| - `user_prompt` (truncated) | ||
| - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection |
There was a problem hiding this comment.
Clarify the excerpt definition to avoid implementation drift.
Line 138 mixes “full text” with “first 500 chars,” which can be interpreted two ways. Please make it unambiguous (store full content vs store excerpt only) so the scorer/log schema is implemented consistently.
Suggested wording tweak
- - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection
+ - `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `injected_items[]` — id, source backend, **content excerpt (full text, first 500 chars)**, confidence at time of injection | |
| - `injected_items[]` — id, source backend, **content_excerpt (first 500 chars only)**, confidence at time of injection |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/CONTEXT_INJECTION_QUALITY_SPEC.md` at line 138, Clarify the ambiguous
"content excerpt (full text, first 500 chars)" in the injected_items[] schema by
deciding and stating explicitly whether the field stores the full document or
only the excerpt; update the wording for the "content excerpt" entry to either
"content (full text)" or "content_excerpt (first 500 characters)" and, if
keeping an excerpt, add the exact truncation rule and character limit and
whether it uses Unicode codepoints or bytes; ensure the change is applied to the
injected_items[] description so implementers of the scorer/log schema know the
precise field name and length behavior.
Summary
Contract for fixing KMS context injection quality. Currently unmeasured and surfacing roughly 22% on-topic items (anecdotal — leg 0 produces the real number). This spec pins down what "working right" means quantitatively, how we'll measure it, how measurement flows back into ranking, and how we'll roll it out with kill criteria per phase.
Deliverable is the spec itself — no code, no schema, no hooks. The implementation plan for Leg 0 is a separate PR that will reference this spec as its acceptance contract.
Complements the corrective tools in #36 — that PR adds the surface for correcting wrong entries; this spec defines the feedback loop that decides which entries to correct and when.
What the spec covers
kms_quality_logcollection in existing Mongo DB, Haiku 4.5 judge with Sonnet 4.6 auto-fallbackeffective_confidence = stored × (1 + α·usage − β·contradiction), α=0.5, β=2.0kms_supersedesuggestions that surface in the next session's injected bundleKMS_QUALITY_RANKING_ENABLED=false)Why these four metrics
They're orthogonal: on-topic rate measures retrieval quality, usage rate measures whether on-topic results were actually load-bearing, contradiction rate flags wrong stored facts, discovery rate measures whether the corrective tools are reaching daily flow. You can't game one without moving another.
Why the β > α asymmetry
β=2.0 vs α=0.5 means a single contradiction punishes ranking ~4× harder than a single use rewards it. Embodies the principle "wrong information is more costly than missing information" — wrong facts should fade fast; useful facts should promote cautiously. §7 has telemetry (
false_demotion_rate) that detects if this is too aggressive, and the mitigation is explicit: raise α, drop β.What happens after sign-off
kms_quality_log+ rollup script, no ranking changesTest plan
kms_quality_log)🤖 Generated with Claude Code
Summary by CodeRabbit