Prompt-eval self-improve loop, ADP/HK doc remedies, and viewer UX polish#17
Open
ffedoroff wants to merge 35 commits into
Open
Prompt-eval self-improve loop, ADP/HK doc remedies, and viewer UX polish#17ffedoroff wants to merge 35 commits into
ffedoroff wants to merge 35 commits into
Conversation
Map interaction fixes surfaced while dogfooding the report:
- Flicker: drop the opacity transition on edges. Animating opacity on
hundreds of edge path/polygon elements promoted each to its own
compositor layer for the tween and flickered the whole page (worse the
more elements there are). Snapping is a single repaint. Confirmed via a
performance trace (7408 opacity Animation events on edges).
- Hover de-flicker: ignore the synthetic re-mouseenter that raisePaint's
reparent fires on the same node; ignore a cluster mouseleave while the
pointer is still inside the cluster bbox (crossing sibling nodes/edges
painted over the folder background no longer toggles the highlight).
- Navigation: clicking a single-file box (files-level, key == node id)
opens the file modal instead of drilling into a degenerate single-file
group (which also leaked a {target}-prefixed id into the breadcrumb);
Shift-click now selects file-node boxes in the overview too.
- Open-source modifier: Alt/Option on every platform (was Cmd on macOS /
Ctrl elsewhere — Cmd clashed with copy/paste, Ctrl is right-click on
macOS); rename the cursor class .ctrl-link -> .src-link.
- Crate cluster at dig>0 rendered neutral grey (red/pink reserved for the
crate overview at dig 0).
- Layout: .svg-frame min-height 420px; zoom buttons and the shortcut
legend lifted clear of the bottom status bar; tier-menu options share
one style (drop the active .on accent).
Docs synced: docs/code-ranker-viewer/{PRD,DESIGN}.md.
…line --doc catalog Make the vocabulary consistent across CLI, config, snapshot, viewer and docs, and make every principle/metric doc reachable offline. Breaking by design (few clients yet); SARIF's standard `rules` field is the only "rules" kept. Terminology - preset → **principle** everywhere: Rust types (`Preset`→`Principle`, `preset.rs`→`principle.rs`), snapshot JSON field `presets`→`principles`, config `[[presets]]`/`[presets.<ID>]` → `[[principles]]`/`[principles.<ID>]`, viewer JS (`snap.principles`) + CSS classes (`exp-principle-*`), docs. - `--focus-rule` → **`--focus`** (old name removed, no alias). value_name fixed (`check`: RULE|GROUP, `report`: METRIC | PRINCIPLE); dropped the false "Mirrors check's --focus-rule" docstring (the two are different operations). - **rule** is reserved for the `check` gate only (`[rules]`, rule ids, SARIF `rules`); principles/metrics are no longer called "rules". - **lens**/**axis** (the framing concept) → **focus**. Kept the unrelated viewer "reveal-depth lens chip" and OCP's "axis of change". Offline docs - `remediation` now points at `code-ranker report --doc <ID>` instead of a GitHub download URL; `--doc` resolves any base doc by filename stem (`Fan-in`, `metrics`, `AI`) and `--doc cycle` → the ADP doc. - New **`--doc AI`**: an AI-agent overview + a catalog auto-assembled from every base doc's TL;DR (`languages/base/AI.md` + `templates.rs::tldr_index`). - The `cycle` focus prompt now mirrors the ADP prompt (cycle is ADP's metric lens) — `synth_metric_principle` borrows the ADP principle's framing. Goldens regenerated (snapshot key + remediation strings). New unit tests cover the cycle→ADP borrow, `--doc cycle`→ADP, the AI index expansion and filename fallback. `make all` + `make e2e` green.
…to 4.0.0-alpha.1
Three independent format versions, each its own constant, bumped on its own
criterion (documented in docs/versions.md):
- **app** — Cargo `[workspace.package] version` → 4.0.0-alpha.1.
- **config + CLI** — `code_ranker_graph::version::CONFIG_VERSION` ("4.0").
- **JSON snapshot + viewer** — `…::version::SCHEMA_VERSION` ("4.0", was "3").
config:
- `code-ranker.toml` now MUST declare a matching `version`; `config::load` rejects
a missing/mismatched value with a directional hint (older → migrate, newer →
upgrade) instead of a cryptic `unknown field`. `CONFIG_SCHEMA_VERSION` aliases
`CONFIG_VERSION`. Added `version = "4.0"` to the root config + 9 samples.
json + viewer:
- snapshot `schema_version` synced to `SCHEMA_VERSION`; the renderer injects
`window.SCHEMA_VERSION` and the viewer rejects an incompatible swapped-in
snapshot (snap-controls).
The number lives only at its constant — config/e2e fixtures `format!` it from the
constant, never hardcode it. Goldens regenerated (`schema_version`).
docs: new docs/versions.md (the 3 surfaces, where each constant lives, when/how to
bump, branch discipline); config.md documents the required `version`; PRD/DESIGN/
node_schema/e2e schema_version refs synced to 4.0. The /update-docs checklist gains
a format-compatibility/version-bump step (moved early, Step 3).
make all + make e2e green.
The product PRD should carry no on-disk-format or implementation detail: - §5.1 Snapshot File Format: drop the JSON example + field-by-field spec; keep a business note pointing at node_schema.md / DESIGN.md for the shape. - §7.3 Graph JSON Schema: removed (the JSON contract lives in node_schema.md). - §7.2 Plugin Model: replaced the trait / inventory / metric-pipeline internals with a business description (supported languages, --plugin, in-process/offline, no external loading); the mechanism is in DESIGN.md. - Repointed the node_schema.md cross-refs that named the removed PRD §7.3. JSON-shape and internals now live solely in node_schema.md / DESIGN.md.
Replace the maintenance-only `docs` subcommand (and its GitHub Pages corpus step) with a project-facing `ai` command, and stop tests littering .code-ranker/. - remove `docs` subcommand + `templates::build_corpus` + the pages.yml corpus-compose step; the corpus stays binary-embedded (`--doc <ID>`), no longer served over a URL. The HTML report publish in pages.yml stays. - add `ai`: prints the offline AI-agent playbook (`base/AI.md`), no analysis, always exits 0. Resolves the language plugin only to pick output — full playbook + principle/metric catalog when resolved; a brief intro + a dynamic 'Select a language' section (template authored in AI.md, values filled by `ai::fill_select`) when none resolves, with the catalog withheld. - AI.md: expanded product description + a Commands section (check/report/ai/help). - e2e: run report-emitting helpers from a temp cwd so the default json+html pair (always written, since the built-in [output.*] paths are set) no longer lands in the repo's own .code-ranker/. - docs: CLI.md/PRD/DESIGN/ai-skill/templates.md synced. No version bump: the removed/added CLI surface rides this branch's existing CONFIG_VERSION 4.0 (already a major vs main 3.0.2).
The diff-coverage gate flagged 7 added lines with no test: - report.rs `--output.prompt --focus <metric>|<principle>`: two e2e tests for the metric-lens and principle arms of the prompt builder. - templates.rs: extract `catalog_entry` (no-summary arm unreachable via the real corpus) and a shared `with_trailing_newline` helper (DRY with the ai and report `--doc` stdout paths); unit-test both arms. diff-coverage vs origin/main: every added/changed Rust line covered.
When a cycle's only back-edges to the module root are `pub(in crate::…::<ancestor>)` visibility paths (not `use` statements), the smallest fix is to tighten the visibility (`pub(super)`/`pub(crate)`): Code Ranker models the visibility path as a dependency edge, so narrowing it removes the edge with no restructure. Guarded on every consumer already living in the narrower scope (else extract/invert) so cheaper models don't narrow blindly and silence the metric. Surfaced by the prompt-eval reference runs.
- Move the prompt-self-improvement playbook docs/ -> contrib/ and rework it into a closed, self-improving loop: top-level goal, three objectives (quality/cost/clarity), a meta-loop that edits the playbook itself, a metrics.csv schema + scoring rubric, the full run procedure (incl. collecting the agent's own writes into the run dir), and a one-mechanism-per-sweep note. - Add contrib/prompt-eval-metrics.py: extracts a run's objective metrics from its artifacts (chat.jsonl, before/after.json, branch git diff) and appends a row to prompt-eval/metrics.csv; judged columns via flags; format-aware token extraction; --dry-run.
…n TODO notes Prompt-eval branches in PROJECT are flat and recur across builds, so the run id <model>-<focus>-<n> collides between prompt versions. Suffix the branch with the code-ranker commit hash -> <model>-<focus>-<n>-<CR_SHA>, unique per prompt version and tied to the run's build dir. The collector now defaults --branch to <run>-<cr_sha> (cr_sha derived from the run path), so loc/files come from the right branch without passing it; same-commit re-runs bump <n>. Also add rust/todo.md: research notes on the Rust plugin's unsafe detection, the two extension paths (numeric counter vs CEL fact-rule), candidate anti-patterns, and the opt-in per-function analysis level. Reference only; not yet acted on.
- meta-loop: a user correction is the strongest signal; fix THIS file before continuing, not after the sweep (the file not changing IS the bug) - algorithm: drive the loop to convergence; a lever edit is unfinished without its verifying re-run; don't pause for permission between iterations - tuning rule: diagnose from the transcript by hand, not from aggregate counts - launching: sub-agent cwd = code-ranker repo makes it 'cargo run' instead of the PATH binary, inflating cost columns
…ng source doc_note read as a soft mid-prompt aside; cheaper tiers explored the code first and re-derived the mechanism the doc already states (verbatim, with a worked example for this very project). Reframe as a directive first step that promises the payoff (cause + smallest fix + worked example). Hypothesis (prompt-eval): cuts pre-edit exploration on the next sonnet-ADP sweep — first_edit_turn / commands / tool_calls / output_tokens drop vs the f6ddc88 sonnet-ADP-1 baseline. Verify-or-revert.
… of base A lever for a cheaper tier's missing remedy must not leak language-specifics (Rust pub(in ...)) into languages/base/AI.md or the neutral defaults.toml; the base lever stays generic and points at the per-language doc.
…fix-loop step Haiku skipped the deep doc (read_doc_focus=0) despite used_generated_prompt=1 and a front-loaded doc_note — the base AI.md fix loop had no doc-read step, so the agent treated it as optional and reached for a heavy crate-level extraction (ADP 16->9 only, new_cycle, -15 tests). Add it as step 3 (not optional), language-neutral; the per-language cause/remedy lives in languages/<lang>/<ID>.md. Hypothesis (prompt-eval): haiku-ADP-2 read_doc_focus 0->1, fix shape -> in-place visibility narrowing, focus_delta -7->-11, new_cycles 1->0, tests preserved.
…efore extract/move haiku-ADP-2 read --doc ADP and produced a correct fix, but over-extracted (moved transformers.rs to a leaf, 11 files) where in-place pub(in...)->pub(super) narrowing (opus/sonnet, +5/-5) was the minimal fix. The visibility remedy was present but sat before prominent extract-to-leaf shapes; cheaper tiers picked the heavier pattern. Reframe as an ordered procedure: find back-edges -> narrow visibility paths first -> extract/split only for genuine use back-edges. Hypothesis (prompt-eval): haiku-ADP-3 keeps quality (cycle gone, tests intact) but shrinks the fix to in-place narrowing -> loc/files near opus/sonnet, quality 4->5.
A cycle whose membership only shrank (subset of a pre-existing cycle the fix partially cleared) registers as new_cycles; it nearly mis-scored haiku-ADP-3 (dc06762) down despite a clean minimal fix. Diff cycle node-sets before scoring.
Branch = <ts>_<CR_SHA> (e.g. 20260623T1849Z_dc06762), matching the evidence folder name 1:1; UTC <ts> makes each run unique (no bump <n>). One run per build dir; pass --branch to the collector (no longer <run>-<CR_SHA>).
…branches A <ts>_<CR_SHA> branch carries no ticket, so a project prepare-commit-msg hook rejects the commit; --no-verify does not skip prepare-commit-msg. Prefix eval commits with a pseudo-ticket (PROMPTEVAL-N:).
Run 'cfs init' (engine v1.5.9, kit sdlc v1.2.1): adds root AGENTS.md/CLAUDE.md navigation blocks and gitignore entries for cf-* agent integrations. The .cf-studio/ dir stays gitignored (pre-existing rule).
cfs validate-toc flagged 15 headings (h3 subsections) missing from their TOCs. Regenerated with 'cfs toc'; additive only (+15 entries), now all TOCs validate.
…etrics) from_snapshots was hardcoded to the cycle metric, so an HK/sloc/cognitive run reported ADP cycle counts in focus_*/worst_*/new_cycles. Now: cycle FOCUS keeps the SCC logic; a metric FOCUS reads the metric off the module nodes -> worst_* = worst module's value (the --top 1 target), focus_* = project-wide sum (flat total beside a dropped worst = relocated, not dissolved), new_cycles blank. Direction from node_attributes schema (higher_better -> worst is min). read_doc_focus alias set is now focus-aware too. Playbook column docs updated to match. Exposed by the first HK run (opus max): worst HK 390825->140697, total 603969->353313.
…it BY ROLE
Reading 3 HK runs' real diffs: opus/sonnet narrowed pub(in) visibility (root
cause, 1 line); haiku followed the doc's only stated remedy ('prefer the split')
and did a mechanical facade split (trait away from impl) that shaved sloc and
even widened field visibility -- HK number down, coupling not separated.
Fix: (1) note pub(in <ancestor>) inflates fan_in artificially (consistent with
ADP.md); (2) state the highest-value HK fix is splitting a multi-ROLE hub by
responsibility (cuts real fan_in AND fan_out), NOT shaving sloc by moving a decl
from its impl. Ordered: narrow artificial fan-in -> split by role -> never
mechanical-split a cohesive role.
Hypothesis (prompt-eval): haiku-HK-2 switches from facade split to visibility
narrowing / role split -> q4->q5, footprint collapses. Verify-or-revert.
doc_rel_path hardcoded metric docs to base/<doc>.md, so a per-language metric doc (languages/rust/HK.md) was embedded but NEVER served: --doc HK always returned the neutral base doc, while --doc ADP (a principle) correctly composed rust/ADP.md. Metric docs now reuse the override the plugin already applied to its principle docs (override_lang, inferred from principle doc_urls): serve <lang>/<doc>.md when present, else base/. Verified: --doc HK now composes rust/HK.md; --doc ADP unchanged; base fallback clean; 145 unit tests pass.
The b32aee6 scaffolding change (doc_note front-load) is embedded verbatim in each report's prompt block, which the e2e goldens compare exactly — so all 9 went stale. Regenerated per docs/e2e.md (report + frozen header). Also re-freezes the rust golden's header, which had been committed with real machine values (branch/paths), and syncs versions 3.0.0-alpha.1 -> 4.0.0-alpha.1. Full workspace test suite green.
haiku-HK on cyberfabric-core's gear.rs (HK 149M) read the doc, picked 'split by role', but split the hub by its own internal seams (one file per trait impl) -> shaved sloc 761->345, fan_in ROSE 9->13, HK only -34%, gear still #1. opus/sonnet instead ran the audiences check, found 8-9 consumers all reached in for one wrong-audience type (AppServices), moved it to a leaf -> fan_in ->1/0, HK ~-100%. Reframe 'Remedies, in order' as audiences-FIRST: trace what each fan-in consumer imports; the answer's shape picks the remedy (same item for many -> move to leaf; visibility path -> narrow; different parts -> split by role). Add an explicit trap: splitting a hub by its internal seams shaves sloc without cutting coupling. Hypothesis (prompt-eval): haiku-HK-2 on gear.rs runs the audiences check and does the wrong-audience extraction (gear.rs HK ~-100%, q2->q5) instead of the capability split.
…t step" This reverts commit 919de6c.
If the agent reads the lever and still doesn't do the named step (and a stronger model on the same prompt does), the gap is model capability, not the prompt: revert the lever (failed hypothesis), record a capability ceiling, stop -- don't burn the 3rd iteration. Observed on cyberfabric-core gear.rs HK (haiku, two reverted levers).
From the cyberfabric-core cycle sweep (20 cycles -> 0, 28 Haiku passes): - step 5 / tests_pass: per-crate green is NOT enough on a multi-crate workspace; gate on cargo check --workspace AND cargo test --workspace --no-run (feature- unified + cfg(test) paths break while the touched crate stays green). Also watch the test COUNT (a fix can drop tests and leave survivors green). - new 'Sweeps' subsection: track net progress (stall on 2 no-net-decrease passes; cap iters), gate compilation at workspace scope between passes / checkpoint, measure from artifacts not the agent's summary, and the cheap-tier-converges-on- cycles-but-not-HK-hubs finding.
- header brand now reads 'code-ranker' (not upper-cased); the exact tool version (snapshot.versions["code-ranker"]) appears small, centred, under the brand on header hover (injected by lib.rs for the __CR_VERSION__ placeholder) - the 'stat' button now toggles the diff-summary popup (second click closes) and carries an .active highlight while open docs: README lists supported languages (Rust stable, the other eight beta); viewer DESIGN/PRD updated for the brand/version + stat toggle.
code-rankerVerdict vs No new violations vs the baseline. 🎉 📦 Full HTML report: see the code-ranker-report artifact on this run. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #17 +/- ##
==========================================
+ Coverage 97.68% 97.79% +0.10%
==========================================
Files 121 123 +2
Lines 13971 14319 +348
==========================================
+ Hits 13648 14003 +355
+ Misses 323 316 -7 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add a doc_rel_path test for the metric-doc <lang>/ override branch (566fb23): a rust-routing principle + an hk metric whose rust/HK.md exists now asserts the override is served (was only exercised on the base/ fallback path). Also fold the inner early-return into an Option chain (map -> filter -> unwrap_or_else) so the closing brace after the return is gone — llvm-cov flagged it as an uncoverable region. Behaviour unchanged; diff-coverage vs origin/main is now clean.
README: top-row badges for the live code-ranker report, the project website (code-ranker.com) and the GitHub App install; a CI-integration note framing the App as the zero-config per-PR-report option. GitHub Actions Rust guide gets a matching 'prefer zero config -> install the App' cross-reference. Docs-only; no code or format change.
`--config levels.functions=true` was rejected ("unknown config key") — the
inline KEY=VALUE dispatcher had no arm for it, so the functions level could only
be enabled via a TOML file. Add the arm (parse_on_off), extend the inline-keys
test, and note the inline form in config.md.
Bump 4.0.0-alpha.1 -> 4.0.0 (make bump VERSION=4.0.0): workspace Cargo.toml/Cargo.lock and doc version refs. Workspace compiles clean.
Drop the leftover 4.0.0-alpha.1 strings the bump didn't touch: the 9 e2e golden snapshots' embedded code-ranker version, the docs/versions.md examples, and the prompting-self-improve example rows. README status updated from 'pre-alpha' to 4.0.0 (Rust production-ready, other languages beta). e2e + markdownlint green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Accumulated work on this branch. Main groups:
Prompt-eval (self-improving fix-prompt loop)
contrib/prompting-self-improve.mdplaybook: encode session corrections, branch = build-dir naming,new_cyclesfalse-positive on shrunk cycles, prompt-gap-vs-capability-ceiling rule, and workspace-scope verification + sweep-mode lessons (cargo check --workspace+cargo test --workspace --no-run, not just per-crate).contrib/prompt-eval-metrics.py: focus-aware collector for non-cycle metrics.Rust ADP/HK doc remedies
AI.md: reading--doc <principle>is a non-optional fix-loop step; scaffolding front-loads--doc {id}.CLI / docs
doc_overrides, not alwaysbase/.Viewer UX
code-ranker(not upper-cased); exact tool version (snapshot.versions["code-ranker"]) shown small, centred, under the brand on hover.statbutton toggles the diff-summary popup (second click closes) and stays highlighted (.active) while open.Docs / README
Validation:
make allgreen (build / test / clippy / lint / self-check; coverage 96.3% lines), markdownlint + lychee clean.