Skip to content

Prompt-eval self-improve loop, ADP/HK doc remedies, and viewer UX polish#17

Open
ffedoroff wants to merge 35 commits into
mainfrom
refactor/principle-focus-terminology
Open

Prompt-eval self-improve loop, ADP/HK doc remedies, and viewer UX polish#17
ffedoroff wants to merge 35 commits into
mainfrom
refactor/principle-focus-terminology

Conversation

@ffedoroff

Copy link
Copy Markdown
Owner

Accumulated work on this branch. Main groups:

Prompt-eval (self-improving fix-prompt loop)

  • New/expanded contrib/prompting-self-improve.md playbook: encode session corrections, branch = build-dir naming, new_cycles false-positive on shrunk cycles, prompt-gap-vs-capability-ceiling rule, and workspace-scope verification + sweep-mode lessons (cargo check --workspace + cargo test --workspace --no-run, not just per-crate).
  • contrib/prompt-eval-metrics.py: focus-aware collector for non-cycle metrics.

Rust ADP/HK doc remedies

  • ADP: visibility-narrowing as the explicit step-1 remedy (before extract/move).
  • HK: ordered remedies (narrow artificial fan-in, then split by role); an audiences-first lever was trialed and reverted (capability ceiling on god-modules).
  • AI.md: reading --doc <principle> is a non-optional fix-loop step; scaffolding front-loads --doc {id}.

CLI / docs

  • Fix: metric docs route through doc_overrides, not always base/.
  • e2e report goldens regenerated for the front-loaded doc_note.

Viewer UX

  • Header brand reads code-ranker (not upper-cased); exact tool version (snapshot.versions["code-ranker"]) shown small, centred, under the brand on hover.
  • stat button toggles the diff-summary popup (second click closes) and stays highlighted (.active) while open.

Docs / README

  • README lists supported languages (Rust production-ready; the other eight beta).
  • Viewer DESIGN/PRD synced for the brand/version + stat toggle.

Validation: make all green (build / test / clippy / lint / self-check; coverage 96.3% lines), markdownlint + lychee clean.

ffedoroff added 29 commits June 22, 2026 23:23
Map interaction fixes surfaced while dogfooding the report:

- Flicker: drop the opacity transition on edges. Animating opacity on
  hundreds of edge path/polygon elements promoted each to its own
  compositor layer for the tween and flickered the whole page (worse the
  more elements there are). Snapping is a single repaint. Confirmed via a
  performance trace (7408 opacity Animation events on edges).
- Hover de-flicker: ignore the synthetic re-mouseenter that raisePaint's
  reparent fires on the same node; ignore a cluster mouseleave while the
  pointer is still inside the cluster bbox (crossing sibling nodes/edges
  painted over the folder background no longer toggles the highlight).
- Navigation: clicking a single-file box (files-level, key == node id)
  opens the file modal instead of drilling into a degenerate single-file
  group (which also leaked a {target}-prefixed id into the breadcrumb);
  Shift-click now selects file-node boxes in the overview too.
- Open-source modifier: Alt/Option on every platform (was Cmd on macOS /
  Ctrl elsewhere — Cmd clashed with copy/paste, Ctrl is right-click on
  macOS); rename the cursor class .ctrl-link -> .src-link.
- Crate cluster at dig>0 rendered neutral grey (red/pink reserved for the
  crate overview at dig 0).
- Layout: .svg-frame min-height 420px; zoom buttons and the shortcut
  legend lifted clear of the bottom status bar; tier-menu options share
  one style (drop the active .on accent).

Docs synced: docs/code-ranker-viewer/{PRD,DESIGN}.md.
…line --doc catalog

Make the vocabulary consistent across CLI, config, snapshot, viewer and docs,
and make every principle/metric doc reachable offline. Breaking by design
(few clients yet); SARIF's standard `rules` field is the only "rules" kept.

Terminology
- preset → **principle** everywhere: Rust types (`Preset`→`Principle`,
  `preset.rs`→`principle.rs`), snapshot JSON field `presets`→`principles`,
  config `[[presets]]`/`[presets.<ID>]` → `[[principles]]`/`[principles.<ID>]`,
  viewer JS (`snap.principles`) + CSS classes (`exp-principle-*`), docs.
- `--focus-rule` → **`--focus`** (old name removed, no alias). value_name fixed
  (`check`: RULE|GROUP, `report`: METRIC | PRINCIPLE); dropped the false
  "Mirrors check's --focus-rule" docstring (the two are different operations).
- **rule** is reserved for the `check` gate only (`[rules]`, rule ids, SARIF
  `rules`); principles/metrics are no longer called "rules".
- **lens**/**axis** (the framing concept) → **focus**. Kept the unrelated
  viewer "reveal-depth lens chip" and OCP's "axis of change".

Offline docs
- `remediation` now points at `code-ranker report --doc <ID>` instead of a
  GitHub download URL; `--doc` resolves any base doc by filename stem
  (`Fan-in`, `metrics`, `AI`) and `--doc cycle` → the ADP doc.
- New **`--doc AI`**: an AI-agent overview + a catalog auto-assembled from every
  base doc's TL;DR (`languages/base/AI.md` + `templates.rs::tldr_index`).
- The `cycle` focus prompt now mirrors the ADP prompt (cycle is ADP's metric
  lens) — `synth_metric_principle` borrows the ADP principle's framing.

Goldens regenerated (snapshot key + remediation strings). New unit tests cover
the cycle→ADP borrow, `--doc cycle`→ADP, the AI index expansion and filename
fallback. `make all` + `make e2e` green.
…to 4.0.0-alpha.1

Three independent format versions, each its own constant, bumped on its own
criterion (documented in docs/versions.md):

- **app** — Cargo `[workspace.package] version` → 4.0.0-alpha.1.
- **config + CLI** — `code_ranker_graph::version::CONFIG_VERSION` ("4.0").
- **JSON snapshot + viewer** — `…::version::SCHEMA_VERSION` ("4.0", was "3").

config:
- `code-ranker.toml` now MUST declare a matching `version`; `config::load` rejects
  a missing/mismatched value with a directional hint (older → migrate, newer →
  upgrade) instead of a cryptic `unknown field`. `CONFIG_SCHEMA_VERSION` aliases
  `CONFIG_VERSION`. Added `version = "4.0"` to the root config + 9 samples.

json + viewer:
- snapshot `schema_version` synced to `SCHEMA_VERSION`; the renderer injects
  `window.SCHEMA_VERSION` and the viewer rejects an incompatible swapped-in
  snapshot (snap-controls).

The number lives only at its constant — config/e2e fixtures `format!` it from the
constant, never hardcode it. Goldens regenerated (`schema_version`).

docs: new docs/versions.md (the 3 surfaces, where each constant lives, when/how to
bump, branch discipline); config.md documents the required `version`; PRD/DESIGN/
node_schema/e2e schema_version refs synced to 4.0. The /update-docs checklist gains
a format-compatibility/version-bump step (moved early, Step 3).

make all + make e2e green.
The product PRD should carry no on-disk-format or implementation detail:
- §5.1 Snapshot File Format: drop the JSON example + field-by-field spec; keep a
  business note pointing at node_schema.md / DESIGN.md for the shape.
- §7.3 Graph JSON Schema: removed (the JSON contract lives in node_schema.md).
- §7.2 Plugin Model: replaced the trait / inventory / metric-pipeline internals
  with a business description (supported languages, --plugin, in-process/offline,
  no external loading); the mechanism is in DESIGN.md.
- Repointed the node_schema.md cross-refs that named the removed PRD §7.3.

JSON-shape and internals now live solely in node_schema.md / DESIGN.md.
Replace the maintenance-only `docs` subcommand (and its GitHub Pages corpus
step) with a project-facing `ai` command, and stop tests littering .code-ranker/.

- remove `docs` subcommand + `templates::build_corpus` + the pages.yml
  corpus-compose step; the corpus stays binary-embedded (`--doc <ID>`), no longer
  served over a URL. The HTML report publish in pages.yml stays.
- add `ai`: prints the offline AI-agent playbook (`base/AI.md`), no analysis,
  always exits 0. Resolves the language plugin only to pick output — full playbook
  + principle/metric catalog when resolved; a brief intro + a dynamic
  'Select a language' section (template authored in AI.md, values filled by
  `ai::fill_select`) when none resolves, with the catalog withheld.
- AI.md: expanded product description + a Commands section (check/report/ai/help).
- e2e: run report-emitting helpers from a temp cwd so the default json+html pair
  (always written, since the built-in [output.*] paths are set) no longer lands in
  the repo's own .code-ranker/.
- docs: CLI.md/PRD/DESIGN/ai-skill/templates.md synced.

No version bump: the removed/added CLI surface rides this branch's existing
CONFIG_VERSION 4.0 (already a major vs main 3.0.2).
The diff-coverage gate flagged 7 added lines with no test:
- report.rs `--output.prompt --focus <metric>|<principle>`: two e2e tests
  for the metric-lens and principle arms of the prompt builder.
- templates.rs: extract `catalog_entry` (no-summary arm unreachable via the
  real corpus) and a shared `with_trailing_newline` helper (DRY with the ai
  and report `--doc` stdout paths); unit-test both arms.

diff-coverage vs origin/main: every added/changed Rust line covered.
When a cycle's only back-edges to the module root are
`pub(in crate::…::<ancestor>)` visibility paths (not `use` statements), the
smallest fix is to tighten the visibility (`pub(super)`/`pub(crate)`): Code
Ranker models the visibility path as a dependency edge, so narrowing it removes
the edge with no restructure. Guarded on every consumer already living in the
narrower scope (else extract/invert) so cheaper models don't narrow blindly and
silence the metric. Surfaced by the prompt-eval reference runs.
- Move the prompt-self-improvement playbook docs/ -> contrib/ and rework it into
  a closed, self-improving loop: top-level goal, three objectives
  (quality/cost/clarity), a meta-loop that edits the playbook itself, a
  metrics.csv schema + scoring rubric, the full run procedure (incl. collecting
  the agent's own writes into the run dir), and a one-mechanism-per-sweep note.
- Add contrib/prompt-eval-metrics.py: extracts a run's objective metrics from
  its artifacts (chat.jsonl, before/after.json, branch git diff) and appends a
  row to prompt-eval/metrics.csv; judged columns via flags; format-aware token
  extraction; --dry-run.
…n TODO notes

Prompt-eval branches in PROJECT are flat and recur across builds, so the run id
<model>-<focus>-<n> collides between prompt versions. Suffix the branch with the
code-ranker commit hash -> <model>-<focus>-<n>-<CR_SHA>, unique per prompt version
and tied to the run's build dir. The collector now defaults --branch to
<run>-<cr_sha> (cr_sha derived from the run path), so loc/files come from the right
branch without passing it; same-commit re-runs bump <n>.

Also add rust/todo.md: research notes on the Rust plugin's unsafe detection, the
two extension paths (numeric counter vs CEL fact-rule), candidate anti-patterns,
and the opt-in per-function analysis level. Reference only; not yet acted on.
- meta-loop: a user correction is the strongest signal; fix THIS file before
  continuing, not after the sweep (the file not changing IS the bug)
- algorithm: drive the loop to convergence; a lever edit is unfinished without
  its verifying re-run; don't pause for permission between iterations
- tuning rule: diagnose from the transcript by hand, not from aggregate counts
- launching: sub-agent cwd = code-ranker repo makes it 'cargo run' instead of
  the PATH binary, inflating cost columns
…ng source

doc_note read as a soft mid-prompt aside; cheaper tiers explored the code
first and re-derived the mechanism the doc already states (verbatim, with a
worked example for this very project). Reframe as a directive first step that
promises the payoff (cause + smallest fix + worked example).

Hypothesis (prompt-eval): cuts pre-edit exploration on the next sonnet-ADP
sweep — first_edit_turn / commands / tool_calls / output_tokens drop vs the
f6ddc88 sonnet-ADP-1 baseline. Verify-or-revert.
… of base

A lever for a cheaper tier's missing remedy must not leak language-specifics
(Rust pub(in ...)) into languages/base/AI.md or the neutral defaults.toml; the
base lever stays generic and points at the per-language doc.
…fix-loop step

Haiku skipped the deep doc (read_doc_focus=0) despite used_generated_prompt=1 and
a front-loaded doc_note — the base AI.md fix loop had no doc-read step, so the agent
treated it as optional and reached for a heavy crate-level extraction (ADP 16->9
only, new_cycle, -15 tests). Add it as step 3 (not optional), language-neutral; the
per-language cause/remedy lives in languages/<lang>/<ID>.md.

Hypothesis (prompt-eval): haiku-ADP-2 read_doc_focus 0->1, fix shape -> in-place
visibility narrowing, focus_delta -7->-11, new_cycles 1->0, tests preserved.
…efore extract/move

haiku-ADP-2 read --doc ADP and produced a correct fix, but over-extracted
(moved transformers.rs to a leaf, 11 files) where in-place pub(in...)->pub(super)
narrowing (opus/sonnet, +5/-5) was the minimal fix. The visibility remedy was
present but sat before prominent extract-to-leaf shapes; cheaper tiers picked the
heavier pattern. Reframe as an ordered procedure: find back-edges -> narrow
visibility paths first -> extract/split only for genuine use back-edges.

Hypothesis (prompt-eval): haiku-ADP-3 keeps quality (cycle gone, tests intact) but
shrinks the fix to in-place narrowing -> loc/files near opus/sonnet, quality 4->5.
A cycle whose membership only shrank (subset of a pre-existing cycle the fix
partially cleared) registers as new_cycles; it nearly mis-scored haiku-ADP-3
(dc06762) down despite a clean minimal fix. Diff cycle node-sets before scoring.
Branch = <ts>_<CR_SHA> (e.g. 20260623T1849Z_dc06762), matching the evidence
folder name 1:1; UTC <ts> makes each run unique (no bump <n>). One run per build
dir; pass --branch to the collector (no longer <run>-<CR_SHA>).
…branches

A <ts>_<CR_SHA> branch carries no ticket, so a project prepare-commit-msg hook
rejects the commit; --no-verify does not skip prepare-commit-msg. Prefix eval
commits with a pseudo-ticket (PROMPTEVAL-N:).
Run 'cfs init' (engine v1.5.9, kit sdlc v1.2.1): adds root AGENTS.md/CLAUDE.md navigation blocks and gitignore entries for cf-* agent integrations. The .cf-studio/ dir stays gitignored (pre-existing rule).
cfs validate-toc flagged 15 headings (h3 subsections) missing from their TOCs. Regenerated with 'cfs toc'; additive only (+15 entries), now all TOCs validate.
…etrics)

from_snapshots was hardcoded to the cycle metric, so an HK/sloc/cognitive run
reported ADP cycle counts in focus_*/worst_*/new_cycles. Now: cycle FOCUS keeps
the SCC logic; a metric FOCUS reads the metric off the module nodes -> worst_* =
worst module's value (the --top 1 target), focus_* = project-wide sum (flat total
beside a dropped worst = relocated, not dissolved), new_cycles blank. Direction
from node_attributes schema (higher_better -> worst is min). read_doc_focus alias
set is now focus-aware too. Playbook column docs updated to match.

Exposed by the first HK run (opus max): worst HK 390825->140697, total 603969->353313.
…it BY ROLE

Reading 3 HK runs' real diffs: opus/sonnet narrowed pub(in) visibility (root
cause, 1 line); haiku followed the doc's only stated remedy ('prefer the split')
and did a mechanical facade split (trait away from impl) that shaved sloc and
even widened field visibility -- HK number down, coupling not separated.

Fix: (1) note pub(in <ancestor>) inflates fan_in artificially (consistent with
ADP.md); (2) state the highest-value HK fix is splitting a multi-ROLE hub by
responsibility (cuts real fan_in AND fan_out), NOT shaving sloc by moving a decl
from its impl. Ordered: narrow artificial fan-in -> split by role -> never
mechanical-split a cohesive role.

Hypothesis (prompt-eval): haiku-HK-2 switches from facade split to visibility
narrowing / role split -> q4->q5, footprint collapses. Verify-or-revert.
doc_rel_path hardcoded metric docs to base/<doc>.md, so a per-language metric
doc (languages/rust/HK.md) was embedded but NEVER served: --doc HK always
returned the neutral base doc, while --doc ADP (a principle) correctly composed
rust/ADP.md. Metric docs now reuse the override the plugin already applied to its
principle docs (override_lang, inferred from principle doc_urls): serve
<lang>/<doc>.md when present, else base/. Verified: --doc HK now composes
rust/HK.md; --doc ADP unchanged; base fallback clean; 145 unit tests pass.
The b32aee6 scaffolding change (doc_note front-load) is embedded verbatim in each
report's prompt block, which the e2e goldens compare exactly — so all 9 went stale.
Regenerated per docs/e2e.md (report + frozen header). Also re-freezes the rust
golden's header, which had been committed with real machine values (branch/paths),
and syncs versions 3.0.0-alpha.1 -> 4.0.0-alpha.1. Full workspace test suite green.
haiku-HK on cyberfabric-core's gear.rs (HK 149M) read the doc, picked 'split by
role', but split the hub by its own internal seams (one file per trait impl) ->
shaved sloc 761->345, fan_in ROSE 9->13, HK only -34%, gear still #1. opus/sonnet
instead ran the audiences check, found 8-9 consumers all reached in for one
wrong-audience type (AppServices), moved it to a leaf -> fan_in ->1/0, HK ~-100%.

Reframe 'Remedies, in order' as audiences-FIRST: trace what each fan-in consumer
imports; the answer's shape picks the remedy (same item for many -> move to leaf;
visibility path -> narrow; different parts -> split by role). Add an explicit trap:
splitting a hub by its internal seams shaves sloc without cutting coupling.

Hypothesis (prompt-eval): haiku-HK-2 on gear.rs runs the audiences check and does the
wrong-audience extraction (gear.rs HK ~-100%, q2->q5) instead of the capability split.
If the agent reads the lever and still doesn't do the named step (and a stronger
model on the same prompt does), the gap is model capability, not the prompt: revert
the lever (failed hypothesis), record a capability ceiling, stop -- don't burn the
3rd iteration. Observed on cyberfabric-core gear.rs HK (haiku, two reverted levers).
From the cyberfabric-core cycle sweep (20 cycles -> 0, 28 Haiku passes):
- step 5 / tests_pass: per-crate green is NOT enough on a multi-crate workspace;
  gate on cargo check --workspace AND cargo test --workspace --no-run (feature-
  unified + cfg(test) paths break while the touched crate stays green). Also watch
  the test COUNT (a fix can drop tests and leave survivors green).
- new 'Sweeps' subsection: track net progress (stall on 2 no-net-decrease passes;
  cap iters), gate compilation at workspace scope between passes / checkpoint,
  measure from artifacts not the agent's summary, and the cheap-tier-converges-on-
  cycles-but-not-HK-hubs finding.
- header brand now reads 'code-ranker' (not upper-cased); the exact tool
  version (snapshot.versions["code-ranker"]) appears small, centred, under the
  brand on header hover (injected by lib.rs for the __CR_VERSION__ placeholder)
- the 'stat' button now toggles the diff-summary popup (second click closes) and
  carries an .active highlight while open

docs: README lists supported languages (Rust stable, the other eight beta);
viewer DESIGN/PRD updated for the brand/version + stat toggle.
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

code-ranker

Verdict vs main: ❔ unknown

No new violations vs the baseline. 🎉

📦 Full HTML report: see the code-ranker-report artifact on this run.

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.79%. Comparing base (c2fb6ac) to head (2901617).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
+ Coverage   97.68%   97.79%   +0.10%     
==========================================
  Files         121      123       +2     
  Lines       13971    14319     +348     
==========================================
+ Hits        13648    14003     +355     
+ Misses        323      316       -7     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add a doc_rel_path test for the metric-doc <lang>/ override branch (566fb23):
a rust-routing principle + an hk metric whose rust/HK.md exists now asserts the
override is served (was only exercised on the base/ fallback path).

Also fold the inner early-return into an Option chain (map -> filter ->
unwrap_or_else) so the closing brace after the return is gone — llvm-cov flagged
it as an uncoverable region. Behaviour unchanged; diff-coverage vs origin/main
is now clean.
README: top-row badges for the live code-ranker report, the project website
(code-ranker.com) and the GitHub App install; a CI-integration note framing the
App as the zero-config per-PR-report option. GitHub Actions Rust guide gets a
matching 'prefer zero config -> install the App' cross-reference.

Docs-only; no code or format change.
`--config levels.functions=true` was rejected ("unknown config key") — the
inline KEY=VALUE dispatcher had no arm for it, so the functions level could only
be enabled via a TOML file. Add the arm (parse_on_off), extend the inline-keys
test, and note the inline form in config.md.
Bump 4.0.0-alpha.1 -> 4.0.0 (make bump VERSION=4.0.0): workspace Cargo.toml/Cargo.lock
and doc version refs. Workspace compiles clean.
Drop the leftover 4.0.0-alpha.1 strings the bump didn't touch: the 9 e2e golden
snapshots' embedded code-ranker version, the docs/versions.md examples, and the
prompting-self-improve example rows. README status updated from 'pre-alpha' to
4.0.0 (Rust production-ready, other languages beta). e2e + markdownlint green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant