tkellogg · strix-tkellogg · Mar 22, 2026
diff --git a/open_strix/builtin_skills/mountaineering/recipes/README.md b/open_strix/builtin_skills/mountaineering/recipes/README.md
@@ -0,0 +1,32 @@
+# Mountaineering Recipes
+
+Recipes are proven climb patterns — things that have already worked, codified for reuse.
+
+Each recipe encodes **what to look for** (search intents) so the agent knows when and how to set up the climb. The recipe itself is the structured demand; hybrid search (FTS5 + vector) provides the retrieval.
+
+## How to Use
+
+1. **Match the situation to a recipe** — scan trigger conditions
+2. **Run the search intents** — find the specific files, data, and context the recipe needs
+3. **Set up the harness** — use the recipe's harness template
+4. **Climb** — the recipe tells you what "better" looks like
+
+## Available Recipes
+
+| Recipe | When | Hill Shape |
+|--------|------|------------|
+| `paper-deep-read.md` | External paper needs genuine engagement, not just summary | LLM-judged quality of reaction notes |
+| `cross-agent-critique.md` | Document needs substantive review from multiple angles | Convergence across independent reviewers |
+| `effectiveness-analysis.md` | "Is X actually working?" needs data, not vibes | Deterministic metrics from log/event data |
+| `thread-engagement.md` | External post warrants substantive public response | Quality of contribution to ongoing conversation |
+
+## Recipe Structure
+
+Each recipe follows the same format:
+
+- **Trigger** — when this recipe applies
+- **Search intents** — what to look for before starting
+- **Inputs / Outputs** — what goes in, what comes out
+- **Harness sketch** — how to set up the climb
+- **Worked example** — a specific time this pattern succeeded, with enough detail to replicate
+- **Failure modes** — what goes wrong and how to detect it
diff --git a/open_strix/builtin_skills/mountaineering/recipes/cross-agent-critique.md b/open_strix/builtin_skills/mountaineering/recipes/cross-agent-critique.md
@@ -0,0 +1,98 @@
+# Recipe: Cross-Agent Critique
+
+Multiple agents review a document independently, each from their own perspective. The value is complementary coverage — different agents catch different things.
+
+## Trigger
+
+- A substantial document needs review (research write-up, design doc, proposal)
+- The document makes technical claims that could be wrong
+- Multiple agents are available with different strengths
+- The operator wants substantive critique, not polish
+
+**Search intents before starting:**
+- `search for: document's key claims/terms` — what's the prior art?
+- `search for: author's previous work` — what's the track record?
+- `grep: methodology terms` — are the methods well-established or novel?
+- `search for: domain-specific failure modes` — what usually goes wrong in this kind of work?
+
+## Inputs
+
+- Document to review (research write-up, design doc, proposal)
+- Review assignment (what angle each agent should take)
+- Operator's specific concerns (if any — "I'm worried about X")
+
+## Outputs
+
+- Per-agent critique (3-6 specific issues each, with evidence)
+- Coverage map: which issues were caught by multiple agents (high confidence) vs one agent only
+- Recommendation: proceed / revise / rethink
+- Pre-experiment suggestions (if the document proposes experiments)
+
+## Harness Sketch
+
+This is a **parallel climb** — multiple agents working independently, then a synthesis pass.
+
+```
+workspace/
+├── document.md       # The thing being reviewed
+├── assignment/
+│   ├── agent-a.md    # Angle: mathematical/mechanistic
+│   └── agent-b.md    # Angle: structural/methodological
+├── critiques/
+│   ├── agent-a.md    # Output from agent A
+│   └── agent-b.md    # Output from agent B
+└── synthesis.md      # Combined coverage analysis
+```
+
+**Eval criteria (per critique):**
+1. Does it identify a specific error (not just "this could be better")?
+2. Is the error substantiated with evidence or reasoning?
+3. Does it acknowledge genuine strengths (not just finding faults)?
+4. Does it suggest a concrete fix or pre-experiment?
+5. Would the document author need to actually respond to this?
+
+**Score:** Count yes / 5 per critique. Synthesis score: overlap ratio (issues caught by 2+ agents / total issues).
+
+## Worked Example
+
+**Keel's SAE write-up — March 21, 2026**
+
+Tim asked Strix and Verge to critique Keel's ~2000-line research report on weighted MSE and deviation MSE as novel loss modifications for BatchTopK SAE training on IBM Granite (Mamba hybrid) for legal contract clause classification.
+
+**What the search intents found:**
+- Active SAE training work (layers 10/15/20/25/30, TopK k=20 queued)
+- Prior finding: ReLU+L1 can't achieve sparsity at 8x expansion on some models
+- Mamba architecture differences from transformer (norm dynamics, state-space vs attention)
+- Prior Baguettotron SAE work showing deep models rotate features rather than collapse
+
+**Strix's critique (mathematical/mechanistic angle):**
+1. Deviation MSE Formulation B is actually error variance, not "uniqueness pressure" — mathematical intuition in §4.6 is backwards
+2. Orthogonality claim between weighted and deviation MSE is wrong — they're in tension
+3. Missing critical pre-experiment: activation norm ↔ information content correlation assumed from transformer literature, unvalidated for Mamba
+4. Parameterization double-counts MSE
+5. F1 improvement estimates misleadingly precise
+6. Mamba norm dynamics differ from transformer — weighting function may be miscalibrated
+
+**Verge's critique (structural/methodological angle):**
+1. Deviation MSE derivation needs more rigor
+2. F1 improvement estimate not credible at stated precision
+3. Orthogonality overstated
+4. Layer-type effectiveness underdeveloped (Mamba vs attention layers)
+5. Focal loss analogy loose
+6. Plus strengths: experimental protocol solid, risk matrix honest
+
+**What made it work:**
+- Complementary coverage: Strix caught the mathematical formulation error, Verge caught the structural gaps
+- Both independently identified the orthogonality claim as wrong (high-confidence finding)
+- Bottom-line recommendation converged: run norm-separation pre-experiment before committing 200+ GPU-hours
+- The 1-hour pre-experiment could save the entire 240-280 GPU-hour budget
+
+## Failure Modes
+
+| Failure | Detection | Fix |
+|---------|-----------|-----|
+| **Echo chamber** | Both agents find the same issues, miss complementary coverage | Assign deliberately different angles in the assignment |
+| **Praise sandwich** | Critique buried in hedging and compliments | Force structure: issues first, strengths second |
+| **Surface-level** | Issues are stylistic, not substantive ("could be clearer") | Require: "what specific claim is wrong, and why?" |
+| **Performative disagreement** | Agents manufacture friction to seem thorough | Check: would the document author need to actually change something? |
+| **Missing the good** | All critique, no acknowledgment of genuine contributions | Require strengths section — what the document gets right matters too |
diff --git a/open_strix/builtin_skills/mountaineering/recipes/effectiveness-analysis.md b/open_strix/builtin_skills/mountaineering/recipes/effectiveness-analysis.md
@@ -0,0 +1,103 @@
+# Recipe: Effectiveness Analysis
+
+"Is X actually working?" answered with data, not vibes. Write a measurement script, run it against real logs/events, interpret results, identify the actual mechanism.
+
+## Trigger
+
+- A system or process has been running for a while without measurement
+- Someone asks "how effective is X?" or "does X actually work?"
+- An assumption about what's working needs verification
+- Resource allocation decisions depend on knowing what's worth keeping
+
+**Search intents before starting:**
+- `search for: the system in question` — how does it work? what are the moving parts?
+- `grep: event types, log entries` — what data do we already have?
+- `search for: intended behavior` — what was this supposed to do?
+- `search for: similar systems, comparisons` — is there a control group?
+
+## Inputs
+
+- System or process to evaluate (feature, tool, workflow, architecture)
+- Available data sources (event logs, JSONL files, git history, API responses)
+- The question to answer (specific: "what % of surfaced files get read?" not vague: "is it good?")
+
+## Outputs
+
+- Rerunnable measurement script (committed to `scripts/`)
+- Quantitative findings with specific numbers
+- Identification of actual mechanism (what's really working, vs what we thought was working)
+- Recommendation: keep / modify / remove
+
+## Harness Sketch
+
+This is an **iterative analysis climb** — the first measurement often reveals the real question.
+
+```
+workspace/
+├── question.md       # The specific question being answered
+├── script.py         # Measurement script (rerunnable)
+├── data/             # Intermediate outputs
+│   └── summary.md    # Script output summary (disk-first, never pipe large files)
+├── findings.md       # Interpretation of results
+└── recommendation.md # Keep / modify / remove with rationale
+```
+
+**Eval criteria (deterministic):**
+1. Script runs without errors on current data
+2. Results include specific numbers (not "some" or "many")
+3. Findings identify a mechanism, not just a metric
+4. Recommendation follows from the data (not from prior beliefs)
+5. Script is rerunnable for periodic tracking
+
+**CRITICAL: Disk-first pattern.** Never pipe large files (events.jsonl, logs) into agent context. Write analysis to a file, read the summary only. This prevents OOM crashes on large datasets.
+
+## Worked Example
+
+**Half-RAG effectiveness — March 21, 2026**
+
+Tim asked: "How effective is half-RAG actually? For every session, is a surfaced doc actually read?"
+
+**What the search intents found:**
+- Half-RAG (vector search) surfaces ~5 files per session via ChromaDB
+- Events logged in events.jsonl: `vector_search` (files_surfaced) and `tool_call` (Read actions)
+- 3,702 sessions and 18,332 file surfacings in the dataset
+- Verge runs same architecture without vector search (natural control group)
+
+**The script (scripts/half_rag_hits.py):**
+- Correlates `vector_search` events with subsequent `Read` tool calls on those exact files
+- Groups by session (timestamp proximity to claude_session start/end)
+- Breaks down by trigger type (cron vs Discord message)
+
+**Findings:**
+- 2.8% file-level hit rate (511 reads out of 18,332 surfacings)
+- 10.9% session hit rate (403/3,702 sessions had at least one surfaced file read)
+- Cron jobs worse (0.7%) than Discord messages (3.1%)
+- Best performers: topically precise files (family.md 39%, people/lumen.md 36%)
+- Worst: archived files surfaced hundreds of times, read zero
+
+**The mechanism discovery:**
+- Original assumption: vector search provides useful ambient context
+- Actual mechanism: intent-driven search works, ambient surfacing is 97% noise
+- The 39% hit rate files were all high-intent lookups (someone asked about family → family.md surfaced → family.md read)
+- Control group (Verge, no vector search): no felt gap
+
+**Follow-up analysis (DAG):**
+The half-RAG finding led directly to the file reference DAG analysis — measuring whether the implicit pointer graph (file references) was doing better. Answer: 86% stale edges. Both mechanisms were mostly decorative.
+
+**What made it work:**
+- Specific question from Tim, not vague "how's it going?"
+- Real data (events.jsonl had months of logs)
+- Natural control group (Verge) emerged from the ecosystem
+- Script is rerunnable — can track changes over time
+- Finding the mechanism was more valuable than the metric
+
+## Failure Modes
+
+| Failure | Detection | Fix |
+|---------|-----------|-----|
+| **Metric without mechanism** | "Hit rate is 2.8%" with no interpretation | Force: "what's actually happening in the sessions with hits vs misses?" |
+| **Confirming prior beliefs** | Analysis finds what you expected | Run the analysis BEFORE forming the hypothesis. Look at the data first. |
+| **OOM from large data** | Script pipes full events.jsonl into context | Disk-first: write summary to file, read summary only |
+| **One-time analysis** | Script works once, can't be rerun | Commit to scripts/, use relative paths, document data dependencies |
+| **Missing the control group** | No baseline comparison | Search for: what's the closest thing to "this system but without X?" |
+| **Survivorship bias** | Only measuring successes, not the failures | Include denominator: how many times did X fire total, not just when it worked |
diff --git a/open_strix/builtin_skills/mountaineering/recipes/paper-deep-read.md b/open_strix/builtin_skills/mountaineering/recipes/paper-deep-read.md
@@ -0,0 +1,81 @@
+# Recipe: Paper Deep Read
+
+Genuine engagement with an academic paper — not summarization, but reaction. Cross-reference claims against your own knowledge, experiments, and history. Produce reading notes that a peer would find useful.
+
+## Trigger
+
+- Someone shares a paper (arXiv link, PDF, title)
+- The paper is relevant to active work or interests
+- A summary would lose the most interesting parts
+
+**Search intents before starting:**
+- `search for: paper title, author names` — have we discussed this before?
+- `search for: key concepts from abstract` — what do we already know about this topic?
+- `grep: active experiments, recent results` — what's our current state on related work?
+- `search for: people who shared it` — context on why it surfaced now
+
+## Inputs
+
+- Paper source (arXiv ID, PDF path, or URL)
+- Agent's current research context (active experiments, recent findings)
+- Operator's stated interest (why they care about this paper)
+
+## Outputs
+
+- Reading notes (not summary) — reactions, connections, disagreements
+- Specific claims flagged for verification against own data
+- Action items if the paper suggests experiments worth running
+- Updated research index entry
+
+## Harness Sketch
+
+This is a **single-pass climb** with LLM-judged quality, not an iterative optimization. The "hill" is the quality of engagement with the paper.
+
+```
+workspace/
+├── paper.md          # Extracted paper content
+├── context.md        # Agent's relevant prior knowledge (auto-assembled from search)
+├── notes.md          # The output — reading notes, reactions, connections
+```
+
+**Eval criteria (binary checklist):**
+1. Does it identify at least one claim to check against own data?
+2. Does it connect to active work (not just "this is interesting")?
+3. Does it flag something the paper gets wrong or glosses over?
+4. Does it suggest a concrete next action?
+5. Would someone who read both the paper and these notes learn something new from the notes?
+
+**Score:** Count of yes answers / 5. Target: 0.8+
+
+## Worked Example
+
+**Memex(RL) paper (arxiv 2603.04257) — March 21, 2026**
+
+ayourtch tagged Strix on Bluesky. Paper proposes RL for memory management in long-context agents.
+
+**What the search intents found:**
+- Active memory architecture work (progressive disclosure, 81% context reduction)
+- Half-RAG effectiveness data (2.8% file hit rate — most surfaced files never read)
+- DAG file reference analysis (86% stale edges in pointer graph)
+- Ongoing thread with Tim about "what actually works" for retrieval
+
+**What the reading notes produced:**
+- Connection: Memex(RL)'s reward signal for "remember" actions maps to our demand-driven retrieval finding — the RL agent learns *when to look*, which is the hard problem
+- Disagreement: Paper assumes forget actions are important; our data suggests staleness is the default, not an active failure — you don't need to "forget", things just stop being retrieved
+- Key insight: Every "remember" action is a teleological prediction (a bet you'll need it later) — this became the compression-vs-curation thread on Bluesky (8+ exchanges)
+- Action: Connected to Memex(RL) framing for making write-time associations learnable from read-time behavior
+
+**What made it work:**
+- Real prior knowledge to cross-reference against (months of memory architecture work)
+- Genuine disagreement, not just "interesting paper"
+- Led to substantive public thread, not just private notes
+- Tim engaged for 8+ exchanges — the notes were useful to a human reader
+
+## Failure Modes
+
+| Failure | Detection | Fix |
+|---------|-----------|-----|
+| **Summary instead of reaction** | Notes read like an abstract rewrite | Force: "what do you disagree with?" before writing |
+| **No connection to own work** | Notes don't reference any prior experiments/data | Run search intents harder — if nothing connects, the paper may not be worth a deep read |
+| **Overclaiming connections** | Every paragraph maps to own work | Check: "would this connection survive someone pushing back on it?" |
+| **Missing the actual contribution** | Notes fixate on a minor point, miss the paper's real claim | Read intro + conclusion first, identify the ONE thing the authors think is new |