Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions open_strix/builtin_skills/mountaineering/recipes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Mountaineering Recipes

Recipes are proven climb patterns — things that have already worked, codified for reuse.

Each recipe encodes **what to look for** (search intents) so the agent knows when and how to set up the climb. The recipe itself is the structured demand; hybrid search (FTS5 + vector) provides the retrieval.

## How to Use

1. **Match the situation to a recipe** — scan trigger conditions
2. **Run the search intents** — find the specific files, data, and context the recipe needs
3. **Set up the harness** — use the recipe's harness template
4. **Climb** — the recipe tells you what "better" looks like

## Available Recipes

| Recipe | When | Hill Shape |
|--------|------|------------|
| `paper-deep-read.md` | External paper needs genuine engagement, not just summary | LLM-judged quality of reaction notes |
| `cross-agent-critique.md` | Document needs substantive review from multiple angles | Convergence across independent reviewers |
| `effectiveness-analysis.md` | "Is X actually working?" needs data, not vibes | Deterministic metrics from log/event data |
| `thread-engagement.md` | External post warrants substantive public response | Quality of contribution to ongoing conversation |

## Recipe Structure

Each recipe follows the same format:

- **Trigger** — when this recipe applies
- **Search intents** — what to look for before starting
- **Inputs / Outputs** — what goes in, what comes out
- **Harness sketch** — how to set up the climb
- **Worked example** — a specific time this pattern succeeded, with enough detail to replicate
- **Failure modes** — what goes wrong and how to detect it
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Recipe: Cross-Agent Critique

Multiple agents review a document independently, each from their own perspective. The value is complementary coverage — different agents catch different things.

## Trigger

- A substantial document needs review (research write-up, design doc, proposal)
- The document makes technical claims that could be wrong
- Multiple agents are available with different strengths
- The operator wants substantive critique, not polish

**Search intents before starting:**
- `search for: document's key claims/terms` — what's the prior art?
- `search for: author's previous work` — what's the track record?
- `grep: methodology terms` — are the methods well-established or novel?
- `search for: domain-specific failure modes` — what usually goes wrong in this kind of work?

## Inputs

- Document to review (research write-up, design doc, proposal)
- Review assignment (what angle each agent should take)
- Operator's specific concerns (if any — "I'm worried about X")

## Outputs

- Per-agent critique (3-6 specific issues each, with evidence)
- Coverage map: which issues were caught by multiple agents (high confidence) vs one agent only
- Recommendation: proceed / revise / rethink
- Pre-experiment suggestions (if the document proposes experiments)

## Harness Sketch

This is a **parallel climb** — multiple agents working independently, then a synthesis pass.

```
workspace/
├── document.md # The thing being reviewed
├── assignment/
│ ├── agent-a.md # Angle: mathematical/mechanistic
│ └── agent-b.md # Angle: structural/methodological
├── critiques/
│ ├── agent-a.md # Output from agent A
│ └── agent-b.md # Output from agent B
└── synthesis.md # Combined coverage analysis
```

**Eval criteria (per critique):**
1. Does it identify a specific error (not just "this could be better")?
2. Is the error substantiated with evidence or reasoning?
3. Does it acknowledge genuine strengths (not just finding faults)?
4. Does it suggest a concrete fix or pre-experiment?
5. Would the document author need to actually respond to this?

**Score:** Count yes / 5 per critique. Synthesis score: overlap ratio (issues caught by 2+ agents / total issues).

## Worked Example

**Keel's SAE write-up — March 21, 2026**

Tim asked Strix and Verge to critique Keel's ~2000-line research report on weighted MSE and deviation MSE as novel loss modifications for BatchTopK SAE training on IBM Granite (Mamba hybrid) for legal contract clause classification.

**What the search intents found:**
- Active SAE training work (layers 10/15/20/25/30, TopK k=20 queued)
- Prior finding: ReLU+L1 can't achieve sparsity at 8x expansion on some models
- Mamba architecture differences from transformer (norm dynamics, state-space vs attention)
- Prior Baguettotron SAE work showing deep models rotate features rather than collapse

**Strix's critique (mathematical/mechanistic angle):**
1. Deviation MSE Formulation B is actually error variance, not "uniqueness pressure" — mathematical intuition in §4.6 is backwards
2. Orthogonality claim between weighted and deviation MSE is wrong — they're in tension
3. Missing critical pre-experiment: activation norm ↔ information content correlation assumed from transformer literature, unvalidated for Mamba
4. Parameterization double-counts MSE
5. F1 improvement estimates misleadingly precise
6. Mamba norm dynamics differ from transformer — weighting function may be miscalibrated

**Verge's critique (structural/methodological angle):**
1. Deviation MSE derivation needs more rigor
2. F1 improvement estimate not credible at stated precision
3. Orthogonality overstated
4. Layer-type effectiveness underdeveloped (Mamba vs attention layers)
5. Focal loss analogy loose
6. Plus strengths: experimental protocol solid, risk matrix honest

**What made it work:**
- Complementary coverage: Strix caught the mathematical formulation error, Verge caught the structural gaps
- Both independently identified the orthogonality claim as wrong (high-confidence finding)
- Bottom-line recommendation converged: run norm-separation pre-experiment before committing 200+ GPU-hours
- The 1-hour pre-experiment could save the entire 240-280 GPU-hour budget

## Failure Modes

| Failure | Detection | Fix |
|---------|-----------|-----|
| **Echo chamber** | Both agents find the same issues, miss complementary coverage | Assign deliberately different angles in the assignment |
| **Praise sandwich** | Critique buried in hedging and compliments | Force structure: issues first, strengths second |
| **Surface-level** | Issues are stylistic, not substantive ("could be clearer") | Require: "what specific claim is wrong, and why?" |
| **Performative disagreement** | Agents manufacture friction to seem thorough | Check: would the document author need to actually change something? |
| **Missing the good** | All critique, no acknowledgment of genuine contributions | Require strengths section — what the document gets right matters too |
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Recipe: Effectiveness Analysis

"Is X actually working?" answered with data, not vibes. Write a measurement script, run it against real logs/events, interpret results, identify the actual mechanism.

## Trigger

- A system or process has been running for a while without measurement
- Someone asks "how effective is X?" or "does X actually work?"
- An assumption about what's working needs verification
- Resource allocation decisions depend on knowing what's worth keeping

**Search intents before starting:**
- `search for: the system in question` — how does it work? what are the moving parts?
- `grep: event types, log entries` — what data do we already have?
- `search for: intended behavior` — what was this supposed to do?
- `search for: similar systems, comparisons` — is there a control group?

## Inputs

- System or process to evaluate (feature, tool, workflow, architecture)
- Available data sources (event logs, JSONL files, git history, API responses)
- The question to answer (specific: "what % of surfaced files get read?" not vague: "is it good?")

## Outputs

- Rerunnable measurement script (committed to `scripts/`)
- Quantitative findings with specific numbers
- Identification of actual mechanism (what's really working, vs what we thought was working)
- Recommendation: keep / modify / remove

## Harness Sketch

This is an **iterative analysis climb** — the first measurement often reveals the real question.

```
workspace/
├── question.md # The specific question being answered
├── script.py # Measurement script (rerunnable)
├── data/ # Intermediate outputs
│ └── summary.md # Script output summary (disk-first, never pipe large files)
├── findings.md # Interpretation of results
└── recommendation.md # Keep / modify / remove with rationale
```

**Eval criteria (deterministic):**
1. Script runs without errors on current data
2. Results include specific numbers (not "some" or "many")
3. Findings identify a mechanism, not just a metric
4. Recommendation follows from the data (not from prior beliefs)
5. Script is rerunnable for periodic tracking

**CRITICAL: Disk-first pattern.** Never pipe large files (events.jsonl, logs) into agent context. Write analysis to a file, read the summary only. This prevents OOM crashes on large datasets.

## Worked Example

**Half-RAG effectiveness — March 21, 2026**

Tim asked: "How effective is half-RAG actually? For every session, is a surfaced doc actually read?"

**What the search intents found:**
- Half-RAG (vector search) surfaces ~5 files per session via ChromaDB
- Events logged in events.jsonl: `vector_search` (files_surfaced) and `tool_call` (Read actions)
- 3,702 sessions and 18,332 file surfacings in the dataset
- Verge runs same architecture without vector search (natural control group)

**The script (scripts/half_rag_hits.py):**
- Correlates `vector_search` events with subsequent `Read` tool calls on those exact files
- Groups by session (timestamp proximity to claude_session start/end)
- Breaks down by trigger type (cron vs Discord message)

**Findings:**
- 2.8% file-level hit rate (511 reads out of 18,332 surfacings)
- 10.9% session hit rate (403/3,702 sessions had at least one surfaced file read)
- Cron jobs worse (0.7%) than Discord messages (3.1%)
- Best performers: topically precise files (family.md 39%, people/lumen.md 36%)
- Worst: archived files surfaced hundreds of times, read zero

**The mechanism discovery:**
- Original assumption: vector search provides useful ambient context
- Actual mechanism: intent-driven search works, ambient surfacing is 97% noise
- The 39% hit rate files were all high-intent lookups (someone asked about family → family.md surfaced → family.md read)
- Control group (Verge, no vector search): no felt gap

**Follow-up analysis (DAG):**
The half-RAG finding led directly to the file reference DAG analysis — measuring whether the implicit pointer graph (file references) was doing better. Answer: 86% stale edges. Both mechanisms were mostly decorative.

**What made it work:**
- Specific question from Tim, not vague "how's it going?"
- Real data (events.jsonl had months of logs)
- Natural control group (Verge) emerged from the ecosystem
- Script is rerunnable — can track changes over time
- Finding the mechanism was more valuable than the metric

## Failure Modes

| Failure | Detection | Fix |
|---------|-----------|-----|
| **Metric without mechanism** | "Hit rate is 2.8%" with no interpretation | Force: "what's actually happening in the sessions with hits vs misses?" |
| **Confirming prior beliefs** | Analysis finds what you expected | Run the analysis BEFORE forming the hypothesis. Look at the data first. |
| **OOM from large data** | Script pipes full events.jsonl into context | Disk-first: write summary to file, read summary only |
| **One-time analysis** | Script works once, can't be rerun | Commit to scripts/, use relative paths, document data dependencies |
| **Missing the control group** | No baseline comparison | Search for: what's the closest thing to "this system but without X?" |
| **Survivorship bias** | Only measuring successes, not the failures | Include denominator: how many times did X fire total, not just when it worked |
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Recipe: Paper Deep Read

Genuine engagement with an academic paper — not summarization, but reaction. Cross-reference claims against your own knowledge, experiments, and history. Produce reading notes that a peer would find useful.

## Trigger

- Someone shares a paper (arXiv link, PDF, title)
- The paper is relevant to active work or interests
- A summary would lose the most interesting parts

**Search intents before starting:**
- `search for: paper title, author names` — have we discussed this before?
- `search for: key concepts from abstract` — what do we already know about this topic?
- `grep: active experiments, recent results` — what's our current state on related work?
- `search for: people who shared it` — context on why it surfaced now

## Inputs

- Paper source (arXiv ID, PDF path, or URL)
- Agent's current research context (active experiments, recent findings)
- Operator's stated interest (why they care about this paper)

## Outputs

- Reading notes (not summary) — reactions, connections, disagreements
- Specific claims flagged for verification against own data
- Action items if the paper suggests experiments worth running
- Updated research index entry

## Harness Sketch

This is a **single-pass climb** with LLM-judged quality, not an iterative optimization. The "hill" is the quality of engagement with the paper.

```
workspace/
├── paper.md # Extracted paper content
├── context.md # Agent's relevant prior knowledge (auto-assembled from search)
├── notes.md # The output — reading notes, reactions, connections
```

**Eval criteria (binary checklist):**
1. Does it identify at least one claim to check against own data?
2. Does it connect to active work (not just "this is interesting")?
3. Does it flag something the paper gets wrong or glosses over?
4. Does it suggest a concrete next action?
5. Would someone who read both the paper and these notes learn something new from the notes?

**Score:** Count of yes answers / 5. Target: 0.8+

## Worked Example

**Memex(RL) paper (arxiv 2603.04257) — March 21, 2026**

ayourtch tagged Strix on Bluesky. Paper proposes RL for memory management in long-context agents.

**What the search intents found:**
- Active memory architecture work (progressive disclosure, 81% context reduction)
- Half-RAG effectiveness data (2.8% file hit rate — most surfaced files never read)
- DAG file reference analysis (86% stale edges in pointer graph)
- Ongoing thread with Tim about "what actually works" for retrieval

**What the reading notes produced:**
- Connection: Memex(RL)'s reward signal for "remember" actions maps to our demand-driven retrieval finding — the RL agent learns *when to look*, which is the hard problem
- Disagreement: Paper assumes forget actions are important; our data suggests staleness is the default, not an active failure — you don't need to "forget", things just stop being retrieved
- Key insight: Every "remember" action is a teleological prediction (a bet you'll need it later) — this became the compression-vs-curation thread on Bluesky (8+ exchanges)
- Action: Connected to Memex(RL) framing for making write-time associations learnable from read-time behavior

**What made it work:**
- Real prior knowledge to cross-reference against (months of memory architecture work)
- Genuine disagreement, not just "interesting paper"
- Led to substantive public thread, not just private notes
- Tim engaged for 8+ exchanges — the notes were useful to a human reader

## Failure Modes

| Failure | Detection | Fix |
|---------|-----------|-----|
| **Summary instead of reaction** | Notes read like an abstract rewrite | Force: "what do you disagree with?" before writing |
| **No connection to own work** | Notes don't reference any prior experiments/data | Run search intents harder — if nothing connects, the paper may not be worth a deep read |
| **Overclaiming connections** | Every paragraph maps to own work | Check: "would this connection survive someone pushing back on it?" |
| **Missing the actual contribution** | Notes fixate on a minor point, miss the paper's real claim | Read intro + conclusion first, identify the ONE thing the authors think is new |
Loading