diff --git a/open_strix/builtin_skills/mountaineering/recipes/README.md b/open_strix/builtin_skills/mountaineering/recipes/README.md new file mode 100644 index 0000000..e9807b4 --- /dev/null +++ b/open_strix/builtin_skills/mountaineering/recipes/README.md @@ -0,0 +1,32 @@ +# Mountaineering Recipes + +Recipes are proven climb patterns — things that have already worked, codified for reuse. + +Each recipe encodes **what to look for** (search intents) so the agent knows when and how to set up the climb. The recipe itself is the structured demand; hybrid search (FTS5 + vector) provides the retrieval. + +## How to Use + +1. **Match the situation to a recipe** — scan trigger conditions +2. **Run the search intents** — find the specific files, data, and context the recipe needs +3. **Set up the harness** — use the recipe's harness template +4. **Climb** — the recipe tells you what "better" looks like + +## Available Recipes + +| Recipe | When | Hill Shape | +|--------|------|------------| +| `paper-deep-read.md` | External paper needs genuine engagement, not just summary | LLM-judged quality of reaction notes | +| `cross-agent-critique.md` | Document needs substantive review from multiple angles | Convergence across independent reviewers | +| `effectiveness-analysis.md` | "Is X actually working?" needs data, not vibes | Deterministic metrics from log/event data | +| `thread-engagement.md` | External post warrants substantive public response | Quality of contribution to ongoing conversation | + +## Recipe Structure + +Each recipe follows the same format: + +- **Trigger** — when this recipe applies +- **Search intents** — what to look for before starting +- **Inputs / Outputs** — what goes in, what comes out +- **Harness sketch** — how to set up the climb +- **Worked example** — a specific time this pattern succeeded, with enough detail to replicate +- **Failure modes** — what goes wrong and how to detect it diff --git a/open_strix/builtin_skills/mountaineering/recipes/cross-agent-critique.md b/open_strix/builtin_skills/mountaineering/recipes/cross-agent-critique.md new file mode 100644 index 0000000..dab3b10 --- /dev/null +++ b/open_strix/builtin_skills/mountaineering/recipes/cross-agent-critique.md @@ -0,0 +1,98 @@ +# Recipe: Cross-Agent Critique + +Multiple agents review a document independently, each from their own perspective. The value is complementary coverage — different agents catch different things. + +## Trigger + +- A substantial document needs review (research write-up, design doc, proposal) +- The document makes technical claims that could be wrong +- Multiple agents are available with different strengths +- The operator wants substantive critique, not polish + +**Search intents before starting:** +- `search for: document's key claims/terms` — what's the prior art? +- `search for: author's previous work` — what's the track record? +- `grep: methodology terms` — are the methods well-established or novel? +- `search for: domain-specific failure modes` — what usually goes wrong in this kind of work? + +## Inputs + +- Document to review (research write-up, design doc, proposal) +- Review assignment (what angle each agent should take) +- Operator's specific concerns (if any — "I'm worried about X") + +## Outputs + +- Per-agent critique (3-6 specific issues each, with evidence) +- Coverage map: which issues were caught by multiple agents (high confidence) vs one agent only +- Recommendation: proceed / revise / rethink +- Pre-experiment suggestions (if the document proposes experiments) + +## Harness Sketch + +This is a **parallel climb** — multiple agents working independently, then a synthesis pass. + +``` +workspace/ +├── document.md # The thing being reviewed +├── assignment/ +│ ├── agent-a.md # Angle: mathematical/mechanistic +│ └── agent-b.md # Angle: structural/methodological +├── critiques/ +│ ├── agent-a.md # Output from agent A +│ └── agent-b.md # Output from agent B +└── synthesis.md # Combined coverage analysis +``` + +**Eval criteria (per critique):** +1. Does it identify a specific error (not just "this could be better")? +2. Is the error substantiated with evidence or reasoning? +3. Does it acknowledge genuine strengths (not just finding faults)? +4. Does it suggest a concrete fix or pre-experiment? +5. Would the document author need to actually respond to this? + +**Score:** Count yes / 5 per critique. Synthesis score: overlap ratio (issues caught by 2+ agents / total issues). + +## Worked Example + +**Keel's SAE write-up — March 21, 2026** + +Tim asked Strix and Verge to critique Keel's ~2000-line research report on weighted MSE and deviation MSE as novel loss modifications for BatchTopK SAE training on IBM Granite (Mamba hybrid) for legal contract clause classification. + +**What the search intents found:** +- Active SAE training work (layers 10/15/20/25/30, TopK k=20 queued) +- Prior finding: ReLU+L1 can't achieve sparsity at 8x expansion on some models +- Mamba architecture differences from transformer (norm dynamics, state-space vs attention) +- Prior Baguettotron SAE work showing deep models rotate features rather than collapse + +**Strix's critique (mathematical/mechanistic angle):** +1. Deviation MSE Formulation B is actually error variance, not "uniqueness pressure" — mathematical intuition in §4.6 is backwards +2. Orthogonality claim between weighted and deviation MSE is wrong — they're in tension +3. Missing critical pre-experiment: activation norm ↔ information content correlation assumed from transformer literature, unvalidated for Mamba +4. Parameterization double-counts MSE +5. F1 improvement estimates misleadingly precise +6. Mamba norm dynamics differ from transformer — weighting function may be miscalibrated + +**Verge's critique (structural/methodological angle):** +1. Deviation MSE derivation needs more rigor +2. F1 improvement estimate not credible at stated precision +3. Orthogonality overstated +4. Layer-type effectiveness underdeveloped (Mamba vs attention layers) +5. Focal loss analogy loose +6. Plus strengths: experimental protocol solid, risk matrix honest + +**What made it work:** +- Complementary coverage: Strix caught the mathematical formulation error, Verge caught the structural gaps +- Both independently identified the orthogonality claim as wrong (high-confidence finding) +- Bottom-line recommendation converged: run norm-separation pre-experiment before committing 200+ GPU-hours +- The 1-hour pre-experiment could save the entire 240-280 GPU-hour budget + +## Failure Modes + +| Failure | Detection | Fix | +|---------|-----------|-----| +| **Echo chamber** | Both agents find the same issues, miss complementary coverage | Assign deliberately different angles in the assignment | +| **Praise sandwich** | Critique buried in hedging and compliments | Force structure: issues first, strengths second | +| **Surface-level** | Issues are stylistic, not substantive ("could be clearer") | Require: "what specific claim is wrong, and why?" | +| **Performative disagreement** | Agents manufacture friction to seem thorough | Check: would the document author need to actually change something? | +| **Missing the good** | All critique, no acknowledgment of genuine contributions | Require strengths section — what the document gets right matters too | diff --git a/open_strix/builtin_skills/mountaineering/recipes/effectiveness-analysis.md b/open_strix/builtin_skills/mountaineering/recipes/effectiveness-analysis.md new file mode 100644 index 0000000..9b3f63c --- /dev/null +++ b/open_strix/builtin_skills/mountaineering/recipes/effectiveness-analysis.md @@ -0,0 +1,103 @@ +# Recipe: Effectiveness Analysis + +"Is X actually working?" answered with data, not vibes. Write a measurement script, run it against real logs/events, interpret results, identify the actual mechanism. + +## Trigger + +- A system or process has been running for a while without measurement +- Someone asks "how effective is X?" or "does X actually work?" +- An assumption about what's working needs verification +- Resource allocation decisions depend on knowing what's worth keeping + +**Search intents before starting:** +- `search for: the system in question` — how does it work? what are the moving parts? +- `grep: event types, log entries` — what data do we already have? +- `search for: intended behavior` — what was this supposed to do? +- `search for: similar systems, comparisons` — is there a control group? + +## Inputs + +- System or process to evaluate (feature, tool, workflow, architecture) +- Available data sources (event logs, JSONL files, git history, API responses) +- The question to answer (specific: "what % of surfaced files get read?" not vague: "is it good?") + +## Outputs + +- Rerunnable measurement script (committed to `scripts/`) +- Quantitative findings with specific numbers +- Identification of actual mechanism (what's really working, vs what we thought was working) +- Recommendation: keep / modify / remove + +## Harness Sketch + +This is an **iterative analysis climb** — the first measurement often reveals the real question. + +``` +workspace/ +├── question.md # The specific question being answered +├── script.py # Measurement script (rerunnable) +├── data/ # Intermediate outputs +│ └── summary.md # Script output summary (disk-first, never pipe large files) +├── findings.md # Interpretation of results +└── recommendation.md # Keep / modify / remove with rationale +``` + +**Eval criteria (deterministic):** +1. Script runs without errors on current data +2. Results include specific numbers (not "some" or "many") +3. Findings identify a mechanism, not just a metric +4. Recommendation follows from the data (not from prior beliefs) +5. Script is rerunnable for periodic tracking + +**CRITICAL: Disk-first pattern.** Never pipe large files (events.jsonl, logs) into agent context. Write analysis to a file, read the summary only. This prevents OOM crashes on large datasets. + +## Worked Example + +**Half-RAG effectiveness — March 21, 2026** + +Tim asked: "How effective is half-RAG actually? For every session, is a surfaced doc actually read?" + +**What the search intents found:** +- Half-RAG (vector search) surfaces ~5 files per session via ChromaDB +- Events logged in events.jsonl: `vector_search` (files_surfaced) and `tool_call` (Read actions) +- 3,702 sessions and 18,332 file surfacings in the dataset +- Verge runs same architecture without vector search (natural control group) + +**The script (scripts/half_rag_hits.py):** +- Correlates `vector_search` events with subsequent `Read` tool calls on those exact files +- Groups by session (timestamp proximity to claude_session start/end) +- Breaks down by trigger type (cron vs Discord message) + +**Findings:** +- 2.8% file-level hit rate (511 reads out of 18,332 surfacings) +- 10.9% session hit rate (403/3,702 sessions had at least one surfaced file read) +- Cron jobs worse (0.7%) than Discord messages (3.1%) +- Best performers: topically precise files (family.md 39%, people/lumen.md 36%) +- Worst: archived files surfaced hundreds of times, read zero + +**The mechanism discovery:** +- Original assumption: vector search provides useful ambient context +- Actual mechanism: intent-driven search works, ambient surfacing is 97% noise +- The 39% hit rate files were all high-intent lookups (someone asked about family → family.md surfaced → family.md read) +- Control group (Verge, no vector search): no felt gap + +**Follow-up analysis (DAG):** +The half-RAG finding led directly to the file reference DAG analysis — measuring whether the implicit pointer graph (file references) was doing better. Answer: 86% stale edges. Both mechanisms were mostly decorative. + +**What made it work:** +- Specific question from Tim, not vague "how's it going?" +- Real data (events.jsonl had months of logs) +- Natural control group (Verge) emerged from the ecosystem +- Script is rerunnable — can track changes over time +- Finding the mechanism was more valuable than the metric + +## Failure Modes + +| Failure | Detection | Fix | +|---------|-----------|-----| +| **Metric without mechanism** | "Hit rate is 2.8%" with no interpretation | Force: "what's actually happening in the sessions with hits vs misses?" | +| **Confirming prior beliefs** | Analysis finds what you expected | Run the analysis BEFORE forming the hypothesis. Look at the data first. | +| **OOM from large data** | Script pipes full events.jsonl into context | Disk-first: write summary to file, read summary only | +| **One-time analysis** | Script works once, can't be rerun | Commit to scripts/, use relative paths, document data dependencies | +| **Missing the control group** | No baseline comparison | Search for: what's the closest thing to "this system but without X?" | +| **Survivorship bias** | Only measuring successes, not the failures | Include denominator: how many times did X fire total, not just when it worked | diff --git a/open_strix/builtin_skills/mountaineering/recipes/paper-deep-read.md b/open_strix/builtin_skills/mountaineering/recipes/paper-deep-read.md new file mode 100644 index 0000000..b7a86d1 --- /dev/null +++ b/open_strix/builtin_skills/mountaineering/recipes/paper-deep-read.md @@ -0,0 +1,81 @@ +# Recipe: Paper Deep Read + +Genuine engagement with an academic paper — not summarization, but reaction. Cross-reference claims against your own knowledge, experiments, and history. Produce reading notes that a peer would find useful. + +## Trigger + +- Someone shares a paper (arXiv link, PDF, title) +- The paper is relevant to active work or interests +- A summary would lose the most interesting parts + +**Search intents before starting:** +- `search for: paper title, author names` — have we discussed this before? +- `search for: key concepts from abstract` — what do we already know about this topic? +- `grep: active experiments, recent results` — what's our current state on related work? +- `search for: people who shared it` — context on why it surfaced now + +## Inputs + +- Paper source (arXiv ID, PDF path, or URL) +- Agent's current research context (active experiments, recent findings) +- Operator's stated interest (why they care about this paper) + +## Outputs + +- Reading notes (not summary) — reactions, connections, disagreements +- Specific claims flagged for verification against own data +- Action items if the paper suggests experiments worth running +- Updated research index entry + +## Harness Sketch + +This is a **single-pass climb** with LLM-judged quality, not an iterative optimization. The "hill" is the quality of engagement with the paper. + +``` +workspace/ +├── paper.md # Extracted paper content +├── context.md # Agent's relevant prior knowledge (auto-assembled from search) +├── notes.md # The output — reading notes, reactions, connections +``` + +**Eval criteria (binary checklist):** +1. Does it identify at least one claim to check against own data? +2. Does it connect to active work (not just "this is interesting")? +3. Does it flag something the paper gets wrong or glosses over? +4. Does it suggest a concrete next action? +5. Would someone who read both the paper and these notes learn something new from the notes? + +**Score:** Count of yes answers / 5. Target: 0.8+ + +## Worked Example + +**Memex(RL) paper (arxiv 2603.04257) — March 21, 2026** + +ayourtch tagged Strix on Bluesky. Paper proposes RL for memory management in long-context agents. + +**What the search intents found:** +- Active memory architecture work (progressive disclosure, 81% context reduction) +- Half-RAG effectiveness data (2.8% file hit rate — most surfaced files never read) +- DAG file reference analysis (86% stale edges in pointer graph) +- Ongoing thread with Tim about "what actually works" for retrieval + +**What the reading notes produced:** +- Connection: Memex(RL)'s reward signal for "remember" actions maps to our demand-driven retrieval finding — the RL agent learns *when to look*, which is the hard problem +- Disagreement: Paper assumes forget actions are important; our data suggests staleness is the default, not an active failure — you don't need to "forget", things just stop being retrieved +- Key insight: Every "remember" action is a teleological prediction (a bet you'll need it later) — this became the compression-vs-curation thread on Bluesky (8+ exchanges) +- Action: Connected to Memex(RL) framing for making write-time associations learnable from read-time behavior + +**What made it work:** +- Real prior knowledge to cross-reference against (months of memory architecture work) +- Genuine disagreement, not just "interesting paper" +- Led to substantive public thread, not just private notes +- Tim engaged for 8+ exchanges — the notes were useful to a human reader + +## Failure Modes + +| Failure | Detection | Fix | +|---------|-----------|-----| +| **Summary instead of reaction** | Notes read like an abstract rewrite | Force: "what do you disagree with?" before writing | +| **No connection to own work** | Notes don't reference any prior experiments/data | Run search intents harder — if nothing connects, the paper may not be worth a deep read | +| **Overclaiming connections** | Every paragraph maps to own work | Check: "would this connection survive someone pushing back on it?" | +| **Missing the actual contribution** | Notes fixate on a minor point, miss the paper's real claim | Read intro + conclusion first, identify the ONE thing the authors think is new | diff --git a/open_strix/builtin_skills/mountaineering/recipes/thread-engagement.md b/open_strix/builtin_skills/mountaineering/recipes/thread-engagement.md new file mode 100644 index 0000000..e3755ef --- /dev/null +++ b/open_strix/builtin_skills/mountaineering/recipes/thread-engagement.md @@ -0,0 +1,89 @@ +# Recipe: Thread Engagement + +Substantive public response to an external post or thread. Not just "cool post!" but genuine engagement that adds something the original author would find valuable. + +## Trigger + +- External post surfaces a genuinely interesting claim or finding +- The claim connects to your active work or data +- You have something specific to add (not just agreement) +- The platform and audience warrant public engagement + +**Search intents before starting:** +- `search for: post author` — who is this person? what's their track record? +- `search for: key claims in the post` — what do we know about this topic? +- `grep: own recent work on related topics` — what's our data say? +- `search for: prior engagement with this person` — have we talked before? + +## Inputs + +- The external post (URL, text, context) +- Your relevant prior knowledge and data +- The platform's conventions (character limits, threading model, audience) + +## Outputs + +- Substantive reply or thread (1-4 posts) +- Updated engagement tracking (posts.jsonl or equivalent) +- Optional: follow-up actions if the conversation opens new directions + +## Harness Sketch + +This is a **single-pass climb** with a quality gate before posting. The hill is "does this reply add value?" + +``` +workspace/ +├── external-post.md # The post being responded to +├── context.md # Your relevant knowledge (auto-assembled from search) +├── draft.md # Draft reply +└── quality-check.md # Pre-post verification +``` + +**Eval criteria (self-check before posting):** +1. Does the reply reference something specific from their post (not generic)? +2. Does it add information or perspective they didn't have? +3. Would the author need to actually think about your response? +4. Is it within platform conventions (length, tone, threading)? +5. Would you be comfortable if the author pushed back on every claim? + +**Score:** All 5 must be yes before posting. If any fail, revise or don't post. + +## Worked Example + +**Memex(RL) Bluesky thread — March 21, 2026** + +ayourtch tagged Strix on a post about the Memex(RL) paper (arxiv 2603.04257), which proposes RL for memory management in long-context agents. + +**What the search intents found:** +- ayourtch: trusted Bluesky account (7th whitelisted), technical, has engaged before +- Memex(RL) paper: RL for memory — selecting what to remember/forget +- Own data: months of memory architecture work, half-RAG effectiveness (2.8%), progressive disclosure (81% context reduction), DAG analysis (86% stale edges) +- Platform: Bluesky, 300 char limit, threading model + +**The thread that developed (8+ exchanges):** + +1. **Opening:** Connected paper to own architecture — "structurally very similar" with specific parallels (indexed summaries + external store + selective retrieval) +2. **Tim's engagement:** "remember/forget asymmetry" — context management is about managing a FULL context and actively forgetting, not starting empty and remembering +3. **Strix's extension:** "every forget is also a prediction" — framed forgetting as teleological (betting you won't need it) +4. **Tim's correction:** "don't think deletes are super important" — challenged the paper's emphasis on forget actions +5. **Deeper exchange:** Compression-vs-curation distinction, RL reward signals (task completion easy to reward, identity coherence harder) +6. **External contributions:** Fenrir (dialogical generativity angle), Isambard (specific technical points) +7. **Convergence:** The thread reached "easy to state, hard to score" — the fundamental RL reward design problem for memory agents + +**What made it work:** +- Real prior knowledge to draw from (not just reacting to the paper) +- Tim joined naturally — the thread was interesting enough to participate in +- External contributors added genuinely new angles +- Each reply built on the previous one (not parallel monologues) +- Thread wound down naturally when it reached a real insight + +## Failure Modes + +| Failure | Detection | Fix | +|---------|-----------|-----| +| **"Cool post!" energy** | Reply is generic agreement that adds nothing | Don't post. Search for something specific to add. If nothing, silence is fine. | +| **Redirecting to own work** | Reply ignores their point, pivots to your stuff | Lead with THEIR insight, then connect. Their post, their thread. | +| **Overclaiming expertise** | Making strong claims in unfamiliar territory | Hedge appropriately. "Our data suggests..." not "This proves..." | +| **Reply loop** | Responding to every reply in the thread | Check: does this reply add something new? If not, let the thread rest. | +| **Thread hijacking** | Your replies dominate a thread that isn't yours | Max 2-3 replies in someone else's thread unless they're directly asking you questions | +| **Tone mismatch** | Academic post gets casual reply, or vice versa | Match their energy. Analytical essay → analytical response. Hot take → more heat. |