Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 28 additions & 21 deletions examples/cholesterol_primary_prevention/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,33 +23,39 @@ unstructured LLM-generated review would miss.
| File | What it is |
|---|---|
| `protocol.md` | The pre-registered protocol JSON, formatted for reading. |
| `run.log` | A trimmed timeline of what each phase produced. |
| `synthesis.md` | The Synthesizer's output, converted to markdown. |
| `synthesis.docx` | Same content as a Word document with forest plot, table, and PRISMA flow embedded (v0.2 — placeholder for now). |
| `run.log` | Phase-by-phase counts from the actual run (project id `216f7599-1c1b-487c-a269-e67277da0e42`, completed 2026-06-01). |
| `synthesis.md` | The Synthesizer's output, content-faithful markdown (image-free). |
| `synthesis.docx` | The same content as a Word document with forest plot, stance × tier heat-table, PRISMA flow, and effect-size table embedded. This is the artifact the protocol actually produced. |
| `synthesis.pdf` | PDF export of `synthesis.docx`, for read-only sharing. |

## What the protocol caught

Three things worth calling out for anyone using this as a model:

1. **The Skeptic surfaced a 2023 meta-analysis with a null primary-outcome
result that the popular narrative does not include.** Without the
Skeptic role explicitly running counter-queries, that paper would
never have entered the corpus.
2. **The Methodologist downgraded several frequently-cited studies to
"RCT, high RoB" or "narrative review"** based on abstract-stated
methodology. The Synthesizer then weighted the effect-size table
accordingly.
3. **The Synthesizer cut three claims** that were in an earlier draft of
the bottom-line answer because the cited rows' `quote_span` fields,
after Phase 4 full-text retrieval, didn't actually contain the
verbatim numeric claim. The cut claims became "evidence is
insufficient to determine X" statements.
1. **The Skeptic surfaced strict-PP-only meta-analyses (Morsell 2026,
Pignone 2000) and the ALLHAT-LLT older-adult subgroup (Han 2017)** —
all showing null ACM in true primary prevention. The popular
narrative leans on broader meta-analyses (CTT/Fulcher and similar)
that pool primary + secondary prevention; the Skeptic's counter-
queries were what brought the strict-PP evidence into the corpus.
2. **The "supports" stance row ended empty by design.** No row in the
corpus directly supports an ACM benefit of pharmacological LDL-C
lowering in true primary prevention. Broader-PP analyses that show
ACM benefit live as `background` because they don't match the PICO.
The synthesis names this explicitly rather than hiding it.
3. **Of 20 Pass-2 candidates, 11 were retrievable at full text.**
The seven paywalled-with-no-numerics candidates are named in the
synthesis's Limitations section, so the reader can see exactly
where the corpus is thin (CTT overestimation abstract, JUPITER
reanalysis, Tonelli/Ray-era PP meta-analyses, BMJ 2021 statin
adverse-events meta-analysis).

## What to read in what order

1. `protocol.md` — to see what was agreed BEFORE any search ran.
2. `run.log` — to see the corpus shape at the end of Pass-1.
3. `synthesis.md` — to see the no-fabrication output.
2. `run.log` — to see the corpus shape at each phase.
3. `synthesis.docx` (or `synthesis.pdf` / `synthesis.md`) — the
no-fabrication output.

## Reproducing

Expand All @@ -58,9 +64,10 @@ PubMed, arXiv, Europe PMC, Crossref, and Unpaywall. Running the protocol
again on a later date will find more recent studies (use the `refresh`
continuation mode for that).

The agent runtime used was the Claude Agent SDK with three concurrent
Pass-1 subagents and one sequential Synthesizer. Other agent runtimes
should be able to reproduce the workflow given the prompts in `agents/`.
The agent runtime used was the Claude Agent SDK with concurrent Pass-1
subagents (Scout + Skeptic for this run) and one sequential Synthesizer.
Other agent runtimes should be able to reproduce the workflow given the
prompts in `agents/`.

## A caveat

Expand Down
155 changes: 83 additions & 72 deletions examples/cholesterol_primary_prevention/run.log
Original file line number Diff line number Diff line change
@@ -1,112 +1,123 @@
==============================================================================
deep_research run log — cholesterol primary prevention
==============================================================================
project slug: cholesterol-primary-prevention
created at: 2026-06-01T10:42Z
completed at: 2026-06-01T18:55Z
agent runtime: Claude Agent SDK, three concurrent Pass-1 subagents,
one sequential Synthesizer.
project id: 216f7599-1c1b-487c-a269-e67277da0e42
project slug: cholesterol-primary-prevention
created at: 2026-06-01T12:28:15Z
completed at: 2026-06-01T17:44:18Z
synthesis v2 at: 2026-06-01T17:53:59Z # native-formatted .docx supersedes
# the plain-text v1 doc
total cost (USD): $42.85

agent runtime: Claude Agent SDK
roles: Scout + Skeptic (Pass-1 concurrent),
Synthesizer (Phase 4 sequential)

------------------------------------------------------------------------------
Phase 1 — Protocol pre-registration
------------------------------------------------------------------------------
Status: planned → protocol_gated
Gate 1: approved by user 2026-06-01T11:14Z
Gate 1: approved by user

The pre-registered protocol (see protocol.md) fixed: PICO definition,
inclusion/exclusion criteria (true primary prevention only — no prior
CVD/CeVD/PVD), effect measures (ACM primary, MACE secondary), source
tiers, retraction-sweep policy, and the Pass-2 spend ceiling.

------------------------------------------------------------------------------
Phase 2 — Pass-1 triage
------------------------------------------------------------------------------
Status: protocol_gated → pass1_running

Scout
queries_run: 42 (7 protocol queries × 6 sources)
unique_hits_kept: 89
sources_with_zero_hits: ["arxiv"] # expected — clinical question

Skeptic
counter_query_count: 18
refutes_kept: 11
mixed_kept: 7
retractions_found: 2 # one 2019 statin meta-analysis, one 2021
# PCSK9 cohort study; both written as
# refute-side rows referencing the
# original DOI.

Methodologist
rows_graded: 107 # 89 scout + 18 skeptic
effect_sizes_extracted: 23
rows_flagged_unclear: 14 # all flagged with notes='unclear from abstract'

Corpus rollup after Pass-1 reconciliation:
by stance:
background: 71
supports: 0 # nothing claimed support yet — Phase 4 sets stance
refutes: 13 # skeptic + retractions
mixed: 7
by tier:
tier 1: 58
tier 2: 4
tier 3: 29
tier 4: 0 # blogs/news intentionally excluded by protocol
retractions: 2
Corpus rollup after Pass-1 (recorded in research_evidence):
total rows: 135
retrieved_by_role (distinct): 2 # scout, skeptic
by stance (post-synthesis update):
background: 125 # not directly tested against PICO
refutes: 4 # quote-spanned numerics show null/harm
mixed: 6 # quote-spanned MACE↓ but ACM null
supports: 0 # by design — no row directly supports
# ACM benefit in strict primary
# prevention; broader-PP analyses
# that show ACM benefit live as
# background because they pool
# primary + secondary populations
by source tier:
tier 1 (peer-reviewed): 129
tier 2 (preprint / gov): 2
tier 3 (grey lit / abstract): 4

------------------------------------------------------------------------------
Gate 2 — Pass-2 spend approval
------------------------------------------------------------------------------
Gate 2: approved by user 2026-06-01T14:02Z
Approved Pass-2 candidate count: 18 (capped at pass2_max_full_text_retrievals=20)
Gate 2: approved by user
Pass-2 candidate cap (protocol): 20
Pass-2 candidates approved: 20

------------------------------------------------------------------------------
Phase 3 — Pass-2 full-text retrieval
------------------------------------------------------------------------------
Status: pass1_gated → pass2_running

candidates_selected: 18
retrieved_full_text: 13 # OA via Unpaywall + crossref direct
abstract_only: 3 # paywalled, no OA copy
unavailable: 2 # DOI resolution failed
candidates selected: 20
retrieved at full text: 11 # OA via Unpaywall + crossref direct
abstract-only: 2 # paywalled, rich abstract usable
unavailable: 7 # paywalled with no extractable numerics
# or DOI resolution failed

------------------------------------------------------------------------------
Phase 4 — Synthesis
------------------------------------------------------------------------------
Status: pass2_running → synthesizing

Synthesizer (one sequential subagent):
tool_calls_used: 38 / 50
claims_drafted: 24
claims_cut: 3 # cited rows' quote_span did not contain the
# paraphrased number after full-text retrieval.
# Cut claims rewritten as
# 'evidence is insufficient to determine X'.
n_cited: 21
rows_updated_with_quote_span: 21 # synth wrote quote_span + locator
# for every cited row before citing.
stance_assignments:
supports: 9
refutes: 6
mixed: 6
visuals_rendered: forest plot (ACM panel + MACE panel),
PRISMA flow,
stance × tier heat-table,
effect-size summary table.
complete: true
evidence rows updated with verbatim quote_span: 7
rows directly cited in synthesis: 7 # every cited row carries
# a verbatim quote_span
# plus locator
claims drafted but cut (failed verbatim check): several
# rewritten as
# 'evidence is
# insufficient to
# determine X' rather
# than paraphrased

visuals rendered:
- effect-size summary table (10 rows, 7 unique studies)
- forest plot (ACM panel + MACE panel)
- stance × tier heat-table
- PRISMA flow (135 → 20 → 11 → 7)

document features:
- native heading hierarchy (no ASCII separators)
- native Word tables (not paragraph-rendered)
- intense-quote blockquotes for every cited row
- .docx uploaded via gdrive(convert_to_doc=true) for fidelity

------------------------------------------------------------------------------
Phase 5 — Closure
------------------------------------------------------------------------------
Status: synthesizing → complete
Synthesis doc: see synthesis.md / synthesis.docx in this directory.
Synthesis artifacts (this directory):
synthesis.docx — native Word, embedded forest plot + PRISMA + heat-table
synthesis.pdf — PDF export of the same
synthesis.md — content-faithful markdown (image-free)

------------------------------------------------------------------------------
Notes worth keeping for future runs
------------------------------------------------------------------------------
- Skeptic counter-queries on "fails to replicate" and "publication bias"
surfaced two papers that Scout's straight queries missed even at 6 sources.
- Two of the three cut claims involved the absolute-risk-reduction figure
often cited in popular summaries. The verbatim quote spans in the
cited papers expressed relative risk only; the absolute number was a
derivation that the synthesizer would not commit to without a row
saying so verbatim.
- The retraction sweep was cheap (one crossref ping per Scout hit's DOI)
and caught two non-trivial entries. Worth keeping as a fixed Skeptic
task on every run.
- Strict-PP-only meta-analyses tend to find null ACM; broader meta-analyses
that pool primary + secondary populations tend to find ACM benefit. The
difference is mostly about who's in the denominator, not about the drug.
- Pass-2 full-text retrieval rate was 11/20 (55%). Paywalls dominated the
unavailable bucket. The synthesis flags every landmark paper that was
paywall-blocked, so the reader knows where the corpus is thin.
- The supports stance ended empty by design. The protocol's strict PICO
excludes the analyses that would have populated it. This is a feature
of the protocol, not a hole in the corpus — and the synthesis says so
explicitly.
- Two roles ran in Pass-1 here (Scout + Skeptic) rather than the
three-role pattern (+Methodologist) shown in SKILL.md. Methodology
grading was folded into the Skeptic + Synthesizer passes for this
question because the methodology dimension was largely uncontested
for the included RCTs.
Binary file not shown.
Loading