OHBM 2026 Pipeline

This repository builds a local OHBM 2026 abstract corpus from the Oxford Abstracts GraphQL API and carries it through figure enrichment, reference matching, embeddings, clustering, and a static search UI.

This README is the runbook for a person or an agent that needs to go from the original abstract download to the current latest step.

Repository home:

Git remote origin: git@github.com:sensein/ohbm2026.git
GitHub URL: github.com/sensein/ohbm2026

Project conventions that should not be violated live in CONSTITUTION.md, including the rules that Python work stays inside the repository-local .venv, recorded experiment runs write to fresh directories instead of overwriting prior outputs, behavior-changing work stays plan-first and test-driven, and secrets never get copied into the repo or logs.

This README is the operational runbook, not the full project charter. For the repo-level intent, reproducibility model, authoritative defaults, key decisions, and experiment history, start with docs/reproducibility-vision.md.

If you only read one document before changing behavior, read docs/reproducibility-vision.md first.

Catalogs for the rest of the repository:

Recommended reading order for a new person or agent:

docs/reproducibility-vision.md
README.md
docs/README.md
CONSTITUTION.md
memory/summary.md
the specific plan or experiment README closest to the work you are changing

What The Pipeline Produces

Core artifacts:

data/inputs/abstracts_graphql__<state-key>.json
- GraphQL-fetched source snapshot for the latest ingest run
data/primary/abstracts.json
- canonical normalized accepted abstracts derived from the fetched snapshot
data/inputs/assets/
- downloaded local figure files, restricted to methods/results figures
data/cache/figure_analysis/image_analyses_<backend>__<state-key>.json
- resumable figure-analysis cache with direct state-key lookup
data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json
- resumable cllm claim-extraction cache with direct state-key lookup
data/outputs/experiments/title_audit/title_modifications.json
- audit log of cleaned abstract titles versus original raw titles
data/primary/abstracts_enriched.json
- enriched abstract corpus with markdown sections, figure analyses, and claim extraction when available
data/primary/reference_metadata.json
- OpenAlex-matched reference metadata
data/outputs/experiments/embeddings/*
- canonical embedding bundles, stage-2 projections, and neighbors
data/outputs/experiments/*__<state-key>/
- clustering, projection, and other experiment-style derived outputs
data/outputs/exported-sites/ui-site__<state-key>/
- local exported-site bundle before optional publish mirroring
data/outputs/proposals/*__<state-key>/
- proposal bundles and proposal-adjacent analysis outputs
export/ui-site/
- optional publish mirror of the latest exported-site bundle

Local artifact layout rules:

data/inputs/ is for fetched snapshots, API-derived inputs, and manual operator-supplied inputs
data/primary/ is for canonical normalized datasets consumed by downstream stages
data/cache/ is for resumable caches and checkpoints
data/outputs/experiments/, data/outputs/exported-sites/, and data/outputs/proposals/ are for local derived outputs
archive/ is for local pre-migration backups that preserve legacy paths
data/, export/, and tmp/ remain ignored by git

Current Latest Step

The latest end state of the project is:

accepted abstracts downloaded locally
methods/results figures downloaded and linked
OpenAI figure text promoted into the main enriched abstract dataset
reference metadata matched with OpenAlex where possible
multiple embedding bundles generated
published NeuroScape stage-2 applied to Voyage embeddings
clustering benchmarks run on embedding bundles
static UI built with:
- lexical search
- browser-side semantic search
- facets
- UMAP selection
- two semantic cluster lenses:
  - 25-cluster benchmark
  - claims 28-cluster benchmark

External Requirements

Required:

python 3.11+
uv

Optional, depending on which branch of the pipeline you run:

ollama
local Ollama model qwen3.5:35b
Hugging Face access for downloading sentence-transformer models
OpenAI API access for hosted figure analysis, OpenAI embeddings, and cllm claim extraction
Voyage API access for Voyage embeddings
OpenAlex API key for authenticated reference matching

Environment Variables

Create .env from .env.sample.

Common keys:

OHBM2026_API
- required for Oxford Abstracts ingest and author lookup
OPENAI_API_KEY
- required for OpenAI figure analysis, OpenAI embeddings, extract-claims with OpenAI, and OpenAI-backed reference splitting
ANTHROPIC_API_KEY
- required only if extract-claims is run with --llm-provider anthropic
VOYAGE_API
- required for Voyage embeddings
OPENALEX_API
- optional but recommended for reference enrichment
HF_TOKEN
- optional for Hugging Face model downloads

No API key is needed for local Ollama figure analysis.

Treat .env and shell environment variables as the only valid homes for these secrets. Do not commit tokens, paste them into docs, or leave them in command logs.

Token And Tool Matrix

Use this as the quick answer to "what do I need before I run this step?"

Workflow	Required secret(s)	Extra local tool(s)	Notes
`ohbmcli ingest`	`OHBM2026_API`	none	Fetches accepted abstracts and figure assets
`ohbmcli refresh-assets`	none	none	Uses the existing local normalized corpus
`ohbmcli authors`	`OHBM2026_API`	none	Pulls author details from Oxford Abstracts
`ohbmcli analyze-figures --vision-backend openai`	`OPENAI_API_KEY`	none	Current preferred figure-analysis route
`ohbmcli analyze-figures --vision-backend ollama`	none	`ollama`, `qwen3.5:35b`	No hosted token required
`ohbmcli extract-claims`	`OPENAI_API_KEY` or `ANTHROPIC_API_KEY`	`cllm` installed in `.venv`	Provider depends on `--llm-provider`
`ohbmcli enrich`	none	none	Consumes local caches only
`ohbmcli title-audit`	none	none	Reads local normalized corpus only
`ohbmcli reference-metadata`	optional `OPENALEX_API`; `OPENAI_API_KEY` only if using OpenAI reference splitting	none	`OPENALEX_API` is recommended for authenticated reference matching
`ohbmcli embed-minilm` / `embed-hf`	optional `HF_TOKEN`	`sentence-transformers`	`HF_TOKEN` is only needed for gated/private Hub access
`ohbmcli embed-openai`	`OPENAI_API_KEY`	none	Hosted embedding route
`ohbmcli embed-voyage`	`VOYAGE_API`	none	Voyage embedding route
`ohbmcli apply-published-stage2` / `embed-stage2`	none	local model dependencies already in `.venv`	Uses local artifacts
`ohbmcli semantic-analysis` / `cluster-benchmark` / `umap-plot` / `compare-projections` / `optimize-projections`	none	optional `plotly`, `umap-learn`	Purely local once embeddings exist
`scripts/optimize_poster_layout.py` / `scripts/analyze_poster_layout.py`	none	none	Uses local proposal inputs, authors, and layout assets
poster sequencing scripts under `scripts/`	none	none	Use local proposals and embeddings
`ohbmcli export-ui` / `build-ui`	none	none	Consumes local corpora, caches, clusters, and manual inputs

Setup

Do not use system Python in this repo. Create or refresh .venv with uv, and run Python commands through .venv/bin/python or uv targeting that interpreter.

Create the virtual environment and run tests:

UV_CACHE_DIR=.uv-cache uv venv --python 3.11 .venv
PYTHONPATH=src .venv/bin/python -m unittest discover -s tests -v

Optional Python packages by workflow:

MiniLM or HF embeddings:

UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python sentence-transformers

Interactive projections:

UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python plotly umap-learn

Claim extraction:

UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python git+https://github.com/OpenEvalProject/cllm.git

Headless layout review:

UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python ".[review]"
PYTHONPATH=src .venv/bin/python -m playwright install chromium
PYTHONPATH=src .venv/bin/python scripts/check_layout_review.py

For local figure analysis, confirm Ollama can see the required model:

ollama list

Recommended Sequences

Pick the sequence that matches what you are trying to regenerate.

Full Rebuild To The Static UI

Run these in order when rebuilding the main deliverable from upstream data:

ohbmcli ingest
ohbmcli authors
ohbmcli analyze-figures --vision-backend openai
ohbmcli extract-claims
ohbmcli enrich
ohbmcli title-audit
ohbmcli reference-metadata --use-title-search
one or more embedding commands such as embed-minilm, embed-voyage, or embed-openai
ohbmcli apply-published-stage2 if you want the published Voyage stage-2 space
ohbmcli semantic-analysis, cluster-benchmark, umap-plot, or compare-projections for the cluster and projection products you want the UI to consume
ohbmcli export-ui or ohbmcli build-ui

Add Or Refresh A Cluster Family

Use this when you already have the corpora and want a new cluster output:

confirm the required embedding bundle exists under data/outputs/experiments/embeddings/
run ohbmcli semantic-analysis for community-detection style outputs
run ohbmcli cluster-benchmark for k-sweep style outputs
optionally run scripts/evaluate_label_systems.py to compare a new cluster family against the submitter taxonomy
point export-ui, build-ui, or layout scripts at the new cluster directory

Generate Or Refresh A Layout Proposal

Use this when you want a new organizer-facing proposal:

confirm data/primary/abstracts.json, data/inputs/authors.json, and data/inputs/poster_layout/layout_assets/layout_geometry.json exist
choose the embedding bundle and any claims/layout cluster inputs you want to drive the proposal
run scripts/optimize_poster_layout.py into a fresh proposal directory under data/outputs/proposals/
run scripts/analyze_poster_layout.py on that proposal
optionally run comparison or review scripts against multiple proposal directories

Run A Poster Sequencing Experiment

Use this when you already have a base proposal and want comparative sequencing evidence:

pick a base proposal under data/outputs/proposals/
run one of the sequencing scripts under scripts/ into a fresh dated experiment directory under experiments/ or a fresh local output root under data/outputs/proposals/
keep the experiment outputs immutable and compare them rather than overwriting the active proposal set

Minimal UI Refresh

Use this when the data products already exist locally:

rerun only the upstream steps that changed
rerun ohbmcli export-ui or ohbmcli build-ui
do not rerun hosted/API steps unless their inputs or parameters changed

End-To-End Workflow

Use ohbmcli for the corpus, enrichment, embedding, clustering, and UI pipeline. Use the script wrappers under scripts/ for proposal generation, layout analysis, and sequencing experiments.

1. Download The Raw Abstracts And Figures

This is the canonical starting point.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli ingest

What it does:

fetches accepted abstracts from Oxford Abstracts
stores the normalized corpus in data/primary/abstracts.json
downloads only methods/results figure images
writes local figure links into each abstract

Important behavior:

retries use an exponential timeout schedule starting at 100ms and capped at 10s
figure downloads are reuse-aware

2. Refresh Assets Without Rerunning Abstract Extraction

Use this if the raw JSON already exists and you only need to rebuild or prune local figure links.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli refresh-assets --reuse-existing-assets-only

3. Export Authors

Optional, but useful if you want a separate author database.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli authors

Output:

data/inputs/authors.json

4. Run Figure Analysis

There are two supported routes.

Route A: OpenAI figure analysis

This is the current preferred route for the main enriched corpus.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli analyze-figures \
  --vision-backend openai \
  --openai-model gpt-4.1-mini \
  --enriched-output data/outputs/experiments/enrichment/abstracts_enriched_openai.json

Notes:

the cache is incremental and resumable
current code batches OpenAI image requests for better throughput
finished analyses are written as they complete

Route B: Local Ollama figure analysis

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli analyze-figures \
  --vision-backend ollama \
  --vision-model qwen3.5:35b

5. Build The Main Enriched Abstract Dataset

This step converts abstract content to ordered markdown and merges figure analysis plus any cached claim extraction back into the canonical enriched corpus.

Current default:

enrich now defaults to the OpenAI figure-analysis cache under data/cache/figure_analysis/
enrich now also defaults to the cllm claim cache under data/cache/claim_analysis/

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli enrich

Explicit form:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli enrich \
  --input data/primary/abstracts.json \
  --image-analyses-input data/cache/figure_analysis/image_analyses_openai__<state-key>.json \
  --claim-analyses-input data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json \
  --enriched-output data/primary/abstracts_enriched.json

Output:

data/primary/abstracts_enriched.json

This is the main corpus used by downstream steps.

6. Audit And Clean Display Titles

The raw Oxford Abstracts export is kept unchanged, but downstream consumers now normalize obvious title issues such as leading bullets, wrapping quotes, and stray outer whitespace.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli title-audit \
  --input data/primary/abstracts.json \
  --output data/outputs/experiments/title_audit/title_modifications.json

Output:

data/outputs/experiments/title_audit/title_modifications.json

This file records each changed title with the original string, cleaned title, and normalization reasons.

7. Match References With OpenAlex

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli reference-metadata \
  --input data/primary/abstracts.json \
  --output data/primary/reference_metadata.json \
  --use-title-search

Output:

data/primary/reference_metadata.json

This file is resumable and checkpoint-friendly.

Reference resolution now follows this order:

markdown normalization of the raw references field
LLM-assisted splitting of the full reference markdown block, validated against the source text
exact DOI -> OpenAlex
exact PMID -> OpenAlex
direct OpenAlex title search for references with a title
Semantic Scholar full-reference search only for references that still have neither DOI nor title

Current operational notes:

the OpenAI splitter runs one request per abstract attempt and can be driven concurrently
failed or invalid splits can be requeued and retried before falling back to a single-block record
OpenAlex title search can also run concurrently with an explicit requests-per-second cap
OpenAlex /rate-limit is the best way to inspect current search budget before a long rerun

Useful options:

--no-doi-discovery
- skip the Semantic Scholar full-reference DOI-discovery fallback
--no-llm-reference-splitting
- skip the OpenAI/Ollama splitting pass and fall back to local markdown heuristics
--reference-splitting-backend openai
- use OpenAI for the splitting helper
--reference-splitting-model gpt-5-nano
- model used for reference structuring; defaults to gpt-5-nano
- the OpenAI backend uses the Responses API with a strict JSON schema for {"references": [{"reference", "title", "doi"}]} output
- extracted title and doi values are only used downstream if they are lexically present in the returned reference text
--split-concurrency 500
- number of in-flight OpenAI reference-splitting requests during collect
--split-max-requeues 5
- maximum retries for failed or invalid split attempts before falling back to a single merged block
--title-concurrency 50
- number of concurrent OpenAlex title-search workers
--title-max-rps 90
- soft request-rate cap for OpenAlex title search; useful for staying below OpenAlex short-window throttle limits
--doi-discovery-similarity-threshold 0.8
- minimum title similarity required before accepting a discovered DOI
--delay-seconds 1.05
- pacing for sequential fallback phases such as Semantic Scholar DOI discovery

If a completed reference map still contains fallback split cases, rerun only those abstracts and merge the repaired results back into the existing output:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli reference-metadata \
  --input data/primary/abstracts.json \
  --output data/primary/reference_metadata.json \
  --repair-failed-splits-from data/primary/reference_metadata.json \
  --use-title-search \
  --reference-splitting-backend openai \
  --reference-splitting-model gpt-5-nano

8. Optional Claim Extraction

If you want claim lists over the abstracts, run this after figure analysis so cached figure notes can be included in the cllm manuscript.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli extract-claims

What it does:

reads data/primary/abstracts.json
reads the OpenAI figure-analysis cache under data/cache/figure_analysis/ by default so figure-analysis text can be appended when present
builds a manuscript from the title, introduction, methods, results, discussion, conclusion, and filtered additional-content fields
excludes references and acknowledgements from the claim prompt
writes a resumable cache under data/cache/claim_analysis/

Current default OpenAI path:

provider: openai
model: gpt-4o-2024-08-06

Useful explicit form:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli extract-claims \
  --input data/primary/abstracts.json \
  --image-analyses-input data/cache/figure_analysis/image_analyses_openai__<state-key>.json \
  --claim-analyses-output data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json \
  --openai-model gpt-4o-2024-08-06

If you want the claims to appear in the UI, rerun:

enrich
build-ui

9. Generate Embeddings

Pick one or more embedding routes.

MiniLM:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilm

OpenAI:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-openai

Voyage:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-voyage

Hugging Face model:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-hf \
  --model neuml/pubmedbert-base-embeddings

Embedding text is built on demand from:

title
claims
introduction
methods
results
conclusion

You can override the fields at runtime, for example:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilm \
  --fields title methods results

To build a claims-only embedding bundle:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilm \
  --fields claims \
  --output-name minilm_claims

This uses claim_extraction.claims from data/primary/abstracts_enriched.json and formats each extracted claim as a short bullet containing the claim statement itself.

10. Apply Or Train Stage 2

Apply the published NeuroScape stage-2 model

Use this when you have a compatible Voyage stage-1 bundle.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli apply-published-stage2

Train a local stage-2 model

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-stage2

11. Build Semantic Analysis And Cluster Outputs

Community detection over an embedding bundle:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli semantic-analysis \
  --embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published

Clustering benchmark over an embedding bundle:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli cluster-benchmark \
  --embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
  --output-dir data/outputs/experiments/clustering_benchmark__<state-key>

To benchmark a claims-only bundle around 25-30 clusters:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli cluster-benchmark \
  --embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
  --output-dir data/outputs/experiments/clustering_benchmark_claims_25_30__<state-key> \
  --k-min 25 \
  --k-max 30

This is the current claims-cluster artifact consumed by the UI. The latest run selected a 28-cluster k-means solution inside that benchmark output.

If you want to score a new cluster family against the submitter taxonomy:

PYTHONPATH=src .venv/bin/python scripts/evaluate_label_systems.py \
  --embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
  --raw-input data/primary/abstracts.json \
  --label-system submitter_parent \
  --label-system submitter_exact \
  --label-system candidate=data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmark/cluster_assignments.json \
  --output-dir data/outputs/experiments/embeddings/voyage_stage2_published/category_evaluation

Projection outputs:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli umap-plot
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli compare-projections
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli optimize-projections

12. Generate And Analyze Layout Proposals

The stable route for proposal generation currently lives in the script wrappers under scripts/, not in ohbmcli.

Generate a fresh proposal bundle:

PYTHONPATH=src .venv/bin/python scripts/optimize_poster_layout.py \
  --raw-input data/primary/abstracts.json \
  --authors-input data/inputs/authors.json \
  --embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
  --claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
  --claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
  --output-dir data/outputs/proposals/layout_claims__<fresh-run-name>

Analyze that proposal:

PYTHONPATH=src .venv/bin/python scripts/analyze_poster_layout.py \
  --assignment data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
  --raw-input data/primary/abstracts.json \
  --embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
  --claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
  --claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
  --output data/outputs/proposals/layout_claims__<fresh-run-name>/analysis.json

To drive the layout with a learned label system instead of the submitter taxonomy, add:

--layout-cluster-assignments <cluster_assignments.json>
--layout-cluster-summaries <cluster_summaries.json>
--layout-label-system <name>

Use a fresh --output-dir whenever the layout label system, embeddings, or weights change. The default output-root hash does not encode every proposal option.

13. Run Poster Sequencing And Proposal Experiments

Once a base proposal exists, the sequencing and comparison workflows are also script-driven. Write these outputs to fresh experiment directories or fresh proposal output roots.

Graph benchmark against an existing proposal:

PYTHONPATH=src .venv/bin/python scripts/benchmark_poster_sequencing.py \
  --proposal data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
  --raw-input data/primary/abstracts.json \
  --authors-input data/inputs/authors.json \
  --embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
  --claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
  --claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
  --output-root experiments/<date>-poster-sequencing-benchmark/runs/<fresh-run-name>

Advanced non-diffusion global-path experiment:

PYTHONPATH=src .venv/bin/python scripts/run_advanced_global_path_experiment.py \
  --proposal data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
  --raw-input data/primary/abstracts.json \
  --authors-input data/inputs/authors.json \
  --embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
  --claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
  --claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
  --output-root experiments/<date>-advanced-global-path/runs/<fresh-run-name>

The same pattern applies to scripts/sweep_diffusion_variants.py, scripts/sweep_global_path_variants.py, and scripts/sweep_global_path_mapalign_variants.py: pass explicit current paths for the proposal, corpora, authors, embeddings, and output root rather than relying on older baked-in defaults.

14. Build The Static UI

This is the current latest delivery step.

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli build-ui

The current default UI build uses:

data/primary/abstracts.json
data/primary/abstracts_enriched.json
data/primary/reference_metadata.json
the OpenAI figure-analysis cache under data/cache/figure_analysis/
data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmark
data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30
data/outputs/experiments/embeddings/minilm_stage1/umap_title-introduction-methods-results-conclusion.json

By default build-ui now writes the local bundle under data/outputs/exported-sites/ui-site__<state-key>/ and mirrors that bundle to export/ui-site/. Pass --site-output-dir or --publish-dir to override one or both locations.

Useful explicit form if you want to point the UI at a different claims-cluster run:

PYTHONPATH=src .venv/bin/python -m ohbm2026.cli build-ui \
  --site-output-dir data/outputs/exported-sites/ui-site__<state-key> \
  --publish-dir export/ui-site \
  --cluster-25-dir data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmark \
  --claims-cluster-dir data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30

The exported detail payload now includes:

merged claim_extraction from data/primary/abstracts_enriched.json
reference_summary from data/primary/reference_metadata.json
semantic_25 and claims_28 cluster lenses in the facet and detail metadata

Then serve it locally:

.venv/bin/python -m http.server 8000

Open:

http://localhost:8000/export/ui-site/

Suggested Minimal Rebuilds

If you already have raw abstracts:

rerun figure analysis
rerun extract-claims if claim prompts should reflect updated figure analyses
rerun enrich
rerun build-ui

If you already have figures and only changed UI code:

rerun build-ui

If you already have fresh figure analyses and only changed claim extraction:

rerun extract-claims
rerun enrich
rerun build-ui

If you already have embeddings but want new cluster evaluations:

rerun cluster-benchmark
optionally rerun scripts/evaluate_label_systems.py
optionally rerun build-ui

If you specifically want to refresh the claims-based semantic lens:

rerun embed-minilm --fields claims --output-name minilm_claims
rerun cluster-benchmark --embeddings-dir data/outputs/experiments/embeddings/minilm_claims --output-dir data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30 --k-min 25 --k-max 30
rerun build-ui

If you want to regenerate a proposal without touching the corpora:

rerun scripts/optimize_poster_layout.py into a fresh data/outputs/proposals/... directory
rerun scripts/analyze_poster_layout.py

If you want to rerun sequencing experiments on an existing proposal:

pick the proposal JSON under data/outputs/proposals/
rerun the relevant script under scripts/ into a fresh experiment run directory

Module Layout

src/ohbm2026/graphql_api.py
- GraphQL access, env loading, batching, retries
src/ohbm2026/assets.py
- abstract ingest and figure asset download/refresh
src/ohbm2026/enrichment.py
- markdown conversion, figure analysis, claim extraction, enrichment assembly
src/ohbm2026/openalex.py
- reference parsing and OpenAlex matching
src/ohbm2026/neuroscape.py
- embeddings, stage-2 paths, semantic analysis, clustering, projections
src/ohbm2026/ui.py
- static UI export/build pipeline
src/ohbm2026/cli.py
- unified CLI entrypoint

Main Outputs By Stage

raw ingest
- data/primary/abstracts.json
- data/inputs/assets/
manual and operator inputs
- data/inputs/abstracts_with_phenomena_with_theories_refined.csv
- data/inputs/poster_layout/layout_assets/
authors
- data/inputs/authors.json
figure analysis
- data/cache/figure_analysis/image_analyses_ollama__<state-key>.json
- data/cache/figure_analysis/image_analyses_openai__<state-key>.json
claim extraction
- data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json
enriched corpus
- data/primary/abstracts_enriched.json
audit outputs
- data/outputs/experiments/title_audit/title_modifications.json
references
- data/primary/reference_metadata.json
embeddings and clustering
- data/outputs/experiments/embeddings/*
static site
- data/outputs/exported-sites/ui-site__<state-key>/
- optional publish mirror at export/ui-site/

Validation

Default validation command:

PYTHONPATH=src .venv/bin/python -m unittest discover -s tests -v

If an agent is taking over this repo, this should be the first command after setting up the environment.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.agents/skills		.agents/skills
.specify		.specify
docs		docs
experiments		experiments
memory		memory
scripts		scripts
specs/001-refactor-cache-utils		specs/001-refactor-cache-utils
src/ohbm2026		src/ohbm2026
tests		tests
ui		ui
.env.sample		.env.sample
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONSTITUTION.md		CONSTITUTION.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OHBM 2026 Pipeline

What The Pipeline Produces

Current Latest Step

External Requirements

Environment Variables

Token And Tool Matrix

Setup

Recommended Sequences

Full Rebuild To The Static UI

Add Or Refresh A Cluster Family

Generate Or Refresh A Layout Proposal

Run A Poster Sequencing Experiment

Minimal UI Refresh

End-To-End Workflow

1. Download The Raw Abstracts And Figures

2. Refresh Assets Without Rerunning Abstract Extraction

3. Export Authors

4. Run Figure Analysis

Route A: OpenAI figure analysis

Route B: Local Ollama figure analysis

5. Build The Main Enriched Abstract Dataset

6. Audit And Clean Display Titles

7. Match References With OpenAlex

8. Optional Claim Extraction

9. Generate Embeddings

10. Apply Or Train Stage 2

Apply the published NeuroScape stage-2 model

Train a local stage-2 model

11. Build Semantic Analysis And Cluster Outputs

12. Generate And Analyze Layout Proposals

13. Run Poster Sequencing And Proposal Experiments

14. Build The Static UI

Suggested Minimal Rebuilds

Module Layout

Main Outputs By Stage

Validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages