This repository builds a local OHBM 2026 abstract corpus from the Oxford Abstracts GraphQL API and carries it through figure enrichment, reference matching, embeddings, clustering, and a static search UI.
This README is the runbook for a person or an agent that needs to go from the original abstract download to the current latest step.
Repository home:
- Git remote
origin:git@github.com:sensein/ohbm2026.git - GitHub URL: github.com/sensein/ohbm2026
Project conventions that should not be violated live in
CONSTITUTION.md,
including the rules that Python work stays inside the repository-local .venv,
recorded experiment runs write to fresh directories instead of overwriting prior
outputs, behavior-changing work stays plan-first and test-driven, and secrets
never get copied into the repo or logs.
This README is the operational runbook, not the full project charter. For the repo-level intent, reproducibility model, authoritative defaults, key decisions, and experiment history, start with docs/reproducibility-vision.md.
If you only read one document before changing behavior, read docs/reproducibility-vision.md first.
Catalogs for the rest of the repository:
Recommended reading order for a new person or agent:
- docs/reproducibility-vision.md
- README.md
- docs/README.md
- CONSTITUTION.md
- memory/summary.md
- the specific plan or experiment README closest to the work you are changing
Core artifacts:
data/inputs/abstracts_graphql__<state-key>.json- GraphQL-fetched source snapshot for the latest ingest run
data/primary/abstracts.json- canonical normalized accepted abstracts derived from the fetched snapshot
data/inputs/assets/- downloaded local figure files, restricted to methods/results figures
data/cache/figure_analysis/image_analyses_<backend>__<state-key>.json- resumable figure-analysis cache with direct state-key lookup
data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json- resumable
cllmclaim-extraction cache with direct state-key lookup
- resumable
data/outputs/experiments/title_audit/title_modifications.json- audit log of cleaned abstract titles versus original raw titles
data/primary/abstracts_enriched.json- enriched abstract corpus with markdown sections, figure analyses, and claim extraction when available
data/primary/reference_metadata.json- OpenAlex-matched reference metadata
data/outputs/experiments/embeddings/*- canonical embedding bundles, stage-2 projections, and neighbors
data/outputs/experiments/*__<state-key>/- clustering, projection, and other experiment-style derived outputs
data/outputs/exported-sites/ui-site__<state-key>/- local exported-site bundle before optional publish mirroring
data/outputs/proposals/*__<state-key>/- proposal bundles and proposal-adjacent analysis outputs
export/ui-site/- optional publish mirror of the latest exported-site bundle
Local artifact layout rules:
data/inputs/is for fetched snapshots, API-derived inputs, and manual operator-supplied inputsdata/primary/is for canonical normalized datasets consumed by downstream stagesdata/cache/is for resumable caches and checkpointsdata/outputs/experiments/,data/outputs/exported-sites/, anddata/outputs/proposals/are for local derived outputsarchive/is for local pre-migration backups that preserve legacy pathsdata/,export/, andtmp/remain ignored by git
The latest end state of the project is:
- accepted abstracts downloaded locally
- methods/results figures downloaded and linked
- OpenAI figure text promoted into the main enriched abstract dataset
- reference metadata matched with OpenAlex where possible
- multiple embedding bundles generated
- published NeuroScape stage-2 applied to Voyage embeddings
- clustering benchmarks run on embedding bundles
- static UI built with:
- lexical search
- browser-side semantic search
- facets
- UMAP selection
- two semantic cluster lenses:
25-cluster benchmarkclaims 28-cluster benchmark
Required:
python3.11+uv
Optional, depending on which branch of the pipeline you run:
ollama- local Ollama model
qwen3.5:35b - Hugging Face access for downloading sentence-transformer models
- OpenAI API access for hosted figure analysis, OpenAI embeddings, and
cllmclaim extraction - Voyage API access for Voyage embeddings
- OpenAlex API key for authenticated reference matching
Create .env from .env.sample.
Common keys:
OHBM2026_API- required for Oxford Abstracts ingest and author lookup
OPENAI_API_KEY- required for OpenAI figure analysis, OpenAI embeddings,
extract-claimswith OpenAI, and OpenAI-backed reference splitting
- required for OpenAI figure analysis, OpenAI embeddings,
ANTHROPIC_API_KEY- required only if
extract-claimsis run with--llm-provider anthropic
- required only if
VOYAGE_API- required for Voyage embeddings
OPENALEX_API- optional but recommended for reference enrichment
HF_TOKEN- optional for Hugging Face model downloads
No API key is needed for local Ollama figure analysis.
Treat .env and shell environment variables as the only valid homes for these
secrets. Do not commit tokens, paste them into docs, or leave them in command
logs.
Use this as the quick answer to "what do I need before I run this step?"
| Workflow | Required secret(s) | Extra local tool(s) | Notes |
|---|---|---|---|
ohbmcli ingest |
OHBM2026_API |
none | Fetches accepted abstracts and figure assets |
ohbmcli refresh-assets |
none | none | Uses the existing local normalized corpus |
ohbmcli authors |
OHBM2026_API |
none | Pulls author details from Oxford Abstracts |
ohbmcli analyze-figures --vision-backend openai |
OPENAI_API_KEY |
none | Current preferred figure-analysis route |
ohbmcli analyze-figures --vision-backend ollama |
none | ollama, qwen3.5:35b |
No hosted token required |
ohbmcli extract-claims |
OPENAI_API_KEY or ANTHROPIC_API_KEY |
cllm installed in .venv |
Provider depends on --llm-provider |
ohbmcli enrich |
none | none | Consumes local caches only |
ohbmcli title-audit |
none | none | Reads local normalized corpus only |
ohbmcli reference-metadata |
optional OPENALEX_API; OPENAI_API_KEY only if using OpenAI reference splitting |
none | OPENALEX_API is recommended for authenticated reference matching |
ohbmcli embed-minilm / embed-hf |
optional HF_TOKEN |
sentence-transformers |
HF_TOKEN is only needed for gated/private Hub access |
ohbmcli embed-openai |
OPENAI_API_KEY |
none | Hosted embedding route |
ohbmcli embed-voyage |
VOYAGE_API |
none | Voyage embedding route |
ohbmcli apply-published-stage2 / embed-stage2 |
none | local model dependencies already in .venv |
Uses local artifacts |
ohbmcli semantic-analysis / cluster-benchmark / umap-plot / compare-projections / optimize-projections |
none | optional plotly, umap-learn |
Purely local once embeddings exist |
scripts/optimize_poster_layout.py / scripts/analyze_poster_layout.py |
none | none | Uses local proposal inputs, authors, and layout assets |
poster sequencing scripts under scripts/ |
none | none | Use local proposals and embeddings |
ohbmcli export-ui / build-ui |
none | none | Consumes local corpora, caches, clusters, and manual inputs |
Do not use system Python in this repo. Create or refresh .venv with uv, and
run Python commands through .venv/bin/python or uv targeting that
interpreter.
Create the virtual environment and run tests:
UV_CACHE_DIR=.uv-cache uv venv --python 3.11 .venv
PYTHONPATH=src .venv/bin/python -m unittest discover -s tests -vOptional Python packages by workflow:
MiniLM or HF embeddings:
UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python sentence-transformersInteractive projections:
UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python plotly umap-learnClaim extraction:
UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python git+https://github.com/OpenEvalProject/cllm.gitHeadless layout review:
UV_CACHE_DIR=.uv-cache uv pip install --python .venv/bin/python ".[review]"
PYTHONPATH=src .venv/bin/python -m playwright install chromium
PYTHONPATH=src .venv/bin/python scripts/check_layout_review.pyFor local figure analysis, confirm Ollama can see the required model:
ollama listPick the sequence that matches what you are trying to regenerate.
Run these in order when rebuilding the main deliverable from upstream data:
ohbmcli ingestohbmcli authorsohbmcli analyze-figures --vision-backend openaiohbmcli extract-claimsohbmcli enrichohbmcli title-auditohbmcli reference-metadata --use-title-search- one or more embedding commands such as
embed-minilm,embed-voyage, orembed-openai ohbmcli apply-published-stage2if you want the published Voyage stage-2 spaceohbmcli semantic-analysis,cluster-benchmark,umap-plot, orcompare-projectionsfor the cluster and projection products you want the UI to consumeohbmcli export-uiorohbmcli build-ui
Use this when you already have the corpora and want a new cluster output:
- confirm the required embedding bundle exists under
data/outputs/experiments/embeddings/ - run
ohbmcli semantic-analysisfor community-detection style outputs - run
ohbmcli cluster-benchmarkfor k-sweep style outputs - optionally run
scripts/evaluate_label_systems.pyto compare a new cluster family against the submitter taxonomy - point
export-ui,build-ui, or layout scripts at the new cluster directory
Use this when you want a new organizer-facing proposal:
- confirm
data/primary/abstracts.json,data/inputs/authors.json, anddata/inputs/poster_layout/layout_assets/layout_geometry.jsonexist - choose the embedding bundle and any claims/layout cluster inputs you want to drive the proposal
- run
scripts/optimize_poster_layout.pyinto a fresh proposal directory underdata/outputs/proposals/ - run
scripts/analyze_poster_layout.pyon that proposal - optionally run comparison or review scripts against multiple proposal directories
Use this when you already have a base proposal and want comparative sequencing evidence:
- pick a base proposal under
data/outputs/proposals/ - run one of the sequencing scripts under
scripts/into a fresh dated experiment directory underexperiments/or a fresh local output root underdata/outputs/proposals/ - keep the experiment outputs immutable and compare them rather than overwriting the active proposal set
Use this when the data products already exist locally:
- rerun only the upstream steps that changed
- rerun
ohbmcli export-uiorohbmcli build-ui - do not rerun hosted/API steps unless their inputs or parameters changed
Use ohbmcli for the corpus, enrichment, embedding, clustering, and UI
pipeline. Use the script wrappers under scripts/ for proposal generation,
layout analysis, and sequencing experiments.
This is the canonical starting point.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli ingestWhat it does:
- fetches accepted abstracts from Oxford Abstracts
- stores the normalized corpus in
data/primary/abstracts.json - downloads only methods/results figure images
- writes local figure links into each abstract
Important behavior:
- retries use an exponential timeout schedule starting at
100msand capped at10s - figure downloads are reuse-aware
Use this if the raw JSON already exists and you only need to rebuild or prune local figure links.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli refresh-assets --reuse-existing-assets-onlyOptional, but useful if you want a separate author database.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli authorsOutput:
data/inputs/authors.json
There are two supported routes.
This is the current preferred route for the main enriched corpus.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli analyze-figures \
--vision-backend openai \
--openai-model gpt-4.1-mini \
--enriched-output data/outputs/experiments/enrichment/abstracts_enriched_openai.jsonNotes:
- the cache is incremental and resumable
- current code batches OpenAI image requests for better throughput
- finished analyses are written as they complete
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli analyze-figures \
--vision-backend ollama \
--vision-model qwen3.5:35bThis step converts abstract content to ordered markdown and merges figure analysis plus any cached claim extraction back into the canonical enriched corpus.
Current default:
enrichnow defaults to the OpenAI figure-analysis cache underdata/cache/figure_analysis/enrichnow also defaults to thecllmclaim cache underdata/cache/claim_analysis/
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli enrichExplicit form:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli enrich \
--input data/primary/abstracts.json \
--image-analyses-input data/cache/figure_analysis/image_analyses_openai__<state-key>.json \
--claim-analyses-input data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json \
--enriched-output data/primary/abstracts_enriched.jsonOutput:
data/primary/abstracts_enriched.json
This is the main corpus used by downstream steps.
The raw Oxford Abstracts export is kept unchanged, but downstream consumers now normalize obvious title issues such as leading bullets, wrapping quotes, and stray outer whitespace.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli title-audit \
--input data/primary/abstracts.json \
--output data/outputs/experiments/title_audit/title_modifications.jsonOutput:
data/outputs/experiments/title_audit/title_modifications.json
This file records each changed title with the original string, cleaned title, and normalization reasons.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli reference-metadata \
--input data/primary/abstracts.json \
--output data/primary/reference_metadata.json \
--use-title-searchOutput:
data/primary/reference_metadata.json
This file is resumable and checkpoint-friendly.
Reference resolution now follows this order:
- markdown normalization of the raw references field
- LLM-assisted splitting of the full reference markdown block, validated against the source text
- exact DOI -> OpenAlex
- exact PMID -> OpenAlex
- direct OpenAlex title search for references with a title
- Semantic Scholar full-reference search only for references that still have neither DOI nor title
Current operational notes:
- the OpenAI splitter runs one request per abstract attempt and can be driven concurrently
- failed or invalid splits can be requeued and retried before falling back to a single-block record
- OpenAlex title search can also run concurrently with an explicit requests-per-second cap
- OpenAlex
/rate-limitis the best way to inspect current search budget before a long rerun
Useful options:
--no-doi-discovery- skip the Semantic Scholar full-reference DOI-discovery fallback
--no-llm-reference-splitting- skip the OpenAI/Ollama splitting pass and fall back to local markdown heuristics
--reference-splitting-backend openai- use OpenAI for the splitting helper
--reference-splitting-model gpt-5-nano- model used for reference structuring; defaults to
gpt-5-nano - the OpenAI backend uses the Responses API with a strict JSON schema for
{"references": [{"reference", "title", "doi"}]}output - extracted
titleanddoivalues are only used downstream if they are lexically present in the returned reference text
- model used for reference structuring; defaults to
--split-concurrency 500- number of in-flight OpenAI reference-splitting requests during collect
--split-max-requeues 5- maximum retries for failed or invalid split attempts before falling back to a single merged block
--title-concurrency 50- number of concurrent OpenAlex title-search workers
--title-max-rps 90- soft request-rate cap for OpenAlex title search; useful for staying below OpenAlex short-window throttle limits
--doi-discovery-similarity-threshold 0.8- minimum title similarity required before accepting a discovered DOI
--delay-seconds 1.05- pacing for sequential fallback phases such as Semantic Scholar DOI discovery
If a completed reference map still contains fallback split cases, rerun only those abstracts and merge the repaired results back into the existing output:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli reference-metadata \
--input data/primary/abstracts.json \
--output data/primary/reference_metadata.json \
--repair-failed-splits-from data/primary/reference_metadata.json \
--use-title-search \
--reference-splitting-backend openai \
--reference-splitting-model gpt-5-nanoIf you want claim lists over the abstracts, run this after figure analysis so
cached figure notes can be included in the cllm manuscript.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli extract-claimsWhat it does:
- reads
data/primary/abstracts.json - reads the OpenAI figure-analysis cache under
data/cache/figure_analysis/by default so figure-analysis text can be appended when present - builds a manuscript from the title, introduction, methods, results, discussion, conclusion, and filtered additional-content fields
- excludes references and acknowledgements from the claim prompt
- writes a resumable cache under
data/cache/claim_analysis/
Current default OpenAI path:
- provider:
openai - model:
gpt-4o-2024-08-06
Useful explicit form:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli extract-claims \
--input data/primary/abstracts.json \
--image-analyses-input data/cache/figure_analysis/image_analyses_openai__<state-key>.json \
--claim-analyses-output data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json \
--openai-model gpt-4o-2024-08-06If you want the claims to appear in the UI, rerun:
enrichbuild-ui
Pick one or more embedding routes.
MiniLM:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilmOpenAI:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-openaiVoyage:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-voyageHugging Face model:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-hf \
--model neuml/pubmedbert-base-embeddingsEmbedding text is built on demand from:
titleclaimsintroductionmethodsresultsconclusion
You can override the fields at runtime, for example:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilm \
--fields title methods resultsTo build a claims-only embedding bundle:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-minilm \
--fields claims \
--output-name minilm_claimsThis uses claim_extraction.claims from data/primary/abstracts_enriched.json and formats each extracted claim as a short bullet containing the claim statement itself.
Use this when you have a compatible Voyage stage-1 bundle.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli apply-published-stage2PYTHONPATH=src .venv/bin/python -m ohbm2026.cli embed-stage2Community detection over an embedding bundle:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli semantic-analysis \
--embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_publishedClustering benchmark over an embedding bundle:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli cluster-benchmark \
--embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
--output-dir data/outputs/experiments/clustering_benchmark__<state-key>To benchmark a claims-only bundle around 25-30 clusters:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli cluster-benchmark \
--embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
--output-dir data/outputs/experiments/clustering_benchmark_claims_25_30__<state-key> \
--k-min 25 \
--k-max 30This is the current claims-cluster artifact consumed by the UI. The latest run selected a 28-cluster k-means solution inside that benchmark output.
If you want to score a new cluster family against the submitter taxonomy:
PYTHONPATH=src .venv/bin/python scripts/evaluate_label_systems.py \
--embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
--raw-input data/primary/abstracts.json \
--label-system submitter_parent \
--label-system submitter_exact \
--label-system candidate=data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmark/cluster_assignments.json \
--output-dir data/outputs/experiments/embeddings/voyage_stage2_published/category_evaluationProjection outputs:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli umap-plot
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli compare-projections
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli optimize-projectionsThe stable route for proposal generation currently lives in the script wrappers
under scripts/, not in ohbmcli.
Generate a fresh proposal bundle:
PYTHONPATH=src .venv/bin/python scripts/optimize_poster_layout.py \
--raw-input data/primary/abstracts.json \
--authors-input data/inputs/authors.json \
--embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
--claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
--claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
--output-dir data/outputs/proposals/layout_claims__<fresh-run-name>Analyze that proposal:
PYTHONPATH=src .venv/bin/python scripts/analyze_poster_layout.py \
--assignment data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
--raw-input data/primary/abstracts.json \
--embeddings-dir data/outputs/experiments/embeddings/minilm_claims \
--claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
--claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
--output data/outputs/proposals/layout_claims__<fresh-run-name>/analysis.jsonTo drive the layout with a learned label system instead of the submitter taxonomy, add:
--layout-cluster-assignments <cluster_assignments.json>--layout-cluster-summaries <cluster_summaries.json>--layout-label-system <name>
Use a fresh --output-dir whenever the layout label system, embeddings, or
weights change. The default output-root hash does not encode every proposal
option.
Once a base proposal exists, the sequencing and comparison workflows are also script-driven. Write these outputs to fresh experiment directories or fresh proposal output roots.
Graph benchmark against an existing proposal:
PYTHONPATH=src .venv/bin/python scripts/benchmark_poster_sequencing.py \
--proposal data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
--raw-input data/primary/abstracts.json \
--authors-input data/inputs/authors.json \
--embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
--claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
--claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
--output-root experiments/<date>-poster-sequencing-benchmark/runs/<fresh-run-name>Advanced non-diffusion global-path experiment:
PYTHONPATH=src .venv/bin/python scripts/run_advanced_global_path_experiment.py \
--proposal data/outputs/proposals/layout_claims__<fresh-run-name>/proposal.json \
--raw-input data/primary/abstracts.json \
--authors-input data/inputs/authors.json \
--embeddings-dir data/outputs/experiments/embeddings/voyage_stage2_published \
--claims-cluster-assignments data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_assignments.json \
--claims-cluster-summaries data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30/cluster_summaries.json \
--output-root experiments/<date>-advanced-global-path/runs/<fresh-run-name>The same pattern applies to scripts/sweep_diffusion_variants.py,
scripts/sweep_global_path_variants.py, and
scripts/sweep_global_path_mapalign_variants.py: pass explicit current paths
for the proposal, corpora, authors, embeddings, and output root rather than
relying on older baked-in defaults.
This is the current latest delivery step.
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli build-uiThe current default UI build uses:
data/primary/abstracts.jsondata/primary/abstracts_enriched.jsondata/primary/reference_metadata.json- the OpenAI figure-analysis cache under
data/cache/figure_analysis/ data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmarkdata/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30data/outputs/experiments/embeddings/minilm_stage1/umap_title-introduction-methods-results-conclusion.json
By default build-ui now writes the local bundle under
data/outputs/exported-sites/ui-site__<state-key>/ and mirrors that bundle to
export/ui-site/. Pass --site-output-dir or --publish-dir to override one
or both locations.
Useful explicit form if you want to point the UI at a different claims-cluster run:
PYTHONPATH=src .venv/bin/python -m ohbm2026.cli build-ui \
--site-output-dir data/outputs/exported-sites/ui-site__<state-key> \
--publish-dir export/ui-site \
--cluster-25-dir data/outputs/experiments/embeddings/voyage_stage2_published/clustering_benchmark \
--claims-cluster-dir data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30The exported detail payload now includes:
- merged
claim_extractionfromdata/primary/abstracts_enriched.json reference_summaryfromdata/primary/reference_metadata.jsonsemantic_25andclaims_28cluster lenses in the facet and detail metadata
Then serve it locally:
.venv/bin/python -m http.server 8000Open:
http://localhost:8000/export/ui-site/
If you already have raw abstracts:
- rerun figure analysis
- rerun
extract-claimsif claim prompts should reflect updated figure analyses - rerun
enrich - rerun
build-ui
If you already have figures and only changed UI code:
- rerun
build-ui
If you already have fresh figure analyses and only changed claim extraction:
- rerun
extract-claims - rerun
enrich - rerun
build-ui
If you already have embeddings but want new cluster evaluations:
- rerun
cluster-benchmark - optionally rerun
scripts/evaluate_label_systems.py - optionally rerun
build-ui
If you specifically want to refresh the claims-based semantic lens:
- rerun
embed-minilm --fields claims --output-name minilm_claims - rerun
cluster-benchmark --embeddings-dir data/outputs/experiments/embeddings/minilm_claims --output-dir data/outputs/experiments/embeddings/minilm_claims/clustering_benchmark_25_30 --k-min 25 --k-max 30 - rerun
build-ui
If you want to regenerate a proposal without touching the corpora:
- rerun
scripts/optimize_poster_layout.pyinto a freshdata/outputs/proposals/...directory - rerun
scripts/analyze_poster_layout.py
If you want to rerun sequencing experiments on an existing proposal:
- pick the proposal JSON under
data/outputs/proposals/ - rerun the relevant script under
scripts/into a fresh experiment run directory
src/ohbm2026/graphql_api.py- GraphQL access, env loading, batching, retries
src/ohbm2026/assets.py- abstract ingest and figure asset download/refresh
src/ohbm2026/enrichment.py- markdown conversion, figure analysis, claim extraction, enrichment assembly
src/ohbm2026/openalex.py- reference parsing and OpenAlex matching
src/ohbm2026/neuroscape.py- embeddings, stage-2 paths, semantic analysis, clustering, projections
src/ohbm2026/ui.py- static UI export/build pipeline
src/ohbm2026/cli.py- unified CLI entrypoint
- raw ingest
data/primary/abstracts.jsondata/inputs/assets/
- manual and operator inputs
data/inputs/abstracts_with_phenomena_with_theories_refined.csvdata/inputs/poster_layout/layout_assets/
- authors
data/inputs/authors.json
- figure analysis
data/cache/figure_analysis/image_analyses_ollama__<state-key>.jsondata/cache/figure_analysis/image_analyses_openai__<state-key>.json
- claim extraction
data/cache/claim_analysis/claim_analyses_cllm__<state-key>.json
- enriched corpus
data/primary/abstracts_enriched.json
- audit outputs
data/outputs/experiments/title_audit/title_modifications.json
- references
data/primary/reference_metadata.json
- embeddings and clustering
data/outputs/experiments/embeddings/*
- static site
data/outputs/exported-sites/ui-site__<state-key>/- optional publish mirror at
export/ui-site/
Default validation command:
PYTHONPATH=src .venv/bin/python -m unittest discover -s tests -vIf an agent is taking over this repo, this should be the first command after setting up the environment.