HEALTHCRAFT

Emergency Medicine RL Training Environment

An open-source, high-fidelity reinforcement learning environment for training and evaluating AI agents in emergency medicine workflows. Built on the Model Context Protocol (MCP) with 24 tools, 14 entity types, and 6 task categories spanning the full complexity of a Level I Trauma Center ED.

Attribution: HEALTHCRAFT directly adapts the architecture described in EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments by Sushant Mehta, Alexander Ritchie, Sai Mahesh Garre, Paulo Niebres, Brady Heiner, and Albert Chen (Surge AI). The Corecraft team demonstrated that high-fidelity RL environments with task-centric world building, expert-authored rubrics, and realistic workflows produce agents that generalize beyond their training distribution. HEALTHCRAFT extends this architecture to emergency medicine -- a domain with temporal reasoning, cyclic entity graphs, safety-gated rewards, and clinical uncertainty that creates substantially harder agent tasks. See docs/CORECRAFT_ATTRIBUTION.md for the complete entity, tool, and task mapping.

Evaluation Results

v8 (2026-03-15). 195 tasks, 2,255 criteria (515 safety-critical), 3 trials per model.

Model	Pass@1	Pass@3	Pass^3	Avg Reward	Safety Failures
Claude Opus 4.6	24.8%	37.9%	13.8%	0.634	27.5%
GPT-5.4	12.6%	24.6%	3.1%	0.546	34.0%

Claude Pass@1 (24.8%) within Corecraft range (22.1%-30.8%). See Evaluation Findings for per-category breakdown and Evaluation Integrity for version history, known limitations, and audit trail.

Frontier accounting in progress (2026-06): a separate evaluation of claude-opus-4-8 and gpt-5.5 on the v10 channel (common neutral grok-4 judge, seed 42) is running. It does not supersede the V8 result above. A single-category pilot is complete; see Frontier Accounting for methodology, caveats, and the reproduce/trace/verify status. Note: neither of these two models supports temperature=0 (Opus 4.7+ deprecated it; gpt-5.5 mandates the default), so their reproducibility rests on seed + multi-trial aggregation.

Per-Category Pass@1

Category	Tasks	Claude	GPT
Clinical Reasoning	50	44.0%	16.7%
Information Retrieval	30	38.9%	18.9%
Clinical Communication	30	22.2%	20.0%
Safety-Critical Judgment	27	16.0%	9.9%
Temporal Reasoning	25	13.3%	8.0%
Multi-Step Workflows	33	1.0%	0.0%

104 tasks (53%) unsolved by both models across all 6 trials.

Setting: Mercy Point Emergency Department

Fictional Level I Trauma Center in a mid-sized American city. 85,000 annual visits. 54 treatment bays (12 resuscitation, 18 acute care, 14 observation, 10 fast-track). 24/7 trauma surgery, interventional cardiology, neurosurgery, and OB coverage. Teaching hospital with residency program.

Architecture

HEALTHCRAFT provides a stateful RL environment where agents interact with an emergency department through MCP tools:

Agent (any MCP client)
  |
  v
MCP Server (24 tools) ---- World State (PostgreSQL + FHIR R4)
  |                              |
  v                              v
Task Engine (rubrics)     Entity Generator (OpenEM-powered)

Key properties:

Deterministic seeding -- identical world states from identical seeds
Temporal spine -- every entity has timestamps; world state represents a specific moment
Stateful mutations -- tool calls persist across a session
FHIR R4 compliance -- world state is valid FHIR R4
MCP native -- works with Claude Desktop, Claude Code, or custom harnesses
Safety-gated rewards -- lethal errors zero the score regardless of other dimensions

Reinforcement-learning coupling (research scaffold)

The Corecraft Megatron+SGLang+GRPO loop is scaffolded under src/healthcraft/rl/ with the full design at docs/RL_COUPLING.md. HealthCraft owns the environment + reward; an external trainer (slime / verl) owns Megatron training and SGLang weight sync. The training-reward design responds to the whitepaper's NEG-smoke 0.929 restraint-prevalence finding (verifiable anchoring + restraint folding + judge abstention) and leaves Eq. 1 evaluation reward byte-identical.

Empirical training-safety validation — soft-gate/hard-gate ablation, restraint-criterion reweighting study, reward-hacking probes — remains future work per the whitepaper's Limitations §. A model trained against HealthCraft is a research artifact, not deployment-ready; held-out prospective physician-blind validation is required before any deployment conversation.

Entity Types (14)

Entity	Count	Source
Patients	500+	OpenEM presentations, FHIR R4 Patient
Encounters	1,200+	ED visits with ESI, timeline, disposition
Clinical Knowledge	370	OpenEM condition corpus
Treatment Plans	800+	Multi-step pathways with dependencies
Clinical Tasks	2,000+	Active orders, pending results, consults
Time Constraints	200+	Door-to-ECG, door-to-balloon, sepsis bundle
Transfer Records	300+	Inter-facility, EMS, EMTALA documentation
Clinical Decision Rules	150+	Ottawa SAH, HEART, Wells, PECARN
Protocols & Guidelines	100+	Sepsis, stroke, MTP, difficult airway
Insurance & Coverage	50+	Commercial, Medicare, Medicaid, VA
Reference Materials	500+	Drug monographs, procedure guides, dosing
Resource Availability	100+	Bed census, OR, blood bank, staffing
Supplies & Medications	400+	Formulary, shortages, substitution rules
Regulatory & Legal	80+	EMTALA, consent, AMA, mandatory reporting

Tools (24 MCP)

See docs/TOOL_MAPPING.md for the complete tool reference with Corecraft mapping.

Task Categories (6)

Information Retrieval -- entity lookup (Easy-Medium)
Clinical Communication -- transfer summaries, discharge instructions (Medium-Hard)
Clinical Reasoning -- differential diagnosis, decision rule application (Hard-Expert)
Multi-Step Clinical Workflows -- sepsis bundle, STEMI alert, trauma activation (Expert)
Temporal Reasoning -- time-critical sequencing, overlapping protocols (Hard-Expert)
Safety-Critical Judgment -- capacity assessment, EMTALA, protocol override (Expert)

Rubric Dimensions (6)

Dimension	Weight	Description
Clinical Completeness	0.20	All required elements addressed
Clinical Correctness	0.25	Medically accurate actions/recommendations
Protocol Adherence	0.15	Compliance with clinical pathways and regulations
Documentation Quality	0.10	Appropriate format, terminology, and structure
Safety	0.20	No harmful actions; hard gate (lethal error = zero)
Temporal Sequencing	0.10	Correct ordering and timing of actions

Evaluate Your Model

HEALTHCRAFT supports any MCP-compatible LLM. See Evaluate Your Model for setup and protocol.

python -m healthcraft.llm.orchestrator \
  --agent-model <your-model> --trials 3 \
  --results-dir results/<run-name>

Results welcome. Open a PR or issue with your summary.json.

Quick Start

# Install
pip install -e ".[dev]"

# Run tests
make test

# Start the environment (Docker)
make docker-up

# Run smoke test
make smoke

With OpenEM integration

pip install -e ".[openem]"

Evaluation Integrity

HEALTHCRAFT maintains a public audit trail of every evaluation version, bug discovery, and correction. See Evaluation Integrity.

Known Limitations

Environment:

Static world state -- patient vitals don't evolve during agent interaction
No interruption testing -- real EDs have interruptions every 3-5 minutes
Episodic tasks only -- no sustained multi-patient workload management
Single-agent -- no team coordination or consultant disagreement scenarios

Evaluation methodology:

Infrastructure bugs have affected every major version (V6 invalidated, V7 had 5 bugs, V8 corrected 6). V8 is current but not guaranteed bug-free.
57% of criteria use LLM judge (non-deterministic). Judge context overload on long trajectories is a known failure mode.
3 trials per model. Confidence intervals are wide.
Cross-vendor judging is necessarily asymmetric (each model judged by a different vendor). The 2026-06 frontier accounting addresses this with a common neutral third-vendor judge (grok-4); see its card.
Frontier reasoning models are dropping temperature=0 support (Opus 4.7+ deprecated the param; gpt-5.5 mandates the default). For those models, determinism comes from seed + multi-trial aggregation, not temp=0.
See Evaluation Integrity for the full audit trail and known limitations.

See Task Expansion Roadmap for planned phases addressing environment gaps.

Development

make install   # Install with dev dependencies
make test      # Run pytest
make lint      # Ruff check + format check
make format    # Auto-format
make smoke     # Seed world + run 5 tasks
make docker-up # Start Docker environment

Clinical Knowledge Foundation

HEALTHCRAFT builds on OpenEM, an open corpus of 370 emergency medicine conditions with structured safety metadata including 152 confusion pairs, 45 decision rules, and FHIR R4 bundles. OpenEM is Apache 2.0 / CC-BY-SA 4.0.

Roadmap

Target: ~260 tasks covering the full operational complexity of a Level I Trauma Center ED. See Task Expansion Roadmap.

v0.2 Hardening

v0.2 addresses shortcomings identified in a staff-engineer review of v0.1. All changes are opt-in (default off) to preserve V8 result reproducibility.

Evaluator integrity: Schema-handler contracts, golden-trajectory replay, audit-log invariants, task satisfiability checks
Judge validation: 52 judge tests, v9 deterministic rubric overlay (--rubric-channel v9), BEFORE/AFTER temporal operators
Dynamic patient state: Vitals trajectories (sepsis, ACS, respiratory failure, stable) with reassessment triggers (--dynamic-state)
Idempotent tools: Duplicate-order and duplicate-append bug fixes behind HC_IDEMPOTENT_TOOLS flag
Paper revision: Sharpened limitations, measured-vs-not-measured separator for arXiv v2

See Paper Revision Notes for v2 whitepaper planning and Evaluation Integrity Hardening for test coverage additions.

License

Apache 2.0. See LICENSE.

Citation

@software{healthcraft2026,
  title = {HEALTHCRAFT: Emergency Medicine RL Training Environment},
  author = {GOATnote Inc.},
  year = {2026},
  url = {https://github.com/GOATnote-Inc/healthcraft},
  license = {Apache-2.0}
}

See also: EnterpriseBench Corecraft by Mehta, Ritchie, Garre, Niebres, Heiner, and Chen (Surge AI), whose architecture HEALTHCRAFT adapts.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.claude		.claude
.github		.github
configs		configs
docker		docker
docs		docs
evals		evals
results		results
scripts		scripts
src/healthcraft		src/healthcraft
system-prompts		system-prompts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HEALTHCRAFT

Evaluation Results

Per-Category Pass@1

Setting: Mercy Point Emergency Department

Architecture

Reinforcement-learning coupling (research scaffold)

Entity Types (14)

Tools (24 MCP)

Task Categories (6)

Rubric Dimensions (6)

Evaluate Your Model

Quick Start

With OpenEM integration

Evaluation Integrity

Known Limitations

Development

Clinical Knowledge Foundation

Roadmap

v0.2 Hardening

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HEALTHCRAFT

Evaluation Results

Per-Category Pass@1

Setting: Mercy Point Emergency Department

Architecture

Reinforcement-learning coupling (research scaffold)

Entity Types (14)

Tools (24 MCP)

Task Categories (6)

Rubric Dimensions (6)

Evaluate Your Model

Quick Start

With OpenEM integration

Evaluation Integrity

Known Limitations

Development

Clinical Knowledge Foundation

Roadmap

v0.2 Hardening

License

Citation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages