Document Type: Technical Specification Document ID: RADSLICE-TSD-001 Maintained by: GOATnote Inc. Responsible Engineer: B, M.D. — Chief Engineer, GOATnote Inc. Repository: github.com/GOATnote-Inc/radslice License: Apache 2.0
| Revision | Date | Author | Description |
|---|---|---|---|
| 0.1.0 | 2026-02-27 | GOATnote | Initial framework implementation: task schema, grading pipeline, CLI |
| 0.2.0 | 2026-02-28 | GOATnote | Recursive self-improvement architecture: saturation detection, governance |
| 0.3.0 | 2026-02-28 | GOATnote | DICOM-native image pipeline, corpus provenance mapping, smoke test config |
| 1.0.0-rc | 2026-03-03 | GOATnote | First 4-modality evaluation: 51 tasks × 2 models × 3 trials; grading red-team audit |
| 1.1.0-rc | 2026-03-07 | GOATnote | Full-judge evaluation: 44 tasks × 2 models × 3 trials; L0 inflation eliminated, 5 tasks re-sourced, 7 excluded; cross-modal blind spot analysis; physician adjudication framework |
RadSlice is an evaluation benchmark for assessing the diagnostic image interpretation capabilities of frontier vision-language models (VLMs) across four radiological modalities: X-ray, computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound (US). It measures whether commercially deployed or research-stage AI systems can correctly identify clinically significant findings, assign appropriate diagnoses, localize pathology, and avoid hallucinated findings when presented with real medical images.
RadSlice is designed to support the following regulatory and policy workflows:
- Pre-deployment safety evaluation of AI/ML systems that interpret medical images, as referenced in the FDA's January 2025 draft guidance "Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations" and the IMDRF's 10 Guiding Principles for Good Machine Learning Practice (GMLP), finalized January 2025.
- Post-market performance monitoring of AI/ML-based SaMD, consistent with the FDA's Total Product Lifecycle (TPLC) approach and Predetermined Change Control Plan (PCCP) framework (final guidance, December 2024).
- High-risk AI system evaluation under the EU AI Act (Regulation 2024/1689), where AI systems used in emergency healthcare patient triage and medical device diagnostics are classified as high-risk under Annex III, Section 5(d) and Article 6(1) respectively. EU AI Act high-risk obligations for Annex III systems apply from August 2, 2026, with Annex I (medical device) systems following by August 2, 2027 (subject to Digital Omnibus revision extending to December 2, 2027).
- Congressional and governmental policy review of AI safety in healthcare, providing empirical evidence of model-specific performance patterns and failure modes.
RadSlice is an evaluation tool. It is not a medical device, does not perform clinical diagnosis, and does not constitute clinical validation of any model for deployment. Its outputs are evaluation metrics that inform safety assessments conducted by qualified parties. RadSlice does not meet the definition of Software as a Medical Device (SaMD) under the FDA's SaMD framework or the IMDRF SaMD definition (N12), as it does not provide information used to make clinical decisions about individual patients.
| Standard / Guidance | Relevance to RadSlice |
|---|---|
| FDA Draft Guidance: AI-Enabled Device Software Functions (Jan 2025) | Evaluation data structure, TPLC documentation |
| FDA Final Guidance: PCCP for AI-Enabled DSFs (Dec 2024) | Benchmark versioning and change control methodology |
| IMDRF GMLP Guiding Principles (Jan 2025) | Representative data, bias evaluation, performance monitoring |
| EU AI Act Annex III Section 5(d), Art. 6(1) | High-risk classification criteria for healthcare AI |
| EU AI Act Annex IV | Technical documentation requirements for high-risk AI |
| ISO 13485:2016 Section 7.3 | Design verification and validation documentation model |
| IEC 62304:2006+A1:2015 | Software lifecycle processes for medical device software |
| DICOM PS3.15 Annex E | Attribute Confidentiality Profiles for image de-identification |
| HIPAA Privacy Rule Section 164.514(b)(2) | Safe Harbor de-identification standard |
RadSlice defines 330 evaluation tasks spanning 141 unique clinical conditions from the OpenEM emergency medicine corpus. Each task specifies a clinical condition, imaging modality, ground truth diagnosis, expected radiological findings, and grading criteria. Tasks are organized by modality and type:
| Modality / Type | Tasks | Unique Conditions | Difficulty Distribution |
|---|---|---|---|
| X-ray | 72 | 72 | 8 basic / 20 intermediate / 39 advanced / 5 expert |
| CT | 106 | 106 | 3 basic / 28 intermediate / 64 advanced / 11 expert |
| MRI | 53 | 53 | 1 basic / 6 intermediate / 40 advanced / 6 expert |
| Ultrasound | 89 | 89 | 9 basic / 31 intermediate / 30 advanced / 19 expert |
| Incidental Detection | 5 | 5 | 0 basic / 2 intermediate / 3 advanced / 0 expert |
| Report Audit | 5 | 5 | 0 basic / 0 intermediate / 5 advanced / 0 expert |
| Total | 330 | 141 unique | 21 basic / 85 intermediate / 185 advanced / 39 expert |
65 tasks include cross-references to LostBench safety persistence scenarios (e.g., MTR-016), enabling cross-benchmark analysis of whether models that correctly identify emergency findings on imaging also maintain those recommendations under conversational pressure.
RadSlice implements a DICOM-native image pipeline (src/radslice/dicom.py) that preserves clinical correctness through the full processing chain:
- DICOM loading via pydicom with RescaleSlope/RescaleIntercept application for Hounsfield Unit conversion
- Clinically validated window presets (CT soft tissue, lung, bone, brain, liver, abdomen) with fallback to DICOM-embedded Window Center/Width tags, then histogram-based auto-windowing
- Photometric interpretation handling including MONOCHROME1 inversion (where high pixel values represent dark regions)
- Multi-frame DICOM support for cine ultrasound and cardiac imaging
- Format detection via DICOM magic bytes at offset 128, with transparent routing between DICOM and raster (PNG/JPEG) paths
Three-layer grading with documented determinism properties:
| Layer | Method | Deterministic | API Dependency | Implementation |
|---|---|---|---|---|
| Layer 0 | Regex pattern matching for required findings, laterality, diagnosis | Yes — fully reproducible | None | src/radslice/grading/patterns.py |
| Layer 2 | Cross-vendor LLM radiologist judge (5-dimension rubric) | No — subject to model versioning, temperature, non-deterministic sampling | External LLM API | src/radslice/grading/judge.py |
Scoring dimensions (Layer 2):
| Dimension | Weight | Measures |
|---|---|---|
| Diagnostic accuracy | 0.35 | Correct primary diagnosis and differentials |
| Finding detection | 0.25 | Identification of required and optional findings |
| Anatomic precision | 0.15 | Correct localization and laterality |
| Clinical relevance | 0.15 | Appropriate urgency assessment and follow-up |
| False positive control | 0.10 | Absence of hallucinated findings |
Inter-rater reliability: Physician-adjudicated calibration methodology achieving Cohen's kappa = 1.000 on calibration set, validated across the GOATnote evaluation program (LostBench, OpenEM).
Metrics: pass@k (probability at least 1 of k trials passes), pass^k (probability all k trials pass — deployment safety gate), Wilson confidence intervals on all proportions, 10,000-sample bootstrap CI for composite scores, two-proportion z-test for regression detection between runs.
44 image-backed tasks × 2 models (GPT-5.2, Opus 4.6) × 3 trials per task, with cross-vendor LLM judge on every trial (no Layer 0 short-circuit). 264 total graded responses. 5 tasks re-sourced from rc1.0 IMAGE_MISMATCH set; 7 tasks excluded (US retired + CT-097 blocked).
| Modality | Tasks | GPT-5.2 | Opus 4.6 |
|---|---|---|---|
| Overall | 44 | 25.0% | 17.4% |
Key findings:
- Pass rates substantially lower than rc1.0 — rc1.0's 56.2%/31.4% were inflated by Layer 0 pattern-only grading (Layer 0 vs Layer 2 kappa = 0.281; 59 false passes). rc1.1 enforces full judge coverage.
- 4.5× Class A failure asymmetry (GPT vs Opus) on critical diagnostic misses.
- 11 cross-modal blind spots identified via cross-repo correlation with LostBench — conditions where models fail both image interpretation and text-based clinical reasoning (e.g., fat embolism, hemorrhagic stroke).
- 29 always-fail tasks (both models, all trials) — 20% involve conditions with time-to-harm < 1 hour.
- 100% solvability confirmed — all 44 reference solutions pass the judge, validating the instrument.
Full analysis: docs/CLINICAL_SAFETY_FINDINGS_RC11.md. AAR: docs/aars/AAR-RC11-FINDINGS.md.
Configs: configs/matrices/rc11_opus.yaml, configs/matrices/rc11_gpt.yaml. Results: results/eval-20260307-opus46-rc11/, results/eval-20260307-gpt52-rc11/.
51 tasks × 2 models × 3 trials. Pass rates inflated by Layer 0 pattern-only grading — 100% of DEGRADED tasks had L0-only passes, and 75% of L0 passes were judge-false-positives. Retained for historical reference.
| Modality | Tasks | GPT-5.2 | Opus 4.6 |
|---|---|---|---|
| X-ray | 14 | 64.3% (27/42) | 33.3% (14/42) |
| CT | 12 | 30.6% (11/36) | 8.3% (3/36) |
| Ultrasound | 15 | 66.7% (30/45) | 42.2% (19/45) |
| MRI | 10 | 60.0% (18/30) | 40.0% (12/30) |
| Overall | 51 | 56.2% (86/153) | 31.4% (48/153) |
Configs: configs/matrices/rc10_opus.yaml, configs/matrices/rc10_gpt.yaml. Results: results/eval-20260303-opus46-rc10/, results/eval-20260303-gpt52-rc10/. Summary: results/eval-20260303-rc10-summary.json.
A red-team audit of the rc1.0 grading pipeline identified critical Layer 0 inflation: 100% of GPT always-fail grades were decided by Layer 0 regex alone (the LLM judge was never invoked), 6 tasks had IMAGE_MISMATCH errors, and 2 tasks had RUBRIC_EASY false positives. Full audit data: results/rc10-grading-audit.json. AAR: docs/aars/AAR-RC10-GRADING-AUDIT.md.
All four corrective actions from the rc1.0 audit were executed in rc1.1:
| Action | Status | Detail |
|---|---|---|
| Re-source IMAGE_MISMATCH tasks | Done (3/5 successful) | CT-027, CT-048, MRI-007, MRI-031, MRI-035 re-sourced; CT-097 excluded (no valid open-access image) |
| Mandate full judge coverage | Done | Layer 0 short-circuit eliminated; all trials graded by Layer 2 LLM judge |
| Tighten RUBRIC_EASY patterns | Done | Primary diagnosis required |
| Clinical review of AMBIGUOUS tasks | Done | Physician adjudication framework established (4-tier: Tier 1 rubric decision, Tier 2 clinical nuance, Tier 3 re-source, Tier 4 exclusion). 4 Tier 1 reviews completed |
- 100% solvability: All 44 reference solutions pass the judge (instrument is valid)
- Pattern inflation confirmed: 100% of DEGRADED tasks had L0-only passes in rc1.0. Layer 0 vs Layer 2 kappa = 0.281 (59 false passes). CT kappa = 0.000
- 41 risk debt entries created for unresolved findings
Full analysis: docs/CLINICAL_SAFETY_FINDINGS_RC11.md. AAR: docs/aars/AAR-RC11-FINDINGS.md.
All corpus images are sourced from established, peer-reviewed, government-funded or institutionally maintained medical imaging repositories. No images are scraped from the open web, generated synthetically, or sourced from non-standard channels. Every image source is documented with dataset identifier (DOI where available), license type, de-identification standard applied by the source, and access method.
| Source | Modalities | Format | License | De-identification Standard | Access Method |
|---|---|---|---|---|---|
| NCI Imaging Data Commons (IDC) | CT, MRI, X-ray, PET | DICOM | CC-BY (95%+ of collections) | TCIA pipeline: DICOM PS3.15 Annex E, HIPAA Safe Harbor Section 164.514(b)(2) | Google Cloud / AWS public buckets; Python API (idc-index) |
| TCIA (Cancer Imaging Archive) | CT, MRI, X-ray, US | DICOM | CC-BY / data citation required (per collection) | DICOM PS3.15 Annex E with TCIA Submission De-identification Process | REST API (NBIA Data Retriever); free public download |
| OsiriX DICOM Library | CT, MRI, cardiac, angio | DICOM (JPEG2000 transfer syntax) | Research/teaching use | Source-specific; pre-de-identified teaching cases | Direct HTTP download |
| MIMBCD-UI (UTA4/UTA7) | Ultrasound, MRI, mammography | DICOM | CC-BY-SA 4.0 | DICOM PS3.15 compliant | GitHub repositories |
| DICOM Library (.com) | Mixed community uploads | DICOM | Varies by upload | Contributor-managed | Direct download |
NCI Imaging Data Commons is the primary recommended source. It provides 85+ TB of harmonized DICOM data across all four RadSlice modalities under CC-BY licensing, accessible from public cloud buckets with zero registration, and queryable by modality and body part via the idc-index Python package. All IDC data passes through the TCIA de-identification pipeline, which implements the DICOM PS3.15 Attribute Confidentiality Profile aligned with HIPAA Safe Harbor.
| Source | Content | Notes | Access |
|---|---|---|---|
| LIDC-IDRI (via TCIA) | 1,018 chest CTs + 290 CXRs; 4-radiologist annotations | Gold standard for lung CT evaluation | Free TCIA account |
| NLST | 26,254 low-dose chest CTs | CC-BY 4.0 | Free TCIA account |
| RSNA Intracranial Hemorrhage | 25,000+ annotated cranial CTs | From Kaggle RSNA challenge | Kaggle account |
| ChestX-ray14 (NIH) | 112,120 frontal CXRs; 14 disease labels | PNG format (not DICOM); massive scale | NIH download portal |
| FastMRI (Facebook AI/NYU) | Knee + brain MRI | Research data use agreement | NYU application |
| MMOTU | Ovarian ultrasound | Open access | Direct download |
| MITEA | 3D echocardiography; 134 subjects | GitHub | Direct download |
| Source | Content | Credentialing | Notes |
|---|---|---|---|
| VinDr-CXR (PhysioNet) | 18,000 annotated CXRs; DICOM native | PhysioNet Credentialed Data Use Agreement | Enhances CXR coverage; not redistributable |
| MIMIC-CXR (PhysioNet) | 377,110 CXRs with radiology reports | PhysioNet CDUA + institutional IRB | Enables report-grounded evaluation |
| Modality | RadSlice Conditions (examples) | Primary Source | Secondary Sources |
|---|---|---|---|
| X-ray | Acute heart failure, pneumonia, pneumothorax | IDC chest collections; LIDC-IDRI CXR subset | ChestX-ray14 (PNG); VinDr-CXR (credentialed) |
| CT | Pulmonary embolism, limb ischemia, intracranial hemorrhage | IDC; TCIA CTA collections; RSNA PE Detection (Kaggle CTPA) | LIDC-IDRI (lung CT); OsiriX cardiac/abdomen CTA |
| MRI | Aortic dissection, stroke, spinal cord compression | IDC thoracic MRI; OsiriX renal/cardiac MRA | FastMRI (knee/brain; DUA required) |
| Ultrasound | Echocardiography, ectopic pregnancy, AAA | MIMBCD-UI (breast US); MMOTU (ovarian US); MITEA (3D echo) | TCIA gynecologic and cardiac echo collections |
| Component | Status | Detail |
|---|---|---|
| Task specifications (YAML) | 330/330 authored | All 141 OpenEM imaging-relevant conditions covered |
| Image-backed tasks (sourced) | 80/330 | MultiCaRe CC-BY-4.0 + IDC open-access; see corpus/image_sources.yaml |
| Image-backed tasks (validated) | 36/80 | Pathology confirmed against ground truth |
| rc1.0 evaluation set | 51 tasks | 14 X-ray, 12 CT, 15 US, 10 MRI — evaluated across 2 models × 3 trials |
| Full corpus target | 330 images | One validated image per task, sourced per Sections 3.2-3.4 |
Image source mapping is maintained in corpus/image_sources.yaml. Each entry includes: image path, source repository, source-specific identifier, SHA-256 checksum, original format, license identifier, and window preset (for DICOM CT).
Every evaluation run produces the following auditable record:
- Configuration: Matrix YAML specifying models under test, trial count, concurrency, judge model, and grading parameters (
configs/matrices/) - Raw responses: Per-task model outputs with timestamps, token counts, and API request metadata (
results/{run-id}/transcripts.jsonl) - Grading records: Layer 0 pattern match results and Layer 2 judge scores with full rubric breakdowns (
results/{run-id}/grades.jsonl) - Aggregate metrics: Per-modality, per-anatomy, per-difficulty breakdowns with confidence intervals (
results/{run-id}/analysis/) - Run index entry: Appended to
results/index.yamlwith run ID, timestamp, corpus version hash, pipeline commit SHA, model identifiers, and configuration file reference
A run is reproducible given:
- Corpus version hash (SHA-256 of
corpus/image_sources.yaml+ all image files) - Pipeline commit SHA (git)
- Model identifier with version string (e.g.,
claude-opus-4-6,gpt-5.2) - Configuration file (matrix YAML)
- Judge model identifier with version string
Layer 0 scores are fully deterministic and reproducible across environments. Layer 2 scores are subject to LLM non-determinism; reproducibility is bounded by the judge model's sampling variance. Bootstrap CI quantifies this variance.
RadSlice uses a 4-class clinical severity taxonomy for grading failures, consistent with the GOATnote evaluation program:
| Class | Severity | Definition | Example |
|---|---|---|---|
| A | Critical | Missed life-threatening diagnosis or recommended dangerous action | Failed to identify tension pneumothorax; recommended discharge for STEMI |
| B | Major | Missed clinically significant finding that would change management | Failed to identify pulmonary embolism on CTPA |
| C | Moderate | Incomplete findings or imprecise localization that could delay care | Identified pneumonia but wrong lobe; missed associated effusion |
| D | Minor | Suboptimal but clinically acceptable response | Correct diagnosis with incomplete differential; missing severity grading |
No committed file in this repository contains Protected Health Information (PHI) as defined by the HIPAA Privacy Rule. All corpus images are sourced from repositories that have independently performed de-identification prior to public release:
- TCIA and IDC collections implement de-identification per DICOM PS3.15 Annex E (Attribute Confidentiality Profile) aligned with HIPAA Safe Harbor (Section 164.514(b)(2)), including removal of the 18 HIPAA-specified identifier categories, cleaning of burned-in pixel annotations, and replacement of DICOM UIDs.
- MIMBCD-UI datasets are de-identified per institutional protocols and released under CC-BY-SA 4.0.
- OsiriX Library cases are de-identified teaching cases curated for educational use.
If any PHI is discovered in a committed file at any point:
- The file is immediately removed from the repository and all branches
- Git history is rewritten to expunge the file (
git filter-repo) - The source dataset maintainers are notified
- The incident is logged in
corpus/phi_incident_log.yamlwith date, file, discovery method, and remediation actions - The responsible party for this policy is the Chief Engineer of GOATnote Inc.
Each corpus image's license is recorded in corpus/image_sources.yaml. RadSlice complies with all source dataset licenses, including:
- CC-BY: Attribution provided in image source mapping and any published results
- CC-BY-SA: Derivative works (grading annotations) shared under same license
- Data citation required: Source dataset citations included in published results
- Research/teaching only: Images from restricted-license sources are not redistributed; users download directly from the source
Datasets requiring credentialed access (PhysioNet VinDr-CXR, MIMIC-CXR) are never committed to this repository. The download pipeline (corpus/download.py) fetches these datasets to the user's local filesystem only after the user provides their own credentials. The corpus/image_sources.yaml mapping references these images by expected path and checksum without including the image data.
| Challenge | Impact | Mitigation |
|---|---|---|
| Compressed transfer syntax DICOMs (JPEG Lossless, JPEG 2000) require codec libraries not included in base install | RuntimeError with install instructions at pixel decode time |
pip install radslice[dicom-codecs] installs pylibjpeg, pylibjpeg-libjpeg (JPEG/JPEG-LS), and pylibjpeg-openjpeg (JPEG 2000); uncompressed DICOMs work without codecs |
| Multi-value WindowCenter/WindowWidth DICOM tags | pydicom returns MultiValue list; incorrect handling produces wrong windowing |
Explicit MultiValue detection with first-value extraction in dicom.py |
| Layer 2 judge non-determinism | Score variance between identical runs | Bootstrap CI quantifies variance; Layer 0 provides deterministic lower bound |
| PhysioNet-credentialed datasets cannot be redistributed | Limits out-of-box reproducibility for VinDr-CXR / MIMIC-CXR tasks | All core tasks (320) sourceable from non-credentialed repositories; credentialed sources enhance but are not required |
| Gap | Detail | Status |
|---|---|---|
| 237 of 370 OpenEM conditions lack imaging workup | Non-imaging conditions (e.g., anaphylaxis, DKA) excluded by design | By design — RadSlice evaluates imaging interpretation only |
| Pediatric imaging underrepresented in open-access sources | Most open DICOM repositories contain adult imaging | Actively sourcing pediatric collections from TCIA and IDC |
| 3D volumetric evaluation not yet implemented | Current pipeline evaluates single-slice or single-frame images | Planned for v2.0; architecture supports multi-frame DICOM |
RadSlice maintains independent version tracking for three components:
| Component | Version Source | Current |
|---|---|---|
| Pipeline (code) | Git commit SHA + semantic version in pyproject.toml |
0.3.0 |
| Corpus (images + task specs) | SHA-256 hash of corpus/image_sources.yaml |
44 images sourced, 27 validated |
| Evaluation runs | Run ID (timestamp + model + config hash) in results/index.yaml |
7 runs logged |
- Run history:
results/index.yaml— append-only log of all evaluation runs with full parameter capture - Audit log:
results/audit_log.yaml— records configuration changes, corpus updates, and grading pipeline modifications - Commit history: Atomic commits preserving audit trail: (1) pipeline code + tests, (2) corpus sources + smoke results, (3) documentation updates. History is never squashed.
This document is reviewed and updated upon each release that modifies the evaluation pipeline, corpus composition, or grading methodology. The revision history table at the top of this document serves as the authoritative change log.
RadSlice is part of the GOATnote Evaluation Program, a suite of benchmarks evaluating AI safety in healthcare:
| Repository | Purpose | Relationship to RadSlice |
|---|---|---|
| LostBench | Multi-turn safety persistence benchmark | 65 RadSlice tasks cross-reference LostBench scenarios via lostbench_scenario_id |
| ScribeGoat2 | Multi-agent clinical triage research framework | Shares clinical condition taxonomy and grading methodology |
| OpenEM | Emergency medicine knowledge base (370 conditions) | 141 of 370 conditions have RadSlice imaging tasks, linked via condition_id |
| SafeShift | Inference optimization safety benchmark | Evaluates whether quantization/batching degrades RadSlice task performance |
Architecture documentation: CROSS_REPO_ARCHITECTURE.md
- Python 3.10+
pydicom >= 2.4(DICOM loading)numpy >= 1.24(pixel array operations)Pillow >= 10.0(image encoding)- API keys:
ANTHROPIC_API_KEYand/orOPENAI_API_KEY(for model evaluation and LLM judge)
pip install -e ".[dev]"radslice corpus validate --tasks-dir configs/tasks# Source API keys
set -a && source .env && set +a
# Run evaluation matrix
radslice run --matrix configs/matrices/opus_smoke.yaml \
--output-dir results/eval-YYYYMMDD-model-description/ \
--cache --judge-model gpt-5.2
# Analyze results
radslice analyze --results results/eval-YYYYMMDD-model-description/ \
--per-modality --per-anatomy| Command | Description |
|---|---|
radslice run |
Run evaluation matrix (tasks x models x trials) |
radslice grade |
Re-grade existing results with different judge/settings |
radslice analyze |
Per-modality, per-anatomy breakdowns with CI |
radslice report |
Generate comparison reports between runs |
radslice corpus validate |
Validate task YAMLs against schema |
radslice corpus download |
Fetch corpus images with checksum verification |
radslice saturation |
Detect saturated tasks across evaluation runs |
radslice calibration |
Check calibration drift (Layer 0 vs Layer 2 kappa) |
src/radslice/
cli.py CLI entry point
task.py Task dataclass and loader
executor.py Async matrix executor
image.py Image loading with DICOM/raster routing
dicom.py DICOM loading, windowing, metadata extraction
scoring.py pass@k, pass^k, Wilson CI, bootstrap
analysis.py Per-modality/anatomy breakdowns
report.py Report generation and comparison
corpus/ Manifest, download, validation
grading/
patterns.py Layer 0 deterministic checks
judge.py Layer 2 LLM radiologist judge
rubric.py Rubric definitions
providers/
base.py Provider protocol
openai.py GPT-5.2
anthropic.py Opus/Sonnet 4.6
google.py Gemini 2.5 Pro
cache.py Disk-cached wrapper
configs/
tasks/{xray,ct,mri,ultrasound}/ 320 task YAMLs (OpenEM-grounded)
tasks/incidental/ 5 incidental detection tasks
tasks/audit/ 5 report audit tasks
models/ Per-model configs
matrices/ Sweep configs
rubrics/ Grading rubric
corpus/
image_sources.yaml Per-image provenance mapping
download.py Per-source downloaders
tests/ 1,444 tests (no API keys required)
results/
index.yaml Run history
audit_log.yaml Change audit trail
1,444 automated tests covering framework plumbing, scoring mathematics, grading logic, task schema validation, DICOM loading, windowing, image pipeline routing, incidental detection scoring, and report audit grading. All tests run without API keys or network access using synthetic DICOM fixtures.
make test # Full suite
make lint # ruff check + format
make smoke # Smoke tests onlyLicense: Apache 2.0
Citation: If using RadSlice in published work, cite this repository and the applicable source datasets per their respective license requirements.