OpenICU is an open-source Python framework for extracting and harmonising intensive care unit (ICU) data from heterogeneous sources — such as MIMIC-IV, eICU-CRD, and your own institutional database exports — into the standardised MEDS (Medical Event Data Standard) format.
Instead of writing one-off SQL or pandas scripts for every dataset and every study, you describe what to extract in declarative YAML configurations and let OpenICU handle the how: typed reading, joins, timestamp reconstruction, unit-aware event codes, and cross-dataset concept harmonisation. OpenICU ships with curated configurations for public ICU datasets and a growing dictionary of clinical concepts (vital signs, laboratory values, vasopressors, ventilation, and more), so the same study code can run against any supported dataset.
OpenICU is the spiritual successor to dataset-harmonisation tools like ricu (R), rebuilt in Python on a modern stack: Polars for fast, out-of-core streaming processing, Pydantic for validated configuration, and MEDS for an interoperable output format that plugs directly into the wider MEDS ecosystem of modelling and evaluation tools.
Source code: https://github.com/aidh-ms/OpenICU · Issues: https://github.com/aidh-ms/OpenICU/issues
- One concept, many datasets. Define a clinical concept (e.g. heart rate, norepinephrine rate, antibiotics) once; map it per dataset with small YAML files. Extraction code stays identical across MIMIC-IV, eICU-CRD, NWICU, and custom sources.
- MEDS-native output. All output is written as MEDS-compliant Parquet (
subject_id,time,code,numeric_value,text_value, plus configurable extension columns) with full metadata (dataset.json,codes.parquet) — ready for MEDS-compatible downstream tooling. - Declarative, versioned, reproducible. Every table, event, and concept is a versioned YAML config with a stable identifier. The exact merged configuration used for a run is snapshotted into the output project, so results can be traced and reproduced.
- Versions and variants as diffs. New dataset versions and variants (like the eICU demo)
extenda reference version and state only their differences — no copied configs to keep in sync. The entire eICU demo configuration is two small files. - Fully offline. Designed for sensitive medical data: no network access required, nothing leaves your secure perimeter.
- Scales down and up. Polars lazy streaming lets you process full MIMIC-IV on a 16–32 GB laptop, without a database server or cluster.
- Extensible. Add new datasets by writing YAML (no Python required); add new transformation callbacks or complex concept logic in Python where YAML isn't enough.
OpenICU organises processing as a pipeline of steps that operate inside a project directory:
flowchart LR
A[Raw source data<br>CSV/CSV.GZ] -->|ExtractionStep| B[Dataset-level MEDS events<br>e.g. mimic-iv//chartevents//220045//Heart Rate//bpm]
B -->|ConceptStep| C[Harmonised concepts<br>e.g. heart_rate//bpm]
C --> D[Your analysis /<br>MEDS ecosystem tools]
- Extraction step — reads the raw tables of each dataset (typed CSV scan, lookup-table joins, timestamp reconstruction) and emits one MEDS event stream per configured event. Event codes preserve full provenance:
dataset//table//itemid//label//unit. - Concept step — maps dataset-specific event codes onto shared, dataset-agnostic clinical concepts using pattern matching (simple concepts), computation over other concepts (derived concepts), or custom Python transformers (complex concepts). Concept dependencies are resolved automatically.
- Sharding step (in development) — re-partitions harmonised data into analysis-ready shards.
Each step reads its inputs and writes its outputs as a self-contained MEDS dataset inside the project, so intermediate results are inspectable and individual steps can be re-run in isolation.
OpenICU ships with ready-to-use extraction and concept configurations under config/:
| Dataset | Version | Extraction configs | Concept mappings |
|---|---|---|---|
| MIMIC-IV | 3.1, 2.2 | 19 tables (ICU + hosp) | ~80 concepts |
| MIMIC-IV demo | 2.2 | inherited from MIMIC-IV via extends |
inherited |
| eICU-CRD | 2.0 | 14 tables | in progress |
| eICU-CRD demo | 2.0 | inherited from eICU-CRD via extends |
inherited |
| NWICU | 0.1.0 | 9 tables | in progress |
The shared concept dictionary in config/concept/ currently covers ~90 concepts across vital signs, blood gas, clinical chemistry, hematology, medications (incl. vasopressors and antibiotics), neurological scores, respiratory parameters, fluid output, and demographics.
Note
Access to the public datasets themselves requires the usual credentialing on PhysioNet. OpenICU works on the downloaded CSV/CSV.GZ files — no database setup needed.
OpenICU requires Python 3.13+. Until the first PyPI release, install from GitHub:
# with pip
pip install git+https://github.com/aidh-ms/OpenICU
# with uv
uv add git+https://github.com/aidh-ms/OpenICUFor development, clone the repository and use the included dev container or uv:
git clone https://github.com/aidh-ms/OpenICU.git
cd OpenICU
uv sync --all-groupsA pipeline is two small YAML files plus a few lines of Python.
1. Configure the extraction step (config/extraction.yml) — point OpenICU at the bundled dataset configs and your local data:
name: Extraction
version: 1.0.0
config_files:
- path: /path/to/OpenICU/config/dataset/mimic-iv/3.1/dataset/
config:
data:
- name: mimic-iv
path: /path/to/physionet.org/files/mimiciv/3.12. Configure the concept step (config/concept.yml) — select the concept dictionary and per-dataset mappings:
name: Concept
version: 1.0.0
config_files:
- path: /path/to/OpenICU/config/concept
config:
extraction_step: Extraction
dataset_configs:
- name: mimic-iv
path: /path/to/OpenICU/config/dataset/mimic-iv/3.1/concept/3. Run the pipeline:
from pathlib import Path
from open_icu import OpenICUProject, ExtractionStep, ConceptStep
config_path = Path.cwd() / "config"
project_path = Path.cwd() / "output" / "project"
with OpenICUProject(project_path) as project:
extraction_step = ExtractionStep.load(project, config_path / "extraction.yml")
extraction_step.run()
concept_step = ConceptStep.load(project, config_path / "concept.yml")
concept_step.run()4. Use the results. The project directory now contains one MEDS dataset per step:
output/project/
├── configs/ # snapshot of every config used (reproducibility)
├── workspace/ # intermediate per-step files
└── datasets/
├── extraction/
│ ├── data/mimic-iv/3.1/<table>/<EVENT>.parquet
│ └── metadata/{dataset.json, codes.parquet}
└── concept/
├── data/<concept>/<version>/<dataset>.parquet
└── metadata/{dataset.json, codes.parquet}
Read it with any Parquet-capable tool, e.g.:
import polars as pl
heart_rate = pl.scan_parquet(
project_path / "datasets" / "concept" / "data" / "heart_rate" / "1.0.0" / "mimic-iv.parquet"
).collect()A complete runnable example lives in example/pipeline.ipynb.
OpenICU is configured at three levels — see the documentation for the full reference.
Dataset/table configs (config/dataset/<dataset>/<version>/dataset/*.yml) describe how to turn one raw table into MEDS events: typed columns, joins, and event definitions.
path: icu/chartevents.csv.gz
columns:
- name: subject_id
type: int64
- name: charttime
type: datetime
params: { format: "%Y-%m-%d %H:%M:%S" }
# ...
join:
- path: icu/d_items.csv.gz
both_on: [itemid]
columns: [{ name: itemid, type: int64 }, { name: label, type: string }]
events:
- name: CHART
columns:
subject_id: col(subject_id)
time: col(charttime)
code: [col(itemid), col(label), col(valueuom)]
numeric_value: col(valuenum)
text_value: col(value)Concept configs (config/concept/<category>/*.yml) define dataset-agnostic clinical concepts:
name: heart_rate
version: 1.0.0
unit: bpmConcept mappings (config/dataset/<dataset>/<version>/concept/*.yml) connect a concept to dataset-specific codes:
type: simple
mappings:
- pattern:
table: chartevents
event: CHART
code: (220045//Heart Rate)
columns:
numeric_value: col(numeric_value)Wherever values are computed, configs use a small expression language with composable callbacks — e.g. reconstructing absolute timestamps from eICU's relative offsets:
callbacks:
- to_datetime(hospitaldischargeyear, 1, 1, hospitaladmittime24, hospitaladmitoffset, "minutes", output=admission_timestamp)
post_join_callbacks:
- add_offset(admission_timestamp, labresultoffset, output=event_timestamp)Arithmetic, comparisons, and boolean logic work as ordinary expressions (col(weight) / (col(height) * col(height))), and custom callbacks can be registered from Python.
- Package overview
- Installation
- Basic usage
- User guide: pipeline & projects · extraction configs · dataset versions & variants · concept configs · expression language
- Contributing
- Architecture documentation following arc42 in
docs/arc/
OpenICU is in active development (pre-1.0); configuration formats may still change between minor versions. Current focus areas:
- Completing concept mappings for eICU-CRD and NWICU
- The sharding step for analysis-ready data partitioning
- Additional datasets (e.g. MIMIC-IV demo, HiRID, AmsterdamUMCdb)
- A first PyPI release
Contributions of dataset configurations and concept definitions are especially welcome — they are pure YAML and require no changes to the framework itself.
We welcome contributions! Please use the dev container for a ready-to-go environment, follow Conventional Commits (feat:, fix:, docs:, …), and make sure checks pass before opening a PR:
uv run ruff format && uv run ruff check # format & lint
uv run ty check . # type checking
uv run pytest # testsSee the contributing guide for details, and our Code of Conduct.
If you use OpenICU in your research, please cite it via the metadata in CITATION.cff (use the "Cite this repository" button on GitHub).
- MEDS — the Medical Event Data Standard that OpenICU targets
ricu— R package for ICU data harmonisation that inspired OpenICU's concept dictionary- YAIB — Yet Another ICU Benchmark, a complementary benchmarking framework
OpenICU is released under the MIT License. Developed by the AIDH MS team at the University of Münster and the Medical University of Innsbruck.