A curated collection of open-source technologies and an accompanying CLI (dml) for experimenting with modern data architecture and MLOps locally.
Provisioning a local data environment with distributed systems can be highly complex. The Open DataML Stack streamlines this process by resolving dependency conflicts, network routing configurations, and integration challenges across tools like Kafka, Spark, Flink, Iceberg, and Airflow. It provides a cohesive, Docker-based blueprint that operates seamlessly out of the box.
The stack is organized into distinct profiles that can be launched independently or together:
- Event Streaming: Kafka (KRaft), Schema Registry (Karapace), Kafka Connect
- Processing Engines: Apache Spark, Apache Flink
- Storage & Catalog: SeaweedFS (S3-compatible), Iceberg REST Catalog, ClickHouse, PostgreSQL (pgvector), Valkey, Apache Fluss
- Orchestration & MLOps: Apache Airflow, MLflow, Feast Feature Store
- Federation & BI: Trino, Metabase
- Governance & Observability: OpenMetadata, Marquez (Lineage), Prometheus, Grafana
- Docker: Docker Engine or Docker Desktop must be running. We highly recommend allocating at least 8GB to 16GB of RAM to Docker, as data processing engines are resource-heavy.
- Python: Version 3.10 or higher.
Since dml is a CLI tool, it is highly recommended to install it in an isolated environment using uv tool or pipx.
Using uv (Recommended):
uv tool install dml-cliUsing pipx:
pipx install dml-cliUsing pip:
pip install dml-cliGet your local cluster up and running in three simple steps.
1. Initialize your workspace
This command copies the default Docker Compose files and configurations into a hidden .dml folder in your current directory.
dml init2. Explore available profiles See a full list of technologies you can launch.
dml list3. Launch the streaming and batch processing engines Bring up a robust data engineering environment.
dml up kafka flink1 sparkNote: You do not need to memorize dependencies. The CLI will automatically detect that these profiles require foundational infrastructure and will launch PostgreSQL, SeaweedFS (S3), and the Iceberg REST Catalog for you before starting the target compute engines.
The dml CLI orchestrates the Open DataML Stack and is logically grouped by functionality. You can append --help to any command for deeper parameter details.
--verbose: Enable debug-level logging across all commands.-w, --workspace PATH: Path to the DML workspace directory (default:./.dml).
dml list: List all available profiles and their capabilities.dml explain <profile>: Explain the details, services, images, and dependencies of a profile.dml ps: List Docker containers managed by the Open DataML Stack.dml info: View package and system-wide Docker daemon health status.
dml init: Initialize a local.dmlworkspace for custom configurations.
dml pull: Pre-fetch Docker images without starting the containers.dml up: Launch DataML profiles (automatically resolves upstream dependencies).dml down: Stop and remove profile containers and networks.
dml iceberg: Execute PyIceberg CLI commands natively within the stack.
dml logs: Fetch the logs of containers managed by specific profiles.dml restart: Restart one or more specific profiles.
# View all profiles and exposed ports
$ dml list -d
# See exactly what the kafka profile provisions
$ dml explain kafka
# Launch specific profiles and their dependencies
$ dml up flink1 kafka spark
# Complete teardown and wipe all data
$ dml down --all --volumesThe Open DataML Stack is designed to be fully hackable. When you run dml init, a local ./.dml/ workspace is generated in your current working directory.
This folder contains all the underlying configurations that power the stack:
compose-*.yml: The actual Docker Compose definitions. You can edit these to change exposed ports, adjust memory limits, or inject new environment variables.registry.yml: The internal dependency graph..env: The environment variables used across the stack (e.g., default credentials or timezones).
The CLI will always prioritize the files in your local ./.dml/ directory. If you make a mistake, you can always revert to the pristine default state by running dml init --force.
If you want to contribute to the CLI itself, we welcome pull requests!
- Clone the repository.
- Install uv for dependency management.
- Sync the dependencies and install the project in development mode:
uv sync
- Install the pre-commit hooks to ensure formatting checks pass:
uv run pre-commit install
- Run the test suite:
uv run pytest tests/
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
