Open DataML Stack

A curated collection of open-source technologies and an accompanying CLI (dml) for experimenting with modern data architecture and MLOps locally.

Provisioning a local data environment with distributed systems can be highly complex. The Open DataML Stack streamlines this process by resolving dependency conflicts, network routing configurations, and integration challenges across tools like Kafka, Spark, Flink, Iceberg, and Airflow. It provides a cohesive, Docker-based blueprint that operates seamlessly out of the box.

Bundled Technologies

The stack is organized into distinct profiles that can be launched independently or together:

Event Streaming: Kafka (KRaft), Schema Registry (Karapace), Kafka Connect
Processing Engines: Apache Spark, Apache Flink
Storage & Catalog: SeaweedFS (S3-compatible), Iceberg REST Catalog, ClickHouse, PostgreSQL (pgvector), Valkey, Apache Fluss
Orchestration & MLOps: Apache Airflow, MLflow, Feast Feature Store
Federation & BI: Trino, Metabase
Governance & Observability: OpenMetadata, Marquez (Lineage), Prometheus, Grafana

Prerequisites & Installation

Requirements

Docker: Docker Engine or Docker Desktop must be running. We highly recommend allocating at least 8GB to 16GB of RAM to Docker, as data processing engines are resource-heavy.
Python: Version 3.10 or higher.

Installation

Since dml is a CLI tool, it is highly recommended to install it in an isolated environment using uv tool or pipx.

Using uv (Recommended):

uv tool install dml-cli

Using pipx:

pipx install dml-cli

Using pip:

pip install dml-cli

Quick Start

Get your local cluster up and running in three simple steps.

1. Initialize your workspace This command copies the default Docker Compose files and configurations into a hidden .dml folder in your current directory.

dml init

2. Explore available profiles See a full list of technologies you can launch.

dml list

3. Launch the streaming and batch processing engines Bring up a robust data engineering environment.

dml up kafka flink1 spark

Note: You do not need to memorize dependencies. The CLI will automatically detect that these profiles require foundational infrastructure and will launch PostgreSQL, SeaweedFS (S3), and the Iceberg REST Catalog for you before starting the target compute engines.

CLI Command Reference

The dml CLI orchestrates the Open DataML Stack and is logically grouped by functionality. You can append --help to any command for deeper parameter details.

Global Options

--verbose: Enable debug-level logging across all commands.
-w, --workspace PATH: Path to the DML workspace directory (default: ./.dml).

Inspection & Info

dml list: List all available profiles and their capabilities.
dml explain <profile>: Explain the details, services, images, and dependencies of a profile.
dml ps: List Docker containers managed by the Open DataML Stack.
dml info: View package and system-wide Docker daemon health status.

Workspace

dml init: Initialize a local .dml workspace for custom configurations.

Cluster Lifecycle

dml pull: Pre-fetch Docker images without starting the containers.
dml up: Launch DataML profiles (automatically resolves upstream dependencies).
dml down: Stop and remove profile containers and networks.

Data Operations

dml iceberg: Execute PyIceberg CLI commands natively within the stack.

Management

dml logs: Fetch the logs of containers managed by specific profiles.
dml restart: Restart one or more specific profiles.

Examples

# View all profiles and exposed ports
$ dml list -d

# See exactly what the kafka profile provisions
$ dml explain kafka

# Launch specific profiles and their dependencies
$ dml up flink1 kafka spark

# Complete teardown and wipe all data
$ dml down --all --volumes

Workspace Customization (.dml)

The Open DataML Stack is designed to be fully hackable. When you run dml init, a local ./.dml/ workspace is generated in your current working directory.

This folder contains all the underlying configurations that power the stack:

compose-*.yml: The actual Docker Compose definitions. You can edit these to change exposed ports, adjust memory limits, or inject new environment variables.
registry.yml: The internal dependency graph.
.env: The environment variables used across the stack (e.g., default credentials or timezones).

The CLI will always prioritize the files in your local ./.dml/ directory. If you make a mistake, you can always revert to the pristine default state by running dml init --force.

Local Development & Contributing

If you want to contribute to the CLI itself, we welcome pull requests!

Clone the repository.
Install uv for dependency management.
Sync the dependencies and install the project in development mode:
```
uv sync
```
Install the pre-commit hooks to ensure formatting checks pass:
```
uv run pre-commit install
```
Run the test suite:
```
uv run pytest tests/
```

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github		.github
.vscode		.vscode
image		image
src/dml		src/dml
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open DataML Stack

Bundled Technologies

Prerequisites & Installation

Requirements

Installation

Quick Start

CLI Command Reference

Global Options

Inspection & Info

Workspace

Cluster Lifecycle

Data Operations

Management

Examples

Workspace Customization (.dml)

Local Development & Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open DataML Stack

Bundled Technologies

Prerequisites & Installation

Requirements

Installation

Quick Start

CLI Command Reference

Global Options

Inspection & Info

Workspace

Cluster Lifecycle

Data Operations

Management

Examples

Workspace Customization (.dml)

Local Development & Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages