Skip to content

prakulhiremath/flashback

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ flashback

Git for Datasets — time-travel debugging and transformation lineage tracking for pandas & Polars.

CI PyPI Python Coverage Ruff License: MIT DOI Medium PyPI Downloads

📂 load  ──▶  🔍 filter  ──▶  ➕ with_columns  ──▶  ⏪ lag  ──▶  HEAD
                  │
              (before-lag)  ◀── fb.checkout("before-lag")

Why this exists

Every ML researcher has asked: "Why did my metric change?" Nobody knows.

You ran a 6-hour training job, the Sharpe ratio dropped from 1.4 to 0.9, and somewhere between the raw tick data and the feature matrix a silent transformation introduced look-ahead bias. You have no idea where.

DVC is too heavy — it versions entire files with S3 backends, CI pipelines, and YAML configs. You don't want to learn a new orchestration system; you want to know what happened to column price_lag1 between step 3 and step 7.

Git doesn't understand columns. git diff on a Parquet file is binary noise. It cannot tell you "this .filter() removed 412 rows" or "this .with_columns() introduced a null in 3% of rows."

flashback fixes this.

It wraps your DataFrame in a zero-cost proxy that records every transformation as a node in an in-memory Directed Acyclic Graph (DAG). Each node is identified by a deterministic SHA-256 hash of the schema + operation arguments, giving you:

  • Instant time-travelfb.checkout("before-lag") returns the exact frame at that checkpoint with no I/O unless you ask for it.
  • Structural diffingframe.diff(other) shows you exactly which rows were added or removed between any two checkpoints.
  • Beautiful lineage viewsfb.visualize() renders a rich-powered git-log-style tree in your terminal, or an SVG graph in Jupyter.
  • Reproducibility — identical transformations applied to identical data always produce the same node ID — transformations are deterministic by construction.

Install

pip install flashback-df
# or, if you use uv (recommended):
uv add flashback

Requirements: Python ≥ 3.10, Polars ≥ 0.20, pandas ≥ 2.0.


Quickstart

import flashback as fb

# ── 1. Load any source ──────────────────────────────────────────────────────
df = fb.load("trades.parquet")          # Parquet
df = fb.load("prices.csv")             # CSV
df = fb.load(my_polars_df)             # existing Polars DataFrame
df = fb.load(my_pandas_df)             # existing Pandas DataFrame

# ── 2. Transform — every step is recorded automatically ─────────────────────
df = df.filter(fb.col("price") > 0)
df = df.with_columns(
    (fb.col("price") * fb.col("volume")).alias("notional")
)

# Tag a checkpoint before the next risky operation.
df = df.tag("before-lag")

df = df.lag("price", 1)               # sugar for shift(-1) + tracking
df = df.rolling_mean("notional", 5)

# ── 3. Time-travel ──────────────────────────────────────────────────────────
df_clean = fb.checkout("before-lag")  # ← instant; no disk I/O

# ── 4. See what broke your Sharpe ratio ─────────────────────────────────────
fb.visualize()

Terminal output:

╭─ flashback lineage  •  4 commits  •  HEAD → rolling_mean ──────────────────╮
│                                                                             │
│  📂 LOAD  5,000 rows × 4 cols  [14:03:01]                                  │
│  │                                                                          │
│  ├─ 🔍 filter  arg_0=...col("price")...  4,823 rows × 4 cols  #a1b2c3d4   │
│  │                                                                          │
│  ├─ ➕ with_columns  arg_0=...alias("notional")  4,823 rows × 5  #e5f6a7  │
│  │                                                                          │
│  ├─ ⏪ lag  column='price'  n=1  4,823 rows × 6  [before-lag]  #b8c9d0    │
│  │                                                                          │
│  └─ 📈 rolling_mean  window=5  4,823 rows × 7 ● HEAD  #01e2f3a4           │
│                                                                             │
╰─────────────────────────────────────────────────────────────────────────────╯

API Reference

fb.load(source, *, label=None, track=True)

Load a DataFrame from a file path, Polars DataFrame, or Pandas DataFrame and begin tracking its lineage.

Param Type Description
source str | pl.DataFrame | pd.DataFrame | FlashbackFrame Data source
label str | None Human-readable root label (default: filename stem or "root")
track bool Register with the global registry (default: True)

Supported formats: .parquet, .csv, .json, .ndjson, .ipc, .arrow


fb.col(name)

Alias for polars.col. Use inside transform chains for IDE-friendly imports:

df = df.filter(fb.col("price") > 0)

fb.commit(frame, label, *, message="")

Tag the current state of frame with a human-readable label — analogous to git tag.

df = fb.commit(df, "before-normalise", message="Raw features, no scaling")

Or use the method form:

df = df.tag("before-normalise", message="Raw features, no scaling")

fb.checkout(label, *, frame=None)

Time-travel to a named checkpoint. Returns a new FlashbackFrame at that exact state, fully materialised.

df_original = fb.checkout("before-normalise")

If frame is provided, searches only that frame's lineage. Otherwise, searches the global registry.


fb.visualize(frame=None, *, style="tree", max_width=120)

Render the transformation lineage.

  • style="tree" — rich tree with icons, timestamps, shapes, node IDs.
  • style="dag" — compact ASCII graph (git log --graph style).
  • In Jupyter, automatically falls back to an SVG/HTML widget.

FlashbackFrame.lag(column, n=1, *, alias=None)

Shift column by n periods with a tracked checkpoint.

df = df.lag("price", 1)                    # → price_lag1
df = df.lag("price", 3, alias="price_t3")  # → price_t3

FlashbackFrame.rolling_mean(column, window, *, alias=None, min_periods=None)

Rolling mean over window periods with lineage tracking.

df = df.rolling_mean("notional", 20)  # → notional_rmean20

FlashbackFrame.diff(other)

Structural diff between two frames. Returns a Polars DataFrame with a _diff column of "added" / "removed".

delta = df_now.diff(df_old)
print(delta.filter(pl.col("_diff") == "removed"))

FlashbackFrame.history()

Return the full transformation chain as a list of dicts (root → HEAD):

for step in df.history():
    print(step["op_name"], step["shape"], step["label"])

Persistence

Lineage graphs can be saved to and loaded from disk:

from flashback.storage import Storage

store = Storage(".flashback")  # or Storage.from_cwd()
store.save(df, frame_id="experiment-001")

# Later, in another session:
df = store.load("experiment-001")

The .flashback/ directory layout:

.flashback/
├── config.json
├── graphs/
│   └── experiment-001.json   # serialised DAG
└── cache/
    └── <node_id>.parquet     # materialised node snapshots

How it works

┌──────────────────────────────────────────────────────────┐
│  FlashbackFrame                                          │
│                                                          │
│  ┌──────────────┐    intercept    ┌───────────────────┐  │
│  │  Polars API  │ ─────────────▶ │   LineageDAG      │  │
│  │  .filter()   │                │                   │  │
│  │  .sort()     │  record node   │  root ──▶ filter  │  │
│  │  .join()     │ ◀──────────── │         ──▶ sort  │  │
│  └──────────────┘                │         ──▶ join  │  │
│         │                        └───────────────────┘  │
│         ▼                                               │
│  polars.DataFrame  (unchanged; Polars still optimises)  │
└──────────────────────────────────────────────────────────┘

Node identity is a 20-character hex SHA-256 of:

{
  "parents": ["<parent_node_id>"],
  "op": "filter",
  "kwargs": {"arg_0": "[(col(\"price\")) > (0)]"},
  "schema": {"id": "Int64", "price": "Float64", ...}
}

This means:

  • Identical pipelines on identical data always hash to the same node → instant cache hits.
  • Changing any argument or parent state produces a different hash → no silent collisions.

Development

git clone https://github.com/flashback-dev/flashback
cd flashback
pip install -e ".[dev]"

# Lint
ruff check flashback tests
ruff format --check flashback tests

# Type-check
mypy flashback

# Test with coverage
pytest

The CI matrix runs across Ubuntu × macOS × Windows and Python 3.10 – 3.13 with a hard 90% coverage threshold.


Roadmap

  • Branchingfb.branch("experiment-A") for parallel pipeline exploration
  • Merge — reconcile two branches at the DAG level
  • Remote storage — push/pull lineage graphs to S3 / GCS
  • Streaming Polars — track lazy plans before .collect()
  • Notebook integration%load_ext flashback magic with live DAG sidebar
  • Export to DVC — generate .dvc stage files from a flashback DAG

License

MIT — see LICENSE.


Built with Polars · Rich · NetworkX

About

⏪ Git for DataFrames. Time-travel debugging, exact temporal lineage, and feature evolution tracking for Pandas and Polars.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors