Skip to content

0xsyax/dataalchemist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧙 DataAlchemist

Intelligent Data Transformation & ETL Pipeline Agent

Python MiMo Data

Schema inference · Data quality · Auto-transformation · Lineage tracking


🚨 The Problem

A fintech company integrated data from 12 sources (PostgreSQL, MongoDB, Salesforce, APIs). Their data team spent:

  • 40% of time: Writing repetitive ETL scripts
  • 30%: Debugging data quality issues in production
  • 20%: Explaining to stakeholders why numbers don't match
  • 10%: Actually analyzing data

✅ The Solution

DataAlchemist is an autonomous data transformation agent:

  • 🔍 Schema inference: Reads messy CSV/JSON and infers correct types
  • 🧠 Smart cleaning: Context-aware deduplication, imputation, outlier handling
  • 🔄 Auto-transformation: Generates SQL/dbt models from natural language
  • 📊 Lineage tracking: Full audit trail from source to dashboard

🏗️ ETL Pipeline

Raw Sources (CSV, JSON, API, DB)
         ↓
Schema Profiler → type inference, null analysis, cardinality
         ↓
Quality Agent → anomaly detection, drift monitoring
         ↓
Transform Agent → dedup, impute, normalize, aggregate
         ↓
Validator → Great Expectations suite execution
         ↓
Destination → warehouse (Snowflake/BigQuery/Postgres)

🚀 Key Features

Feature Detail Accuracy
Schema Inference Detects dates in strings, categorical vs numeric 97%
Smart Deduplication Fuzzy matching for "John Smith" vs "Jon Smyth" 94%
NL-to-SQL "Monthly revenue by region" → working query 91%
Drift Detection Alerts when data distribution shifts Real-time

📊 Real-World Impact

Fintech data platform:

  • ETL development: 3 days → 45 minutes
  • Data quality incidents: 15/month → 1/month
  • Schema drift detection: Caught 3 breaking changes before production
  • Duplicate records eliminated: 12,847 in first week

📈 Token Consumption

Pipeline Complexity Sources Monthly Tokens Time
Simple 2-3 ~800K 10 min
Medium 5-8 ~3M 1 hr
Enterprise 12+ ~10M 4 hrs

🛠️ Quick Start

git clone https://github.com/0xsyax/dataalchemist.git
cd dataalchemist
pip install -r requirements.txt
python alchemist.py --sources ./sources/ --target postgres --schedule daily

🔧 Tech Stack

  • Python 3.11 · MiMo API (transformation logic)
  • Pandas · Polars · SQLAlchemy
  • Great Expectations (validation)
  • dbt (transformation models)
  • Apache Airflow (scheduling)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages