Skip to content

hailtr/data-platform

Repository files navigation

Data Platform

Multi-project data platform demonstrating production patterns used in real data engineering work.

Streaming ingestion with Redpanda (Kafka), medallion-layer transformations with dbt, polyglot implementations in Python and Go, and a hybrid-cloud bridge to BigQuery. Every project is self-contained, runs locally via Docker, and mirrors patterns from client work at retail and SaaS scale.

Live CV · Portfolio · LinkedIn


Repository Layout

data-platform/
├── foundation/                    # Shared infrastructure (Docker services)
│   ├── docker-compose.yml         # PostgreSQL, Redpanda, Redis
│   └── shared/                    # Reusable libraries (messaging, database, models)
│
├── warehouse/                     # BigQuery + dbt medallion (staging → intermediate → marts)
│   └── models/
│       ├── staging/               # stg_events — source normalization + country null-fix
│       ├── intermediate/          # int_events — bot detection, PPP pricing, country backfill
│       └── marts/                 # mart_funnel, mart_campaign_performance, mart_session_stats
│
├── projects/
│   ├── ecommerce-dbt/             # Python — Kafka → PostgreSQL → dbt
│   ├── go-ecommerce/              # Go — direct port of ecommerce-dbt for shadow deployment
│   ├── go-marketing-analytics/    # Go — multi-source marketing platform (GA4 + CRM + Ads)
│   └── hybrid-cloud-bridge/       # Python + GCP — GCS → BigQuery via Cloud Functions
│
├── scripts/                       # Setup, verification, and quality check scripts
└── tests/                         # Unit and integration tests

High-Level Architecture

flowchart TB
    subgraph Foundation["Foundation (Docker)"]
        PG[(PostgreSQL<br/>:5433)]
        RP[Redpanda<br/>:19092]
        RD[(Redis<br/>:6379)]
    end

    subgraph Projects
        EC_PY[ecommerce-dbt<br/>Python]
        EC_GO[go-ecommerce<br/>Go]
        MKT[go-marketing-analytics<br/>Go]
        HB[hybrid-cloud-bridge<br/>Python]
    end

    subgraph Cloud["GCP"]
        BQ[(BigQuery)]
        GCS[(GCS)]
    end

    subgraph Warehouse["Warehouse (dbt)"]
        STG[Staging]
        INT[Intermediate]
        MRT[Marts]
    end

    EC_PY --> RP
    EC_GO --> RP
    MKT --> RP
    RP --> PG
    HB --> BQ
    HB --> GCS
    GCS --> BQ
    BQ --> STG
    STG --> INT
    INT --> MRT
Loading

Foundation

Containerized shared infrastructure. Every project runs against the same local services with namespace isolation (separate Postgres databases, prefixed Kafka topics, prefixed Redis keys).

Service Purpose Port Web UI
PostgreSQL Operational + warehouse DB 5433
Redpanda Kafka-compatible streaming 19092 Console
Redis Caching layer 6379

See foundation/README.md for architecture details.


Projects

1. projects/ecommerce-dbt — Python

End-to-end real-time e-commerce pipeline. Python data generator publishes to Redpanda, a Kafka consumer lands events into PostgreSQL, and dbt transforms them into analytics-ready tables.

Stack: Python, Redpanda (Kafka), PostgreSQL, dbt, Docker Project details →


2. projects/go-ecommerce — Go

Direct port of ecommerce-dbt to Go, designed to run side-by-side with the Python version against the same Redpanda topics and PostgreSQL schema. Uses separate Kafka consumer groups (orders_ingestion_go vs orders_ingestion), so both pipelines can process the same events concurrently — useful for performance comparisons and shadow deployments.

Highlights:

  • Event generator mirroring the exact statistical distribution of the Python version
  • Highly concurrent Kafka consumer using goroutines (one per topic)
  • Bulk insert via jackc/pgx with ON CONFLICT deduplication

Stack: Go 1.25, segmentio/kafka-go, jackc/pgx, Redpanda, PostgreSQL Project details →


3. projects/go-marketing-analytics — Go

Real-time marketing data platform. Simulates GA4 page views, CRM lead lifecycle events (created → qualified → opportunity → won/lost), and paid-media ad spend — all unified through Redpanda ingestion and attributed across sources in PostgreSQL.

Highlights:

  • Full GA4-style event tracking with UTM parameters
  • Cross-source attribution: GA4 utm_source → CRM lead_source → Ads platform spend
  • High-throughput Go services with goroutines, Kafka batching, and pgx connection pooling
  • BigQuery client integration (cloud.google.com/go/bigquery) for warehouse uplift

Stack: Go 1.25, segmentio/kafka-go, jackc/pgx, cloud.google.com/go/bigquery, Redpanda, PostgreSQL, dbt Project details →


4. projects/hybrid-cloud-bridge — Python + GCP

Hybrid-cloud data flow: local Docker producer mock uploads to GCS, GCS triggers a Cloud Function that loads Parquet into BigQuery. Demonstrates the ingestion half of the medallion stack before the dbt transformations take over.

Stack: Python, Docker, GCS, BigQuery, Cloud Functions Project folder →


Warehouse — BigQuery + dbt Medallion

Production-grade dbt project targeting Google BigQuery, implementing a full medallion architecture with self-healing patterns.

Data Model Layers

Layer Model Purpose
Staging stg_events Source normalization, NULL country → 'XX' placeholder
Intermediate int_events Business logic: bot detection, PPP pricing, country backfill
Marts mart_funnel Conversion funnel: View → Click → Purchase
mart_campaign_performance Revenue breakdown by campaign, device, country
mart_session_stats Session-level aggregations

Key Patterns

  • Self-healing ingestion: window-function backfill of corrupt country codes in int_events without touching source data
  • Bot detection: automated flagging of high-frequency sessions (>50 events/session)
  • Regional pricing (PPP): tier-based value adjustment (US/UK/DE/JP = 100%, BR/FR/CA = 50%, others = 20%)
  • Campaign hierarchy extraction: parsing campaign category from composite IDs (cmp_US_blackfridayBlackfriday)
  • Revenue attribution: separate streams for sales, clicks, and views

Lineage

flowchart LR
    subgraph Bronze["Bronze"]
        SRC[(events)]
    end

    subgraph Silver["Silver"]
        STG[stg_events]
        INT[int_events]
    end

    subgraph Gold["Gold"]
        F1[mart_funnel]
        F2[mart_campaign_performance]
        F3[mart_session_stats]
    end

    SRC --> STG
    STG --> INT
    INT --> F1
    INT --> F2
    INT --> F3

    STG -.-|"NULL → XX"| STG
    INT -.-|"Bot Detection<br/>PPP Pricing"| INT
Loading

Stack: BigQuery, dbt, Python, GCS (Parquet)


Quick Start

Prerequisites

  • Docker Desktop (running)
  • Python 3.10+ with venv
  • Go 1.21+ (for Go projects)
  • Git

Start the Platform

# 1. Clone
git clone https://github.com/hailtr/data-platform.git
cd data-platform

# 2. Start infrastructure (PostgreSQL, Redpanda, Redis)
cd foundation && docker-compose up -d && cd ..

# 3. Python environment
python -m venv venv
source venv/bin/activate          # macOS/Linux
# venv\Scripts\activate            # Windows
pip install -r requirements.txt

# 4. Initialize the database
python scripts/init_database.py

# 5. (Optional) Run quality checks
./scripts/run_checks.sh            # macOS/Linux
# scripts\run_checks.bat           # Windows

Quick Verification

# Check services are up
python scripts/check_services.py

# Redpanda Console
open http://localhost:8080         # macOS
# start http://localhost:8080      # Windows

Techniques Applied

Infrastructure & Architecture

  • Docker containerization with namespace isolation across projects
  • Multi-tenant platform design (DB-per-project, topic-prefixing, Redis key-prefixing)
  • Shared-library pattern (foundation/shared/) for cross-project reuse

Streaming & Ingestion

  • Kafka-compatible streaming via Redpanda
  • Consumer-group shadowing (run two pipeline implementations against the same topic concurrently)
  • Bulk insert with ON CONFLICT dedup for idempotent replays

Warehousing & Analytics

  • BigQuery medallion architecture (staging → intermediate → marts)
  • dbt: incremental models, window functions, cross-model macros
  • Self-healing data patterns (country-code backfill via window functions)
  • Bot detection and PPP-adjusted revenue attribution

Go

  • Goroutine-based concurrent Kafka consumers
  • jackc/pgx connection pooling and bulk insert
  • BigQuery client integration via cloud.google.com/go/bigquery

Python

  • Synthetic data generation with tunable statistical distributions
  • Kafka consumer pipelines
  • Quality gating via Black + Flake8 + Pytest

Quality & CI

# All checks (format, lint, test)
./scripts/run_checks.sh            # macOS/Linux
scripts\run_checks.bat             # Windows

Tooling: Black (formatting), Flake8 (linting), Pytest (unit + integration tests).


License

MIT — see LICENSE.


Built by Rafael Ortiz — Senior Data Engineer. These patterns are derived from real client work on Snowflake → ClickHouse migrations, Microsoft Fabric lakehouses, and streaming backends. See the live CV for the full context.

About

Streaming data platform monorepo: Redpanda (Kafka) + ClickHouse medallion lakehouse + dbt transformations. Event-driven ingestion and analytical marts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages