Data Platform

Multi-project data platform demonstrating production patterns used in real data engineering work.

Streaming ingestion with Redpanda (Kafka), medallion-layer transformations with dbt, polyglot implementations in Python and Go, and a hybrid-cloud bridge to BigQuery. Every project is self-contained, runs locally via Docker, and mirrors patterns from client work at retail and SaaS scale.

Live CV · Portfolio · LinkedIn

Repository Layout

data-platform/
├── foundation/                    # Shared infrastructure (Docker services)
│   ├── docker-compose.yml         # PostgreSQL, Redpanda, Redis
│   └── shared/                    # Reusable libraries (messaging, database, models)
│
├── warehouse/                     # BigQuery + dbt medallion (staging → intermediate → marts)
│   └── models/
│       ├── staging/               # stg_events — source normalization + country null-fix
│       ├── intermediate/          # int_events — bot detection, PPP pricing, country backfill
│       └── marts/                 # mart_funnel, mart_campaign_performance, mart_session_stats
│
├── projects/
│   ├── ecommerce-dbt/             # Python — Kafka → PostgreSQL → dbt
│   ├── go-ecommerce/              # Go — direct port of ecommerce-dbt for shadow deployment
│   ├── go-marketing-analytics/    # Go — multi-source marketing platform (GA4 + CRM + Ads)
│   └── hybrid-cloud-bridge/       # Python + GCP — GCS → BigQuery via Cloud Functions
│
├── scripts/                       # Setup, verification, and quality check scripts
└── tests/                         # Unit and integration tests

High-Level Architecture

flowchart TB
    subgraph Foundation["Foundation (Docker)"]
        PG[(PostgreSQL<br/>:5433)]
        RP[Redpanda<br/>:19092]
        RD[(Redis<br/>:6379)]
    end

    subgraph Projects
        EC_PY[ecommerce-dbt<br/>Python]
        EC_GO[go-ecommerce<br/>Go]
        MKT[go-marketing-analytics<br/>Go]
        HB[hybrid-cloud-bridge<br/>Python]
    end

    subgraph Cloud["GCP"]
        BQ[(BigQuery)]
        GCS[(GCS)]
    end

    subgraph Warehouse["Warehouse (dbt)"]
        STG[Staging]
        INT[Intermediate]
        MRT[Marts]
    end

    EC_PY --> RP
    EC_GO --> RP
    MKT --> RP
    RP --> PG
    HB --> BQ
    HB --> GCS
    GCS --> BQ
    BQ --> STG
    STG --> INT
    INT --> MRT

Foundation

Containerized shared infrastructure. Every project runs against the same local services with namespace isolation (separate Postgres databases, prefixed Kafka topics, prefixed Redis keys).

Service	Purpose	Port	Web UI
PostgreSQL	Operational + warehouse DB	5433	—
Redpanda	Kafka-compatible streaming	19092	Console
Redis	Caching layer	6379	—

See foundation/README.md for architecture details.

Projects

1. `projects/ecommerce-dbt` — Python

End-to-end real-time e-commerce pipeline. Python data generator publishes to Redpanda, a Kafka consumer lands events into PostgreSQL, and dbt transforms them into analytics-ready tables.

Stack: Python, Redpanda (Kafka), PostgreSQL, dbt, Docker Project details →

2. `projects/go-ecommerce` — Go

Direct port of ecommerce-dbt to Go, designed to run side-by-side with the Python version against the same Redpanda topics and PostgreSQL schema. Uses separate Kafka consumer groups (orders_ingestion_go vs orders_ingestion), so both pipelines can process the same events concurrently — useful for performance comparisons and shadow deployments.

Highlights:

Event generator mirroring the exact statistical distribution of the Python version
Highly concurrent Kafka consumer using goroutines (one per topic)
Bulk insert via jackc/pgx with ON CONFLICT deduplication

Stack: Go 1.25, segmentio/kafka-go, jackc/pgx, Redpanda, PostgreSQL Project details →

3. `projects/go-marketing-analytics` — Go

Real-time marketing data platform. Simulates GA4 page views, CRM lead lifecycle events (created → qualified → opportunity → won/lost), and paid-media ad spend — all unified through Redpanda ingestion and attributed across sources in PostgreSQL.

Highlights:

Full GA4-style event tracking with UTM parameters
Cross-source attribution: GA4 utm_source → CRM lead_source → Ads platform spend
High-throughput Go services with goroutines, Kafka batching, and pgx connection pooling
BigQuery client integration (cloud.google.com/go/bigquery) for warehouse uplift

Stack: Go 1.25, segmentio/kafka-go, jackc/pgx, cloud.google.com/go/bigquery, Redpanda, PostgreSQL, dbt Project details →

4. `projects/hybrid-cloud-bridge` — Python + GCP

Hybrid-cloud data flow: local Docker producer mock uploads to GCS, GCS triggers a Cloud Function that loads Parquet into BigQuery. Demonstrates the ingestion half of the medallion stack before the dbt transformations take over.

Stack: Python, Docker, GCS, BigQuery, Cloud Functions Project folder →

Warehouse — BigQuery + dbt Medallion

Production-grade dbt project targeting Google BigQuery, implementing a full medallion architecture with self-healing patterns.

Data Model Layers

Layer	Model	Purpose
Staging	`stg_events`	Source normalization, NULL country → `'XX'` placeholder
Intermediate	`int_events`	Business logic: bot detection, PPP pricing, country backfill
Marts	`mart_funnel`	Conversion funnel: View → Click → Purchase
	`mart_campaign_performance`	Revenue breakdown by campaign, device, country
	`mart_session_stats`	Session-level aggregations

Key Patterns

Self-healing ingestion: window-function backfill of corrupt country codes in int_events without touching source data
Bot detection: automated flagging of high-frequency sessions (>50 events/session)
Regional pricing (PPP): tier-based value adjustment (US/UK/DE/JP = 100%, BR/FR/CA = 50%, others = 20%)
Campaign hierarchy extraction: parsing campaign category from composite IDs (cmp_US_blackfriday → Blackfriday)
Revenue attribution: separate streams for sales, clicks, and views

Lineage

flowchart LR
    subgraph Bronze["Bronze"]
        SRC[(events)]
    end

    subgraph Silver["Silver"]
        STG[stg_events]
        INT[int_events]
    end

    subgraph Gold["Gold"]
        F1[mart_funnel]
        F2[mart_campaign_performance]
        F3[mart_session_stats]
    end

    SRC --> STG
    STG --> INT
    INT --> F1
    INT --> F2
    INT --> F3

    STG -.-|"NULL → XX"| STG
    INT -.-|"Bot Detection<br/>PPP Pricing"| INT

Stack: BigQuery, dbt, Python, GCS (Parquet)

Quick Start

Prerequisites

Docker Desktop (running)
Python 3.10+ with venv
Go 1.21+ (for Go projects)
Git

Start the Platform

# 1. Clone
git clone https://github.com/hailtr/data-platform.git
cd data-platform

# 2. Start infrastructure (PostgreSQL, Redpanda, Redis)
cd foundation && docker-compose up -d && cd ..

# 3. Python environment
python -m venv venv
source venv/bin/activate          # macOS/Linux
# venv\Scripts\activate            # Windows
pip install -r requirements.txt

# 4. Initialize the database
python scripts/init_database.py

# 5. (Optional) Run quality checks
./scripts/run_checks.sh            # macOS/Linux
# scripts\run_checks.bat           # Windows

Quick Verification

# Check services are up
python scripts/check_services.py

# Redpanda Console
open http://localhost:8080         # macOS
# start http://localhost:8080      # Windows

Techniques Applied

Infrastructure & Architecture

Docker containerization with namespace isolation across projects
Multi-tenant platform design (DB-per-project, topic-prefixing, Redis key-prefixing)
Shared-library pattern (foundation/shared/) for cross-project reuse

Streaming & Ingestion

Kafka-compatible streaming via Redpanda
Consumer-group shadowing (run two pipeline implementations against the same topic concurrently)
Bulk insert with ON CONFLICT dedup for idempotent replays

Warehousing & Analytics

BigQuery medallion architecture (staging → intermediate → marts)
dbt: incremental models, window functions, cross-model macros
Self-healing data patterns (country-code backfill via window functions)
Bot detection and PPP-adjusted revenue attribution

Go

Goroutine-based concurrent Kafka consumers
jackc/pgx connection pooling and bulk insert
BigQuery client integration via cloud.google.com/go/bigquery

Python

Synthetic data generation with tunable statistical distributions
Kafka consumer pipelines
Quality gating via Black + Flake8 + Pytest

Quality & CI

# All checks (format, lint, test)
./scripts/run_checks.sh            # macOS/Linux
scripts\run_checks.bat             # Windows

Tooling: Black (formatting), Flake8 (linting), Pytest (unit + integration tests).

License

MIT — see LICENSE.

Built by Rafael Ortiz — Senior Data Engineer. These patterns are derived from real client work on Snowflake → ClickHouse migrations, Microsoft Fabric lakehouses, and streaming backends. See the live CV for the full context.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
foundation		foundation
projects		projects
scripts		scripts
tests		tests
warehouse		warehouse
.flake8		.flake8
.git_status.txt		.git_status.txt
.gitignore		.gitignore
.tracked_files.txt		.tracked_files.txt
README.md		README.md
demo_pipeline.py		demo_pipeline.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Platform

Repository Layout

High-Level Architecture

Foundation

Projects

1. `projects/ecommerce-dbt` — Python

2. `projects/go-ecommerce` — Go

3. `projects/go-marketing-analytics` — Go

4. `projects/hybrid-cloud-bridge` — Python + GCP

Warehouse — BigQuery + dbt Medallion

Data Model Layers

Key Patterns

Lineage

Quick Start

Prerequisites

Start the Platform

Quick Verification

Techniques Applied

Infrastructure & Architecture

Streaming & Ingestion

Warehousing & Analytics

Go

Python

Quality & CI

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Platform

Repository Layout

High-Level Architecture

Foundation

Projects

1. projects/ecommerce-dbt — Python

2. projects/go-ecommerce — Go

3. projects/go-marketing-analytics — Go

4. projects/hybrid-cloud-bridge — Python + GCP

Warehouse — BigQuery + dbt Medallion

Data Model Layers

Key Patterns

Lineage

Quick Start

Prerequisites

Start the Platform

Quick Verification

Techniques Applied

Infrastructure & Architecture

Streaming & Ingestion

Warehousing & Analytics

Go

Python

Quality & CI

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `projects/ecommerce-dbt` — Python

2. `projects/go-ecommerce` — Go

3. `projects/go-marketing-analytics` — Go

4. `projects/hybrid-cloud-bridge` — Python + GCP

Packages