Skip to content

bbasaranemir/xforge

Repository files navigation

xForge — Football Analytics Platform

CI/CD Python 3.10 Airflow 2.8 dbt 1.7 PostgreSQL 15 Docker XGBoost License: MIT

Production-grade football analytics platform — 9.2M+ events, 3,464 matches, XGBoost xP (AUC 0.8948) + xG (AUC 0.7822, 88k calibrated predictions). Engineered for real-world constraints: OOM-safe server-side cursor writes, idempotent fault-tolerant restarts, and fully autonomous PDF/XML match reporting. Orchestrated end-to-end by Apache Airflow.


Dashboard Previews

xT Surface Heatmap Player xT Ranking
xT Heatmap Player xT Ranking
Match xT Balance (Home vs Away) PDF Match Report
Match xT Balance PDF Report

Architecture

flowchart TD
    SB[("StatsBomb\nOpen Data\n9.2M events / 3,464 matches")]

    subgraph AIRFLOW["⚙️ Apache Airflow — 3 DAGs"]
        direction TB
        DAG1["ingestion_pipeline\n● daily 02:00 UTC\n● incremental, ~2.5 s/match"]
        DAG2["ml_training\n● weekly Sun 03:00\n● xP + xG + K-Means in parallel"]
        DAG3["matchday_push\n● manual trigger\n● PDF + XML generation"]
    end

    subgraph PG["🗄️ PostgreSQL 15"]
        FACT["fact_events\n9.2M rows · 40+ LIST partitions\nxt_value · xp_value · xg_value columns"]
        DIM["dim_matches · dim_players\ndim_teams · dim_competitions"]
        AUX["xt_surface (192 cells)\nmodel_registry\nset_piece_clusters"]
    end

    subgraph DBT["📦 dbt 1.7"]
        STG["3 staging models\nstg_events · stg_passes · stg_shots"]
        MART["4 mart models\nplayer_metrics (total_xg · avg_xg · finishing_quality)\nteam_summary · match_summary\ncompetition_leaderboard (xt_per_match rank)"]
    end

    subgraph ML["🤖 ML Pipeline"]
        XT["xT Model\nValue iteration · 16×12 grid\n5,375,085 rows written"]
        XP["XGBoost xP\nAUC = 0.8948 · log-loss = 0.3236\n3,387,760 predictions"]
        XG["XGBoost xG\nAUC = 0.7822 · calibrated avg=0.111\n88,023 shot predictions"]
        KM["K-Means Clustering\n6 corner delivery zones · 6 shot zones\npress trigger detection"]
    end

    subgraph SERVE["📊 Serving Layer"]
        SS["Apache Superset\n7 charts · Matchday Analytics dashboard"]
        PDF["PDF Match Reports\nmplsoccer · 5-page per match"]
        XML["SportsCode XML\n25 top-xT events · Hudl-compatible"]
        GR["Grafana\npipeline & container monitoring"]
    end

    SB --> DAG1
    DAG1 --> FACT & DIM
    FACT --> DAG1
    FACT --> DAG2
    DAG2 --> XT & XP & XG & KM
    XT & XP & XG & KM --> AUX
    FACT & DIM --> STG --> MART
    MART --> SS
    FACT --> DAG3 --> PDF & XML
    PG --> GR
Loading

Business Use Cases

xForge is not just a data pipeline; it is an operational engine designed to solve real-world football analytics bottlenecks:

  • Data-Driven Scouting: Traditional completion rates are misleading. By filtering players who successfully complete high-difficulty (low xP) but high-reward (high xT) passes, scouting departments can identify undervalued, high-vision playmakers before their market value peaks.
  • Tactical Opponent Analysis: Autonomous generation of Press Intensity and xT Surface Heatmaps allows coaching staffs to instantly identify opposition defensive vulnerabilities (e.g., high-threat leaks in specific half-spaces) without manual data slicing.
  • Autonomous Video Analysis Integration: The pipeline automatically generates HUDL Sportscode-compatible XML files mapped to the top 25 highest-xT events of a match. This eliminates hours of manual video tagging for performance analysts, allowing them to focus strictly on tactical review.
  • Striker Evaluation via Finishing Quality: The finishing_quality metric (goals − total_xG) separates clinical finishers from shot-volume players. A striker with finishing_quality +189 (Messi, Barcelona) is systematically outperforming expected goals — exactly what recruitment departments need when evaluating transfer value beyond raw goal counts.

Future Roadmap

While the current architecture handles event data at scale, the next iterations of this project will focus on:

  • Tracking Data Integration: Transitioning from an RDBMS to a Data Lake architecture to ingest and process 25 FPS X,Y coordinate tracking data.
  • Real-time Streaming: Replacing batch Airflow ingestion with Apache Kafka to process match events in milliseconds, powering live in-match dashboards.
  • MLOps Implementation: Integrating MLflow for continuous model registry, monitoring XGBoost performance over time, and automated retraining triggers to prevent model drift.

Key Engineering Features

Memory-Safe ML at Scale

Training an XGBoost classifier on 3.4M pass events inside a 4 GB Codespace would OOM-kill. This pipeline solves it in two stages:

  1. Training: stratified random sample of 300k rows via ORDER BY RANDOM() LIMIT 300_000 — sufficient for AUC 0.8948.
  2. Prediction: server-side psycopg2 named cursor streams rows in 50k-row chunks. Training data is explicitly freed with del + gc.collect() before the prediction phase begins.
# server-side cursor — no full result set in RAM
cur = conn.cursor("xp_pred_cur")
cur.execute("SELECT ... FROM fact_events WHERE xp_value IS NULL")
while rows := cur.fetchmany(50_000):
    probas = model.predict_proba(build_features(rows))
    write_chunk(probas)   # commit per chunk

Idempotent, Fault-Tolerant Writes

Every prediction write filters WHERE xp_value IS NULL. If the container is killed mid-run, restarting resumes exactly where it stopped — no duplicates, no data loss. The postStartCommand in .devcontainer auto-triggers on Codespace restart; the run() entrypoint exits immediately if the model file exists and zero rows remain unpredicted.

Partitioned Data Warehouse

fact_events is partitioned by competition_id using PostgreSQL LIST partitioning (40+ partitions). Query planners prune irrelevant partitions automatically — full-competition scans stay fast at 9.2M rows without manual sharding.

Autonomous Reporting Pipeline

The matchday_push DAG generates a complete 5-page PDF and a SportsCode/Hudl-compatible XML file for any match_id without human intervention. Reports are written to a volume-mounted reports/ directory accessible from the host.


ML Models

Model Algorithm Target Result
xT Surface Value iteration (15×) Threat per pitch cell 192 cells, max=0.298
xG Classifier XGBoost Goal probability per shot AUC 0.7822, 88,023 predictions, post-hoc calibrated
xP Classifier XGBoost Pass completion probability AUC 0.8948, log-loss 0.3236
Corner Delivery Clustering K-Means Set-piece delivery zones 6 clusters on corner origin coords
Shot Location Clustering K-Means Shot zone patterns 6 clusters — correct StatsBomb zones (6-yard box: x>114, penalty area: x>102)
Press Trigger Rule-based sequence High-press moment detection Ball recovery + 3 defensive actions / 5 s

xG features: distance_to_goal, angle_to_goal, under_pressure, minute_bin — location-based coordinate model consistent with academic xG literature. Class imbalance (~10% goals) handled with dynamic scale_pos_weight. Raw XGBoost probabilities were post-hoc calibrated via multiplicative rescaling (xg_value × actual_goals / sum_predicted_xg) so that sum(xG) = 9,790 = actual goals, yielding avg_xg = 0.111 (11.1% goal rate) ✅

xP features: start_x/y, end_x/y, distance, angle_to_goal, under_pressure, minute_bin

finishing_quality = goals − total_xG per player. Positive = clinical finisher outperforming expectation; negative = poor conversion. Available in mart_player_metrics. Example: Messi (Barcelona) = +189.6.


Airflow DAGs

DAG Schedule Tasks
ingestion_pipeline Daily 02:00 UTC ingest → dbt_run → dbt_test → xt_model → superset_init
ml_training Weekly Sun 03:00 tactical_models ‖ predictive_models ‖ xg_model → dbt_refresh_marts (all 3 ML tasks in parallel)
matchday_push Manual trigger ingest_match → generate_pdf → generate_xml → send_email

Data at Scale

Metric Value
Matches ingested 3,464
Total events 9,200,000+
DB partitions 40+ (by competition)
xT records written 5,375,085
xP predictions written 3,387,760
xG predictions written 88,023 (calibrated, avg_xg = 0.111)
Ingestion throughput ~2.5 s / match
Write chunk size 50,000 rows

Live Pipeline Output

XGBoost xP Model — Training & Prediction

INFO  Loading pass data for training (sample 300,000 rows)…
INFO  Training XGBoost classifier…
INFO  === xP Model Metrics ===
INFO  AUC:       0.8948
INFO  Log-loss:  0.3236
INFO  Accuracy:  0.8221
INFO  Model saved → /opt/airflow/models/xp_model.joblib

INFO  Starting chunked prediction (server-side cursor, chunk=50,000)…
INFO  Chunk 1/68  written  50,000 rows   [total:    50,000 / 3,387,760]
INFO  Chunk 2/68  written  50,000 rows   [total:   100,000 / 3,387,760]
…
INFO  Chunk 68/68 written  37,760 rows   [total: 3,387,760 / 3,387,760]
INFO  xP write complete — 3,387,760 predictions committed to fact_events.xp_value

XGBoost xG Model — Training & Prediction

INFO  Training sample loaded: 80000 shots
INFO  Class balance: 8885 goals / 71115 non-goals → scale_pos_weight=8.01
INFO  === xG Model Metrics ===
INFO  AUC:      0.7822
INFO  Log-loss: 0.3541
INFO  Accuracy: 0.8893
INFO  xG model saved → /opt/airflow/reports/xg_model.joblib

INFO  Training data freed. Starting prediction phase...
INFO  xG written: 50000 rows committed
INFO  xG written: 88023 rows committed
INFO  xG write complete — 88023 shots updated

-- Post-hoc calibration (DB-level rescaling):
UPDATE fact_events
SET xg_value = ROUND((xg_value * 9790.0 / 35658.0)::numeric, 6)
WHERE event_type = 'Shot' AND xg_value IS NOT NULL;
-- UPDATE 88023
-- After: avg_xg = 0.1112  ✓  total_xg = 9790 = actual goals ✓

dbt Marts — Full Refresh

$ dbt run --select marts
Running with dbt=1.7.4

Concurrency: 1 threads (target='prod')

1 of 4 START sql table model analytics_marts.mart_player_metrics ........... [RUN]
1 of 4 OK created sql table model analytics_marts.mart_player_metrics ...... [SELECT 11778 in 4.83s]

2 of 4 START sql table model analytics_marts.mart_team_summary ............. [RUN]
2 of 4 OK created sql table model analytics_marts.mart_team_summary ........ [SELECT 337 in 3.21s]

3 of 4 START sql table model analytics_marts.mart_match_summary ............ [RUN]
3 of 4 OK created sql table model analytics_marts.mart_match_summary ....... [SELECT 3464 in 5.67s]

4 of 4 START sql table model analytics_marts.mart_competition_leaderboard .. [RUN]
4 of 4 OK created sql table model analytics_marts.mart_competition_leaderboard [SELECT 11367 in 2.94s]

Finished running 4 table models in 0 hours 0 minutes and 16.65 seconds (0:00:16).

Completed successfully

Done. PASS=4 WARN=0 ERROR=0 SKIP=0 TOTAL=4

Final Verification Query

SELECT
  (SELECT COUNT(*) FROM fact_events WHERE xt_value IS NOT NULL)  AS xt_rows,
  (SELECT COUNT(*) FROM fact_events WHERE xp_value IS NOT NULL)  AS xp_rows,
  (SELECT COUNT(*) FROM fact_events WHERE xg_value IS NOT NULL)  AS xg_rows,
  (SELECT ROUND(AVG(xg_value)::numeric, 4)
     FROM fact_events WHERE event_type = 'Shot')                 AS avg_xg,
  (SELECT COUNT(*) FROM set_piece_clusters)                      AS clusters,
  (SELECT COUNT(*) FROM model_registry)                          AS models,
  (SELECT COUNT(*) FROM xt_surface)                              AS xt_surface_cells;
 xt_rows  | xp_rows   | xg_rows | avg_xg | clusters | models | xt_surface_cells
----------+-----------+---------+--------+----------+--------+------------------
 5375085  | 3387760   |   88023 | 0.1112 |       24 |      4 |              192

CI / CD — GitHub Actions

$ gh run list --limit 5

STATUS  TITLE                                                  WORKFLOW  BRANCH  ELAPSED
✓       fix: dual-perspective audit — football accuracy + …   CI / CD   main    1m31s
✓       feat: audit fixes — coverage threshold, new tests …   CI / CD   main    1m44s
✓       fix: correct ml_dag task IDs in DAG tests + LICENSE   CI / CD   main    1m38s
✓       docs: add Live Pipeline Output section to README       CI / CD   main    1m29s
✓       test: add DAG integrity tests (load, task IDs, …)     CI / CD   main    1m35s

All five jobs pass — lint (black · isort · flake8), unit tests (46+ tests across 7 files including xG model tests; DAG tests skip gracefully without Airflow), and dbt compile check.


Project Structure

xforge/
├── .devcontainer/
│   └── devcontainer.json          # Codespaces: docker up + xP auto-resume on restart
├── .github/
│   └── workflows/deploy.yml       # lint → test → dbt-check → deploy (EC2)
├── config/
│   ├── grafana/                   # Provisioned dashboards & Prometheus datasource
│   └── superset_config.py         # Superset secret key & DB URI
├── dags/
│   ├── ingestion_dag.py           # Daily ETL — ingest → dbt → xT → superset
│   ├── ml_dag.py                  # Weekly — xP + K-Means parallel → dbt marts
│   └── matchday_dag.py            # On-demand — ingest → PDF → XML → email
├── dbt_project/
│   └── models/
│       ├── staging/               # stg_events, stg_passes, stg_shots (3 models)
│       └── marts/                 # player_metrics, team_summary,
│                                  # match_summary, competition_leaderboard
├── scripts/
│   ├── init/                      # 01_schema.sql — tables, partitions, indexes
│   ├── massive_ingestion.py       # Incremental StatsBomb loader (upsert, 50k chunks)
│   ├── xt_model.py                # Value-iteration xT surface builder
│   ├── predictive_models.py       # XGBoost xP — sample train + chunked prediction
│   ├── xg_model.py                # XGBoost xG — shot goal probability + calibration
│   ├── tactical_models.py         # K-Means set-piece clustering + press detection
│   ├── report_generator.py        # 5-page PDF via mplsoccer + matplotlib
│   ├── xml_generator.py           # SportsCode/Hudl XML — top-25 xT events
│   ├── setup_superset.py          # Autonomous Superset bootstrap — 7 charts + dashboard
│   └── superset_init.py           # Bootstraps saved queries on first run
├── tests/                         # pytest suite — unit + schema validation
├── docker-compose.yml             # 8 services: Airflow (3), Postgres, Superset,
│                                  #   Grafana, pgAdmin, Redis
├── Dockerfile.airflow             # Custom image: Python deps + dbt + mplsoccer
├── Makefile                       # Developer shortcuts (see below)
├── requirements.txt
└── .env.example                   # Template — copy to .env before first run

Getting Started

Prerequisites

  • Docker ≥ 24 and Docker Compose v2
  • 4 GB RAM minimum (8 GB recommended)
  • Git

1. Clone and configure

git clone https://github.com/bbasaranemir/xforge.git
cd xforge
cp .env.example .env

Generate a Fernet key for Airflow and paste it into .env:

python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

2. Launch all services

make up       # build images + start 8 containers
make status   # verify all services are healthy

3. Run the full pipeline

make ingest   # unpause + trigger ingestion_pipeline DAG
make logs     # tail scheduler logs

4. Generate a match report

make report MATCH_ID=3942349
# → reports/match_3942349.pdf  (5 pages)
# → reports/match_3942349_sportscode.xml  (25 events)

5. Access services

Service URL Default credentials
Airflow http://localhost:8080 see .envAIRFLOW_USER / AIRFLOW_PASSWORD
Superset http://localhost:8088 see .envSUPERSET_USER / SUPERSET_PASSWORD
Grafana http://localhost:3000 admin / admin
pgAdmin http://localhost:5050 see .envPGADMIN_EMAIL / PGADMIN_PASSWORD

GitHub Codespaces

Click Code → Codespaces → New. All services start automatically via .devcontainer; forwarded ports are pre-configured.


CI/CD

Every push to main runs:

lint (black · isort · flake8)
  └─► unit tests (pytest + coverage → Codecov)
        └─► dbt compile check
              └─► deploy to EC2 (requires secrets)
Secret Purpose
EC2_HOST EC2 public IP or DNS
EC2_USER SSH username
EC2_SSH_KEY Private key (PEM contents)

The deploy job is non-blocking (continue-on-error: true) — CI stays green in environments without EC2 configured.


Database Schema

dim_competitions ─┐
dim_seasons ──────┤
dim_matches ──────┤
dim_players ──────┼──► fact_events  (PARTITION BY LIST competition_id, 40+ parts)
dim_teams ────────┘         │
                            ├──► xt_surface          (192 cells, 16×12 grid)
                            ├──► model_registry       (AUC, log-loss, artifact path)
                            ├──► set_piece_clusters   (24 centroids)
                            ├──► press_events         (trigger sequences)
                            │
                            └──► analytics_marts.*
                                  ├── mart_player_metrics        (total_xg · avg_xg · finishing_quality)
                                  ├── mart_team_summary          (avg_xp — NULL-aware)
                                  ├── mart_match_summary
                                  └── mart_competition_leaderboard  (xt_per_match · xt_per_match_rank)

Data Source

StatsBomb Open Data — used under the StatsBomb Open Data Licence. This project is not affiliated with or endorsed by StatsBomb.


License

MIT

Releases

No releases published

Packages

 
 
 

Contributors