⚽ Football Event Detection from ASR Transcripts

Hệ thống phát hiện tự động sự kiện bóng đá (Goal, Penalty, Card, Substitution) từ bản ghi ASR của bình luận viên, sử dụng hybrid pipeline kết hợp Rule-based + Transformer ensemble.

📊 Kết quả

Metric	Score
Accuracy	95.7%
F1 Macro	94.3%
F1 Weighted	95.8%

Per-class F1

Event	F1	Event	F1
⚽ GOAL	93.4%	🟨 YELLOW_CARD	94.1%
🅿️ PENALTY	87.0%	🟥 RED_CARD	93.9%
🔄 SUBSTITUTION	100%	📝 NONE	97.2%

🏗️ Kiến trúc hệ thống

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  ASR Input   │────▶│  Layer 1     │────▶│  Layer 2     │
│ (1_asr.json  │     │  Rule-Based  │     │  Transformer │
│  2_asr.json) │     │  Detection   │     │  Ensemble    │
└─────────────┘     └──────────────┘     └──────┬───────┘
                                                │
                    ┌──────────────┐     ┌──────▼───────┐
                    │  Layer 4     │◀────│  Layer 3     │
                    │  Score       │     │  Dedup       │
                    │  Validation  │     │  (Cooldown)  │
                    └──────┬───────┘     └──────────────┘
                           │
                    ┌──────▼───────┐     ┌──────────────┐
                    │  Layer 5     │────▶│   Output     │
                    │  Refinement  │     │  (Dataset)   │
                    └──────────────┘     └──────────────┘

Layer 1 — Rule-Based Detection

80+ regex patterns phát hiện sự kiện theo động từ hành động
Exclusion patterns riêng cho từng event type (loại false positive)
Confidence tiers (0.75–0.98) theo mức độ cụ thể của pattern

Layer 2 — Transformer Ensemble

Model: microsoft/MiniLM-L12-H384-uncased (33M params, fine-tuned)
Input: Context window 3 câu (prev + current + next)
Ensemble: Rule 40% + Transformer 60%, ưu tiên rule khi exclusion match mạnh

Layer 3 — Deduplication

Cooldown period: GOAL 25s, SUBSTITUTION 20s, PENALTY 15s, CARD 10s
Giữ event có confidence cao nhất trong cùng khoảng cooldown

Layer 4 — Score Validation

Trích tỉ số ground truth từ tên folder match (vd: Chelsea 2 - 2 Swansea)
Trim bớt goal thừa bằng confidence ranking → 0 false goals

Layer 5 — Context Refinement

GOAL + context chứa "miss/save/wide/blocked" → NONE
PENALTY + context chứa "no penalty/waves away" → NONE

🧠 Training Pipeline

Bước 1: Weak Annotation          Bước 2: Gold Set           Bước 3: Fine-tune
Rule pipeline tạo labels    →    Tạo 516 samples đã    →   Train MiniLM với
tự động cho ~5000 samples        review làm eval set        weak labels

Tham số	Giá trị
Model	`microsoft/MiniLM-L12-H384-uncased`
Train	4,986 samples (weak labels)
Eval	516 samples (gold set)
Epochs	5
Batch size	16
Learning rate	2e-5
Loss	CrossEntropy + class weights
Early stopping	Patience 2

📁 Cấu trúc dự án

football/
├── app.py                         # Flask web server
├── create_football_dataset.py     # Pipeline orchestrator (CLI + class)
├── create_gold_set.py             # Tạo gold set cho evaluation
├── train_transformer.py           # Fine-tune transformer model
├── gold_set.json                  # Gold set (516 samples)
├── pipeline/
│   ├── __init__.py
│   ├── layer1_detection.py        # Rule-based detection
│   ├── layer2_transformer.py      # Transformer + ensemble
│   ├── layer3_dedup.py            # Deduplication
│   ├── layer4_validation.py       # Score validation
│   ├── layer5_refinement.py       # Context refinement
│   └── utils.py                   # Merge ASR, context window, balancing
├── checkpoints/best/              # Trained model weights
├── templates/index.html           # Web UI (dark mode)
└── whisper_v1_en/                 # 367 matches data
    └── {league}/{date} {team1} {score} {team2}/
        ├── 1_asr.json
        └── 2_asr.json

📦 Dữ liệu

Nguồn: SoccerNet/sn-echoes — Whisper v1 English ASR
Quy mô: 367 trận từ 6 giải: EPL, La Liga, Bundesliga, Ligue 1, Serie A, Champions League
Format: {"segments": {"0": [start, end, "text"], ...}}

Labels (6 nhãn)

Label	Mô tả	Số lượng
GOAL	Bàn thắng	~1,187
PENALTY	Phạt penalty	~367
YELLOW_CARD	Thẻ vàng	~592
RED_CARD	Thẻ đỏ	~111
SUBSTITUTION	Thay người	~168
NONE	Không phải sự kiện	Còn lại

🚀 Cách chạy

Yêu cầu

pip install flask torch transformers scikit-learn accelerate

Web App

python app.py
# Mở http://localhost:5000

CLI — Tạo dataset

# Toàn bộ matches
python create_football_dataset.py whisper_v1_en -o output.json --split --verify

# Chỉ 1 trận
python create_football_dataset.py whisper_v1_en/england_epl/... --single-match

# Tùy chỉnh
python create_football_dataset.py whisper_v1_en \
  --none-ratio 2.0 \
  --split --test-ratio 0.2 \
  --verify

Training

# Tạo gold set
python create_gold_set.py --create --matches 10

# Fine-tune transformer
python train_transformer.py \
  --train football_dataset_production_train.json \
  --eval gold_set.json \
  --model microsoft/MiniLM-L12-H384-uncased \
  --epochs 5

💡 Quyết định thiết kế

Precision > Recall — Strict patterns, ưu tiên không detect sai hơn bỏ sót
Per-event exclusions — Mỗi loại event có bộ exclusion riêng biệt
Weak supervision — Rule pipeline tạo labels tự động, không cần annotate thủ công
Ensemble > Single model — Rule tốt ở precision, transformer tốt ở recall
Temporal split — Train/test chia theo thời gian, không random
Score validation từ metadata — Tận dụng tên folder chứa tỉ số làm ground truth
Context window 3 câu — Cung cấp ngữ cảnh cho transformer và refinement
Graceful fallback — Không có transformer → chạy rule-only mode

🛠️ Tech Stack

Thành phần	Công nghệ
NLP Engine	Regex + HuggingFace Transformers
Model	MiniLM-L12-H384-uncased (33M params)
Training	PyTorch + HuggingFace Trainer
Backend	Flask
Frontend	HTML/CSS/JS (dark mode)
ASR Source	OpenAI Whisper v1
Dataset	SoccerNet/sn-echoes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚽ Football Event Detection from ASR Transcripts

📊 Kết quả

Per-class F1

🏗️ Kiến trúc hệ thống

Layer 1 — Rule-Based Detection

Layer 2 — Transformer Ensemble

Layer 3 — Deduplication

Layer 4 — Score Validation

Layer 5 — Context Refinement

🧠 Training Pipeline

📁 Cấu trúc dự án

📦 Dữ liệu

Labels (6 nhãn)

🚀 Cách chạy

Yêu cầu

Web App

CLI — Tạo dataset

Training

💡 Quyết định thiết kế

🛠️ Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pipeline		pipeline
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
create_football_dataset.py		create_football_dataset.py
create_gold_set.py		create_gold_set.py
gold_review.json		gold_review.json
gold_set.json		gold_set.json
train_transformer.py		train_transformer.py

Folders and files

Latest commit

History

Repository files navigation

⚽ Football Event Detection from ASR Transcripts

📊 Kết quả

Per-class F1

🏗️ Kiến trúc hệ thống

Layer 1 — Rule-Based Detection

Layer 2 — Transformer Ensemble

Layer 3 — Deduplication

Layer 4 — Score Validation

Layer 5 — Context Refinement

🧠 Training Pipeline

📁 Cấu trúc dự án

📦 Dữ liệu

Labels (6 nhãn)

🚀 Cách chạy

Yêu cầu

Web App

CLI — Tạo dataset

Training

💡 Quyết định thiết kế

🛠️ Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages