| title | Fleet-Watch |
|---|---|
| emoji | 👁️ |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
Developed for the Meta PyTorch OpenEnv Hackathon × Scaler 2026
FleetWatch is a reinforcement learning environment built to train Large Language Models (LLMs) to detect sophisticated, coordinated fraud patterns in multi-agent fleet operations. It employs curriculum learning and a multi-signal reward mechanism to identify deception across complex fleet operation logs.
Fleet operations are highly vulnerable to coordinated fraud schemes in which multiple agents collaborate to conceal evidence. Traditional rule-based detection systems consistently fail against these sophisticated patterns. FleetWatch addresses this gap by training LLMs to:
- Analyze noisy, multi-source operational data
- Identify subtle correlations across agent logs
- Detect coordinated deception across multiple actors
- Produce evidence-based fraud assessments with causal reasoning
FleetWatch implements a full reinforcement learning pipeline with:
- Curriculum Learning — Progressive task difficulty, from single-agent anomalies to multi-agent collusion
- Multi-Signal Rewards — A 7-component reward system that prevents gaming and rewards genuine reasoning
- Anti-Cheat Mechanisms — Penalties for pattern exploitation without supporting evidence
- Evidence-Based Evaluation — Rewards tied to specific log references and causal chains
| Feature | Description |
|---|---|
| Progressive Task Complexity | 5 fraud detection scenarios from single-agent GPS tampering to multi-agent financial collusion |
| Robust Reward System | 7-signal evaluation mechanism with anti-gaming penalties |
| Curriculum Learning | Adaptive task selection based on real-time agent performance |
| Production-Ready API | RESTful endpoints for real-time fraud detection with full documentation |
| Efficient Training | 4-bit quantized Llama-3-8B with LoRA fine-tuning (~30 min on a T4 GPU) |
| Comprehensive Evaluation | Evidence-based scoring with per-signal reward breakdown |
Training using REINFORCE policy gradient optimization delivers significant improvement across all five detection tasks. The full training analysis — baseline vs. enhanced — is shown below.
| Metric | Baseline (50 eps) | Enhanced (75 eps) |
|---|---|---|
| Mean Reward | 0.047 |
0.521 |
| Best Reward | 0.733 |
0.800 |
| Final 20% Avg | 0.001 |
0.377 |
| Std Dev (lower = better) | 0.238 |
0.156 |
| % Episodes > 0.6 | 0.040 |
0.440 |
The Enhanced model achieves 11× higher mean reward, 10× more episodes exceeding 0.6, and a tighter reward distribution — demonstrating stable, consistent fraud detection rather than occasional lucky guesses.
| Task | Label | Avg Reward (Before) | Avg Reward (After) | Delta |
|---|---|---|---|---|
| T1 | Obvious | 0.065 |
0.528 |
+0.463 |
| T2 | Pattern | 0.074 |
0.550 |
+0.476 |
| T3 | Adversarial | 0.045 |
0.587 |
+0.543 |
| T4 | Cascade | 0.009 |
0.563 |
+0.553 |
| T5 | Collusion | 0.043 |
0.377 |
+0.334 |
T4 (Cascade Negligence) shows the largest absolute gain
(+0.553)from the lowest baseline(0.009), demonstrating the model's ability to master complex causal chain analysis. T5 (Financial Collusion) has the lowest post-training score(0.377), reflecting the inherent difficulty of multi-agent phantom transaction detection.
| Baseline | Enhanced | |
|---|---|---|
| Mean (B mean) | 0.047 |
— |
| Mean (E mean) | — | 0.521 |
| Mean Shift (Δ) | — | +0.474 |
The Baseline reward distribution is heavily concentrated near 0 (peak frequency ~47 episodes at reward < 0.1), while the Enhanced model's distribution spreads broadly toward higher reward values — mean shift of +0.474.
| Model | Best Reward Achieved |
|---|---|
| Baseline | 0.733 |
| Enhanced | 0.800 |
The Enhanced model reaches its peak best reward of 0.800 early in training and sustains it, while the Baseline plateaus at 0.733 — a gain of +0.067 in peak performance.
| Metric | Δ Change | Direction |
|---|---|---|
| Mean Reward | +0.474 |
✅ Higher is better |
| Best Reward | +0.067 |
✅ Higher is better |
| Final 20% Avg | +0.376 |
✅ Higher is better |
| Std Dev | −0.082 |
✅ Lower is better |
| % eps > 0.6 | +0.400 |
✅ Higher is better |
Note: Scores are normalized to the
[0.001, 0.999]range using the 7-component reward function. Baseline represents random-initialization performance (50 episodes). Enhanced represents the trained model (75 episodes, REINFORCE + LoRA).
FleetWatch implements five progressively complex fraud detection scenarios designed to stress-test multi-agent reasoning capabilities.
| Task | Fraud Type | Agents | Complexity | Key Challenges |
|---|---|---|---|---|
| T1 | GPS Tampering | 1 | Low | Route deviation detection, tracker manipulation |
| T2 | Timesheet Fraud | 1 | Medium | Pattern recognition, odometer correlation |
| T3 | Collision Cover-up | 2 | High | Log tampering, witness coercion, cross-agent analysis |
| T4 | Cascade Negligence | 3 | High | Causal chain reconstruction, inspection records |
| T5 | Financial Collusion | 3 | Critical | Shell vendor identification, phantom transactions |
Design Philosophy: Each task requires deeper reasoning than the last, progressing from single-agent pattern detection to full multi-agent coordination analysis.
┌─────────────────────────────────────────────────────────────┐
│ FleetWatch Environment (HuggingFace Spaces) │
│ │
│ ┌──────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ 5 Tasks │───▶│ Adaptive │───▶│ Master Grader │ │
│ │ T1→T5 │ │ Curriculum │ │ (7-signal) │ │
│ └──────────┘ └──────────────────┘ └───────────────┘ │
│ │
│ POST /reset POST /step GET /state GET /health │
└──────────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Training Pipeline (Google Colab — T4 GPU) │
│ │
│ • Llama-3-8B-Instruct (4-bit quantized) │
│ • LoRA Fine-tuning (r=8, alpha=16) │
│ • REINFORCE + Advantage Baseline + Entropy Bonus │
│ • Curriculum Learning + Adaptive LR + Gradient Clipping │
└─────────────────────────────────────────────────────────────┘
| Component | Description |
|---|---|
| Environment | OpenEnv-compliant RL environment with 5 fraud detection tasks |
| Curriculum | Adaptive task selection based on rolling agent performance |
| Grader | Multi-signal reward function with anti-cheat mechanisms |
| Model | 4-bit quantized Llama-3-8B-Instruct with LoRA adapters |
| Trainer | REINFORCE policy gradient with variance reduction and entropy bonus |
Seven independent reward signals are normalized to [0.001, 0.999]. The system is designed to reward genuine, evidence-backed reasoning and penalize shortcut strategies.
| Signal | Weight | Description |
|---|---|---|
| Valid JSON Format | +0.4 |
Proper structured response formatting |
| Anomaly Detection | +1.5 |
Accurate fraud identification |
| Agent Identification | +0.8 |
Correct perpetrator(s) identified |
| Severity Classification | +0.4 |
Appropriate severity level assigned |
| Keyword Coverage | +0.8 |
Evidence-based language used |
| Contextual Reasoning | +0.4 |
Logical connections drawn |
| Evidence Integration | +0.3 |
Specific log references cited |
| Task Complexity Bonus | +0.2 |
Extra credit for Tasks 3–5 |
| Anti-Cheat Penalty | −0.2 |
Applied when gaming is detected |
Why it works:
- Always claiming fraud → low agent identification score
- Always claiming no fraud → missed anomaly penalty
- Only genuine, evidence-grounded reasoning consistently scores high
Test the deployed model directly against a fraud detection task:
curl -X POST https://shiva0999-fleet-watch.hf.space/test/3 \
-H "Content-Type: application/json" \
-d '{
"anomaly_detected": true,
"agent_id": "DRIVER-22, DRIVER-08",
"severity": "critical",
"summary": "Coordinated collision cover-up with log tampering and witness coercion."
}'Response: Returns a normalized score (0–1) with a full 7-signal reward breakdown.
Train a custom model using Google Colab with a free T4 GPU (~30 minutes):
Step 1 — Install dependencies (restart runtime after)
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps trl peft accelerate bitsandbytesStep 2 — Upload and run the training script
# Upload FleetWatch_Colab_Train.py to Colab, then execute:
exec(open("FleetWatch_Colab_Train.py").read())Training Configuration
| Parameter | Value |
|---|---|
| Base Model | Llama-3-8B-Instruct (4-bit quantized) |
| Fine-tuning Method | LoRA (rank=8, alpha=16) |
| Training Algorithm | REINFORCE with advantage baseline |
| Optimizer | AdamW with gradient clipping |
| Task Sampling | Adaptive curriculum based on performance |
Base URL: https://shiva0999-fleet-watch.hf.space
| Method | Endpoint | Description |
|---|---|---|
POST |
/reset |
Start a new episode |
POST |
/step |
Submit a fraud detection action |
GET |
/state |
Retrieve the current episode state |
GET |
/health |
Health check |
POST |
/test/{task_id} |
Test a specific task (1–5) |
Python Example
import requests
BASE_URL = "https://shiva0999-fleet-watch.hf.space"
# Start a new episode
task_data = requests.post(f"{BASE_URL}/reset").json()
# Submit a detection action
action = {
"anomaly_detected": True,
"agent_id": "DRIVER-22",
"severity": "high",
"summary": "GPS tampering detected with route deviation exceeding 40km."
}
response = requests.post(f"{BASE_URL}/step", json=action).json()
print(f"Reward: {response['reward']}")| Layer | Technology |
|---|---|
| Base Model | Llama-3-8B-Instruct (4-bit quantized) |
| Training | Unsloth · REINFORCE · LoRA (r=8) |
| Framework | FastAPI · Docker |
| Deployment | HuggingFace Spaces |
| Libraries | transformers · peft · bitsandbytes · accelerate · trl |
- Live API: https://shiva0999-fleet-watch.hf.space
- HuggingFace Space: https://huggingface.co/spaces/shiva0999/Fleet-Watch
- GitHub Repository: https://github.com/shivakewat1/FleetWatch
- Colab Training Notebook: https://colab.research.google.com/drive/1ZYWRl3NI86Cz8VrrxGm3vOGrKxpqHKp1?usp=sharing
- OpenEnv Framework: https://openenv.dev
- Unsloth: https://github.com/unslothai/unsloth
- Meta Llama 3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Meta PyTorch OpenEnv Hackathon × Scaler 2026
Made with ❤️ for better fleet safety and accountability.
