| title | GridMind-RL |
|---|---|
| emoji | ⚡ |
| colorFrom | green |
| colorTo | blue |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
| license | mit |
GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.
Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.
| URL | |
|---|---|
| Environment API | https://prajwal782007-gridmind.hf.space |
| Live Dashboard | https://prajwal782007-gridmind.hf.space/dashboard |
Quick test:
curl https://prajwal782007-gridmind.hf.space/health
curl https://prajwal782007-gridmind.hf.space/tasks| Description | |
|---|---|
| Observation | 13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast |
| Actions | HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5) |
| Reward | 9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation |
| Episode | 96 steps = 24 simulated hours @ 15-min resolution |
| Tasks | 4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following |
Weights reflect real-world building operator priorities — not arbitrary values:
| Component | Weight | Rationale |
|---|---|---|
cost_savings |
0.28 | Primary operator KPI — energy spend is the main business metric |
carbon_reward |
0.20 | ESG compliance — increasingly mandatory for industrial operators |
temp_constraint |
0.20 | Hard safety constraint — comfort SLA violations incur penalties |
grid_response |
0.20 | Regulatory SLA — demand response programs pay operators to shed load |
batch_deadline |
0.12 | Production continuity — missing batch deadlines causes downstream losses |
efficiency_bonus |
0.05 | Storage arbitrage — incentivises smart charge/discharge timing |
stability_penalty |
-0.05 | Anti-cycling — prevents HVAC thrashing that causes equipment wear |
task_satisfaction |
0.50* | Task 4 only — weighted per the episode's instruction card |
fault_mitigation |
dynamic | Emergency response — computed based on fault type and response |
*Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.
| Field | Type | Description |
|---|---|---|
| indoor_temperature | float | °C |
| thermal_storage_level | float | 0-1 (0=empty, 1=full) |
| process_demand | float | kW current industrial power demand |
| current_price | float | $/kWh |
| grid_stress_signal | float | 0-1 (>0.7 = critical) |
| carbon_intensity | float | gCO2/kWh |
| hour_of_day | int | 0-23 |
| batch_queue | int[] | pending job deadline slots |
| cumulative_cost | float | $ total incurred this episode |
| hvac_efficiency | float | 1.0 → degrades to 0.5 over episode |
| active_faults | string[] | Active fault alarm strings |
| instruction_card | object | Task 4 objective only |
| price_forecast | float[] | 4-step upcoming price preview |
| Field | Type | Range |
|---|---|---|
| hvac_power_level | float | 0.0-1.0 |
| thermal_charge_rate | float | -1.0 to 1.0 |
| batch_job_slot | int | 0-4 |
| load_shed_fraction | float | 0.0-0.5 |
A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads /feeder to see fleet-wide demand, then sets per-building price multipliers via /coordinate to orchestrate behavior.
Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.
These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.
A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.
| Policy | Task 1 | Task 2 | Task 3 | Task 4 |
|---|---|---|---|---|
| Heuristic Baseline | 0.494 | 0.471 | 0.748 | 0.478 |
| Zero-shot LLM | 0.715 | 0.645 | 0.610 | 0.582 |
| GRPO Fine-tuned LLM | — | — | — | — |
GRPO fine-tuned scores updating after full training run on T4 GPU. Training plots below show live progress from the actual run.
Reward vs training step. Blue = per-step reward, red dashed = smoothed average.
Training loss decreasing over steps — confirms the model is updating.
Grade scores per task: heuristic baseline vs GRPO-trained LLM.
Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.
🔄 Live update: GRPO fine-tuned scores will be filled in here immediately after the final training run completes on the T4 GPU.
go run main.go# Set up your API token
cp .env.example .env
# Edit .env with HF_TOKEN
# Task 1: Cost minimization
python inference.py --task 1 --episodes 5
# Task 2: Temperature management
python inference.py --task 2 --episodes 5
# Task 3: Full demand response
python inference.py --task 3 --episodes 5
# Task 4: Instruction following
python inference.py --task 4 --episodes 5
# Heuristic baseline (fast, no LLM)
python inference.py --fast-mode --task 3 --episodes 5python scripts/multi_building_demo.pypython scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csvpython scripts/plot_results.pyAgent (python/inference.py)
→ HTTP POST /step, /reset, /grade
↓
Go Environment Server (main.go) → Port 7860
↓
Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go)
↓
Web Dashboard (dashboard/server.py) → Port 7861
Design philosophy:
- Separation of concerns: Physics engine (Go) decoupled from policy layer (Python)
- OpenEnv compliance: Standardized REST API enables any language agent
- Deterministic simulation: Seeded RNG for reproducible experiments
- Dense rewards: 9-component reward for effective learning
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Health check |
| GET | /ping | Liveness probe |
| POST | /reset | Start new episode |
| POST | /step | Take action step |
| GET | /state | Get current state |
| GET | /grade | Grade episode (0.0-1.0 score) |
| GET | /tasks | Available tasks |
| GET | /metrics | Prometheus metrics |
| GET | /replay | Episode history |
| GET | /feeder | Aggregate fleet state |
| POST | /coordinate | Set price multipliers |
| POST | /simulate | World model prediction |
| POST | /coordinator/reset | Reset multi-building episode |
| POST | /coordinator/step | Step with per-building actions |
| GET | /info | OpenEnv metadata |
| GET | /ws | WebSocket endpoint |
gridmind-rl/
├── main.go # HTTP server & OpenEnv API
├── inference.py # Agent entry point (LLM + heuristic)
├── openenv.yaml # OpenEnv spec
├── Dockerfile # Container build
├── HF_BLOG_POST.md # Blog write-up
├── baseline_scores.json # Heuristic baseline scores
├── env/
│ ├── environment.go # Physics simulation
│ ├── models.go # Data models
│ ├── rewards.go # Reward computation
│ ├── tasks.go # Task grading
│ └── faults.go # Fault injection
├── scripts/
│ ├── train_unsloth.py # GRPO training
│ ├── plot_results.py # Training curve visualizer
│ ├── multi_building_demo.py # Fleet AI demo
│ └── gridmind_grpo_colab.ipynb # Colab training notebook
├── server/
│ └── app.py # Python fallback server
├── dashboard/
│ ├── server.py # Web server (port 7861)
│ └── static/ # Frontend assets
├── curves/ # Training curves (train N/)
│ └── train N/ # Per-run plots
├── results/ # Training outputs (generated)
└── README.md
- 🤗 HuggingFace Space: GridMind-RL
- 📓 Training Notebook: gridmind_grpo_colab.ipynb
- 📝 Blog Post: Read the write-up
- 🐙 GitHub: Code Repository
MIT License. See LICENSE file.
Questions? Open an issue on GitHub.