GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

title	GridMind-RL
emoji	⚡
colorFrom	green
colorTo	blue
sdk	docker
app_port	7860
pinned	false
license	mit

GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

Why This Environment Is Novel

Industrial buildings consume ~40% of global electricity yet rely on naive "always-on" HVAC policies. LLMs can reason about pricing curves, fault alerts, and natural language objectives—but no environment trains them for this. GridMind-RL simulates a full 24-hour building energy system with stochastic electricity prices, equipment faults, and instruction cards, creating a genuinely challenging domain where learned policies translate to real operational value.

Live Demo

	URL
Environment API	https://prajwal782007-gridmind.hf.space
Live Dashboard	https://prajwal782007-gridmind.hf.space/dashboard

Quick test:

curl https://prajwal782007-gridmind.hf.space/health
curl https://prajwal782007-gridmind.hf.space/tasks

Environment

	Description
Observation	13 fields: temperature, storage, price, stress, carbon, faults, HVAC efficiency, process demand, batch queue, price forecast
Actions	HVAC level (0-1), thermal charge (-1 to 1), batch slot (0-4), load shed (0-0.5)
Reward	9-component weighted sum: cost, temperature, grid, deadline, efficiency, stability, carbon, instruction, fault_mitigation
Episode	96 steps = 24 simulated hours @ 15-min resolution
Tasks	4 tasks: (1) cost, (2) temperature, (3) demand_response, (4) instruction_following

Reward Weight Rationale

Weights reflect real-world building operator priorities — not arbitrary values:

Component	Weight	Rationale
`cost_savings`	0.28	Primary operator KPI — energy spend is the main business metric
`carbon_reward`	0.20	ESG compliance — increasingly mandatory for industrial operators
`temp_constraint`	0.20	Hard safety constraint — comfort SLA violations incur penalties
`grid_response`	0.20	Regulatory SLA — demand response programs pay operators to shed load
`batch_deadline`	0.12	Production continuity — missing batch deadlines causes downstream losses
`efficiency_bonus`	0.05	Storage arbitrage — incentivises smart charge/discharge timing
`stability_penalty`	-0.05	Anti-cycling — prevents HVAC thrashing that causes equipment wear
`task_satisfaction`	0.50*	Task 4 only — weighted per the episode's instruction card
`fault_mitigation`	dynamic	Emergency response — computed based on fault type and response

*Task 4 instruction reward weight comes from the sampled instruction card, not a fixed value.

Observation Fields

Field	Type	Description
indoor_temperature	float	°C
thermal_storage_level	float	0-1 (0=empty, 1=full)
process_demand	float	kW current industrial power demand
current_price	float	$/kWh
grid_stress_signal	float	0-1 (>0.7 = critical)
carbon_intensity	float	gCO2/kWh
hour_of_day	int	0-23
batch_queue	int[]	pending job deadline slots
cumulative_cost	float	$ total incurred this episode
hvac_efficiency	float	1.0 → degrades to 0.5 over episode
active_faults	string[]	Active fault alarm strings
instruction_card	object	Task 4 objective only
price_forecast	float[]	4-step upcoming price preview

Action Fields

Field	Type	Range
hvac_power_level	float	0.0-1.0
thermal_charge_rate	float	-1.0 to 1.0
batch_job_slot	int	0-4
load_shed_fraction	float	0.0-0.5

Core Capabilities

Multi-Agent Coordination

A single oversight LLM coordinates multiple buildings through price signals. The coordinator reads /feeder to see fleet-wide demand, then sets per-building price multipliers via /coordinate to orchestrate behavior.

Long-Horizon Instruction Following

Task 4 presents a natural language objective card like "Keep total energy cost under $2.50 while maintaining 19-23°C". Agents must plan across all 96 steps—not greedy per-step control.

These two capabilities map directly to Theme 1 and Theme 3 of the OpenEnv Hackathon.

Results

What the Agent Learns

A naive heuristic runs HVAC at fixed levels based on time-of-day. After GRPO training on GridMind-RL, the agent learns to charge thermal storage during off-peak hours (4¢/kWh) and discharge during peak (32¢/kWh), voluntarily shed load during grid stress signals above 0.7, and adjust HVAC intensity as efficiency degrades over the episode. None of these behaviors are hardcoded — the agent discovers them through the reward signal alone.

Policy	Task 1	Task 2	Task 3	Task 4
Heuristic Baseline	0.494	0.471	0.748	0.478
Zero-shot LLM	0.715	0.645	0.610	0.582
GRPO Fine-tuned LLM	—	—	—	—

GRPO fine-tuned scores updating after full training run on T4 GPU. Training plots below show live progress from the actual run.

Reward vs training step. Blue = per-step reward, red dashed = smoothed average.

Training loss decreasing over steps — confirms the model is updating.

Grade scores per task: heuristic baseline vs GRPO-trained LLM.

Scores are episode grade scores (0.0–1.0, clamped open interval). Heuristic = fixed policy with no learning. Zero-shot = Qwen2.5-1.5B-Instruct prompted with task description, no fine-tuning, evaluated over 1 episode per task. Fine-tuned = GRPO-trained on GridMind-RL environment.

🔄 Live update: GRPO fine-tuned scores will be filled in here immediately after the final training run completes on the T4 GPU.

How to Run

Start the environment server

go run main.go

Run the LLM agent (task 1-4)

# Set up your API token
cp .env.example .env
# Edit .env with HF_TOKEN

# Task 1: Cost minimization
python inference.py --task 1 --episodes 5

# Task 2: Temperature management  
python inference.py --task 2 --episodes 5

# Task 3: Full demand response
python inference.py --task 3 --episodes 5

# Task 4: Instruction following
python inference.py --task 4 --episodes 5

# Heuristic baseline (fast, no LLM)
python inference.py --fast-mode --task 3 --episodes 5

Run multi-building coordinator demo

python scripts/multi_building_demo.py

Run training (requires GPU)

python scripts/train_unsloth.py --steps 500 --output-csv results/training_log.csv

Generate training curve plot

python scripts/plot_results.py

Architecture

Agent (python/inference.py)
    → HTTP POST /step, /reset, /grade
    ↓
Go Environment Server (main.go) → Port 7860
    ↓
Physics Engine (env/environment.go) + Rewards (env/rewards.go) + Tasks (env/tasks.go)
    ↓
Web Dashboard (dashboard/server.py) → Port 7861

Design philosophy:

Separation of concerns: Physics engine (Go) decoupled from policy layer (Python)
OpenEnv compliance: Standardized REST API enables any language agent
Deterministic simulation: Seeded RNG for reproducible experiments
Dense rewards: 9-component reward for effective learning

API Reference

Method	Endpoint	Description
GET	/health	Health check
GET	/ping	Liveness probe
POST	/reset	Start new episode
POST	/step	Take action step
GET	/state	Get current state
GET	/grade	Grade episode (0.0-1.0 score)
GET	/tasks	Available tasks
GET	/metrics	Prometheus metrics
GET	/replay	Episode history
GET	/feeder	Aggregate fleet state
POST	/coordinate	Set price multipliers
POST	/simulate	World model prediction
POST	/coordinator/reset	Reset multi-building episode
POST	/coordinator/step	Step with per-building actions
GET	/info	OpenEnv metadata
GET	/ws	WebSocket endpoint

Project Structure

gridmind-rl/
├── main.go                    # HTTP server & OpenEnv API
├── inference.py              # Agent entry point (LLM + heuristic)
├── openenv.yaml              # OpenEnv spec
├── Dockerfile                # Container build
├── HF_BLOG_POST.md           # Blog write-up
├── baseline_scores.json      # Heuristic baseline scores
├── env/
│   ├── environment.go        # Physics simulation
│   ├── models.go           # Data models
│   ├── rewards.go         # Reward computation
│   ├── tasks.go           # Task grading
│   └── faults.go         # Fault injection
├── scripts/
│   ├── train_unsloth.py   # GRPO training
│   ├── plot_results.py   # Training curve visualizer
│   ├── multi_building_demo.py  # Fleet AI demo
│   └── gridmind_grpo_colab.ipynb  # Colab training notebook
├── server/
│   └── app.py            # Python fallback server
├── dashboard/
│   ├── server.py         # Web server (port 7861)
│   └── static/           # Frontend assets
├── curves/               # Training curves (train N/)
│   └── train N/         # Per-run plots
├── results/              # Training outputs (generated)
└── README.md

Links

🤗 HuggingFace Space: GridMind-RL
📓 Training Notebook: gridmind_grpo_colab.ipynb
📝 Blog Post: Read the write-up
🐙 GitHub: Code Repository

License

MIT License. See LICENSE file.

Questions? Open an issue on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

Why This Environment Is Novel

Live Demo

Environment

Reward Weight Rationale

Observation Fields

Action Fields

Core Capabilities

Multi-Agent Coordination

Long-Horizon Instruction Following

Results

What the Agent Learns

How to Run

Start the environment server

Run the LLM agent (task 1-4)

Run multi-building coordinator demo

Run training (requires GPU)

Generate training curve plot

Architecture

API Reference

Project Structure

Links

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.cursor		.cursor
curves		curves
dashboard		dashboard
data		data
env		env
python		python
results		results
scratch		scratch
scripts		scripts
server		server
tests		tests
.env.example		.env.example
.gitignore		.gitignore
BLOG.md		BLOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
baseline_scores.json		baseline_scores.json
go.mod		go.mod
go.sum		go.sum
inference.py		inference.py
main.go		main.go
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
test_coordinator.py		test_coordinator.py
uv.lock		uv.lock
verify_readiness.py		verify_readiness.py

Folders and files

Latest commit

History

Repository files navigation

GridMind-RL — Train LLMs to manage industrial buildings under faults, grid stress, and natural language objectives.

Why This Environment Is Novel

Live Demo

Environment

Reward Weight Rationale

Observation Fields

Action Fields

Core Capabilities

Multi-Agent Coordination

Long-Horizon Instruction Following

Results

What the Agent Learns

How to Run

Start the environment server

Run the LLM agent (task 1-4)

Run multi-building coordinator demo

Run training (requires GPU)

Generate training curve plot

Architecture

API Reference

Project Structure

Links

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages