RL training environments with verifiable rewards for coding agents. Works with TRL, Unsloth, verl, OpenRLHF.
-
Updated
Apr 24, 2026 - Python
RL training environments with verifiable rewards for coding agents. Works with TRL, Unsloth, verl, OpenRLHF.
[ NeurIPS MAR 2024 ] Official Codebase for "Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding"
Verifiable RL rewards for LLMs
Author verifiable eval records through a draft → review → revise → submit loop with server-enforced graders; compile to JSONL/CSV/Inspect/lm-eval via MCP. STDIO or Streamable HTTP.
Outcome-verified agent trajectories, benchmarks, and RL environments — with a live leaderboard and a CI gate for your agents. Offline-first, MIT.
RLVR on FinQA: a verifiable numeric reward for financial reasoning, and a like-for-like online (GRPO) vs offline (DPO) comparison on the same signal. Leads with the reward-design and headroom findings.
GRPO reinforcement-learning fine-tuning, implemented from scratch in numpy (CPU-verified) + a TRL/Qwen GPU path on Modal. CLI-first, MLflow-tracked.
GRPO + verifiable rewards — teaching a small LM to reason over structured data on one GPU.
A verifiable-reward agentic benchmark: does an LLM correctly allocate verification when orchestrating a fallible biology foundation model?
An RL environment where a physics simulator (ngspice) is the verifiable reward — training a 3B LLM to design analog circuits. RLVR/GRPO, 2 decontaminated benchmarks, reward-hacking audit.
reinforcement learning with verifiable rewards using the online encyclopedia of integer sequences to generate diverse python programming tasks
금융 도메인 LLM-Wiki — YouTube 투자 채널 메소돌로지 자동 추출·검증·복리 시스템. Karpathy LLM-Wiki + Verifiable Rewards (forward returns).
Add a description, image, and links to the verifiable-rewards topic page so that developers can more easily learn about it.
To associate your repository with the verifiable-rewards topic, visit your repo's landing page and select "manage topics."