verifiable-rewards

Here are 12 public repositories matching this topic...

DeepGym / deepgym

RL training environments with verifiable rewards for coding agents. Works with TRL, Unsloth, verl, OpenRLHF.

python machine-learning reinforcement-learning deep-learning sandbox evaluation rl code-execution ai-agents daytona llm unsloth coding-agents grpo verifiable-rewards openrlhf reward-function grpo-training

Updated Apr 24, 2026
Python

AlaaLab / Dr-LLaVA

Star

[ NeurIPS MAR 2024 ] Official Codebase for "Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding"

post-training reinforcement-learning-from-human-feedback multimodal-large-language-models verifiable-rewards medical-vlms

Updated Dec 4, 2024
Python

Think-a-Tron / zeno

Star

Verifiable RL rewards for LLMs

reinforcement-learning rewards verifiable-rewards

Updated May 20, 2025
Python

cyanheads / evals-mcp-server

Sponsor

Star

Author verifiable eval records through a draft → review → revise → submit loop with server-enforced graders; compile to JSONL/CSV/Inspect/lm-eval via MCP. STDIO or Streamable HTTP.

typescript mcp evaluation grader evals model-context-protocol rlvr verifiable-rewards inspect-ai lm-eval-harness cyanheads

Updated Jun 28, 2026
TypeScript

agentsynth / agentsynth

Star

Outcome-verified agent trajectories, benchmarks, and RL environments — with a live leaderboard and a CI gate for your agents. Offline-first, MIT.

benchmark leaderboard datasets trajectories synthetic-data tool-use rl-environments llm-finetuning llm-as-judge agentic-ai agent-evaluation verifiable-rewards

Updated Jun 25, 2026
Python

RLVR on FinQA: a verifiable numeric reward for financial reasoning, and a like-for-like online (GRPO) vs offline (DPO) comparison on the same signal. Leads with the reward-design and headroom findings.

post-training financial-nlp rlvr verifiable-rewards

Updated Jun 20, 2026
Python

RubenHaisma / rl-studio

Star

GRPO reinforcement-learning fine-tuning, implemented from scratch in numpy (CPU-verified) + a TRL/Qwen GPU path on Modal. CLI-first, MLflow-tracked.

Updated Jun 8, 2026
Python

CanadaApollo6 / tabular-reasoning-grpo

Star

GRPO + verifiable rewards — teaching a small LM to reason over structured data on one GPU.

reinforcement-learning post-training trl rlhf vllm grpo verifiable-rewards

Updated Jun 24, 2026
Python

jang1563 / verify-or-trust

Star

A verifiable-reward agentic benchmark: does an LLM correctly allocate verification when orchestrating a fallible biology foundation model?

benchmark evaluation computational-biology single-cell agents ai-safety perturbation foundation-models llm verifiable-rewards

Updated Jun 18, 2026
Python

nithingovindugari / phychip

Star

An RL environment where a physics simulator (ngspice) is the verifiable reward — training a 3B LLM to design analog circuits. RLVR/GRPO, 2 decontaminated benchmarks, reward-hacking audit.

reinforcement-learning eda spice ngspice post-training analog-circuits llm grpo rlvr verifiable-rewards