llm-reliability

Here are 8 public repositories matching this topic...

Eatosin / Structura

Turn Chaos Into Structure. A Type-Safe AI Agent that extracts valid JSON from unstructured data using PydanticAI, FastHTML, and Gemini 2.5.

data-extraction fasthtml ai-agents unstructured-data type-safe-builder gemini-ai pydanticai self-healing-ai llm-reliability json-extraction

Updated Jan 10, 2026
Python

North-Shore-AI / crucible_examples

Sponsor

Star

Interactive Phoenix LiveView demonstrations of the Crucible Framework - showcasing ensemble voting, request hedging, statistical analysis, and more with mock LLMs

Updated Apr 23, 2026
Elixir

MukundaKatta / lightweight-agent-eval-paper

Star

Public artifact bundle for the preprint 'Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents'

benchmarking ai-agents preprint agent-evaluation llm-reliability research-artifacts

Updated May 5, 2026
Python

hsieh89t-cloud / legal-agent-reliability-benchmark

Star

Reliability and hallucination mitigation research for tool-augmented legal AI agents using QC-Sentinel verification architecture.

benchmark ai-safety ai-agents legal-ai openai-api prompt-engineering llm-reliability

Updated Mar 6, 2026

TianbaoZhang001 / OpenCAAF

Star

Reference implementation of CAAF — three-pillar agent framework with monotonic convergence.

constraint-satisfaction industrial-ai agent-framework llm-agents agentic-ai llm-reliability deterministic-ai

Updated Apr 28, 2026
Python

shashidharReddy866 / llm-evaluation-system

Star

Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.

nlp json-schema nextjs model-evaluation hono structured-output few-shot-learning ai-evaluation prompt-engineering anthropic llm-evaluation hallucination-detection llm-reliability eval-harness prompt-comparison

Updated May 1, 2026
TypeScript

MukundaKatta / lightweight-eval-scorecards-paper

Star

Preprint paper package — Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Zenodo DOI 10.5281/zenodo.20034550)

research open-science ai-agents preprint tool-use agent-evaluation llm-reliability workflow-evaluation artifact-paper operational-scorecards

Updated May 7, 2026
Python

North-Shore-AI / crucible_framework

Sponsor

Star

CrucibleFramework: A scientific platform for LLM reliability research on the BEAM

documentation machine-learning elixir otp research ai reproducible-research beam reliability ai-research ensemble-methods statistical-testing research-framework experiment-framework llm llm-testing llm-reliability nshkr-crucible

Updated Apr 4, 2026
Elixir

Improve this page

Add a description, image, and links to the llm-reliability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-reliability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-reliability

Here are 8 public repositories matching this topic...

Eatosin / Structura

North-Shore-AI / crucible_examples

MukundaKatta / lightweight-agent-eval-paper

hsieh89t-cloud / legal-agent-reliability-benchmark

TianbaoZhang001 / OpenCAAF

shashidharReddy866 / llm-evaluation-system

MukundaKatta / lightweight-eval-scorecards-paper

North-Shore-AI / crucible_framework

Improve this page

Add this topic to your repo