A Python library for evaluating LLM responses using Google Gemini as an automated judge.
llm-eval provides a simple way to score LLM responses on five key dimensions: accuracy, relevance, coherence, hallucination risk, and conciseness. It uses Google's Gemini model to perform the evaluation and returns structured JSON results.
-
Clone the repository:
git clone <repository-url> cd llm-eval
-
Install dependencies:
pip install -r requirements.txt
Create a .env file in the project root with your Gemini API key:
GEMINI_API_KEY=your_api_key_here
Use the command-line interface to evaluate responses:
python cli.py --question "What is the capital of France?" --response "Paris"Start the FastAPI server:
uvicorn app:app --reloadSend a POST request to /evaluate:
curl -X POST "http://localhost:8000/evaluate" \
-H "Content-Type: application/json" \
-d '{"question": "What is the capital of France?", "response": "Paris"}'{
"accuracy": {
"score": 10,
"reason": "The response is factually correct."
},
"relevance": {
"score": 10,
"reason": "Directly answers the question."
},
"coherence": {
"score": 9,
"reason": "Clear and well-structured."
},
"hallucination_risk": {
"score": 10,
"reason": "No unsupported information."
},
"conciseness": {
"score": 10,
"reason": "Brief and to the point."
}
}