A comprehensive benchmarking platform that evaluates and compares the reasoning, latency, and accuracy of different LLM inference paradigms. Built with TigerGraph, Groq, and FastAPI, this project directly compares LLM-only, Vector RAG, and GraphRAG using a dense corpus of medical research papers on Diabetes.
graph TD
%% Styling
classDef frontend fill:#1E293B,stroke:#4ADE80,stroke-width:2px,color:white;
classDef backend fill:#1E293B,stroke:#F472B6,stroke-width:2px,color:white;
classDef database fill:#0F172A,stroke:#38BDF8,stroke-width:2px,color:white;
classDef llm fill:#451A03,stroke:#FBBF24,stroke-width:2px,color:white;
classDef evaluator fill:#14532D,stroke:#A3E635,stroke-width:2px,color:white;
%% Components
UI["🖥️ React / Vite Dashboard"]:::frontend
API["⚡ FastAPI Backend"]:::backend
%% Databases
VectorDB[("📚 ChromaDB Vector Store")]:::database
TG[("🐯 TigerGraph Cloud")]:::database
%% LLMs
Groq1["🧠 Groq Llama 3.1"]:::llm
Groq2["🧠 Groq Llama 3.1"]:::llm
Groq3["🧠 Groq Llama 3.1"]:::llm
%% Evaluators
Judge["⚖️ Gemini 2.5 Flash Judge"]:::evaluator
Bert["📊 BERTScore Evaluator"]:::evaluator
%% Flow: Request
UI -- "User Query" --> API
%% Flow: Parallel Pipelines
API -- "Pipeline 1: No Context" ---> Groq1
API -- "Pipeline 2: Semantic Search" --> VectorDB
VectorDB -- "Context Chunks" --> Groq2
API -- "Pipeline 3: Graph Traversal" --> TG
TG -- "Multi-hop Graph Context" --> Groq3
%% Flow: Evaluation
Groq1 -- "Answer 1" --> Judge
Groq2 -- "Answer 2" --> Judge
Groq3 -- "Answer 3" --> Judge
Groq1 -- "Answer 1" --> Bert
Groq2 -- "Answer 2" --> Bert
Groq3 -- "Answer 3" --> Bert
%% Flow: Response
Judge -- "Pass/Fail" --> API
Bert -- "Score" --> API
API -- "Aggregated Metrics & Results" --> UI
- Three Reasoning Engines Head-to-Head:
- LLM-Only: Parametric memory only, no retrieval.
- Basic RAG: Standard vector retrieval using sentence embeddings over the document corpus.
- GraphRAG: TigerGraph-powered semantic multi-hop retrieval traversing entities, relationships, and context for unparalleled accuracy.
- Blazing Fast Inference: Powered by Groq's LPU inference engine utilizing
llama-3.1-8b-instant. - Live LLM Judge: Every answer is aggressively graded by an automated LLM Judge (Gemini 2.5 Flash) for factual consistency and hallucination detection.
- BERTScore Evaluation: Measures semantic similarity against a highly grounded reference answer.
- Beautiful Dashboard: A stunning, modern React/Vite UI that renders answers, latencies, token counts, and judge evaluations in real-time.
- Graph Database: TigerGraph Cloud
- LLM Engine: Groq (Llama 3.1)
- Evaluator Judge: Google Gemini (2.5 Flash)
- Backend API: FastAPI / Python
- Frontend UI: React + Vite + TailwindCSS + Framer Motion
- Embeddings: SentenceTransformers / ChromaDB (for basic RAG)
You will need API keys for Groq and Gemini, as well as a TigerGraph Cloud instance.
Create a .env file in the root directory with the following variables:
# TigerGraph Configuration
TG_HOST=https://your-tigergraph-domain.i.tgcloud.io
TG_USERNAME=your_email@domain.com
TG_TGCLOUD=true
TG_SECRET=your_secret_key
TG_GRAPH=your_graph_name
# LLM Providers
GROQ_API_KEY=gsk_your_groq_key_here
GEMINI_API_KEY=your_gemini_key_here
# Model Selection (Defaults provided)
LLM_MODEL=llama-3.1-8b-instant
JUDGE_MODEL=gemini-2.5-flashSet up a virtual environment and run the backend API.
# Create and activate virtual environment
python -m venv .venv
.\.venv\Scripts\activate # Windows
# source .venv/bin/activate # Mac/Linux
# Install dependencies (ensure fastapi, uvicorn, groq, google-genai, etc. are installed)
pip install -r requirements.txt
# Start the FastAPI benchmark server
python -m uvicorn benchmark_api:app --host 127.0.0.1 --port 8010Navigate to the UI folder, install dependencies, and start the development server.
cd graphrag/graphrag-ui
npm install
npm run devVisit http://localhost:5173 to access the Benchmark Dashboard and run your first comparison!
- User Query: The user asks a complex question (e.g., "How does insulin resistance relate to obesity?").
- Parallel Processing: The FastAPI backend routes the query to three pipelines simultaneously:
- Direct LLM inference.
- Vector Search -> LLM synthesis.
- TigerGraph
searchDocuments-> LLM synthesis.
- Reference Generation: A highly grounded reference answer is generated.
- Evaluation: Gemini 2.5 Flash acts as a judge, assigning a strict PASS/FAIL based on material correctness. BERTScore measures semantic similarity to the reference.
- Results Rendering: The UI displays latency, token cost, text output, and the judge's verdict side-by-side.
