Skip to content

Zem-0/Autonomous-Code-Review-Agent

Repository files navigation

Autonomous Code Review Agent

Three AI models debate your code. No human input between passes.

A multi-agent code review system where Groq, Mistral, and NVIDIA NIM work together in an autonomous pipeline — each model challenging and building on the previous one's findings — to produce a structured, verified code quality report.

Supports single files, multi-file projects, entire directories, and GitHub PRs. Comes with both a CLI and a dark-themed Gradio web UI with live streaming output.


Demo

>> Scanner  (Groq / Llama 3.3 70B)     →  Found 11 raw issues
>> Critic   (Mistral Large)             →  Confirmed 9, rejected 2 false positives, added 3 missed
>> Arbitrator (NVIDIA / Llama 3.1 70B) →  Final report: Score 42/100

Table of Contents


How It Works

Unlike single-model reviews, this system runs a 3-agent debate. Each agent has a different role, receives the previous agent's output as context, and is allowed to challenge, reject, or extend earlier findings before the final report is written.

flowchart TD
    INPUT([Code Input\nfile · files · directory · GitHub PR · paste])
    INPUT --> SCANNER

    subgraph AGENT1 ["🔍  Agent 1 — Scanner  (Groq · Llama 3.3 70B)"]
        SCANNER[Reads raw code\nFinds ALL potential issues\nCasts wide net — over-reporting is OK]
    end

    SCANNER -->|"JSON array of raw issues"| AGENT2

    subgraph AGENT2 ["🧐  Agent 2 — Critic  (Mistral Large)"]
        CRITIC[Receives code + Scanner findings\nConfirms real issues\nRejects false positives with reasoning\nAdds issues Scanner missed]
    end

    AGENT2 -->|"Verified + enriched issue list"| AGENT3

    subgraph AGENT3 ["⚖️  Agent 3 — Arbitrator  (NVIDIA NIM · Llama 3.1 70B)"]
        ARBITRATOR[Receives full debate transcript\nResolves conflicts between agents\nWrites concrete fix suggestions\nAssigns quality score 0–100\nProduces executive summary]
    end

    ARBITRATOR --> REPORT([Final Report\nMarkdown · JSON · Web UI · Terminal])
Loading

Agent Pipeline

Each agent has a fixed role and a fixed provider:

Agent Provider Model Responsibility
Scanner Groq llama-3.3-70b-versatile Exhaustive, fast issue discovery — finds everything including edge cases
Critic Mistral mistral-large-latest Challenges Scanner findings, eliminates false positives, adds missed issues
Arbitrator NVIDIA NIM meta/llama-3.1-70b-instruct Reads the full debate, resolves disagreements, writes the final authoritative report

Why three models?

flowchart LR
    A["Single model\nreviews code"] -->|"One perspective\nNo verification\nHigh false-positive rate"| B["Report"]

    C["Scanner\nfinds issues"] --> D["Critic\nchallenges them"] --> E["Arbitrator\nfinal verdict"] --> F["Verified Report"]

    style A fill:#ef4444,color:#fff
    style B fill:#ef4444,color:#fff
    style C fill:#6366f1,color:#fff
    style D fill:#f97316,color:#fff
    style E fill:#22c55e,color:#fff
    style F fill:#22c55e,color:#fff
Loading

The Critic's job is to be skeptical — it actively tries to disprove the Scanner's findings. Only issues that survive both the Scanner and the Critic make it into the final report. The Arbitrator then resolves any remaining disagreements with a senior engineer's judgment.


System Architecture

flowchart TD
    subgraph INPUT_LAYER ["Input Layer  (inputs.py)"]
        F1[Single File]
        F2[Multiple Files]
        F3[Project Directory]
        F4[GitHub PR]
        F5[Pasted Code]
        COMBINE[combine_files\nBuilds multi-file context\nwith ### File: headers]
        F1 & F2 & F3 & F4 & F5 --> COMBINE
    end

    subgraph PROVIDER_LAYER ["Provider Layer  (providers.py)"]
        GROQ_CLIENT[Groq Client\nOpenAI SDK → api.groq.com]
        MISTRAL_CLIENT[Mistral Client\nOpenAI SDK → api.mistral.ai]
        NVIDIA_CLIENT[NVIDIA Client\nOpenAI SDK → integrate.api.nvidia.com]
        RETRY[Tenacity retry\n3 attempts · exponential backoff]
        GROQ_CLIENT & MISTRAL_CLIENT & NVIDIA_CLIENT --> RETRY
    end

    subgraph REVIEW_LAYER ["Review Layer  (multi_reviewer.py)"]
        PASS1[Pass 1 — Scanner\nSCANNER_PROMPT]
        PASS2[Pass 2 — Critic\nCRITIC_PROMPT\nincludes Pass 1 output]
        PASS3[Pass 3 — Arbitrator\nARBITRATOR_PROMPT\nincludes Pass 1 + Pass 2 output]
        PASS1 -->|raw issues JSON| PASS2
        PASS2 -->|verified issues JSON| PASS3
    end

    subgraph PARSE_LAYER ["Parse Layer  (parsers.py)"]
        P1[parse_issues\nJSON array → list of Issue]
        P2[parse_verified_issues\nJSON array → confirmed + rejected]
        P3[parse_report\nJSON object → ReviewReport\nhandles array fallback]
    end

    subgraph OUTPUT_LAYER ["Output Layer"]
        CLI[CLI — Rich tables\nagent.py]
        WEB[Web UI — Gradio\napp.py\nLive streaming]
        MD[Markdown Report\nreport.py]
        JSON_OUT[JSON Report\nreport.py]
    end

    INPUT_LAYER --> REVIEW_LAYER
    REVIEW_LAYER <--> PROVIDER_LAYER
    REVIEW_LAYER --> PARSE_LAYER
    PARSE_LAYER --> OUTPUT_LAYER
Loading

Multi-File Flow

When reviewing connected files, all files are combined into one structured context and sent through the same 3-agent pipeline. The Scanner is explicitly instructed to look for cross-file issues.

flowchart TD
    subgraph FILES ["Project Files"]
        FA[models.py]
        FB[api.py]
        FC[utils.py]
        FD[auth.py]
    end

    FILES --> COMBINER

    subgraph COMBINER ["combine_files  (inputs.py)"]
        HEADER["## Multi-File Project Review\n[1] models.py  Python\n[2] api.py     Python\n..."]
        BLOCKS["### File: models.py\n```python\n...\n```\n---\n### File: api.py\n```python\n...\n```"]
        HEADER --> BLOCKS
    end

    COMBINER --> SCANNER_MULTI

    subgraph SCANNER_MULTI ["Scanner — Cross-File Awareness"]
        XF1[Broken imports between files]
        XF2[API contract mismatches]
        XF3[Type inconsistencies across modules]
        XF4[Circular dependency risks]
        XF5[Dead code — exported but never imported]
        XF6[Shared mutable state across files]
    end

    SCANNER_MULTI --> CRITIC_MULTI[Critic verifies\ncross-file findings]
    CRITIC_MULTI --> ARBITRATOR_MULTI[Arbitrator writes\nfinal multi-file report]
Loading

Auto-excluded from directory scans: __pycache__, node_modules, .git, .venv, dist, build, target, .next, and all binary/lock files.


Input Modes

Mode CLI Web UI
Paste code --code "def foo(): ..." Paste Code tab
Single file --file app.py Upload Files → pick 1 file
Multiple files --files models.py api.py utils.py Upload Files → Ctrl+click multiple
Entire directory --dir ./src Upload Files → enter folder path
GitHub PR --github https://github.com/owner/repo/pull/42 GitHub PR tab

Installation

Prerequisites

Steps

# 1. Clone the repo
git clone https://github.com/Zem-0/Autonomous-Code-Review-Agent.git
cd Autonomous-Code-Review-Agent

# 2. Install dependencies
pip install -r requirements.txt

# 3. Configure API keys
cp .env.example .env
# Edit .env and add your three keys

Configuration

Copy .env.example to .env and fill in your keys:

GROQ_API_KEY=gsk_...
MISTRAL_API_KEY=...
NVIDIA_API_KEY=nvapi-...

# Optional — only needed for GitHub PR reviews
GITHUB_TOKEN=ghp_...

You can also pass keys directly via CLI flags or enter them in the web UI — the .env file is a convenience default.


Usage — CLI

# Review a single file
python agent.py --file mycode.py

# Review multiple connected files together
python agent.py --files models.py api.py utils.py auth.py

# Review an entire project directory
python agent.py --dir ./src --output report.md

# Review a GitHub PR
python agent.py --github https://github.com/owner/repo/pull/42

# Paste a snippet inline
python agent.py --code "def divide(a, b): return a/b"

# Save both markdown and JSON
python agent.py --file app.py --output report.md --output-json report.json

# Preview what would happen without calling the API
python agent.py --dir ./src --dry-run

# Get raw JSON output (useful for piping)
python agent.py --file app.py --json | jq '.issues[] | select(.severity == "CRITICAL")'

CLI flags

Flag Description
--file PATH Review a single local file
--files PATH [PATH ...] Review multiple connected files
--dir PATH Review an entire project directory
--code STRING Review an inline snippet
--github URL Review a GitHub PR
--language LANG Override auto-detected language
--output PATH Save Markdown report
--output-json PATH Save JSON report
--json Print JSON to stdout
--dry-run Preview without calling APIs
--groq-key KEY Override GROQ_API_KEY
--mistral-key KEY Override MISTRAL_API_KEY
--nvidia-key KEY Override NVIDIA_API_KEY

Usage — Web UI

python app.py
# Opens at http://localhost:7860
flowchart LR
    subgraph LEFT ["Left Panel — Input"]
        TAB1[Paste Code tab]
        TAB2[Upload Files tab\nMulti-select · Folder path]
        TAB3[GitHub PR tab]
        LANG[Language dropdown]
        KEYS[API Key fields\nGroq · Mistral · NVIDIA]
        BTN[Start Review button]
    end

    subgraph RIGHT ["Right Panel — Live Results"]
        PROG[Agent Progress\nScanner · Critic · Arbitrator\nwith live state badges]
        STREAM[Debate Transcript\nRaw streaming output\nfrom each agent]
        SCORE[Quality Score\n0–100 with colour badge]
        STATS[Issue counts\nCritical · High · Medium · Low · Info]
        TABLE[Issues Table\nSeverity · Category · Line · Description · Fix]
        DL[Download buttons\nMarkdown report · JSON report]
    end

    BTN --> PROG --> STREAM --> SCORE --> STATS --> TABLE --> DL
Loading

The right panel updates in real time as each agent streams its output — you can watch the Critic disagree with the Scanner live.


File Structure

Autonomous-Code-Review-Agent/
├── agent.py            # CLI entry point — Rich terminal UI
├── app.py              # Gradio web UI entry point
│
├── multi_reviewer.py   # 3-agent agentic loop (streaming + blocking)
├── reviewer.py         # Legacy single-provider loop
│
├── providers.py        # Unified API client for Groq, Mistral, NVIDIA NIM
├── agents.py           # Agent role definitions (persona, provider, prompts)
├── grok_client.py      # Original xAI/Grok client (backward compat)
│
├── inputs.py           # All input modes: file, files, dir, GitHub, paste
├── parsers.py          # Parse LLM JSON responses → Pydantic models
├── models.py           # Pydantic models: Issue, ReviewReport, ReviewSession
├── prompts.py          # Prompt templates for scanner / critic / arbitrator
├── report.py           # Markdown and JSON report generation
│
├── requirements.txt
├── .env.example        # API key template
└── .gitignore          # .env is blocked from commits

Report Output

Every review produces:

Quality Score

A 0–100 score assigned by the Arbitrator based on the severity and breadth of confirmed issues:

Score Label Meaning
85–100 Excellent Production-ready, minor improvements only
70–84 Good A few issues worth fixing
40–69 Fair Meaningful bugs or security gaps present
0–39 Needs Work Critical or high issues that must be fixed

Issue Severity Levels

Severity Description
CRITICAL Must fix before shipping — security vulnerabilities, data loss risks
HIGH Significant bugs or security weaknesses
MEDIUM Logic errors, connection leaks, performance problems
LOW Style, naming, missing validation
INFO Observations and improvement suggestions

Sample JSON output

{
  "overall_score": 42,
  "summary": "The code contains a SQL injection vulnerability and uses MD5 for password hashing, both of which are critical security issues that must be fixed before deployment.",
  "total_issues": 9,
  "critical_count": 2,
  "high_count": 3,
  "issues": [
    {
      "id": "a3f1b2c4",
      "severity": "CRITICAL",
      "category": "security",
      "line_number": 7,
      "description": "SQL injection via string concatenation in login()",
      "suggestion": "Use parameterized queries:\ncursor.execute('SELECT * FROM users WHERE username=? AND password=?', (username, password))",
      "confidence": "high",
      "confirmed": true
    }
  ]
}

Dependencies

Package Purpose
openai API client for all three providers (all use OpenAI-compatible endpoints)
gradio Web UI
pydantic Data models and validation
tenacity Retry logic with exponential backoff
rich Terminal tables and progress display
python-dotenv .env file loading
PyGithub GitHub PR diff fetching

License

MIT

About

Built an agentic code review system analysing GitHub PRs for bugs, anti-patterns, and security vulnerabilities via multi-step LLM reasoning; reduced self-measured review cycle from ~9 min to ~4 min per PR across 200 test cases, webhook latency held under 10 seconds end-to-end.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages