Skip to content

LazyBeaver007/PaperPilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperPilot

PaperPilot is a local-first research workflow tool that helps you:

  • Harvest AI/ML papers from arXiv.
  • Filter papers for relevance.
  • Ingest and index paper content into a vector database.
  • Ask questions in a chat UI and receive source-grounded answers.

The backend uses FastAPI, ChromaDB, and Ollama (Llama 3.2). The frontend is a React + Vite app. image

Features

  • arXiv crawler with optional year filter and max results.
  • Relevance filtering using Gemini (GEMINI_API_KEY) with keyword fallback when unavailable.
  • PDF ingestion pipeline with robust text extraction:
    • pypdf first for stability.
    • Optional docling fallback with timeout and page limits.
  • Chunking strategy using markdown headers + recursive chunk splitting.
  • Vector indexing in persistent ChromaDB.
  • Streaming RAG chat (SSE) with paper ID source attribution.
  • Paper library browsing with search and year filtering.
  • Background crawl/index jobs exposed via API.

Architecture

High-level flow:

  1. crawler.py downloads paper metadata (papers_db/*.json) and PDFs (papers_db/*.pdf).
  2. ingestor.py extracts text, chunks content, generates embeddings, and stores vectors in chroma_db/.
  3. services/rag_service.py retrieves top-k chunks from Chroma and calls Ollama chat model.
  4. api/routes_chat.py streams generated tokens and source IDs over Server-Sent Events.
  5. frontend/ renders the library UI and live chat panel.

Core backend modules:

  • main.py: FastAPI app and route registration.
  • api/routes_jobs.py: background crawl/index jobs.
  • api/routes_papers.py: paper listing and detail APIs.
  • api/routes_chat.py: streaming chat API.
  • services/db_service.py: paper metadata and status resolution.
  • brain.py: CLI assistant for local terminal-based Q&A.

Repository Structure

.
├── api/
├── services/
├── frontend/
├── papers_db/
├── chroma_db/
├── crawler.py
├── ingestor.py
├── brain.py
├── main.py
└── utils.py

Prerequisites

  • Python 3.10+ (3.11 recommended)
  • Node.js 18+ and npm
  • Ollama installed and running locally
  • Llama model pulled in Ollama:
ollama pull llama3.2

Backend Setup

Create and activate a virtual environment, then install dependencies.

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Create a .env (optional but recommended):

GEMINI_API_KEY=your_api_key_here
USE_DOCLING=0

Notes:

  • If GEMINI_API_KEY is not set, relevance filtering falls back to keyword heuristics.
  • USE_DOCLING=1 enables docling fallback for extraction when pypdf yields no text.

Frontend Setup

cd frontend
npm install
npm run dev

The frontend runs on http://localhost:5173.

Run the Backend API

From the repository root:

uvicorn main:app --reload --port 8000

API base URL: http://localhost:8000/api

Typical Workflow

  1. Start backend: uvicorn main:app --reload --port 8000
  2. Start frontend: cd frontend && npm run dev
  3. In UI sidebar:
    • Click Harvest (Crawler) to download paper metadata and PDFs.
    • Click Index (Ingestor) to build vector embeddings in Chroma.
  4. Open Library and ask questions in the chat panel.
  5. Review returned source paper IDs for grounding.

API Endpoints

  • GET /
    Health message.

  • POST /api/jobs/crawl
    Starts a background crawl job.

Request body (optional):

{
  "year": 2025,
  "max_results": 50
}
  • POST /api/jobs/index
    Starts background indexing of local PDFs.

  • GET /api/jobs/{job_id}
    Returns job status (pending, running, completed, failed).

  • GET /api/papers?q=...&year=...
    Lists paper metadata with optional text + year filter.

  • GET /api/papers/{paper_id}
    Returns a single paper record.

  • POST /api/chat/stream
    Streams chat responses as SSE events.

Request body:

{
  "query": "What are recent trends in federated learning?"
}

Event types:

  • token: incremental generated text.
  • done: final event with sources array.
  • error: structured error payload.

CLI Assistant

You can also chat from terminal:

python brain.py

Data and Persistence

  • papers_db/ stores raw metadata JSON and downloaded PDFs.
  • chroma_db/ stores the persistent Chroma vector index.

These directories are local state. Back them up if you need to preserve indexed data.

Troubleshooting

  • Ollama connection/model errors:

    • Ensure Ollama is running.
    • Ensure llama3.2 is pulled (ollama pull llama3.2).
  • Empty or weak retrieval:

    • Run index job after harvesting.
    • Verify PDFs exist in papers_db/.
    • Try broader or clearer queries.
  • PDF extraction failures:

    • Enable docling fallback with USE_DOCLING=1.
    • Large/complex PDFs may still fail; check ingestion logs.
  • SSL/certificate issues during arXiv downloads:

    • The crawler configures certifi-backed SSL context automatically.
    • Ensure certifi is installed.
  • CORS/frontend API errors:

    • Run frontend on http://localhost:5173.
    • Run backend on http://localhost:8000.

Current Limitations

  • Job state is in-memory; restarting backend clears job history.
  • No authentication or multi-user isolation.
  • Status summarized is reserved but not yet implemented in pipeline.
  • No automated test suite is included yet.

Development Notes

  • Keep API and frontend base URLs aligned (frontend/src/lib/api.ts, frontend/src/lib/chatStream.ts).
  • Re-index after major ingestion/chunking changes.
  • frontend/dist/ is build output and may not match current source during development.

License

This project is licensed under the MIT License. See LICENSE.

About

PaperPilot is a local-first AI research assistant that crawls arXiv papers, indexes them with ChromaDB, and enables source-grounded RAG chat using FastAPI, React, and Ollama (Llama 3.2).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors