PaperPilot is a local-first research workflow tool that helps you:
- Harvest AI/ML papers from arXiv.
- Filter papers for relevance.
- Ingest and index paper content into a vector database.
- Ask questions in a chat UI and receive source-grounded answers.
The backend uses FastAPI, ChromaDB, and Ollama (Llama 3.2). The frontend is a React + Vite app.

- arXiv crawler with optional year filter and max results.
- Relevance filtering using Gemini (
GEMINI_API_KEY) with keyword fallback when unavailable. - PDF ingestion pipeline with robust text extraction:
pypdffirst for stability.- Optional
doclingfallback with timeout and page limits.
- Chunking strategy using markdown headers + recursive chunk splitting.
- Vector indexing in persistent ChromaDB.
- Streaming RAG chat (SSE) with paper ID source attribution.
- Paper library browsing with search and year filtering.
- Background crawl/index jobs exposed via API.
High-level flow:
crawler.pydownloads paper metadata (papers_db/*.json) and PDFs (papers_db/*.pdf).ingestor.pyextracts text, chunks content, generates embeddings, and stores vectors inchroma_db/.services/rag_service.pyretrieves top-k chunks from Chroma and calls Ollama chat model.api/routes_chat.pystreams generated tokens and source IDs over Server-Sent Events.frontend/renders the library UI and live chat panel.
Core backend modules:
main.py: FastAPI app and route registration.api/routes_jobs.py: background crawl/index jobs.api/routes_papers.py: paper listing and detail APIs.api/routes_chat.py: streaming chat API.services/db_service.py: paper metadata and status resolution.brain.py: CLI assistant for local terminal-based Q&A.
.
├── api/
├── services/
├── frontend/
├── papers_db/
├── chroma_db/
├── crawler.py
├── ingestor.py
├── brain.py
├── main.py
└── utils.py
- Python 3.10+ (3.11 recommended)
- Node.js 18+ and npm
- Ollama installed and running locally
- Llama model pulled in Ollama:
ollama pull llama3.2Create and activate a virtual environment, then install dependencies.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtCreate a .env (optional but recommended):
GEMINI_API_KEY=your_api_key_here
USE_DOCLING=0Notes:
- If
GEMINI_API_KEYis not set, relevance filtering falls back to keyword heuristics. USE_DOCLING=1enables docling fallback for extraction whenpypdfyields no text.
cd frontend
npm install
npm run devThe frontend runs on http://localhost:5173.
From the repository root:
uvicorn main:app --reload --port 8000API base URL: http://localhost:8000/api
- Start backend:
uvicorn main:app --reload --port 8000 - Start frontend:
cd frontend && npm run dev - In UI sidebar:
- Click
Harvest (Crawler)to download paper metadata and PDFs. - Click
Index (Ingestor)to build vector embeddings in Chroma.
- Click
- Open Library and ask questions in the chat panel.
- Review returned source paper IDs for grounding.
-
GET /
Health message. -
POST /api/jobs/crawl
Starts a background crawl job.
Request body (optional):
{
"year": 2025,
"max_results": 50
}-
POST /api/jobs/index
Starts background indexing of local PDFs. -
GET /api/jobs/{job_id}
Returns job status (pending,running,completed,failed). -
GET /api/papers?q=...&year=...
Lists paper metadata with optional text + year filter. -
GET /api/papers/{paper_id}
Returns a single paper record. -
POST /api/chat/stream
Streams chat responses as SSE events.
Request body:
{
"query": "What are recent trends in federated learning?"
}Event types:
token: incremental generated text.done: final event withsourcesarray.error: structured error payload.
You can also chat from terminal:
python brain.pypapers_db/stores raw metadata JSON and downloaded PDFs.chroma_db/stores the persistent Chroma vector index.
These directories are local state. Back them up if you need to preserve indexed data.
-
Ollama connection/model errors:
- Ensure Ollama is running.
- Ensure
llama3.2is pulled (ollama pull llama3.2).
-
Empty or weak retrieval:
- Run index job after harvesting.
- Verify PDFs exist in
papers_db/. - Try broader or clearer queries.
-
PDF extraction failures:
- Enable docling fallback with
USE_DOCLING=1. - Large/complex PDFs may still fail; check ingestion logs.
- Enable docling fallback with
-
SSL/certificate issues during arXiv downloads:
- The crawler configures certifi-backed SSL context automatically.
- Ensure
certifiis installed.
-
CORS/frontend API errors:
- Run frontend on
http://localhost:5173. - Run backend on
http://localhost:8000.
- Run frontend on
- Job state is in-memory; restarting backend clears job history.
- No authentication or multi-user isolation.
- Status
summarizedis reserved but not yet implemented in pipeline. - No automated test suite is included yet.
- Keep API and frontend base URLs aligned (
frontend/src/lib/api.ts,frontend/src/lib/chatStream.ts). - Re-index after major ingestion/chunking changes.
frontend/dist/is build output and may not match current source during development.
This project is licensed under the MIT License. See LICENSE.