PaperPilot

PaperPilot is a local-first research workflow tool that helps you:

Harvest AI/ML papers from arXiv.
Filter papers for relevance.
Ingest and index paper content into a vector database.
Ask questions in a chat UI and receive source-grounded answers.

The backend uses FastAPI, ChromaDB, and Ollama (Llama 3.2). The frontend is a React + Vite app.

Features

arXiv crawler with optional year filter and max results.
Relevance filtering using Gemini (GEMINI_API_KEY) with keyword fallback when unavailable.
PDF ingestion pipeline with robust text extraction:
- pypdf first for stability.
- Optional docling fallback with timeout and page limits.
Chunking strategy using markdown headers + recursive chunk splitting.
Vector indexing in persistent ChromaDB.
Streaming RAG chat (SSE) with paper ID source attribution.
Paper library browsing with search and year filtering.
Background crawl/index jobs exposed via API.

Architecture

High-level flow:

crawler.py downloads paper metadata (papers_db/*.json) and PDFs (papers_db/*.pdf).
ingestor.py extracts text, chunks content, generates embeddings, and stores vectors in chroma_db/.
services/rag_service.py retrieves top-k chunks from Chroma and calls Ollama chat model.
api/routes_chat.py streams generated tokens and source IDs over Server-Sent Events.
frontend/ renders the library UI and live chat panel.

Core backend modules:

main.py: FastAPI app and route registration.
api/routes_jobs.py: background crawl/index jobs.
api/routes_papers.py: paper listing and detail APIs.
api/routes_chat.py: streaming chat API.
services/db_service.py: paper metadata and status resolution.
brain.py: CLI assistant for local terminal-based Q&A.

Repository Structure

.
├── api/
├── services/
├── frontend/
├── papers_db/
├── chroma_db/
├── crawler.py
├── ingestor.py
├── brain.py
├── main.py
└── utils.py

Prerequisites

Python 3.10+ (3.11 recommended)
Node.js 18+ and npm
Ollama installed and running locally
Llama model pulled in Ollama:

ollama pull llama3.2

Backend Setup

Create and activate a virtual environment, then install dependencies.

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Create a .env (optional but recommended):

GEMINI_API_KEY=your_api_key_here
USE_DOCLING=0

Notes:

If GEMINI_API_KEY is not set, relevance filtering falls back to keyword heuristics.
USE_DOCLING=1 enables docling fallback for extraction when pypdf yields no text.

Frontend Setup

cd frontend
npm install
npm run dev

The frontend runs on http://localhost:5173.

Run the Backend API

From the repository root:

uvicorn main:app --reload --port 8000

API base URL: http://localhost:8000/api

Typical Workflow

Start backend: uvicorn main:app --reload --port 8000
Start frontend: cd frontend && npm run dev
In UI sidebar:
- Click Harvest (Crawler) to download paper metadata and PDFs.
- Click Index (Ingestor) to build vector embeddings in Chroma.
Open Library and ask questions in the chat panel.
Review returned source paper IDs for grounding.

API Endpoints

GET /
Health message.
POST /api/jobs/crawl
Starts a background crawl job.

Request body (optional):

{
  "year": 2025,
  "max_results": 50
}

POST /api/jobs/index
Starts background indexing of local PDFs.
GET /api/jobs/{job_id}
Returns job status (pending, running, completed, failed).
GET /api/papers?q=...&year=...
Lists paper metadata with optional text + year filter.
GET /api/papers/{paper_id}
Returns a single paper record.
POST /api/chat/stream
Streams chat responses as SSE events.

Request body:

{
  "query": "What are recent trends in federated learning?"
}

Event types:

token: incremental generated text.
done: final event with sources array.
error: structured error payload.

CLI Assistant

You can also chat from terminal:

python brain.py

Data and Persistence

papers_db/ stores raw metadata JSON and downloaded PDFs.
chroma_db/ stores the persistent Chroma vector index.

These directories are local state. Back them up if you need to preserve indexed data.

Troubleshooting

Ollama connection/model errors:
- Ensure Ollama is running.
- Ensure llama3.2 is pulled (ollama pull llama3.2).
Empty or weak retrieval:
- Run index job after harvesting.
- Verify PDFs exist in papers_db/.
- Try broader or clearer queries.
PDF extraction failures:
- Enable docling fallback with USE_DOCLING=1.
- Large/complex PDFs may still fail; check ingestion logs.
SSL/certificate issues during arXiv downloads:
- The crawler configures certifi-backed SSL context automatically.
- Ensure certifi is installed.
CORS/frontend API errors:
- Run frontend on http://localhost:5173.
- Run backend on http://localhost:8000.

Current Limitations

Job state is in-memory; restarting backend clears job history.
No authentication or multi-user isolation.
Status summarized is reserved but not yet implemented in pipeline.
No automated test suite is included yet.

Development Notes

Keep API and frontend base URLs aligned (frontend/src/lib/api.ts, frontend/src/lib/chatStream.ts).
Re-index after major ingestion/chunking changes.
frontend/dist/ is build output and may not match current source during development.

License

This project is licensed under the MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperPilot

Features

Architecture

Repository Structure

Prerequisites

Backend Setup

Frontend Setup

Run the Backend API

Typical Workflow

API Endpoints

CLI Assistant

Data and Persistence

Troubleshooting

Current Limitations

Development Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
frontend		frontend
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
brain.py		brain.py
crawler.py		crawler.py
ingestor.py		ingestor.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

PaperPilot

Features

Architecture

Repository Structure

Prerequisites

Backend Setup

Frontend Setup

Run the Backend API

Typical Workflow

API Endpoints

CLI Assistant

Data and Persistence

Troubleshooting

Current Limitations

Development Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages