feat(rag): add extraction and indexing of PDF image captions by nancysangani · Pull Request #490 · param20h/PDF-Assistant-RAG

nancysangani · 2026-06-06T06:25:57Z

🔗 Related Issue

Closes #438

📝 What does this PR do?

Completes image caption extraction and indexing for the RAG pipeline across two files:

backend/app/rag/vision.py (rewrite of existing stub):

Adds extract_captions_from_pdf() — uses PyMuPDF bounding-box proximity to find
the nearest text block below (or above) each figure. Zero cost, fully offline.
Fixes the broken OpenAI hook (openai.Image.create was an image-generation call,
not a vision call) — replaced with gpt-4o-mini Chat Completions vision API,
guarded behind VISION_PROVIDER=openai + OPENAI_API_KEY in settings.
Restructures caption_image() resolution order:
OpenAI (if configured) → OCR (pytesseract) → size-aware placeholder.
generate_captions_for_chunks() now respects pre-extracted image_caption
written by the ingestion pass and only falls back to OCR/placeholder for gaps.
Raw image_bytes are always popped from chunks in the finally block —
they are never serialised into ChromaDB.

backend/app/services/document_ingestion.py (surgical addition):

Adds a proximity caption pass between chunk_document() and store_chunks().
Calls extract_captions_from_pdf(), matches captions to image chunks by
(page, figure_index) order, and writes image_caption + bbox before
store_chunks() / generate_captions_for_chunks() runs.
Wrapped in try/except — a caption failure never blocks ingestion.

🗂️ Type of Change

✨ New feature
🔧 Refactor / code cleanup

🧪 How was this tested?

Ran the backend locally (uvicorn app.main:app --reload)
Uploaded a PDF with labelled figures; confirmed ChromaDB stored chunks
with is_image=True and non-empty image_caption matching adjacent text
Verified images with no adjacent text receive the size-aware placeholder
Confirmed decorative images below 1,000 px² are skipped
Confirmed raw image_bytes are never present in stored chunk metadata
Confirmed existing text and table chunks are unaffected (no regression)

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

nancysangani · 2026-06-06T06:29:13Z

Hi @param20h, Please review the PR when you get a chance. Thanks!

feat(rag): add extraction and indexing of PDF image captions

4fc43dc

nancysangani requested a review from param20h as a code owner June 6, 2026 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): add extraction and indexing of PDF image captions#490

feat(rag): add extraction and indexing of PDF image captions#490
nancysangani wants to merge 1 commit into
param20h:devfrom
nancysangani:feat/pdf-image-caption-indexing

nancysangani commented Jun 6, 2026

Uh oh!

nancysangani commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nancysangani commented Jun 6, 2026

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

✅ Self-Review Checklist

Uh oh!

nancysangani commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant