Skip to content

feat(rag): add extraction and indexing of PDF image captions#490

Open
nancysangani wants to merge 1 commit into
param20h:devfrom
nancysangani:feat/pdf-image-caption-indexing
Open

feat(rag): add extraction and indexing of PDF image captions#490
nancysangani wants to merge 1 commit into
param20h:devfrom
nancysangani:feat/pdf-image-caption-indexing

Conversation

@nancysangani
Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #438


📝 What does this PR do?

Completes image caption extraction and indexing for the RAG pipeline across two files:

backend/app/rag/vision.py (rewrite of existing stub):

  • Adds extract_captions_from_pdf() — uses PyMuPDF bounding-box proximity to find
    the nearest text block below (or above) each figure. Zero cost, fully offline.
  • Fixes the broken OpenAI hook (openai.Image.create was an image-generation call,
    not a vision call) — replaced with gpt-4o-mini Chat Completions vision API,
    guarded behind VISION_PROVIDER=openai + OPENAI_API_KEY in settings.
  • Restructures caption_image() resolution order:
    OpenAI (if configured) → OCR (pytesseract) → size-aware placeholder.
  • generate_captions_for_chunks() now respects pre-extracted image_caption
    written by the ingestion pass and only falls back to OCR/placeholder for gaps.
  • Raw image_bytes are always popped from chunks in the finally block —
    they are never serialised into ChromaDB.

backend/app/services/document_ingestion.py (surgical addition):

  • Adds a proximity caption pass between chunk_document() and store_chunks().
  • Calls extract_captions_from_pdf(), matches captions to image chunks by
    (page, figure_index) order, and writes image_caption + bbox before
    store_chunks() / generate_captions_for_chunks() runs.
  • Wrapped in try/except — a caption failure never blocks ingestion.

🗂️ Type of Change

  • ✨ New feature
  • 🔧 Refactor / code cleanup

🧪 How was this tested?

  • Ran the backend locally (uvicorn app.main:app --reload)
  • Uploaded a PDF with labelled figures; confirmed ChromaDB stored chunks
    with is_image=True and non-empty image_caption matching adjacent text
  • Verified images with no adjacent text receive the size-aware placeholder
  • Confirmed decorative images below 1,000 px² are skipped
  • Confirmed raw image_bytes are never present in stored chunk metadata
  • Confirmed existing text and table chunks are unaffected (no regression)

✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@nancysangani nancysangani requested a review from param20h as a code owner June 6, 2026 06:25
@nancysangani
Copy link
Copy Markdown
Contributor Author

Hi @param20h, Please review the PR when you get a chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rag): Add support for extracting and indexing image captions from PDFs

1 participant