[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention) by hrshjswniii · Pull Request #516 · param20h/PDF-Assistant-RAG

hrshjswniii · 2026-06-07T13:50:29Z

🔗 Related Issue

Closes #487

📝 What does this PR do?

This PR resolves the memory-exhaustion (OOM) risks during PDF ingestion by optimizing how images are parsed and captioned:

Page-by-Page Generator: Modified extract_pdf_images to be a generator that yields images lazily, page-by-page.
On-the-fly Captioning: Caption generation is now executed during the main page chunking loop in chunk_document. Raw image byte strings are processed, transformed into text descriptions, and freed immediately.
Memory Leak Safeguards: Wrapped local image variable cleanup in a finally block to guarantee raw binary data is garbage-collected page-by-page even if captioning throws an exception.
Robust Ingestion Fallback: Added an exception handler inside chunk_document so that if the captioning model/API fails (e.g. rate limit, network timeout), the image isn't dropped. Instead, it falls back to a placeholder text chunk Image on page X., ensuring content presence is preserved.
Unit Testing: Added test_pdf_image_captioning_on_the_fly to verify the generator loop behavior, error fallbacks, and byte-clearing correctness.

🗂️ Type of Change

🐛 Bug fix
🔧 Refactor / code cleanup
🧪 Tests

🧪 How was this tested?

Tested the affected API endpoints manually
Added / updated tests (Created test_pdf_image_captioning_on_the_fly in backend/tests/test_chunker.py)
Ran full backend test suite (.venv/Scripts/pytest — all 120 tests passed)

⚠️ Anything to flag for reviewers?

The old in-place captioning list crawler generate_captions_for_chunks in vision.py is kept for API compatibility, but is now safely bypassed during PDF chunking because the returned list of chunks from chunk_document contains pre-captioned image descriptions without raw byte payloads.

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention)

f170f4c

hrshjswniii requested a review from param20h as a code owner June 7, 2026 13:50

param20h approved these changes Jun 7, 2026

View reviewed changes

param20h merged commit bc47827 into param20h:dev Jun 7, 2026
7 checks passed

github-actions Bot added enhancement New feature or improvement gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) mentor:param20h Mentor for this PR labels Jun 7, 2026

param20h added level:advanced +55 pts type:feature +10 pts and removed enhancement New feature or improvement labels Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention)#516

[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention)#516
param20h merged 1 commit into
param20h:devfrom
hrshjswniii:feature/OOM-Prevention

hrshjswniii commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hrshjswniii commented Jun 7, 2026

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

⚠️ Anything to flag for reviewers?

✅ Self-Review Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants