Skip to content

[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention)#516

Merged
param20h merged 1 commit into
param20h:devfrom
hrshjswniii:feature/OOM-Prevention
Jun 7, 2026
Merged

[FEAT] : add PDF Image Parsing Memory Optimizations (OOM Prevention)#516
param20h merged 1 commit into
param20h:devfrom
hrshjswniii:feature/OOM-Prevention

Conversation

@hrshjswniii
Copy link
Copy Markdown
Contributor

🔗 Related Issue


Closes #487



📝 What does this PR do?


This PR resolves the memory-exhaustion (OOM) risks during PDF ingestion by optimizing how images are parsed and captioned:

  • Page-by-Page Generator: Modified extract_pdf_images to be a generator that yields images lazily, page-by-page.
  • On-the-fly Captioning: Caption generation is now executed during the main page chunking loop in chunk_document. Raw image byte strings are processed, transformed into text descriptions, and freed immediately.
  • Memory Leak Safeguards: Wrapped local image variable cleanup in a finally block to guarantee raw binary data is garbage-collected page-by-page even if captioning throws an exception.
  • Robust Ingestion Fallback: Added an exception handler inside chunk_document so that if the captioning model/API fails (e.g. rate limit, network timeout), the image isn't dropped. Instead, it falls back to a placeholder text chunk Image on page X., ensuring content presence is preserved.
  • Unit Testing: Added test_pdf_image_captioning_on_the_fly to verify the generator loop behavior, error fallbacks, and byte-clearing correctness.


🗂️ Type of Change


  • 🐛 Bug fix
  • 🔧 Refactor / code cleanup
  • 🧪 Tests


🧪 How was this tested?


  • Tested the affected API endpoints manually
  • Added / updated tests (Created test_pdf_image_captioning_on_the_fly in backend/tests/test_chunker.py)
  • Ran full backend test suite (.venv/Scripts/pytest — all 120 tests passed)


⚠️ Anything to flag for reviewers?


  • The old in-place captioning list crawler generate_captions_for_chunks in vision.py is kept for API compatibility, but is now safely bypassed during PDF chunking because the returned list of chunks from chunk_document contains pre-captioned image descriptions without raw byte payloads.


✅ Self-Review Checklist


  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@hrshjswniii hrshjswniii requested a review from param20h as a code owner June 7, 2026 13:50
@param20h param20h merged commit bc47827 into param20h:dev Jun 7, 2026
7 checks passed
@github-actions github-actions Bot added enhancement New feature or improvement gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) mentor:param20h Mentor for this PR labels Jun 7, 2026
@param20h param20h added level:advanced +55 pts type:feature +10 pts and removed enhancement New feature or improvement labels Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved Approved for GSSoC base points (+50 pts) gssoc GirlScript Summer of Code 2026 issue/PR level:advanced +55 pts mentor:param20h Mentor for this PR type:feature +10 pts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] : PDF Image Parsing Memory Optimizations (OOM Prevention)

2 participants