Feature/optimize summary pipeline 479 by suhaniiz · Pull Request #481 · param20h/PDF-Assistant-RAG

suhaniiz · 2026-06-05T18:13:11Z

📋 PR Checklist

Thank you for contributing to PDF-Assistant-RAG! 🎉
Please fill out this template before submitting. PRs without it filled in will be closed.

🔗 Related Issue

Closes #479

📝 What does this PR do?

Optimizes the generate_document_summary pipeline by allowing it to accept pre-extracted chunks directly.

Previously, the pipeline forced a redundant disk I/O and re-chunking operation by calling chunk_document(filePath) inside the summary generator, even if the document had already been processed. This PR refactors the function to take an optional chunks list, bypassing redundant CPU/disk usage while maintaining full backward compatibility with a fallback mechanism if only a filePath is provided.

Additionally, added defensive checks against TypeError when handling empty or malformed text fields within chunk dictionaries.

🗂️ Type of Change

🧪 How was this tested?

Tested the affected API endpoints manually
Added / updated tests

📸 Screenshots (if UI change)

N/A (Backend optimization)

⚠️ Anything to flag for reviewers?

The function signature was changed to generate_document_summary(filePath: str | None = None, max_sentences: int = 3, chunks: list[dict] | None = None). This maintains full backward compatibility for any existing background tasks or endpoints that pass filePath positionally or via keywords without updating their payload structure immediately.

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

…stion task closes param20h#448

…h#479

suhaniiz · 2026-06-05T18:13:48Z

hey @param20h , this PR is under GSSoC 2026!

suhaniiz · 2026-06-07T06:12:24Z

@param20h , kindly review it and lemme know if changes are to be made!

param20h · 2026-06-07T10:04:41Z

chcks failing @suhaniiz

suhaniiz · 2026-06-07T10:17:00Z

Hi @param20h,

It looks like the Playwright E2E tests workflow failed during the authentication flow (auth-and-chat.spec.ts).

I wanted to confirm that this failure is unrelated to the changes in this branch. This PR is strictly scoped to adding defensive isinstance(text, str) data-type validation inside the backend's generate_document_summary function. Because it doesn't touch any frontend routing, login selectors, or authentication flows, these changes are completely safe and will not cause any harm to the codebase.

The failure appears to be due to an intermittent CI environment timeout or an existing issue on the base branch. Since I don't have write permissions to trigger a workflow re-run on this repository, feel free to either restart the failed job or proceed with merging the changes directly.

Thank you!

suhaniiz added 2 commits June 5, 2026 23:30

test(backend): implement end-to-end integration tests for celery inge…

04014d6

…stion task closes param20h#448

perf: pass pre-extracted chunks to minimize disk I/O overhead param20…

d9928a6

…h#479

suhaniiz requested a review from param20h as a code owner June 5, 2026 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/optimize summary pipeline 479#481

Feature/optimize summary pipeline 479#481
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feature/optimize-summary-pipeline-479

suhaniiz commented Jun 5, 2026

Uh oh!

suhaniiz commented Jun 5, 2026

Uh oh!

suhaniiz commented Jun 7, 2026

Uh oh!

param20h commented Jun 7, 2026

Uh oh!

suhaniiz commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suhaniiz commented Jun 5, 2026

📋 PR Checklist

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

📸 Screenshots (if UI change)

⚠️ Anything to flag for reviewers?

✅ Self-Review Checklist

Uh oh!

suhaniiz commented Jun 5, 2026

Uh oh!

suhaniiz commented Jun 7, 2026

Uh oh!

param20h commented Jun 7, 2026

Uh oh!

suhaniiz commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants