Skip to content

Feature/optimize summary pipeline 479#481

Open
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feature/optimize-summary-pipeline-479
Open

Feature/optimize summary pipeline 479#481
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feature/optimize-summary-pipeline-479

Conversation

@suhaniiz
Copy link
Copy Markdown

@suhaniiz suhaniiz commented Jun 5, 2026

📋 PR Checklist

Thank you for contributing to PDF-Assistant-RAG! 🎉
Please fill out this template before submitting. PRs without it filled in will be closed.


🔗 Related Issue

Closes #479


📝 What does this PR do?

Optimizes the generate_document_summary pipeline by allowing it to accept pre-extracted chunks directly.

Previously, the pipeline forced a redundant disk I/O and re-chunking operation by calling chunk_document(filePath) inside the summary generator, even if the document had already been processed. This PR refactors the function to take an optional chunks list, bypassing redundant CPU/disk usage while maintaining full backward compatibility with a fallback mechanism if only a filePath is provided.

Additionally, added defensive checks against TypeError when handling empty or malformed text fields within chunk dictionaries.


🗂️ Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 🔧 Refactor / code cleanup
  • 📝 Documentation update
  • 🎨 UI / styling change
  • ⚙️ CI / tooling / config change
  • 🧪 Tests

🧪 How was this tested?

  • Tested the affected API endpoints manually
  • Added / updated tests

📸 Screenshots (if UI change)

N/A (Backend optimization)


⚠️ Anything to flag for reviewers?

The function signature was changed to generate_document_summary(filePath: str | None = None, max_sentences: int = 3, chunks: list[dict] | None = None). This maintains full backward compatibility for any existing background tasks or endpoints that pass filePath positionally or via keywords without updating their payload structure immediately.


✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@suhaniiz suhaniiz requested a review from param20h as a code owner June 5, 2026 18:13
@suhaniiz
Copy link
Copy Markdown
Author

suhaniiz commented Jun 5, 2026

hey @param20h , this PR is under GSSoC 2026!

@suhaniiz
Copy link
Copy Markdown
Author

suhaniiz commented Jun 7, 2026

@param20h , kindly review it and lemme know if changes are to be made!

@param20h
Copy link
Copy Markdown
Owner

param20h commented Jun 7, 2026

chcks failing @suhaniiz

@suhaniiz
Copy link
Copy Markdown
Author

suhaniiz commented Jun 7, 2026

Hi @param20h,

It looks like the Playwright E2E tests workflow failed during the authentication flow (auth-and-chat.spec.ts).

I wanted to confirm that this failure is unrelated to the changes in this branch. This PR is strictly scoped to adding defensive isinstance(text, str) data-type validation inside the backend's generate_document_summary function. Because it doesn't touch any frontend routing, login selectors, or authentication flows, these changes are completely safe and will not cause any harm to the codebase.

The failure appears to be due to an intermittent CI environment timeout or an existing issue on the base branch. Since I don't have write permissions to trigger a workflow re-run on this repository, feel free to either restart the failed job or proceed with merging the changes directly.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Optimize generate_document_summary pipeline by passing pre-extracted chunks

2 participants