Feature/optimize summary pipeline 479#481
Conversation
|
hey @param20h , this PR is under GSSoC 2026! |
|
@param20h , kindly review it and lemme know if changes are to be made! |
|
chcks failing @suhaniiz |
|
Hi @param20h, It looks like the I wanted to confirm that this failure is unrelated to the changes in this branch. This PR is strictly scoped to adding defensive The failure appears to be due to an intermittent CI environment timeout or an existing issue on the base branch. Since I don't have write permissions to trigger a workflow re-run on this repository, feel free to either restart the failed job or proceed with merging the changes directly. Thank you! |
📋 PR Checklist
🔗 Related Issue
Closes #479
📝 What does this PR do?
Optimizes the
generate_document_summarypipeline by allowing it to accept pre-extracted chunks directly.Previously, the pipeline forced a redundant disk I/O and re-chunking operation by calling
chunk_document(filePath)inside the summary generator, even if the document had already been processed. This PR refactors the function to take an optionalchunkslist, bypassing redundant CPU/disk usage while maintaining full backward compatibility with a fallback mechanism if only afilePathis provided.Additionally, added defensive checks against
TypeErrorwhen handling empty or malformed text fields within chunk dictionaries.🗂️ Type of Change
🧪 How was this tested?
📸 Screenshots (if UI change)
N/A (Backend optimization)
The function signature was changed to
generate_document_summary(filePath: str | None = None, max_sentences: int = 3, chunks: list[dict] | None = None). This maintains full backward compatibility for any existing background tasks or endpoints that passfilePathpositionally or via keywords without updating their payload structure immediately.✅ Self-Review Checklist
dev, notmainmainbranch or any HuggingFace deployment config