fix: exclude soft-deleted documents from RAG retrieval#403
Open
just-tanvi wants to merge 44 commits into
Open
Conversation
feat(rag): Add OCR support for scanning image-based PDFs
…port Revert "feat(rag): Add OCR support for scanning image-based PDFs"
chore: sync main with dev
…into fix-upload-50mb-limit-363
Fix duplicate email check in auth profile update
…ion-registration feat: add email verification flow
feat(ui): add text-to-speech control for AI responses
feat(api): build user profile and avatar upload endpoints
…breadcrumbs feat(drive): add sync progress bar and folder breadcrumbs (param20h#276)
feat: implement keyboard shortcuts and help UI modal param20h#28
fix: validate HF token before initializing clients
…t-363 fix(upload): align document upload limit with advertised 50MB maximum
feat: add empty state UI component
…ge-link feat(ui): clicking logo navigates to homepage param20h#364
Owner
|
rebase and pr correctly |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes an issue where soft-deleted documents (
is_deleted = True) could still be retrieved and used as context during general (unscoped) chat queries.Although deleted documents were hidden from the application through SQLite metadata, their embeddings in ChromaDB and BM25 indices remained available for retrieval. This allowed content from deleted documents to appear in responses and source citations.
The retrieval pipeline now filters results to include only active documents, ensuring deleted content is no longer surfaced during searches.
Changes
BM25 Retrieval
query_bm25()to accept an optional list of document IDs.Hybrid Retrieval
CustomVectorRetrieverandCustomBM25Retriever.Tests
Type of Change
Testing
Added Tests
test_retrieve_includes_active_but_excludes_deletedValidation
python -m pytest) — all 106 tests passed.Notes for Reviewers
The fix is implemented at the retrieval entry point so that filtering is applied consistently across both ChromaDB and BM25 retrieval paths. This keeps the change focused while ensuring soft-deleted documents cannot influence retrieval results.
Self-Review Checklist
devCloses [BUG] : Soft-deleted documents are retrieved and queried during general/multi-document search #399