Skip to content

fix: exclude soft-deleted documents from RAG retrieval#403

Open
just-tanvi wants to merge 44 commits into
param20h:devfrom
just-tanvi:fix/exclude-deleted-documents
Open

fix: exclude soft-deleted documents from RAG retrieval#403
just-tanvi wants to merge 44 commits into
param20h:devfrom
just-tanvi:fix/exclude-deleted-documents

Conversation

@just-tanvi
Copy link
Copy Markdown


Summary

This PR fixes an issue where soft-deleted documents (is_deleted = True) could still be retrieved and used as context during general (unscoped) chat queries.

Although deleted documents were hidden from the application through SQLite metadata, their embeddings in ChromaDB and BM25 indices remained available for retrieval. This allowed content from deleted documents to appear in responses and source citations.

The retrieval pipeline now filters results to include only active documents, ensuring deleted content is no longer surfaced during searches.


Changes

BM25 Retrieval

  • Updated query_bm25() to accept an optional list of document IDs.
  • When document IDs are provided, BM25 retrieval is limited to indices belonging to those documents.

Hybrid Retrieval

  • Added document ID filtering support to both CustomVectorRetriever and CustomBM25Retriever.
  • Updated the main retrieval flow to fetch active (non-deleted) document IDs for the current user before running retrieval.
  • Added an early return when no active documents are available.

Tests

  • Added coverage to verify that soft-deleted documents are excluded from retrieval results.
  • Added validation that active documents continue to be retrieved correctly.
  • Updated existing database mocks where needed.

Type of Change

  • Bug fix
  • Tests

Testing

Added Tests

  • test_retrieve_includes_active_but_excludes_deleted

Validation

  • Ran the full backend test suite (python -m pytest) — all 106 tests passed.
  • Backend lint checks passed.
  • Frontend lint completed successfully (warnings only, no errors).
  • Frontend type checking passed.
  • Production build completed successfully.

Notes for Reviewers

The fix is implemented at the retrieval entry point so that filtering is applied consistently across both ChromaDB and BM25 retrieval paths. This keeps the change focused while ensuring soft-deleted documents cannot influence retrieval results.


Self-Review Checklist

Exodus2004 and others added 26 commits June 1, 2026 16:27
feat(rag): Add OCR support for scanning image-based PDFs
…port

Revert "feat(rag): Add OCR support for scanning image-based PDFs"
@just-tanvi just-tanvi requested a review from param20h as a code owner June 4, 2026 11:47
param20h added 3 commits June 4, 2026 21:22
Fix duplicate email check in auth profile update
…ion-registration

feat: add email verification flow
feat(ui): add text-to-speech control for AI responses
param20h and others added 15 commits June 4, 2026 21:33
feat(api): build user profile and avatar upload endpoints
…breadcrumbs

feat(drive): add sync progress bar and folder breadcrumbs (param20h#276)
feat: implement keyboard shortcuts and help UI modal param20h#28
fix: validate HF token before initializing clients
…t-363

fix(upload): align document upload limit with advertised 50MB maximum
…ge-link

feat(ui): clicking logo navigates to homepage param20h#364
@param20h
Copy link
Copy Markdown
Owner

param20h commented Jun 5, 2026

rebase and pr correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Soft-deleted documents are retrieved and queried during general/multi-document search