fix: harden rag index thread safety by grtninja · Pull Request #746 · amd/gaia

grtninja · 2026-04-11T07:59:04Z

What

add RLock protection around RAG index mutation paths in src/gaia/rag/sdk.py
rebuild chunk/index state before publishing it so failed rebuilds do not leak partial state
query against consistent snapshots so concurrent reads do not mix old indexes with new chunk maps
add targeted regressions for concurrent query-vs-state-swap and failed rebuild rollback

Why

Closes #455 by tightening the shared-state contract on the RAG index surface used by concurrent chat requests.

Scope

src/gaia/rag/sdk.py
tests/test_rag.py

Validation

python -m py_compile src/gaia/rag/sdk.py tests/test_rag.py
PYTHONPATH=src pytest tests/test_rag.py -k "consistent_snapshot or rebuild_fails or document_indexing or document_querying or status_reporting or cache_functionality" -q

Notes

kept the lane focused to the issue surface only
no UI or broader agent-loop changes

grtninja · 2026-04-11T08:10:52Z

@coderabbitai review

kovtcharov · 2026-04-11T18:30:52Z

@claude review this PR.

github-actions · 2026-04-11T18:31:15Z

Claude finished @kovtcharov's task —— View job

PR Review: RAG Index Thread Safety

Read PR diff and changed files
Review src/gaia/rag/sdk.py changes
Review tests/test_rag.py changes
Provide comprehensive review feedback

Summary

This PR adds thread safety to the RAG SDK by introducing RLock protection around index mutation paths and ensuring queries operate on consistent snapshots. The implementation is well-structured and addresses the core issue of concurrent access to shared state.

Review Feedback

🟢 Strengths

Snapshot-based query isolation (sdk.py:1454-1464): The _snapshot_query_state() method correctly captures a consistent read snapshot under the lock, allowing expensive embedding/search work to proceed without holding the lock.
Atomic state publishing (sdk.py:1526-1530): The remove_document() method now builds new state structures first, then publishes them atomically only after _create_vector_index() succeeds. This prevents partial state from being visible.
Good use of RLock: Using threading.RLock() allows nested locking (e.g., reindex_document() calling remove_document() and index_document() within the same lock).
Well-targeted tests: The two new tests (test_query_uses_consistent_snapshot_during_state_swap and test_remove_document_preserves_state_when_rebuild_fails) directly validate the race conditions being fixed.

🟡 Suggestions

Deadlock risk in reindex_document() (sdk.py:1579-1595): This method holds _state_lock while calling index_document(), which also acquires _state_lock. While RLock handles this correctly, the current structure holds the lock across both the remove and index operations. Consider whether this is intentional, as it prevents concurrent reads during reindexing.
Lock scope in _retrieve_chunks_from_file() (sdk.py:2192-2206): The method correctly snapshots state under the lock, but then the rebuilding of a missing per-file index (sdk.py:2216-2223) doesn't cache the result back. The old code cached it:
```
# Old code:
self.file_indices[file_path] = file_index
self.file_embeddings[file_path] = file_embeddings
```
This was intentionally removed to avoid mutating shared state mid-query, but it means repeated queries for the same uncached file will rebuild the index each time. Consider documenting this tradeoff.
Bounds check in _search_chunks() (sdk.py:2320-2326): Good addition of bounds validation:
```
if idx < 0 or idx >= len(chunks_snapshot):
    continue
```
This handles edge cases where FAISS returns invalid indices.
Double capacity check (sdk.py:1888-1398 and 2028-2052): The capacity check happens twice in the cache-load path (once before checking cache, once after loading cache). This is intentional for TOCTOU safety but could benefit from a brief comment explaining why.

🟢 Minor Notes

The clear_cache() method now acquires the lock (sdk.py:2491-2495) - good for consistency.
The _retrieve_chunks_with_metadata() signature change to accept an optional search_snapshot parameter (sdk.py:2254-2255) is a clean refactor that enables snapshot reuse.

Verdict

Approve - This is a solid thread-safety improvement. The changes are focused, well-tested, and follow the correct pattern of snapshot-then-process for read operations and build-then-publish for write operations.

kovtcharov · 2026-04-11T18:34:12Z

Can you address the 4 suggestions by claude and i'll approve.

grtninja · 2026-04-11T18:41:03Z

yeah I'm working on this now, thanks!

grtninja · 2026-04-12T02:05:15Z

Addressing Claude's 4 suggestions now on the latest head.

What changed:

documented the intentional reindex_document() lock scope so readers do not observe a remove/readd gap during reindexing
documented the query-local uncached per-file index rebuild tradeoff in _retrieve_chunks_from_file()
documented the FAISS bounds guard in _search_chunks()
documented the double capacity check as intentional TOCTOU protection
cleaned the two touched-surface style nits in the same files

Validation:

PYTHONPATH=src python -m pytest -q tests/test_rag.py -k "consistent_snapshot or preserves_state_when_rebuild_fails"
ruff check src/gaia/rag/sdk.py tests/test_rag.py
python -m black --check src/gaia/rag/sdk.py tests/test_rag.py
git diff --check

- Move v0.17.2 to Shipped section with release summary - Add v0.17.3 as current release (RAG reliability, security, website) - Update v0.18.x tables to reflect issue triage (moved 10 issues out of v0.17.3) - Add v0.19.0 model fine-tuning deliverables (was missing from table) - Add v0.22.0 App Consolidation section and Mermaid node (was missing) - Remove duplicate #455 from v0.18.2 (addressed by PR #746 in v0.17.3) - Add Detailed Plan links to v0.17.3 - Set sequential weekly deadlines: Apr 17 through Jun 16 - Update timestamp to April 13, 2026 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kovtcharov · 2026-04-15T08:52:07Z

@claude please review this PR.

github-actions · 2026-04-15T08:52:27Z

Claude finished @kovtcharov's task —— View job

PR Review: RAG Index Thread Safety (Follow-up)

Read PR diff and changed files
Review src/gaia/rag/sdk.py changes (documentation updates)
Review tests/test_rag.py changes
Provide comprehensive review feedback

Summary

This PR successfully addresses all 4 suggestions from the previous review. The documentation additions clearly explain the threading invariants and design tradeoffs.

Verification of Previous Suggestions

1. `reindex_document()` Lock Scope ✅

Location: sdk.py:1579-1583

# Keep remove+reindex under the same lock so readers never observe a
# gap where the document disappeared between generations. Query
# paths snapshot state quickly under this same lock and then do the
# expensive work outside it.

The comment clearly documents the intentional design decision.

2. `_retrieve_chunks_from_file()` Caching Tradeoff ✅

Location: sdk.py:2225-2237

# Use cached per-file index when available; otherwise rebuild locally
# from the snapshot without mutating shared state mid-query.
...
# Intentionally keep this rebuilt index query-local. Publishing it
# back into the shared per-file caches here would mutate shared
# state from a read path after the snapshot was taken.

The tradeoff is now explicitly documented.

3. FAISS Bounds Guard ✅

Location: sdk.py:2336-2338

# FAISS can surface stale or invalid positions relative to the
# captured chunk snapshot; skip anything outside the snapshot.
if idx < 0 or idx >= len(chunks_snapshot):

Good defensive coding with clear explanation.

4. Double Capacity Check (TOCTOU Protection) ✅

Location: sdk.py:2046-2049

# This second capacity check is intentional: the optimistic
# pre-flight above avoids expensive extraction work, while this
# locked check protects against TOCTOU drift after extraction
# and chunking finish.

The design rationale is now clear.

Tests Review

Both new tests are well-designed:

test_query_uses_consistent_snapshot_during_state_swap (test_rag.py:470-523)
- Uses threading events to create a deterministic race condition
- Verifies query returns data from the snapshot it captured, not the swapped state
test_remove_document_preserves_state_when_rebuild_fails (test_rag.py:525-573)
- Patches _create_vector_index to raise RuntimeError
- Verifies all state fields remain unchanged after failed rebuild

Minor Note

🟢 The style fix at sdk.py:1226 (line.strip()[-1] not in instead of not line.strip()[-1] in) is a good cleanup for Python idiom.

Verdict

Approve - All 4 documentation suggestions have been addressed. The comments are clear, concise, and explain the "why" behind each design decision. Ready to merge.

Resolves conflicts between PR #768 (pickle → JSON+HMAC cache) and main: - #746 (RAG index thread safety): integrated state-lock + publish-after- rebuild pattern into the cache-load path - #784 (encrypted PDF handling): `is_encrypted = False` added to all PdfReader mocks so the new encryption guard doesn't short-circuit TestCacheSecurity Changes: - src/gaia/rag/sdk.py: cache load path wraps `_verify_and_load_cache` result in `self._state_lock`; cache save uses `_save_cache`; removed obsolete CACHE_HEADER/MAX_CACHE_SIZE constants - tests/test_rag.py: dropped 4 obsolete pickle-format tests (coverage lives in TestCacheSecurity for JSON+HMAC now); added class-level mock_dependencies fixture to TestCacheSecurity; added is_encrypted mock attribute to the module-level fixture as a robustness fallback Post-merge: `pytest tests/test_rag.py tests/unit/rag/` → 40 pass, 1 skip. No pickle/CACHE_HEADER references remain in merged code.

@kovtcharov-amd

# GAIA v0.17.3 Release Notes GAIA v0.17.3 is an extensibility and resilience release. You can now package your own agents into a custom GAIA installer and seed them on first launch, point GAIA at alternative OpenAI-compatible inference servers from the C++ library (Ollama, for example), and start from three new reference agents (weather, RAG Q&A, HTML mockup) that execute against real Lemonade hardware in CI. It also hardens the RAG cache against an insecure-deserialization class of bug (CWE-502) — all users should upgrade. **Why upgrade:** - **Ship your own GAIA** — Export and import agents between machines, follow a new guide to produce a custom installer that seeds your agents on first launch, and on Windows install everything in one step because the installer now includes the Lemonade Server MSI. - **Work with alternative inference backends** — The C++ library now preserves OpenAI-compatible `/v1` base URLs instead of rewriting them to `/api/v1`, so servers that expose the standard `/v1` path (Ollama, for example) work out of the box. - **Start from a working example** — Three new reference agents (weather via MCP, RAG document Q&A, HTML landing-page generator) with integration tests that actually execute against Lemonade on a Strix CI runner. - **Safer RAG cache** — Replaces `pickle` deserialization with JSON + HMAC-SHA256 (CWE-502). Unsigned or tampered caches are rejected and transparently rebuilt on the next query. - **Better document handling** — Encrypted or corrupted PDFs now produce distinct, actionable errors (`EncryptedPDFError`, `CorruptedPDFError`) instead of generic failures, and the RAG index is hardened for concurrent queries. --- ## What's New ### Custom Installers and Agent Portability You can now package a custom GAIA installer that ships with your own agents pre-loaded, and move agents between machines with export/import (PR #795). On Windows, the official installer now includes the Lemonade Server MSI and runs it during install, so a fresh machine has the complete local-LLM stack after a single download (PR #781). **What you can do:** - Export an agent from `~/.gaia/agents/` to a portable bundle with `gaia agents export` and import it on another machine with `gaia agents import` - Follow the new custom-installer playbook at [`docs/playbooks/custom-installer/index.mdx`](/playbooks/custom-installer) to distribute GAIA with your agents pre-loaded — useful for workshops, team deployments, and internal tooling - On Windows, the installer now includes Lemonade Server — no separate download for a complete first-run experience **Under the hood:** - `gaia agents export` / `gaia agents import` CLI commands round-trip agents between machines as portable bundles - First-launch agent seeder (`src/gaia/apps/webui/services/agent-seeder.cjs`) copies `<resourcesPath>/agents/<id>/` into `~/.gaia/agents/<id>/` the first time the app starts - Windows NSIS installer embeds `lemonade-server-minimal.msi` into `$PLUGINSDIR` and runs it via `msiexec /i ... /qn /norestart` during install (auto-cleaned on exit) --- ### Broader Backend Compatibility in the C++ Library The C++ library now preserves OpenAI-compatible `/v1` base URLs (PR #773) instead of rewriting them to `/api/v1`. That means inference servers that expose the standard OpenAI `/v1` path — for example, Ollama at `http://localhost:11434/v1` — work out of the box without needing a special adapter. --- ### Reference Agents and Real-Hardware Integration Tests Three new example agents and a Strix-runner CI workflow land together (PR #340). **What you can do:** - Copy `examples/weather_agent.py`, `examples/rag_doc_agent.py`, or `examples/product_mockup_agent.py` as a starting point for your own agents - Run the new integration tests locally against Lemonade to validate agents end-to-end, not just structurally **Under the hood:** - `tests/integration/test_example_agents.py` executes agents and validates responses with a 5-minute-per-test timeout - `.github/workflows/test_examples.yml` runs on the self-hosted Strix runner (`stx` label) with Lemonade serving `Qwen3-4B-Instruct-2507-GGUF` - Docs homepage refreshed with a technical value prop ("Agent SDK for AMD Ryzen AI") and MCP / CUA added to the capabilities list --- ### Smarter PDF Handling in RAG Encrypted and corrupted PDFs now surface as distinct, actionable errors (`EncryptedPDFError`, `CorruptedPDFError`, `EmptyPDFError`) instead of generic failures or silent 0-chunk indexes (PR #784, closes #451). Encrypted PDFs are detected before extraction; corrupted PDFs are caught during extraction with a clear message. Combined with the indexing-failure surfacing in PR #723, you get a visible indexing-failed status the moment a document fails — and the RAG index itself is now thread-safe under concurrent queries (PR #746). --- ## Security ### RAG Cache Deserialization Replaced with JSON + HMAC Fixes an insecure-deserialization issue in the RAG cache (CWE-502, PR #768). Previously, cached document indexes were serialized with Python `pickle`; if an attacker could write to `~/.gaia/` — via a shared drive, a sync conflict, or a malicious extension — loading that cache could execute arbitrary code. v0.17.3 replaces `pickle` with signed JSON: caches are now serialized as JSON and authenticated with HMAC-SHA256 using a per-install key stored at `~/.gaia/cache/hmac.key`. Unsigned or tampered caches are rejected and transparently rebuilt on the next query. Old `.pkl` caches from previous GAIA versions are ignored and re-indexed the next time you query a document. **You should upgrade if you** share `~/.gaia/` across machines (Dropbox, iCloud, network home directories), run GAIA in a multi-user environment, or have ever imported RAG caches from another source. --- ## Bug Fixes - **Ask Agent attaches files before sending to chat** (PR #725) — Dropped files are indexed into RAG and attached to the active session before the prompt is consumed, so the model sees the document on the first turn instead of the second. - **Document indexing failures are surfaced** (PR #723) — A document that produces 0 chunks now raises `RuntimeError` in the SDK and surfaces as `indexing_status: failed` in the UI, instead of looking like a silent success. Covers RAG SDK, background indexing, and re-index paths. - **Encrypted or corrupted PDFs produce actionable errors** (PR #784, closes #451) — RAG now raises distinct `EncryptedPDFError` and `CorruptedPDFError` exceptions instead of generic failures, so you see exactly what went wrong. - **RAG index thread safety hardened** (PR #746) — Adds `RLock` protection around index mutation paths and rebuilds chunk/index state atomically before publishing it, so concurrent queries read consistent snapshots and failed rebuilds no longer leak partial state. - **MCP JSON-RPC handler guards against non-dict bodies** (PR #803) — A malformed JSON-RPC payload (array, string, null) now returns HTTP 400 `Invalid Request: expected JSON object` instead of an HTTP 500 from a `TypeError`. - **File-search count aligned with accessible results** (PR #754) — The returned count now matches the number of files the tool actually surfaces, instead of a pre-filter total that over-reported results the caller could not access. - **Tracked block cursor replaces misplaced decorative cursor** (PR #727) — Fixes the mis-positioned blinking cursor in the chat input box, which now tracks the actual caret position via a mirror-div technique. - **Ad-hoc sign the macOS app bundle instead of skipping code signing** (PR #765) — The `.app` bundle inside the DMG now carries an ad-hoc signature, so Gatekeeper presents a single "Open Anyway" bypass in System Settings instead of the unrecoverable "is damaged" error. Full Apple Developer ID signing is still being finalized. --- ## Release & CI - **Publish workflow: single approval gate, no legacy Electron apps** (PR #758) — Removed the legacy jira and example standalone Electron apps from the publish pipeline; a single `publish` environment gate governs PyPI, npm, and installer publishing. - **Claude CI modernization** (PR #797, PR #799, PR #783) — Migrated all four `claude-code-action` call sites to `v1.0.99` (pinned by SHA, fixes an issue-handler hang), bumped `--max-turns` from 20 to 50 on both `pr-review` and `pr-comment` for deeper analysis, upgraded to Opus 4.7, standardized 23 subagent definitions with explicit when-to-use sections and tool allowlists, and added agent-builder tooling (manifest schema, `lint.py --agents`, BuilderAgent mixins). --- ## Docs - **Roadmap overhaul** (PR #710) — Milestone-aligned plans with voice-first as P0 and 9 new plan documents for upcoming initiatives. - **Plan: email triage agent** (PR #796) — Specification for an upcoming email triage agent. - **Docs/source drift resolved** (PR #794) — Fixed broken SDK examples across 15 docs, rewrote 5 spec files against the current source (including two that documented entire APIs that don't exist in code), added 20+ missing CLI flags to the CLI reference, and removed 2 already-shipped plan documents (installer, mcp-client). - **FAQ: data-privacy answer clarified for external LLM providers** (PR #798) — Sharper guidance on what leaves your machine when you point GAIA at Claude or OpenAI. --- ## Full Changelog **21 commits** since v0.17.2: - `6d3f3f71` — fix: replace misplaced decorative cursor with tracked terminal block cursor (#727) - `874cf2a3` — fix: Ask Agent indexes and attaches files before sending to chat (#725) - `4fa121e2` — fix: surface document indexing failures instead of silent 0-chunk success (#723) - `34b1d06e` — fix(ci): ad-hoc sign macOS DMG instead of skipping code signing (#765) - `7188b83c` — Roadmap overhaul: milestone-aligned plans with voice-first P0 and 9 new plan documents (#710) - `1beddac5` — cpp: support Ollama-compatible /v1 endpoints (#773) - `cf9ac995` — fix: harden rag index thread safety (#746) - `1c55c31b` — fix(ci): remove legacy electron apps from publish, single approval gate (#758) - `52946a7a` — feat(installer): bundle Lemonade Server MSI into Windows installer (#774) (#781) - `e96b3686` — ci(claude): review infra + conventions + subagent overhaul + agent-builder tooling (#783) - `058674b5` — fix(rag): detect encrypted and corrupted PDFs with actionable errors (#451) (#784) - `7bcb5d51` — fix: replace insecure pickle deserialization with JSON + HMAC in RAG cache (CWE-502) (#768) - `a5167e5f` — fix: keep file-search count aligned with accessible results (#754) - `da5ba458` — ci(claude): migrate to claude-code-action v1.0.99 + fix issue-handler hang (#797) - `03f546b9` — ci(claude): bump pr-review and pr-comment --max-turns 20 -> 50 (#799) - `4119d564` — docs(faq): clarify data-privacy answer re: external LLM providers (#798) - `0cfbcf41` — Add example agents and integration test workflow (#340) - `c4bd15fb` — docs: fix drift between docs and source (docs review pass 1 + 2) (#794) - `407ed5b8` — docs(plans): add email triage agent spec (#796) - `06fb04a4` — fix(mcp): guard JSON-RPC handler against non-dict body (#803) - `880ad603` — feat(installer): custom installer guide, agent export/import, first-launch seeder (#795) Full Changelog: [v0.17.2...v0.17.3](v0.17.2...v0.17.3) --- ## Release checklist - [x] `util/validate_release_notes.py docs/releases/v0.17.3.mdx --tag v0.17.3` passes - [x] `src/gaia/version.py` → `0.17.3` - [x] `src/gaia/apps/webui/package.json` → `0.17.3` - [x] Navbar label in `docs/docs.json` → `v0.17.3 · Lemonade 10.0.0` - [x] All 21 PRs in the range (v0.17.2..HEAD) are represented in the notes - [ ] Review from @kovtcharov-amd addressed

fix: harden rag index thread safety

d3c3c97

grtninja requested a review from kovtcharov-amd as a code owner April 11, 2026 07:59

github-actions bot added rag RAG system changes tests Test changes performance Performance-critical changes labels Apr 11, 2026

kovtcharov assigned kovtcharov and unassigned kovtcharov Apr 11, 2026

kovtcharov added this to the v0.17.3 — Website, and RAG bug fixes [OSS] milestone Apr 11, 2026

kovtcharov assigned grtninja Apr 11, 2026

docs(rag): address thread-safety review follow-ups

4ca3245

kovtcharov approved these changes Apr 15, 2026

View reviewed changes

kovtcharov added this pull request to the merge queue Apr 15, 2026

Merged via the queue into amd:main with commit cf9ac99 Apr 15, 2026
19 checks passed

grtninja deleted the codex/issue-455-rag-thread-safety branch April 15, 2026 09:40

itomek mentioned this pull request Apr 20, 2026

Release v0.17.3 #831

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden rag index thread safety#746

fix: harden rag index thread safety#746
kovtcharov merged 2 commits intoamd:mainfrom
grtninja:codex/issue-455-rag-thread-safety

grtninja commented Apr 11, 2026

Uh oh!

grtninja commented Apr 11, 2026

Uh oh!

kovtcharov commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

kovtcharov commented Apr 11, 2026

Uh oh!

grtninja commented Apr 11, 2026

Uh oh!

grtninja commented Apr 12, 2026

Uh oh!

kovtcharov commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grtninja commented Apr 11, 2026

What

Why

Scope

Validation

Notes

Uh oh!

grtninja commented Apr 11, 2026

Uh oh!

kovtcharov commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: RAG Index Thread Safety

Summary

Review Feedback

🟢 Strengths

🟡 Suggestions

🟢 Minor Notes

Verdict

Uh oh!

kovtcharov commented Apr 11, 2026

Uh oh!

grtninja commented Apr 11, 2026

Uh oh!

grtninja commented Apr 12, 2026

Uh oh!

kovtcharov commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: RAG Index Thread Safety (Follow-up)

Summary

Verification of Previous Suggestions

1. reindex_document() Lock Scope ✅

2. _retrieve_chunks_from_file() Caching Tradeoff ✅

3. FAISS Bounds Guard ✅

4. Double Capacity Check (TOCTOU Protection) ✅

Tests Review

Minor Note

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 11, 2026 •

edited

Loading

github-actions bot commented Apr 15, 2026 •

edited

Loading

1. `reindex_document()` Lock Scope ✅

2. `_retrieve_chunks_from_file()` Caching Tradeoff ✅