Skip to content

fix: harden rag index thread safety#746

Merged
kovtcharov merged 2 commits intoamd:mainfrom
grtninja:codex/issue-455-rag-thread-safety
Apr 15, 2026
Merged

fix: harden rag index thread safety#746
kovtcharov merged 2 commits intoamd:mainfrom
grtninja:codex/issue-455-rag-thread-safety

Conversation

@grtninja
Copy link
Copy Markdown
Contributor

What

  • add RLock protection around RAG index mutation paths in src/gaia/rag/sdk.py
  • rebuild chunk/index state before publishing it so failed rebuilds do not leak partial state
  • query against consistent snapshots so concurrent reads do not mix old indexes with new chunk maps
  • add targeted regressions for concurrent query-vs-state-swap and failed rebuild rollback

Why

Closes #455 by tightening the shared-state contract on the RAG index surface used by concurrent chat requests.

Scope

  • src/gaia/rag/sdk.py
  • tests/test_rag.py

Validation

  • python -m py_compile src/gaia/rag/sdk.py tests/test_rag.py
  • PYTHONPATH=src pytest tests/test_rag.py -k "consistent_snapshot or rebuild_fails or document_indexing or document_querying or status_reporting or cache_functionality" -q

Notes

  • kept the lane focused to the issue surface only
  • no UI or broader agent-loop changes

@github-actions github-actions bot added rag RAG system changes tests Test changes performance Performance-critical changes labels Apr 11, 2026
@grtninja
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude review this PR.

@kovtcharov kovtcharov assigned kovtcharov and unassigned kovtcharov Apr 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 11, 2026

Claude finished @kovtcharov's task —— View job


PR Review: RAG Index Thread Safety

  • Read PR diff and changed files
  • Review src/gaia/rag/sdk.py changes
  • Review tests/test_rag.py changes
  • Provide comprehensive review feedback

Summary

This PR adds thread safety to the RAG SDK by introducing RLock protection around index mutation paths and ensuring queries operate on consistent snapshots. The implementation is well-structured and addresses the core issue of concurrent access to shared state.

Review Feedback

🟢 Strengths

  1. Snapshot-based query isolation (sdk.py:1454-1464): The _snapshot_query_state() method correctly captures a consistent read snapshot under the lock, allowing expensive embedding/search work to proceed without holding the lock.

  2. Atomic state publishing (sdk.py:1526-1530): The remove_document() method now builds new state structures first, then publishes them atomically only after _create_vector_index() succeeds. This prevents partial state from being visible.

  3. Good use of RLock: Using threading.RLock() allows nested locking (e.g., reindex_document() calling remove_document() and index_document() within the same lock).

  4. Well-targeted tests: The two new tests (test_query_uses_consistent_snapshot_during_state_swap and test_remove_document_preserves_state_when_rebuild_fails) directly validate the race conditions being fixed.

🟡 Suggestions

  1. Deadlock risk in reindex_document() (sdk.py:1579-1595): This method holds _state_lock while calling index_document(), which also acquires _state_lock. While RLock handles this correctly, the current structure holds the lock across both the remove and index operations. Consider whether this is intentional, as it prevents concurrent reads during reindexing.

  2. Lock scope in _retrieve_chunks_from_file() (sdk.py:2192-2206): The method correctly snapshots state under the lock, but then the rebuilding of a missing per-file index (sdk.py:2216-2223) doesn't cache the result back. The old code cached it:

    # Old code:
    self.file_indices[file_path] = file_index
    self.file_embeddings[file_path] = file_embeddings

    This was intentionally removed to avoid mutating shared state mid-query, but it means repeated queries for the same uncached file will rebuild the index each time. Consider documenting this tradeoff.

  3. Bounds check in _search_chunks() (sdk.py:2320-2326): Good addition of bounds validation:

    if idx < 0 or idx >= len(chunks_snapshot):
        continue

    This handles edge cases where FAISS returns invalid indices.

  4. Double capacity check (sdk.py:1888-1398 and 2028-2052): The capacity check happens twice in the cache-load path (once before checking cache, once after loading cache). This is intentional for TOCTOU safety but could benefit from a brief comment explaining why.

🟢 Minor Notes

  • The clear_cache() method now acquires the lock (sdk.py:2491-2495) - good for consistency.
  • The _retrieve_chunks_with_metadata() signature change to accept an optional search_snapshot parameter (sdk.py:2254-2255) is a clean refactor that enables snapshot reuse.

Verdict

Approve - This is a solid thread-safety improvement. The changes are focused, well-tested, and follow the correct pattern of snapshot-then-process for read operations and build-then-publish for write operations.


@kovtcharov
Copy link
Copy Markdown
Collaborator

Can you address the 4 suggestions by claude and i'll approve.

@grtninja
Copy link
Copy Markdown
Contributor Author

yeah I'm working on this now, thanks!

@grtninja
Copy link
Copy Markdown
Contributor Author

Addressing Claude's 4 suggestions now on the latest head.

What changed:

  • documented the intentional reindex_document() lock scope so readers do not observe a remove/readd gap during reindexing
  • documented the query-local uncached per-file index rebuild tradeoff in _retrieve_chunks_from_file()
  • documented the FAISS bounds guard in _search_chunks()
  • documented the double capacity check as intentional TOCTOU protection
  • cleaned the two touched-surface style nits in the same files

Validation:

  • PYTHONPATH=src python -m pytest -q tests/test_rag.py -k "consistent_snapshot or preserves_state_when_rebuild_fails"
  • ruff check src/gaia/rag/sdk.py tests/test_rag.py
  • python -m black --check src/gaia/rag/sdk.py tests/test_rag.py
  • git diff --check

kovtcharov-amd pushed a commit that referenced this pull request Apr 13, 2026
- Move v0.17.2 to Shipped section with release summary
- Add v0.17.3 as current release (RAG reliability, security, website)
- Update v0.18.x tables to reflect issue triage (moved 10 issues out of v0.17.3)
- Add v0.19.0 model fine-tuning deliverables (was missing from table)
- Add v0.22.0 App Consolidation section and Mermaid node (was missing)
- Remove duplicate #455 from v0.18.2 (addressed by PR #746 in v0.17.3)
- Add Detailed Plan links to v0.17.3
- Set sequential weekly deadlines: Apr 17 through Jun 16
- Update timestamp to April 13, 2026

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude please review this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

Claude finished @kovtcharov's task —— View job


PR Review: RAG Index Thread Safety (Follow-up)

  • Read PR diff and changed files
  • Review src/gaia/rag/sdk.py changes (documentation updates)
  • Review tests/test_rag.py changes
  • Provide comprehensive review feedback

Summary

This PR successfully addresses all 4 suggestions from the previous review. The documentation additions clearly explain the threading invariants and design tradeoffs.

Verification of Previous Suggestions

1. reindex_document() Lock Scope ✅

Location: sdk.py:1579-1583

# Keep remove+reindex under the same lock so readers never observe a
# gap where the document disappeared between generations. Query
# paths snapshot state quickly under this same lock and then do the
# expensive work outside it.

The comment clearly documents the intentional design decision.

2. _retrieve_chunks_from_file() Caching Tradeoff ✅

Location: sdk.py:2225-2237

# Use cached per-file index when available; otherwise rebuild locally
# from the snapshot without mutating shared state mid-query.
...
# Intentionally keep this rebuilt index query-local. Publishing it
# back into the shared per-file caches here would mutate shared
# state from a read path after the snapshot was taken.

The tradeoff is now explicitly documented.

3. FAISS Bounds Guard ✅

Location: sdk.py:2336-2338

# FAISS can surface stale or invalid positions relative to the
# captured chunk snapshot; skip anything outside the snapshot.
if idx < 0 or idx >= len(chunks_snapshot):

Good defensive coding with clear explanation.

4. Double Capacity Check (TOCTOU Protection) ✅

Location: sdk.py:2046-2049

# This second capacity check is intentional: the optimistic
# pre-flight above avoids expensive extraction work, while this
# locked check protects against TOCTOU drift after extraction
# and chunking finish.

The design rationale is now clear.

Tests Review

Both new tests are well-designed:

  1. test_query_uses_consistent_snapshot_during_state_swap (test_rag.py:470-523)

    • Uses threading events to create a deterministic race condition
    • Verifies query returns data from the snapshot it captured, not the swapped state
  2. test_remove_document_preserves_state_when_rebuild_fails (test_rag.py:525-573)

    • Patches _create_vector_index to raise RuntimeError
    • Verifies all state fields remain unchanged after failed rebuild

Minor Note

🟢 The style fix at sdk.py:1226 (line.strip()[-1] not in instead of not line.strip()[-1] in) is a good cleanup for Python idiom.

Verdict

Approve - All 4 documentation suggestions have been addressed. The comments are clear, concise, and explain the "why" behind each design decision. Ready to merge.


@kovtcharov kovtcharov added this pull request to the merge queue Apr 15, 2026
Merged via the queue into amd:main with commit cf9ac99 Apr 15, 2026
19 checks passed
@grtninja grtninja deleted the codex/issue-455-rag-thread-safety branch April 15, 2026 09:40
kovtcharov added a commit that referenced this pull request Apr 17, 2026
Resolves conflicts between PR #768 (pickle → JSON+HMAC cache) and main:
- #746 (RAG index thread safety): integrated state-lock + publish-after-
  rebuild pattern into the cache-load path
- #784 (encrypted PDF handling): `is_encrypted = False` added to all
  PdfReader mocks so the new encryption guard doesn't short-circuit
  TestCacheSecurity

Changes:
- src/gaia/rag/sdk.py: cache load path wraps `_verify_and_load_cache`
  result in `self._state_lock`; cache save uses `_save_cache`; removed
  obsolete CACHE_HEADER/MAX_CACHE_SIZE constants
- tests/test_rag.py: dropped 4 obsolete pickle-format tests (coverage
  lives in TestCacheSecurity for JSON+HMAC now); added class-level
  mock_dependencies fixture to TestCacheSecurity; added is_encrypted
  mock attribute to the module-level fixture as a robustness fallback

Post-merge: `pytest tests/test_rag.py tests/unit/rag/` → 40 pass, 1 skip.
No pickle/CACHE_HEADER references remain in merged code.
@itomek itomek mentioned this pull request Apr 20, 2026
6 tasks
github-merge-queue bot pushed a commit that referenced this pull request Apr 20, 2026
# GAIA v0.17.3 Release Notes

GAIA v0.17.3 is an extensibility and resilience release. You can now
package your own agents into a custom GAIA installer and seed them on
first launch, point GAIA at alternative OpenAI-compatible inference
servers from the C++ library (Ollama, for example), and start from three
new reference agents (weather, RAG Q&A, HTML mockup) that execute
against real Lemonade hardware in CI. It also hardens the RAG cache
against an insecure-deserialization class of bug (CWE-502) — all users
should upgrade.

**Why upgrade:**
- **Ship your own GAIA** — Export and import agents between machines,
follow a new guide to produce a custom installer that seeds your agents
on first launch, and on Windows install everything in one step because
the installer now includes the Lemonade Server MSI.
- **Work with alternative inference backends** — The C++ library now
preserves OpenAI-compatible `/v1` base URLs instead of rewriting them to
`/api/v1`, so servers that expose the standard `/v1` path (Ollama, for
example) work out of the box.
- **Start from a working example** — Three new reference agents (weather
via MCP, RAG document Q&A, HTML landing-page generator) with integration
tests that actually execute against Lemonade on a Strix CI runner.
- **Safer RAG cache** — Replaces `pickle` deserialization with JSON +
HMAC-SHA256 (CWE-502). Unsigned or tampered caches are rejected and
transparently rebuilt on the next query.
- **Better document handling** — Encrypted or corrupted PDFs now produce
distinct, actionable errors (`EncryptedPDFError`, `CorruptedPDFError`)
instead of generic failures, and the RAG index is hardened for
concurrent queries.

---

## What's New

### Custom Installers and Agent Portability

You can now package a custom GAIA installer that ships with your own
agents pre-loaded, and move agents between machines with export/import
(PR #795). On Windows, the official installer now includes the Lemonade
Server MSI and runs it during install, so a fresh machine has the
complete local-LLM stack after a single download (PR #781).

**What you can do:**
- Export an agent from `~/.gaia/agents/` to a portable bundle with `gaia
agents export` and import it on another machine with `gaia agents
import`
- Follow the new custom-installer playbook at
[`docs/playbooks/custom-installer/index.mdx`](/playbooks/custom-installer)
to distribute GAIA with your agents pre-loaded — useful for workshops,
team deployments, and internal tooling
- On Windows, the installer now includes Lemonade Server — no separate
download for a complete first-run experience

**Under the hood:**
- `gaia agents export` / `gaia agents import` CLI commands round-trip
agents between machines as portable bundles
- First-launch agent seeder
(`src/gaia/apps/webui/services/agent-seeder.cjs`) copies
`<resourcesPath>/agents/<id>/` into `~/.gaia/agents/<id>/` the first
time the app starts
- Windows NSIS installer embeds `lemonade-server-minimal.msi` into
`$PLUGINSDIR` and runs it via `msiexec /i ... /qn /norestart` during
install (auto-cleaned on exit)

---

### Broader Backend Compatibility in the C++ Library

The C++ library now preserves OpenAI-compatible `/v1` base URLs (PR
#773) instead of rewriting them to `/api/v1`. That means inference
servers that expose the standard OpenAI `/v1` path — for example, Ollama
at `http://localhost:11434/v1` — work out of the box without needing a
special adapter.

---

### Reference Agents and Real-Hardware Integration Tests

Three new example agents and a Strix-runner CI workflow land together
(PR #340).

**What you can do:**
- Copy `examples/weather_agent.py`, `examples/rag_doc_agent.py`, or
`examples/product_mockup_agent.py` as a starting point for your own
agents
- Run the new integration tests locally against Lemonade to validate
agents end-to-end, not just structurally

**Under the hood:**
- `tests/integration/test_example_agents.py` executes agents and
validates responses with a 5-minute-per-test timeout
- `.github/workflows/test_examples.yml` runs on the self-hosted Strix
runner (`stx` label) with Lemonade serving `Qwen3-4B-Instruct-2507-GGUF`
- Docs homepage refreshed with a technical value prop ("Agent SDK for
AMD Ryzen AI") and MCP / CUA added to the capabilities list

---

### Smarter PDF Handling in RAG

Encrypted and corrupted PDFs now surface as distinct, actionable errors
(`EncryptedPDFError`, `CorruptedPDFError`, `EmptyPDFError`) instead of
generic failures or silent 0-chunk indexes (PR #784, closes #451).
Encrypted PDFs are detected before extraction; corrupted PDFs are caught
during extraction with a clear message. Combined with the
indexing-failure surfacing in PR #723, you get a visible indexing-failed
status the moment a document fails — and the RAG index itself is now
thread-safe under concurrent queries (PR #746).

---

## Security

### RAG Cache Deserialization Replaced with JSON + HMAC

Fixes an insecure-deserialization issue in the RAG cache (CWE-502, PR
#768). Previously, cached document indexes were serialized with Python
`pickle`; if an attacker could write to `~/.gaia/` — via a shared drive,
a sync conflict, or a malicious extension — loading that cache could
execute arbitrary code.

v0.17.3 replaces `pickle` with signed JSON: caches are now serialized as
JSON and authenticated with HMAC-SHA256 using a per-install key stored
at `~/.gaia/cache/hmac.key`. Unsigned or tampered caches are rejected
and transparently rebuilt on the next query. Old `.pkl` caches from
previous GAIA versions are ignored and re-indexed the next time you
query a document.

**You should upgrade if you** share `~/.gaia/` across machines (Dropbox,
iCloud, network home directories), run GAIA in a multi-user environment,
or have ever imported RAG caches from another source.

---

## Bug Fixes

- **Ask Agent attaches files before sending to chat** (PR #725) —
Dropped files are indexed into RAG and attached to the active session
before the prompt is consumed, so the model sees the document on the
first turn instead of the second.
- **Document indexing failures are surfaced** (PR #723) — A document
that produces 0 chunks now raises `RuntimeError` in the SDK and surfaces
as `indexing_status: failed` in the UI, instead of looking like a silent
success. Covers RAG SDK, background indexing, and re-index paths.
- **Encrypted or corrupted PDFs produce actionable errors** (PR #784,
closes #451) — RAG now raises distinct `EncryptedPDFError` and
`CorruptedPDFError` exceptions instead of generic failures, so you see
exactly what went wrong.
- **RAG index thread safety hardened** (PR #746) — Adds `RLock`
protection around index mutation paths and rebuilds chunk/index state
atomically before publishing it, so concurrent queries read consistent
snapshots and failed rebuilds no longer leak partial state.
- **MCP JSON-RPC handler guards against non-dict bodies** (PR #803) — A
malformed JSON-RPC payload (array, string, null) now returns HTTP 400
`Invalid Request: expected JSON object` instead of an HTTP 500 from a
`TypeError`.
- **File-search count aligned with accessible results** (PR #754) — The
returned count now matches the number of files the tool actually
surfaces, instead of a pre-filter total that over-reported results the
caller could not access.
- **Tracked block cursor replaces misplaced decorative cursor** (PR
#727) — Fixes the mis-positioned blinking cursor in the chat input box,
which now tracks the actual caret position via a mirror-div technique.
- **Ad-hoc sign the macOS app bundle instead of skipping code signing**
(PR #765) — The `.app` bundle inside the DMG now carries an ad-hoc
signature, so Gatekeeper presents a single "Open Anyway" bypass in
System Settings instead of the unrecoverable "is damaged" error. Full
Apple Developer ID signing is still being finalized.

---

## Release & CI

- **Publish workflow: single approval gate, no legacy Electron apps**
(PR #758) — Removed the legacy jira and example standalone Electron apps
from the publish pipeline; a single `publish` environment gate governs
PyPI, npm, and installer publishing.
- **Claude CI modernization** (PR #797, PR #799, PR #783) — Migrated all
four `claude-code-action` call sites to `v1.0.99` (pinned by SHA, fixes
an issue-handler hang), bumped `--max-turns` from 20 to 50 on both
`pr-review` and `pr-comment` for deeper analysis, upgraded to Opus 4.7,
standardized 23 subagent definitions with explicit when-to-use sections
and tool allowlists, and added agent-builder tooling (manifest schema,
`lint.py --agents`, BuilderAgent mixins).

---

## Docs

- **Roadmap overhaul** (PR #710) — Milestone-aligned plans with
voice-first as P0 and 9 new plan documents for upcoming initiatives.
- **Plan: email triage agent** (PR #796) — Specification for an upcoming
email triage agent.
- **Docs/source drift resolved** (PR #794) — Fixed broken SDK examples
across 15 docs, rewrote 5 spec files against the current source
(including two that documented entire APIs that don't exist in code),
added 20+ missing CLI flags to the CLI reference, and removed 2
already-shipped plan documents (installer, mcp-client).
- **FAQ: data-privacy answer clarified for external LLM providers** (PR
#798) — Sharper guidance on what leaves your machine when you point GAIA
at Claude or OpenAI.

---

## Full Changelog

**21 commits** since v0.17.2:

- `6d3f3f71` — fix: replace misplaced decorative cursor with tracked
terminal block cursor (#727)
- `874cf2a3` — fix: Ask Agent indexes and attaches files before sending
to chat (#725)
- `4fa121e2` — fix: surface document indexing failures instead of silent
0-chunk success (#723)
- `34b1d06e` — fix(ci): ad-hoc sign macOS DMG instead of skipping code
signing (#765)
- `7188b83c` — Roadmap overhaul: milestone-aligned plans with
voice-first P0 and 9 new plan documents (#710)
- `1beddac5` — cpp: support Ollama-compatible /v1 endpoints (#773)
- `cf9ac995` — fix: harden rag index thread safety (#746)
- `1c55c31b` — fix(ci): remove legacy electron apps from publish, single
approval gate (#758)
- `52946a7a` — feat(installer): bundle Lemonade Server MSI into Windows
installer (#774) (#781)
- `e96b3686` — ci(claude): review infra + conventions + subagent
overhaul + agent-builder tooling (#783)
- `058674b5` — fix(rag): detect encrypted and corrupted PDFs with
actionable errors (#451) (#784)
- `7bcb5d51` — fix: replace insecure pickle deserialization with JSON +
HMAC in RAG cache (CWE-502) (#768)
- `a5167e5f` — fix: keep file-search count aligned with accessible
results (#754)
- `da5ba458` — ci(claude): migrate to claude-code-action v1.0.99 + fix
issue-handler hang (#797)
- `03f546b9` — ci(claude): bump pr-review and pr-comment --max-turns 20
-> 50 (#799)
- `4119d564` — docs(faq): clarify data-privacy answer re: external LLM
providers (#798)
- `0cfbcf41` — Add example agents and integration test workflow (#340)
- `c4bd15fb` — docs: fix drift between docs and source (docs review pass
1 + 2) (#794)
- `407ed5b8` — docs(plans): add email triage agent spec (#796)
- `06fb04a4` — fix(mcp): guard JSON-RPC handler against non-dict body
(#803)
- `880ad603` — feat(installer): custom installer guide, agent
export/import, first-launch seeder (#795)

Full Changelog:
[v0.17.2...v0.17.3](v0.17.2...v0.17.3)

---

## Release checklist
- [x] `util/validate_release_notes.py docs/releases/v0.17.3.mdx --tag
v0.17.3` passes
- [x] `src/gaia/version.py` → `0.17.3`
- [x] `src/gaia/apps/webui/package.json` → `0.17.3`
- [x] Navbar label in `docs/docs.json` → `v0.17.3 · Lemonade 10.0.0`
- [x] All 21 PRs in the range (v0.17.2..HEAD) are represented in the
notes
- [ ] Review from @kovtcharov-amd addressed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thread safety to RAG index operations

2 participants