feat(cost): semantic caching engine with pgvector integration (#396)#440
feat(cost): semantic caching engine with pgvector integration (#396)#440Mohammedsami001 wants to merge 2 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 5 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds a semantic caching engine: a hash-based ChangesSemantic Caching Engine
Sequence Diagram(s)sequenceDiagram
participant Caller
participant AsyncCompletions
participant SemanticCache
participant DB as PostgreSQL/pgvector
participant OpenAI as OpenAI API
Caller->>AsyncCompletions: create(messages, extra_body={cache_ttl_days: N})
AsyncCompletions->>SemanticCache: get(last_message_content, ttl_days=N)
alt DB session present
SemanticCache->>DB: SELECT by cosine_distance with TTL cutoff
DB-->>SemanticCache: SemanticCacheEntry row (or empty)
else In-memory fallback
SemanticCache->>SemanticCache: prune expired entries, cosine similarity search
end
alt Cache hit
SemanticCache-->>AsyncCompletions: cached response_text
AsyncCompletions-->>Caller: synthetic ChatCompletion
else Cache miss
AsyncCompletions->>OpenAI: original create(messages, ...)
OpenAI-->>AsyncCompletions: ChatCompletion
AsyncCompletions->>SemanticCache: set(prompt, response_text, framework="openai")
SemanticCache->>DB: INSERT SemanticCacheEntry + commit
AsyncCompletions-->>Caller: real ChatCompletion
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (2)
tests/test_caching.py (2)
95-203: ⚡ Quick winAdd interception regression tests for malformed TTL and streaming mode.
Current suite doesn’t assert behavior for invalid
AGENTWATCH_CACHE_TTL_DAYS/cache_ttl_daysvalues orstream=True, which are high-risk paths in the wrapper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_caching.py` around lines 95 - 203, Add two new test functions to cover high-risk paths not currently tested. First, create a test for malformed TTL values by testing invalid inputs for both the AGENTWATCH_CACHE_TTL_DAYS environment variable (e.g., non-numeric strings) and the cache_ttl_days override in extra_body metadata to ensure the semantic cache gracefully handles these invalid values. Second, create a test for streaming mode by calling client.chat.completions.create with stream=True parameter and verify that the cache behaves correctly when streaming is enabled, checking whether streaming responses are properly cached or handled appropriately. Both tests should follow the same pattern as test_semantic_cache_manager_interception and test_semantic_cache_manager_config_override by mocking the embedding provider, using patch_openai and unpatch_openai context, and asserting expected behavior.
205-241: ⚡ Quick winAdd a DB-path negative test that enforces similarity threshold.
This test currently mocks a DB hit unconditionally; it won’t catch regressions where low-similarity nearest neighbors are incorrectly returned.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_caching.py` around lines 205 - 241, The test_semantic_cache_manager_db_backend function currently mocks unconditional database hits without verifying that the similarity threshold is enforced. Add a negative test case after the existing assertions that sets up a second mock_entry with embedding vectors that produce low similarity (e.g., orthogonal vectors like [1.0, 0.0] vs [0.0, 1.0]) and verifies that when cache.get() is called with a query that has low similarity to the cached entry, it returns None or no hit instead of the cached response, confirming the similarity threshold blocks incorrect matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agentwatch/adapters/interception.py`:
- Around line 79-83: The await semantic_cache.set() call in the response
handling block is not protected from exceptions, which means any failure in the
cache population operation will raise and mask the successful upstream response.
Wrap the await semantic_cache.set() call in a try-except block to catch and
handle any exceptions gracefully, ensuring that cache failures do not prevent
the successful response from being returned to the caller.
- Around line 28-37: The code directly converts untrusted values using int()
without error handling, which can raise ValueError and crash request handling.
Add try-except blocks around both the int(global_ttl_env) conversion on line 29
and the int(override_ttl) conversion on line 37 to catch ValueError exceptions.
For each conversion, handle the exception gracefully by either logging the
invalid value, falling back to a default TTL, or skipping the override, ensuring
the system continues operating safely with malformed input.
- Around line 45-50: The current code mutates the shared
`semantic_cache.ttl_days` instance variable across an await point, which causes
concurrent requests to interfere with each other's TTL settings. Remove the
lines that save the original ttl_days value and restore it after the await in
the semantic_cache.get call. Instead, modify the SemanticCache.get method
signature to accept an optional ttl_days_override parameter, and pass the ttl
value as an argument to that method rather than mutating the instance state
directly. This way, the TTL override is scoped to the specific request without
affecting shared state.
- Around line 40-74: The code currently returns a ChatCompletion object on all
cache hits, but this breaks the API contract when stream=True is passed in
kwargs. Before returning the cached_response in the semantic cache hit block,
check if kwargs.get("stream") is True. If streaming is enabled, construct and
return an AsyncStream containing ChatCompletionChunk objects with the
appropriate delta fields instead of the ChatCompletion object. Only return the
ChatCompletion object when streaming is disabled or not specified.
In `@agentwatch/cost/caching.py`:
- Around line 23-35: The cache dictionary (self._cache) is keyed only by
prompt_hash in both the storage operation where CacheHit is assigned and the
retrieval operation in the search method, which causes identical prompts from
different frameworks to overwrite each other. Modify the cache key generation to
include both the prompt_hash and the framework parameter so that the same prompt
with different frameworks are stored as separate cache entries. Update the key
construction logic in the cache assignment (around the CacheHit instantiation)
and in the search method's cache retrieval call to use a composite key that
combines prompt_hash and framework.
In `@agentwatch/cost/semantic_cache.py`:
- Around line 79-85: The return statement in the semantic cache lookup is
returning best_match_db.response_text without verifying that the match meets the
similarity threshold. Add a check after retrieving best_match_db to calculate
the cosine distance between the query_vec and the best match's prompt_vector,
then only return best_match_db.response_text if the distance satisfies the
threshold condition (distance <= 1 - similarity_threshold). If the threshold is
not met, allow the function to continue or return None to indicate no suitable
match was found.
- Around line 131-140: The database commit operation in the SemanticCacheEntry
creation block can raise exceptions and cause caller failures even when the
upstream operation succeeds. Wrap the self.db_session.add(db_entry) and await
self.db_session.commit() calls in a try-except block that logs any errors
without re-raising them, ensuring cache persistence failures do not propagate to
callers and break successful operations.
In `@pyproject.toml`:
- Line 11: The requires-python setting in pyproject.toml specifies >=3.10, but
the code uses datetime.UTC in agentwatch/cost/semantic_cache.py (lines 25, 67,
93) which is only available in Python 3.11+. Either update the requires-python
constraint to >=3.11 to match the actual minimum version required by the
codebase, or replace all occurrences of datetime.UTC with datetime.timezone.utc
throughout semantic_cache.py, which is compatible with Python 3.10.
---
Nitpick comments:
In `@tests/test_caching.py`:
- Around line 95-203: Add two new test functions to cover high-risk paths not
currently tested. First, create a test for malformed TTL values by testing
invalid inputs for both the AGENTWATCH_CACHE_TTL_DAYS environment variable
(e.g., non-numeric strings) and the cache_ttl_days override in extra_body
metadata to ensure the semantic cache gracefully handles these invalid values.
Second, create a test for streaming mode by calling
client.chat.completions.create with stream=True parameter and verify that the
cache behaves correctly when streaming is enabled, checking whether streaming
responses are properly cached or handled appropriately. Both tests should follow
the same pattern as test_semantic_cache_manager_interception and
test_semantic_cache_manager_config_override by mocking the embedding provider,
using patch_openai and unpatch_openai context, and asserting expected behavior.
- Around line 205-241: The test_semantic_cache_manager_db_backend function
currently mocks unconditional database hits without verifying that the
similarity threshold is enforced. Add a negative test case after the existing
assertions that sets up a second mock_entry with embedding vectors that produce
low similarity (e.g., orthogonal vectors like [1.0, 0.0] vs [0.0, 1.0]) and
verifies that when cache.get() is called with a query that has low similarity to
the cached entry, it returns None or no hit instead of the cached response,
confirming the similarity threshold blocks incorrect matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: b5c8e4c1-4e8d-4146-b7d4-e1526c4e76e0
📒 Files selected for processing (6)
agentwatch/adapters/interception.pyagentwatch/cost/caching.pyagentwatch/cost/semantic_cache.pyagentwatch/models/cache.pypyproject.tomltests/test_caching.py
|
Hi @sreerevanth 👋 I've pushed a commit addressing the CodeRabbit feedback. The PR is now ready to merge! Fixes include:
Tests are passing locally. Let me know if anything else is needed! 🚀 |
🧪 PR Test Results
Python 3.12 · commit 9c31784 |
Description
Closes #396
This PR introduces the Semantic Caching Engine to significantly reduce LLM costs and latency by intercepting
openainetwork requests and returning cached responses for semantically identical prompts.The feature was implemented using strict Test-Driven Development (TDD) to ensure robust handling of caching edge cases, cache-eviction, and network abstractions.
What Was Accomplished
1.
SemanticCacheEngine (agentwatch/cost/semantic_cache.py)_cosine_similarity.TTLvalidation to automatically ignore stale entries based on a configurable timeframe (ttl_days).2. OpenAI Network Interception (
agentwatch/adapters/interception.py)openai.AsyncClient.chat.completions.createviapatch_openai().ChatCompletionobjects upon cache hits, ensuring downstream applications (e.g., streaming and non-streaming) are completely unaware the response was fetched locally.AsyncStreamattributes by safely validatinghasattr(response, "choices").3. PostgreSQL /
pgvectorIntegration (agentwatch/models/cache.py)SemanticCacheEntrySQLAlchemy model leveraging theVectorcolumn type frompgvector.AsyncSessionis provided to the cache manager, it scales out of local memory and queries the backend usingpgvector's native.cosine_distanceto rapidly sort vector similarity.4. Per-Session Overrides
AGENTWATCH_CACHE_TTL_DAYS) can be dynamically bypassed or overridden viaextra_body={"agentwatch_metadata": {"cache_ttl_days": X}}in the LLM payload, giving granular control back to the specific execution session.Testing & Verification
Comprehensive TDD testing is included (
tests/test_caching.py). All 6 behavioral suites successfully pass:Checklist
Summary by CodeRabbit
Release Notes
New Features
Chores
Tests