Skip to content

feat(cost): semantic caching engine with pgvector integration (#396)#440

Open
Mohammedsami001 wants to merge 2 commits into
sreerevanth:mainfrom
Mohammedsami001:feature/issue-396-semantic-caching
Open

feat(cost): semantic caching engine with pgvector integration (#396)#440
Mohammedsami001 wants to merge 2 commits into
sreerevanth:mainfrom
Mohammedsami001:feature/issue-396-semantic-caching

Conversation

@Mohammedsami001

@Mohammedsami001 Mohammedsami001 commented Jun 19, 2026

Copy link
Copy Markdown

Description

Closes #396

This PR introduces the Semantic Caching Engine to significantly reduce LLM costs and latency by intercepting openai network requests and returning cached responses for semantically identical prompts.

The feature was implemented using strict Test-Driven Development (TDD) to ensure robust handling of caching edge cases, cache-eviction, and network abstractions.

What Was Accomplished

1. SemanticCache Engine (agentwatch/cost/semantic_cache.py)

  • Built an in-memory & database-agnostic caching engine.
  • Implements both exact hashing (SHA-256) and fuzzy semantic matching using _cosine_similarity.
  • Contains integrated TTL validation to automatically ignore stale entries based on a configurable timeframe (ttl_days).

2. OpenAI Network Interception (agentwatch/adapters/interception.py)

  • Dynamically monkeys-patches openai.AsyncClient.chat.completions.create via patch_openai().
  • Defensively intercepts requests, routing them to the semantic cache before hitting the network.
  • Safely reconstructs standard ChatCompletion objects upon cache hits, ensuring downstream applications (e.g., streaming and non-streaming) are completely unaware the response was fetched locally.
  • Robustness: Protects against crashing AsyncStream attributes by safely validating hasattr(response, "choices").

3. PostgreSQL / pgvector Integration (agentwatch/models/cache.py)

  • Added the SemanticCacheEntry SQLAlchemy model leveraging the Vector column type from pgvector.
  • When an AsyncSession is provided to the cache manager, it scales out of local memory and queries the backend using pgvector's native .cosine_distance to rapidly sort vector similarity.

4. Per-Session Overrides

  • Global caching constraints (e.g., AGENTWATCH_CACHE_TTL_DAYS) can be dynamically bypassed or overridden via extra_body={"agentwatch_metadata": {"cache_ttl_days": X}} in the LLM payload, giving granular control back to the specific execution session.

Testing & Verification

Comprehensive TDD testing is included (tests/test_caching.py). All 6 behavioral suites successfully pass:

  • Exact matching retrieval.
  • Semantic fuzzy matching retrieval.
  • TTL logic expiration and cache invalidation.
  • Network isolation and monkey-patch interception.
  • Session-level override precedence.
  • Mocked database backend operations.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Summary by CodeRabbit

Release Notes

  • New Features

    • Added semantic caching for AI model responses with TTL-based expiration.
    • Supports both in-memory and database-backed caching with vector similarity matching.
    • Automatic integration with OpenAI client for cached request interception.
  • Chores

    • Lowered minimum Python version requirement to 3.10.
    • Added pgvector dependency for vector storage support.
  • Tests

    • Comprehensive test suite for caching functionality and OpenAI integration.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@Mohammedsami001, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a91b0a69-4f0b-45a3-be3d-b7d538b1f6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 383c9dd and 9c31784.

📒 Files selected for processing (6)
  • agentwatch/adapters/interception.py
  • agentwatch/cost/caching.py
  • agentwatch/cost/semantic_cache.py
  • agentwatch/models/cache.py
  • pyproject.toml
  • tests/test_caching.py
📝 Walkthrough

Walkthrough

Adds a semantic caching engine: a hash-based SemanticCacheManager, a pgvector-backed SemanticCacheEntry SQLAlchemy model, TTL and optional DB-session support in SemanticCache, and patch_openai/unpatch_openai functions that monkey-patch AsyncCompletions.create to intercept OpenAI calls and serve or populate the cache. Six async tests cover all paths.

Changes

Semantic Caching Engine

Layer / File(s) Summary
Cache data shapes: CacheHit, CacheEntry, SemanticCacheEntry model
agentwatch/cost/caching.py, agentwatch/cost/semantic_cache.py, agentwatch/models/cache.py
CacheHit dataclass holds prompt hash, response text, and framework; CacheEntry gains a created_at UTC timestamp; SemanticCacheEntry SQLAlchemy model defines the semantic_cache table with a 384-dimension pgvector column and framework/created_at fields.
SemanticCacheManager: in-memory hash-based store/search
agentwatch/cost/caching.py
SemanticCacheManager stores and retrieves CacheHit entries from an internal dictionary using SHA-256 prompt hashing via async store and search methods.
SemanticCache: TTL expiry and DB-backed get/set paths
agentwatch/cost/semantic_cache.py
SemanticCache.__init__ gains ttl_days and db_session parameters; get adds a DB cosine-distance query with TTL cutoff and falls back to in-memory search with expired-entry pruning; set writes a SemanticCacheEntry row when a DB session is present.
OpenAI AsyncCompletions monkey-patching adapter
agentwatch/adapters/interception.py
patch_openai replaces AsyncCompletions.create with a wrapper that reads TTL from AGENTWATCH_CACHE_TTL_DAYS env or extra_body metadata, returns a synthetic ChatCompletion on cache hits, and stores real responses on misses; unpatch_openai restores the original method.
Python version and pgvector dependency
pyproject.toml
requires-python loosened to >=3.10; pgvector>=0.2.0 added to runtime dependencies.
Async test suite
tests/test_caching.py
Six async tests cover exact-match store/search, semantic matching via mocked embeddings, TTL expiry via created_at backdating, OpenAI interception with patch/unpatch, per-request extra_body TTL override, and mocked async DB session write/read.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant AsyncCompletions
  participant SemanticCache
  participant DB as PostgreSQL/pgvector
  participant OpenAI as OpenAI API

  Caller->>AsyncCompletions: create(messages, extra_body={cache_ttl_days: N})
  AsyncCompletions->>SemanticCache: get(last_message_content, ttl_days=N)
  alt DB session present
    SemanticCache->>DB: SELECT by cosine_distance with TTL cutoff
    DB-->>SemanticCache: SemanticCacheEntry row (or empty)
  else In-memory fallback
    SemanticCache->>SemanticCache: prune expired entries, cosine similarity search
  end
  alt Cache hit
    SemanticCache-->>AsyncCompletions: cached response_text
    AsyncCompletions-->>Caller: synthetic ChatCompletion
  else Cache miss
    AsyncCompletions->>OpenAI: original create(messages, ...)
    OpenAI-->>AsyncCompletions: ChatCompletion
    AsyncCompletions->>SemanticCache: set(prompt, response_text, framework="openai")
    SemanticCache->>DB: INSERT SemanticCacheEntry + commit
    AsyncCompletions-->>Caller: real ChatCompletion
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

  • [Premium] CST-005: Semantic Caching Engine #396 — This PR directly implements the CST-005 semantic caching engine: pgvector integration (SemanticCacheEntry), cosine similarity thresholding, OpenAI adapter interception (patch_openai), configurable TTL, and the new agentwatch/cost/caching.py and agentwatch/models/cache.py files all match the issue's requirements and expected file list.
  • [Feat] [ELUSOC] Implement Semantic Cache for Repeated LLM Subtasks #382 — The PR implements semantic matching, OpenAI interception, TTL-based expiry, and optional DB persistence that directly addresses the cost-reduction caching objectives described in this retrieved issue.

Possibly related PRs

  • sreerevanth/AgentWatch#412: Directly extends the same SemanticCache/CacheEntry implementation touched by that PR, adding TTL, DB persistence, and OpenAI interception on top of its foundation.

Poem

🐇 Hop, hop, a cache so bright,
No more asking OpenAI twice tonight!
With vectors stored and TTLs set,
The costliest prompts we shall not forget.
pgvector hums, the bunny grins—
Semantic savings? Everyone wins! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely describes the main change: introduction of a semantic caching engine with pgvector integration for cost reduction.
Linked Issues check ✅ Passed All coding requirements from issue #396 are met: vector store integration via pgvector, configurable similarity thresholding, OpenAI adapter interception, and TTL/threshold configurability.
Out of Scope Changes check ✅ Passed All changes directly support semantic caching objectives. Python version loosening to 3.10 is reasonable for pgvector dependency compatibility; no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
tests/test_caching.py (2)

95-203: ⚡ Quick win

Add interception regression tests for malformed TTL and streaming mode.

Current suite doesn’t assert behavior for invalid AGENTWATCH_CACHE_TTL_DAYS / cache_ttl_days values or stream=True, which are high-risk paths in the wrapper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_caching.py` around lines 95 - 203, Add two new test functions to
cover high-risk paths not currently tested. First, create a test for malformed
TTL values by testing invalid inputs for both the AGENTWATCH_CACHE_TTL_DAYS
environment variable (e.g., non-numeric strings) and the cache_ttl_days override
in extra_body metadata to ensure the semantic cache gracefully handles these
invalid values. Second, create a test for streaming mode by calling
client.chat.completions.create with stream=True parameter and verify that the
cache behaves correctly when streaming is enabled, checking whether streaming
responses are properly cached or handled appropriately. Both tests should follow
the same pattern as test_semantic_cache_manager_interception and
test_semantic_cache_manager_config_override by mocking the embedding provider,
using patch_openai and unpatch_openai context, and asserting expected behavior.

205-241: ⚡ Quick win

Add a DB-path negative test that enforces similarity threshold.

This test currently mocks a DB hit unconditionally; it won’t catch regressions where low-similarity nearest neighbors are incorrectly returned.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_caching.py` around lines 205 - 241, The
test_semantic_cache_manager_db_backend function currently mocks unconditional
database hits without verifying that the similarity threshold is enforced. Add a
negative test case after the existing assertions that sets up a second
mock_entry with embedding vectors that produce low similarity (e.g., orthogonal
vectors like [1.0, 0.0] vs [0.0, 1.0]) and verifies that when cache.get() is
called with a query that has low similarity to the cached entry, it returns None
or no hit instead of the cached response, confirming the similarity threshold
blocks incorrect matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agentwatch/adapters/interception.py`:
- Around line 79-83: The await semantic_cache.set() call in the response
handling block is not protected from exceptions, which means any failure in the
cache population operation will raise and mask the successful upstream response.
Wrap the await semantic_cache.set() call in a try-except block to catch and
handle any exceptions gracefully, ensuring that cache failures do not prevent
the successful response from being returned to the caller.
- Around line 28-37: The code directly converts untrusted values using int()
without error handling, which can raise ValueError and crash request handling.
Add try-except blocks around both the int(global_ttl_env) conversion on line 29
and the int(override_ttl) conversion on line 37 to catch ValueError exceptions.
For each conversion, handle the exception gracefully by either logging the
invalid value, falling back to a default TTL, or skipping the override, ensuring
the system continues operating safely with malformed input.
- Around line 45-50: The current code mutates the shared
`semantic_cache.ttl_days` instance variable across an await point, which causes
concurrent requests to interfere with each other's TTL settings. Remove the
lines that save the original ttl_days value and restore it after the await in
the semantic_cache.get call. Instead, modify the SemanticCache.get method
signature to accept an optional ttl_days_override parameter, and pass the ttl
value as an argument to that method rather than mutating the instance state
directly. This way, the TTL override is scoped to the specific request without
affecting shared state.
- Around line 40-74: The code currently returns a ChatCompletion object on all
cache hits, but this breaks the API contract when stream=True is passed in
kwargs. Before returning the cached_response in the semantic cache hit block,
check if kwargs.get("stream") is True. If streaming is enabled, construct and
return an AsyncStream containing ChatCompletionChunk objects with the
appropriate delta fields instead of the ChatCompletion object. Only return the
ChatCompletion object when streaming is disabled or not specified.

In `@agentwatch/cost/caching.py`:
- Around line 23-35: The cache dictionary (self._cache) is keyed only by
prompt_hash in both the storage operation where CacheHit is assigned and the
retrieval operation in the search method, which causes identical prompts from
different frameworks to overwrite each other. Modify the cache key generation to
include both the prompt_hash and the framework parameter so that the same prompt
with different frameworks are stored as separate cache entries. Update the key
construction logic in the cache assignment (around the CacheHit instantiation)
and in the search method's cache retrieval call to use a composite key that
combines prompt_hash and framework.

In `@agentwatch/cost/semantic_cache.py`:
- Around line 79-85: The return statement in the semantic cache lookup is
returning best_match_db.response_text without verifying that the match meets the
similarity threshold. Add a check after retrieving best_match_db to calculate
the cosine distance between the query_vec and the best match's prompt_vector,
then only return best_match_db.response_text if the distance satisfies the
threshold condition (distance <= 1 - similarity_threshold). If the threshold is
not met, allow the function to continue or return None to indicate no suitable
match was found.
- Around line 131-140: The database commit operation in the SemanticCacheEntry
creation block can raise exceptions and cause caller failures even when the
upstream operation succeeds. Wrap the self.db_session.add(db_entry) and await
self.db_session.commit() calls in a try-except block that logs any errors
without re-raising them, ensuring cache persistence failures do not propagate to
callers and break successful operations.

In `@pyproject.toml`:
- Line 11: The requires-python setting in pyproject.toml specifies >=3.10, but
the code uses datetime.UTC in agentwatch/cost/semantic_cache.py (lines 25, 67,
93) which is only available in Python 3.11+. Either update the requires-python
constraint to >=3.11 to match the actual minimum version required by the
codebase, or replace all occurrences of datetime.UTC with datetime.timezone.utc
throughout semantic_cache.py, which is compatible with Python 3.10.

---

Nitpick comments:
In `@tests/test_caching.py`:
- Around line 95-203: Add two new test functions to cover high-risk paths not
currently tested. First, create a test for malformed TTL values by testing
invalid inputs for both the AGENTWATCH_CACHE_TTL_DAYS environment variable
(e.g., non-numeric strings) and the cache_ttl_days override in extra_body
metadata to ensure the semantic cache gracefully handles these invalid values.
Second, create a test for streaming mode by calling
client.chat.completions.create with stream=True parameter and verify that the
cache behaves correctly when streaming is enabled, checking whether streaming
responses are properly cached or handled appropriately. Both tests should follow
the same pattern as test_semantic_cache_manager_interception and
test_semantic_cache_manager_config_override by mocking the embedding provider,
using patch_openai and unpatch_openai context, and asserting expected behavior.
- Around line 205-241: The test_semantic_cache_manager_db_backend function
currently mocks unconditional database hits without verifying that the
similarity threshold is enforced. Add a negative test case after the existing
assertions that sets up a second mock_entry with embedding vectors that produce
low similarity (e.g., orthogonal vectors like [1.0, 0.0] vs [0.0, 1.0]) and
verifies that when cache.get() is called with a query that has low similarity to
the cached entry, it returns None or no hit instead of the cached response,
confirming the similarity threshold blocks incorrect matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b5c8e4c1-4e8d-4146-b7d4-e1526c4e76e0

📥 Commits

Reviewing files that changed from the base of the PR and between 3b1f4b5 and 383c9dd.

📒 Files selected for processing (6)
  • agentwatch/adapters/interception.py
  • agentwatch/cost/caching.py
  • agentwatch/cost/semantic_cache.py
  • agentwatch/models/cache.py
  • pyproject.toml
  • tests/test_caching.py

Comment thread agentwatch/adapters/interception.py Outdated
Comment thread agentwatch/adapters/interception.py
Comment thread agentwatch/adapters/interception.py Outdated
Comment thread agentwatch/adapters/interception.py Outdated
Comment thread agentwatch/cost/caching.py Outdated
Comment thread agentwatch/cost/semantic_cache.py
Comment thread agentwatch/cost/semantic_cache.py
Comment thread pyproject.toml Outdated
@Mohammedsami001

Copy link
Copy Markdown
Author

Hi @sreerevanth 👋

I've pushed a commit addressing the CodeRabbit feedback. The PR is now ready to merge!

Fixes include:

  • Robust Parsing: Guarded TTL parsing with try-except blocks.
  • State Mutation: Passed ttl_days_override directly to avoid mutating shared state across await.
  • Cache Isolation: Namespaced cache keys by framework to prevent cross-provider collisions.
  • Streaming Support: Returns an AsyncStream of chunks on cache hits when stream=True to maintain the API contract.
  • Version targeting: Reverted the pyproject.toml bump to correctly stay on Python 3.12.

Tests are passing locally. Let me know if anything else is needed! 🚀

@github-actions

Copy link
Copy Markdown

🧪 PR Test Results

Check Result
Tests (pytest tests/) ❌ failure
Lint (ruff check .) ✅ success
Coverage (agentwatch) 73.23%

Python 3.12 · commit 9c31784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Premium] CST-005: Semantic Caching Engine

2 participants