Local Supermemory MCP Server for AI Agents
A self-hosted Model Context Protocol server that gives AI agents persistent, structured memory. Rememory combines a knowledge graph with SuperRAG (hybrid vector + BM25 + graph search) so agents can store, connect, and recall information across conversations — all running locally in Docker with your own embedding and LLM endpoints.
- Knowledge graph memory — Memories are linked with typed relationships (
updates,extends,derives) so agents can trace how knowledge evolves over time - Automatic memory extraction — Documents are chunked, embedded, and analyzed by an LLM to extract atomic facts, preferences, and episodes
- Hybrid search (SuperRAG) — Combines vector similarity, BM25 full-text, and graph traversal via Reciprocal Rank Fusion for high-quality recall
- Content-type-aware chunking — Smart splitting for markdown, code, HTML, and plain text
- Automatic forgetting — Exponential decay for episodes, confidence thresholds, and noise filtering keep memory clean
- Configurable endpoints — Bring your own embedding and LLM endpoints (Ollama, vLLM, OpenAI, or any compatible API)
- Persistent local storage — SQLite (graph + FTS5) and LanceDB (vectors) with Docker volume persistence
- Privacy-first — All data stays on your machine, no external services required
┌─────────────────────────────────────────────────────────┐
│ AI Agent │
└──────────────────────┬──────────────────────────────────┘
│ MCP (stdio / HTTP)
┌──────────────────────▼──────────────────────────────────┐
│ rememory server │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Pipeline │ │ Search │ │ Forgetting │ │
│ │ ─────────── │ │ ────────── │ │ ────────── │ │
│ │ chunker │ │ vector │ │ decay │ │
│ │ extractor │ │ BM25 │ │ expiration │ │
│ │ embedder │ │ graph │ │ cleanup │ │
│ │ relationship│ │ RRF fusion │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ ┌──────▼─────────────────▼──────────────────▼───────┐ │
│ │ Storage Layer │ │
│ │ ┌──────────────────┐ ┌────────────────────────┐ │ │
│ │ │ SQLite + FTS5 │ │ LanceDB (vectors) │ │ │
│ │ │ (graph, BM25) │ │ │ │ │
│ │ └──────────────────┘ └────────────────────────┘ │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
▲ ▲
│ HTTP │ HTTP
┌─────┴───────┐ ┌──────┴──────┐
│ Embedding │ │ LLM │
│ Endpoint │ │ Endpoint │
│ (Ollama) │ │ (Ollama) │
└─────────────┘ └─────────────┘ollama pull nomic-embed-text
ollama pull llama3.2git clone https://github.com/your-org/rememory.git
cd rememory
cp .env.example .env
# Edit .env if your endpoints differ from the defaultsdocker compose build
docker compose run --rm -i rememoryThe server starts in stdio mode by default — it reads JSON-RPC from stdin and writes to stdout, ready for MCP clients.
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"rememory": {
"command": "docker",
"args": [
"compose", "-f", "/path/to/rememory/docker-compose.yml",
"run", "--rm", "-i", "rememory"
]
}
}
}Add to your VS Code settings.json:
{
"mcp": {
"servers": {
"rememory": {
"command": "docker",
"args": [
"compose", "-f", "/path/to/rememory/docker-compose.yml",
"run", "--rm", "-i", "rememory"
]
}
}
}
}Add to .cursor/mcp.json in your project or ~/.cursor/mcp.json globally:
{
"mcpServers": {
"rememory": {
"command": "docker",
"args": [
"compose", "-f", "/path/to/rememory/docker-compose.yml",
"run", "--rm", "-i", "rememory"
]
}
}
}Rememory exposes 9 MCP tools:
| Tool | Description | Key Parameters |
|---|---|---|
add_document |
Ingest a document — chunk, embed, extract memories, build relationships | content, contentType, title, containerTags, extractMemories |
add_memory |
Directly add a memory unit (fact, preference, or episode) | content, memoryType, containerTags, expiresAt |
search |
Hybrid search across memories (vector + BM25 + graph) | query, searchMode, topK, containerTags, includeRelated |
get_memory |
Retrieve a memory by ID with its relationships and version history | id |
list_memories |
List memories with filtering and pagination | containerTags, memoryTypes, onlyLatest, limit, offset |
delete_memory |
Delete a memory and its associated vectors (cascades to relationships) | id |
get_related |
Traverse the memory graph from a seed memory | memoryId, depth, direction, relationshipTypes |
forget |
Trigger memory cleanup — decay, expiration, and noise removal | containerTags, dryRun |
get_status |
Health check and storage statistics | (none) |
All configuration is via environment variables. Copy .env.example to .env and customize as needed.
| Variable | Default | Description |
|---|---|---|
EMBEDDING_ENDPOINT |
http://host.docker.internal:11434/api/embed |
Embedding API endpoint |
EMBEDDING_MODEL |
nomic-embed-text |
Embedding model name |
EMBEDDING_FORMAT |
ollama |
Response format: ollama or openai |
EMBEDDING_DIMENSIONS |
auto |
Vector dimensions (auto to detect, or a number) |
EMBEDDING_BATCH_SIZE |
32 |
Max texts per embedding request |
LLM_ENDPOINT |
http://host.docker.internal:11434/v1/chat/completions |
OpenAI-compatible chat completions endpoint |
LLM_MODEL |
llama3.2 |
LLM model name |
DATA_DIR |
/data |
Storage directory (mapped to Docker volume) |
TRANSPORT |
stdio |
MCP transport: stdio or http |
PORT |
3111 |
HTTP port (when TRANSPORT=http) |
LOG_LEVEL |
info |
Log level: trace, debug, info, warn, error, fatal |
SEARCH_TOP_K |
10 |
Default number of search results |
CHUNK_SIZE |
1000 |
Chunk size in characters |
CHUNK_OVERLAP |
200 |
Overlap between chunks in characters |
WEIGHT_VECTOR |
0.5 |
Hybrid search weight for vector similarity |
WEIGHT_BM25 |
0.3 |
Hybrid search weight for BM25 full-text |
WEIGHT_GRAPH |
0.2 |
Hybrid search weight for graph traversal |
DECAY_LAMBDA |
0.05 |
Exponential decay rate for episode confidence |
CONFIDENCE_THRESHOLD |
0.1 |
Auto-remove memories below this confidence |
CLEANUP_INTERVAL_MINUTES |
60 |
Minutes between automatic forgetting cycles |
A document is raw input — a text, a conversation, a code file. When you call add_document, rememory chunks the document, embeds the chunks for vector search, and optionally uses an LLM to extract memories: atomic, self-contained knowledge units typed as:
- Fact — A persistent truth ("The API uses JWT authentication")
- Preference — A repeated pattern or user preference ("Prefers TypeScript over JavaScript")
- Episode — A time-bound event ("Deployed v2.3 on March 15th") that can expire
Every new memory is compared against existing memories using embedding similarity and LLM classification. Three relationship types form the knowledge graph:
updates— The new memory contradicts or supersedes an existing one. The old memory is markedisLatest = false, preserving full version history.extends— The new memory adds detail to an existing one. Both remain current.derives— The new memory is inferred from an existing one during extraction.
Search combines three signals via Reciprocal Rank Fusion (RRF, k=60):
- Vector similarity (default weight: 0.5) — Cosine similarity via LanceDB
- BM25 full-text (default weight: 0.3) — SQLite FTS5 with Porter stemming
- Graph traversal (default weight: 0.2) — 1–2 hop traversal from top vector matches
Results are filtered to isLatest = true memories and optionally expanded with related memories via graph traversal.
The forgetting engine keeps memory clean and relevant:
- Expiration — Episodes past their
expiresAtare removed - Decay — Episode confidence decays exponentially:
confidence × e^(-λ × days) × (1 + 0.1 × accessCount) - Threshold — Memories below
CONFIDENCE_THRESHOLDare auto-removed - Strengthening — Frequently accessed preferences (accessCount > 5) have their strength increased
Run forget manually or let the automatic cleanup cycle handle it.
npm install# Set environment variables (or create a .env file)
export EMBEDDING_ENDPOINT=http://localhost:11434/api/embed
export LLM_ENDPOINT=http://localhost:11434/v1/chat/completions
export DATA_DIR=./data
npm run devnpm run buildnpm test
npm run test:watch # Watch modenpm run lintRememory includes a LongMemEval integration for benchmarking memory accuracy across 500 questions covering temporal reasoning, knowledge updates, multi-session aggregation, and more.
# Quick 30-question pilot (~4-5 hours)
npx tsx scripts/longmemeval.ts --limit 30 --stratify --output data/results.jsonl
# Evaluate results
npx tsx scripts/longmemeval-eval.ts --results data/results.jsonlSee docs/BENCHMARK.md for the full guide, configuration options, and our 83.3% accuracy results on the 30-question stratified pilot.
MIT