A high-performance Rust library and CLI suite featuring an FST-backed phrase matcher, on-demand document indexer, and field-aware BM25 hybrid search engine.
Get Lume up and running in under two minutes.
Ensure you have the Rust toolchain installed (see the comprehensive rust_setup.md for local environment configuration), then clone and compile:
# Clone the repository
git clone https://github.com/kordless/lume.git
cd lume
# Run the test suite (MiniRoaring bitsets, FST match matrices, and spelling correctors)
cargo test
# Build fully optimized production release binaries
cargo build --releaseTo fetch and format The Count of Monte Cristo (~2.66 MB) as an on-demand test corpus:
mkdir -p examples
curl -L -s https://www.gutenberg.org/files/11/11-0.txt > examples/monte_cristo.txt
# Convert Gutenberg chapter titles into markdown headers for the parser
sed -E 's/^(CHAPTER [0-9]+)\. (.*)$/# \1. \2/' examples/monte_cristo.txt > examples/monte_cristo.mdLume crawls, tokenizes, indexes, and queries the corpus in milliseconds:
# Crawl and search the entire examples/ directory on-the-fly
DATA="examples/data" cargo run --release --bin search -- examples "monte cristo"Coordinate with a remote neural embedder (shivvr.nuts.services) to run conceptual searches:
DATA="examples/data" ALPHA=2.0 cargo run --release --bin hatcher-boost -- examples/monte_cristo.mdLume is the story of ideas moving from one person to another—a search meme carried through years of crawling systems, open-source heritage, industrial search consulting, and modern AI capability.
It all started with web crawling. Back in the early days of distributed search, Kord Campbell created Grub—a massively distributed web crawler. After installing Lucene, Kord sent an email to Eric Schmidt (then-CEO of Google), saying: "Hey, I've got this super fast distributed crawler." Schmidt replied with a classic search insight: "That's not the problem. We've got crawling figured out. Indexing is the challenge."
Decades later, that conversation has come full circle. In the age of AI, crawling is everything again. To feed frontier LLMs, you have to crawl to get the content, and you need a crawler that you can control.
But once you crawl it, where do you put it?
You can't crawl the web fresh every single time you need an answer. Web pages are a type of document memory. Unlike bot or conversational memory (like an LLM remembering that a user's parrot is blue), document memory is about capturing the precise text you just saw. Some of these pages never update, while others update every minute with stock prices or weather data. You need a dedicated, extremely fast local document store to hold and index this memory.
That's when the pieces fell into place. Kord was watching LinkedIn and saw Steve Harris post about porting his zero-dependency JavaScript FST tagger to Rust (released as fstguardrails). Steve had run Portaltown, a search consultancy, and had worked for Lucidworks. His background as a U.S. Marine Corps air traffic controller deeply influenced how he designed systems: a pure focus on safety, extreme precision, and bare-metal performance.
Kord saw Steve's post and realized: "That FST tagger is the first part of our document index."
To turn that FST tagger into a complete, lightweight search engine, Kord drew on years of shared search history. During his time consulting at Lucidworks, Kord had met OG search veterans Trey Grainger and Erik Hatcher.
Trey's work on Solr's Semantic Knowledge Graph (SKG) had always stuck with Kord. The concept seemed complex, but Erik Hatcher had delivered the ultimate "aha" moment by putting it simply:
Facets are just counts of the occurrences of something in a document. The Knowledge Graph is simply looking at those counts across all documents to perform document intersections. It is just counting the counts of things.
That was the magic of Erik Hatcher—he has always had the unique gift of taking complex technology and showing everyone how it actually works under the hood. (We throw affectionate shade at Trey for making it look complicated, and at Eric for making it look too simple!)
Understanding that primitive meant realizing a high-speed search engine didn't need millions of lines of code. It just needed to do simple things incredibly fast: FSTs for words, roaring bitmaps for set intersections, spell correction for misspellings, and additive hybrid boosting for vector context. We wanted a system built with only one external library (tantivy-fst), and the rest made of pure, clean, understandable ideas.
A few months ago, Kord fired up Claude and said, "Build me a simple BM25 index for these words." It did, and looking at the clean code made the mathematical simplicity of BM25 click.
When Steve's Rust FST was released, Kord fired up Antigravity—a frontier-level AI coding assistant that Google just dropped days ago. Since Google is the undisputed king of web search, it is beautifully poetic that their own state-of-the-art AI pair-programmer helped write this search engine.
Working in a continuous human-AI feedback loop with Kord's prattling systems-brain guiding the high-level design, Antigravity rapidly wrote and assembled the primitives. In a single 24-hour sprint, Lume went from an FST tagger to a complete, powerful, ultra-small search engine mesh. Kord called Steve, and they decided to put it out there for the world.
Lume is designed as a stack of modular, self-contained search primitives. Each layer builds upon the FST word tree to add query understanding, search relevance, spatial graphs, and semantic intent.
graph TD
P1[Primitive 1: FST Word Tree] --> P2[Primitive 2: MiniRoaring Postings]
P2 --> P3[Primitive 3: Gödel & PrimeFilters]
P3 --> P4[Primitive 4: Field-Aware BM25]
P4 --> P5[Primitive 5: Trigram Spelling Index]
P5 --> P6[Primitive 6: Semantic Entity Co-occurrence Graph]
P6 --> P7[Primitive 7: Hatcher Semantic Boosting]
The foundation of the engine is the Finite State Transducer. It parses dictionaries of phrases (loaded dynamically from CSV files in the DATA directory) and compiles them into a single FST byte map.
-
What it does: Scans incoming text streams and tags entities in
$O(\text{text length})$ time, matching synonyms and resolving overlapping spans using a longest-match policy. -
OG Code Reference:
// Compiles search phrases into deterministic state paths in src/lib.rs let mut builder = MapBuilder::memory(); for (key, idx) in &keyed { builder.insert(key, *idx)?; }
To support lightning-fast document isolation, we represent the posting lists of the search index using roaring bitmaps.
- What it does: Instead of tracking document lists with basic integer arrays or maps, Lume maps terms to custom-built
MiniRoaringbitsets. For multi-word queries, it performs intersection (AND) or union (OR) bitsets in microseconds to immediately restrict candidate documents. - OG Code Reference:
// Intersects document hit candidate sets instantly in src/fast_retrieval.rs pub fn intersection(&self, other: &Self) -> Self { // High-speed bitwise AND operations over packed integer blocks }
We use number theory to completely bypass expensive scoring loops for irrelevant documents.
-
What it does:
-
Gödel Modulo Pruning: If a query has FST tags, Lume skips candidate documents in
$O(1)$ time by verifying if the document's perfect Gödel tag signature is divisible by the query's prime signature:tagSignature % queryTagPrime == 0 -
PrimeFilter Skips: Before scoring a document, Lume checks a bitset-like signature bucket. If the division has a remainder, the document is guaranteed not to contain the term, and we skip standard HashMap lookups entirely:
signatures[bucket] % termPrime == 0
-
Gödel Modulo Pruning: If a query has FST tags, Lume skips candidate documents in
The lexical scoring core implements BM25 ranking, allowing fields (like document titles vs. document bodies) to carry different weights.
- What it does: Evaluates matches and assigns relevance scores based on three configurable formulations: Classic, BM25+, and BM25-L (optimized for varying document lengths).
- OG Code Reference:
// Scoring logic configured dynamically via environment parameters: let bm25_score = idf * (tf * (k1 + 1.0)) / (tf + k1 * (1.0 - b + b * (doc_len / avg_len)));
A dedicated spelling corrector built directly into the indexing phase.
-
What it does: Breaks down both the static FST tagger phrases and the corpus terms into character-level trigrams (e.g.,
"this"$\rightarrow$ ["_th", "thi", "his", "is_"]). It builds an inverted index of these trigrams using roaring bitmaps and BM25. - Fuzzy Guardrails: If a user misspells an FST tag (including naughty/offensive words), Lume's spelling index automatically maps it back to the closest matching dictionary term using Levenshtein edit-distance checks before querying.
-
OG Code Reference:
// Resolves typos like "lucne" to "lucene" in src/spelling.rs let suggestions = spell_index.correct_word("lucne", 1); // Returns "lucene"
A mathematical graphing layer crossing FST dictionary tags with document roaring bitmaps, inspired by Trey Grainger's Semantic Knowledge Graph (SKG) design.
- What it does: Computes the exact co-occurrence frequency and Jaccard similarity between all registered entities based on their document overlap:
Drawing from Trey's original work on Solr's Semantic Knowledge Graph (SKG) at Lucidworks, this demonstrates how search indices can calculate semantic relationships dynamically. Lume serializes this network into a clean
Jaccard(A, B) = |A ∩ B| / |A ∪ B|monte_cristo_graph.jsonmesh file in less than 1 millisecond using a custom zero-dependency JSON writer, displaying a beautiful box-aligned relationship grid in the console:┌──────────────────────────────┬──────────────────────────────┬────────┬───────┐ │ ENTITY A │ ENTITY B │JACCARD │CO-OCC │ ├──────────────────────────────┼──────────────────────────────┼────────┼───────┤ │ VALENTINE │ VILLEFORT │ 0.4400 │ 33/75 │ │ DANTES │ MARSEILLES │ 0.4386 │ 25/57 │ │ ALBERT │ MONTECRISTO │ 0.4368 │ 38/87 │ │ DANTES │ MERCEDES │ 0.4348 │ 20/46 │ │ DANGLARS │ MONTECRISTO │ 0.4272 │ 44/103 │ │ ALBERT │ PARIS │ 0.4186 │ 36/86 │ │ MONTECRISTO │ VILLEFORT │ 0.4144 │ 46/111 │ │ MAXIMILIAN │ VALENTINE │ 0.4043 │ 19/47 │ │ DANGLARS │ VILLEFORT │ 0.4021 │ 39/97 │ │ MARSEILLES │ PARIS │ 0.4000 │ 36/90 │ │ DANTES │ PHARAON │ 0.3750 │ 15/40 │ │ MARSEILLES │ VILLEFORT │ 0.3721 │ 32/86 │ │ FERNAND │ MERCEDES │ 0.3714 │ 13/35 │ │ DANTES │ FARIA │ 0.3684 │ 14/38 │ │ CHATEAUDIF │ DANTES │ 0.3500 │ 14/40 │ │ MARSEILLES │ MERCEDES │ 0.3455 │ 19/55 │ │ MERCEDES │ PHARAON │ 0.3429 │ 12/35 │ │ MAXIMILIAN │ NOIRTIER │ 0.3400 │ 17/50 │ └──────────────────────────────┴──────────────────────────────┴────────┴───────┘
Our flagship hybrid integration, implementing the two-stage Semantic Boosting pattern pioneered by Erik Hatcher (co-founder of Lucidworks).
- What it does: Combines the precision and safety of local lexical search with the conceptual awareness of deep-neural ONNX embeddings:
- Stage 1 (ONNX Semantic Retrieval & Local Cache): Searches our local persistent index (
.lume-semantic-cache.json) to find query results instantly offline (<1ms). On a cache miss, it lazily connects to an ephemeral session onhttps://shivvr.nuts.services/to fetch semantic similarity scores and automatically preserves the query to the local cache. - Stage 2 (Local Lexical Scoring): Calculates standard BM25 rankings.
- True Union Blending: Blends the candidates into a true Set Engine Union. If a document matches lexically, we apply Erik Hatcher's multiplicative boost:
If a document only matches semantically (without keyword matches), it falls back to its raw
Score_hybrid = Score_BM25 * (1.0 + alpha * Similarity_semantic)Similarity_semanticscore, ensuring conceptual matches are still ranked and retrieved.
- Stage 1 (ONNX Semantic Retrieval & Local Cache): Searches our local persistent index (
- OG Code Reference:
// Core hybrid blend in src/bin/hatcher_boost.rs let hybrid_score = if bm25_score > 0.0 { bm25_score * (1.0 + alpha * sem_score) } else { sem_score };
| Subsystem Component | Inputs | Primary Outputs | Primary Technology |
|---|---|---|---|
FST Tagger (tag-server) |
Text Stream + CSVs | Structured JSON Spans | tantivy-fst Trie Walking |
Lexical Search (search) |
Multi-term Query | Ranked Text Snippets | Field-Aware BM25 + Highlighting |
| Spelling Index | String with Typos | Suggested Corrections | Trigram Bitmaps + Levenshtein DP |
| Entity Mesh (Option A) | Document Corpus | monte_cristo_graph.json |
Pairwise Roaring Bitset Jaccard |
| Prose Generator (Option C) | Seed Token | Generated Text Paragraphs | Trigram Markov Chain + SimpleRng |
Semantic Boost (hatcher-boost) |
Text Stream + Query | Comparative Relevance Grid | Local BM25 + Remote Vector Embeddings |
Lume is compiled as a unified unibinary (lume). The individual commands (tag, tag-server, search, hatcher-boost) are also generated as thin, backwards-compatible wrappers delegating to the unified library entrypoints.
# Clean build and compile fully optimized binaries
cargo build --release
# Single-file search targeting a markdown document
DATA="examples/data" ./target/release/lume search examples/monte_cristo.md "mercedes dantes"
# On-demand recursive directory crawling and indexing
DATA="examples/data" ./target/release/lume search examples "monte cristo"# Construct entity relationship mesh (computes Jaccard > 0.02)
DATA="examples/data" ./target/release/lume search examples/monte_cristo.md graph 0.02# Seed paragraph writer in Dumas' style
DATA="examples/data" ./target/release/lume search examples/monte_cristo.md generate Dantès# One-shot semantic-boosting with custom alpha weight
DATA="examples/data" ALPHA=3.0 ./target/release/lume hatcher-boost examples/monte_cristo.md "mercedes dantes"
# Launch the interactive semantic-boost REPL console
DATA="examples/data" ALPHA=2.0 ./target/release/lume hatcher-boost examples/monte_cristo.mdLume integrates seamlessly with grub.nuts.services to crawl online resources, convert them to high-fidelity markdown format on-the-fly, and automatically save them to your personal search engine document collection (examples/crawled/) for immediate BM25 hybrid querying.
Note
Authentication Token: A valid nuts.services token is required. The audience can acquire their own crawler keys directly at nuts.services.
Set your token in your local .env file (which is automatically ignored from git by .gitignore):
NUTS_SERVICES_TOKEN=your_token_here# Crawl any webpage directly into your personal document library
./target/release/lume crawl https://news.ycombinator.com
# Instantly query the crawled collection with BM25 hybrid search
DATA="examples/data" ./target/release/lume search examples/crawled "show hn"Lume integrates a high-performance Model Context Protocol (MCP) server natively over stdio. It exposes search, entity extraction, and text generation primitives directly to AI agents and IDEs (like Cursor or Claude Desktop) offline.
DATA="examples/data" ./target/release/lume mcplume_tag: Extracts FST-dictionary entities from a text block, returning structured offsets, kinds, unique IDs, and outputs as JSON.lume_search: Performs field-aware BM25 hybrid search over a file or directory on-the-fly, returning results in Markdown format with inline entity highlights.lume_generate: Builds a trigram Markov model over a document on-the-fly and generates guided text with live stochastic attention traces.
Tip
Dynamic Index Caching: The MCP server automatically caches parsed BM25 index states in a thread-safe global static. By tracking the modification time (mtime) of target paths, Lume skips re-indexing when documents haven't changed, reducing subsequent query response times to under 1ms.
Lume is packaged into a secure, multi-stage production container. The runner stage features a non-privileged user (lume:lume) and has ca-certificates installed to support HTTPS connections (e.g., semantic boosting requests via ureq).
DOCKER_BUILDKIT=1 docker build -t lume:latest .By default, the container boots the HTTP REST API server on port 8282:
docker run -d \
-p 8282:8282 \
--name lume-server \
lume:latestMount your document directory and run Lume as an interactive MCP server inside your IDE:
docker run -i --rm \
-v /path/to/local/data:/app/data \
lume:latest mcpLume follows the strict Semantic Versioning 2.0.0 (SemVer) specification.
- Versioning is declared in the root Cargo.toml (currently
0.1.0). - Releases are marked with Git tags (e.g.,
v0.1.0), which trigger the publishing process of the unified binary (lume) and standard wrappers.
Yes! Every push and pull request targeting the main or bm25-search-mesh branches triggers our automated GitHub Action CI Pipeline:
- Verification: Compiles the source using
cargo checkand runs the full test suite (cargo test). - Multi-Platform Compilation: Compiles fully optimized release binaries across:
- Linux (
x86_64-unknown-linux-gnu) - macOS (
x86_64-apple-darwin) - Windows (
x86_64-pc-windows-msvc)
- Linux (
- Artifact Bundling: Packs the unified
lumeunibinary alongside backward-compatible wrappers (tag,tag-server,search,hatcher-boost) into a single release package uploaded straight to the GitHub run interface.
Both the search and hatcher-boost subcommands boot into an interactive shell when run without a query argument.
- Standard REPL (
search >):- Type any search terms (e.g.
faria dungeon) to print ranked and highlighted snippets. graph [min_similarity](e.g.graph 0.02) to compute the entity network.generate [seed](e.g.generate Mercédès) to draft Dumas-styled prose.
- Type any search terms (e.g.
- Hybrid REPL (
hybrid-search >):- Type queries to display side-by-side comparative columns for Pure Lexical BM25, Pure Semantic (ONNX), and Hatcher Boosted Hybrid rankings.
- Type
exitorquitin either interface to safely terminate the session.