v0.1.1
Fast HNSW search in Clojure with no third-party dependencies, delivering sub-millisecond results via SIMD- and BLAS-optimized distance core. Achieves 4–16× speedup. The strength of this solution is that it targets the main runtime bottleneck, leverages data-level parallelism, minimizes overhead, and aligns computation with the hardware’s strengths—significantly reducing query latency and increasing throughput.
Backround: About 3–4 years ago I began exploring how to implement HNSW in Clojure without third-party dependencies and with full customizability—because AI keeps changing, and so do the needs.
Why accelerating distance computation matters
- In HNSW, 70–95% of query time is spent computing distances/inner products.
- SIMD + Java Vector API: 4–16× fewer instructions, fewer cache misses, lower latency.
- JBLAS (BLAS kernels): Near-peak CPU throughput with SIMD + multithreading for large batches.
- Result: Higher QPS, lower P95/P99 latency.
I use the Old Hungarian Károli Bible for semantic similarity research, where pairing Hebrew and Greek meanings is a challenging task (it was). In this environment, it becomes immediately clear if something is not working. I tested on 31,173 verses — the full Károli Bible (mpnet-v2, 768-dim embeddings, 492.9 MB index):
- Peak Performance: 0.186 ms/query (5,376 QPS) with 20 threads
- Average (5 runs): 0.195 ms/query (5,128 QPS)
- Consistency: <5% variance across runs
| Configuration | QPS | Latency (ms) | vs Target | Status |
|---|---|---|---|---|
| Best Run | 5,376 | 0.186 | +14% | ✅ Exceeded |
| Run 2 | 5,236 | 0.191 | +10% | ✅ Exceeded |
| Run 3 | 4,773 | 0.209 | +1% | ✅ Met |
| Average | 5,128 | 0.195 | +9% | ✅ Exceeded |
| Original Target | 4,719 | 0.212 | - | Baseline |
- Single query: 1.093 ms/query (915 QPS)
- Speedup: 5.8x with 20 threads parallelization
- vs Linear search: 86x faster (single), 500x faster (parallel)
| Approach | Best for | Key benefits |
|---|---|---|
| 🧮 Java Vector API (SIMD) | Per-query, low-latency search | 4–16× fewer instructions, branchless tight loops, cache-friendly arrays, no JNI overhead |
| 📦 JBLAS (native BLAS) | Large batches, high throughput | Optimized BLAS kernels (OpenBLAS/MKL), GEMM batching, multithreaded + SIMD |
Impact:
- 4–16× faster distance kernel (CPU/ISA dependent)
- Lower P95/P99 latency, higher QPS
Build navigable graph structures, search by hopping from node to node.
| Implementation | Namespace | Description | Build | Search | Recall |
|---|---|---|---|---|---|
| Pure HNSW | hnsw.hnsw-search |
Original HNSW algorithm | extreme Slow (300s+) ** |
0.2 ms | 99%+ |
| Ultra Fast HNSW | hnsw.ultra-fast |
Optimized HNSW | Medium | 0.3ms | 98%+ |
| Ultra Optimized | hnsw.ultra-optimized |
Max optimized | Medium | 0.2ms | 98%+ |
** I work on this [WIP]
Divide data into partitions, search in parallel.
| Implementation | Namespace | Description | Build | Search | Recall |
|---|---|---|---|---|---|
| Partitioned HNSW | hnsw.partitioned-hnsw |
Multiple HNSW in parallel | 6-7s | 0.4-0.6ms | 91% |
| Lightning | hnsw.lightning |
Simple partitions | 61ms | 1.5ms | 29% ** |
| IVF-FLAT | hnsw.ivf-flat |
K-means + flat search | 2-5s | 5-10ms | 95%+ |
** lightening parallel ~ like knn++ > 5-7 sec index, 7 - 1x ms very good recall.
Group similar vectors using hash functions.
| Implementation | Namespace | Description | Build | Search | Recall |
|---|---|---|---|---|---|
| Hybrid LSH | hnsw.hybrid-lsh |
Multi-probe LSH | 0.5-1s | 2-5ms | 45% |
Combine multiple techniques.
| Implementation | Namespace | Description | Build | Search | Recall |
|---|---|---|---|---|---|
| IVF-HNSW | hnsw.ivf-hnsw |
IVF + HNSW graph | 30-90s | 2-5ms | 85-90% |
First reduce dimensions, then search in reduced space. (my favorite)
| Implementation | Namespace | Description | Build | Search | Recall |
|---|---|---|---|---|---|
| P-HNSW (PCAF) | hnsw.pcaf |
Random Projection + HNSW | ~5s | 1-2ms | 85-90% |
Automatically pick one verse → compare it with all other verses → return the top 5 most similar items.
🔍 Search> love 5
🎯 SEARCHING FOR: "love" (top 5 results)
============================================================
✅ Found 179 verses containing 'love'
📖 First match: [Gen 21:23]
"Now therefore swear unto me here by God that thou wilt not deal falsely..."
🔗 Top 5 semantically similar verses:
------------------------------------------------------------
⏱️ Search completed in 10.12 ms
1. [Josh 2:12] (84.0%)
"Now therefore, I pray you, swear unto me by the LORD..."
2. [Ps 71:18] (83.4%)
"Now also when I am old and greyheaded, O God, forsake me not..."
3. [2Sam 16:19] (81.2%)
"And again, whom should I serve? should I not serve..."
4. [1Kings 9:6] (81.0%)
"But if ye shall at all turn from following me..."
5. [Gen 27:29] (81.0%)
"Let people serve thee, and nations bow down to thee..."
🔍 Search> beginning 10
🎯 SEARCHING FOR: "beginning" (top 10 results)
============================================================
✅ Found 8 verses containing 'beginning'
📖 First match: [Gen 1:1]
"In the beginning God created the heaven and the earth."
⏱️ Search completed in 4.45 ms
1. [Gen 2:4] (92.2%)
"These are the generations of the heavens and of the earth..."
2. [Ps 124:8] (81.2%)
"Our help is in the name of the LORD, who made heaven and earth."
3. [Heb 1:10] (80.8%)
"And, Thou, Lord, in the beginning hast laid the foundation..."
- Ultra-optimized distance functions (SIMD via Java Vector API + JBLAS)
- Sub-millisecond search on 31K+ vectors
- Interactive search shell with colored output
- Save/load index to disk
- Multiple distance metrics (Cosine, Euclidean)
- Parallel search with linear scaling
- Performance benchmarking tools
- Delete vectors from index
- Filter functions (by metadata, categories)
- Update existing vectors
- Incremental index building
- REST API server
- Distributed index support
- GPU acceleration
- Memory-mapped indices
src/hnsw/
├── ultra-optimized.clj # Main HNSW implementation
├── ultra-fast.clj # SIMD-optimized distance functions
└── index-io.clj # Save/load functionality
src/hnsw/
├── core.clj # Examples and demos
├── graph.clj # Original HNSW algorithm
├── simd.clj # SIMD support detection
├── simd-optimized.clj # SIMD implementations
├── filtered.clj # Filter support (WIP)
├── benchmark.clj # Benchmarking utilities
└── api.clj # Public API wrapper
scripts/
├── build_index_complete.clj # Build 31K index
├── interactive_search.clj # Interactive shell
└── test_*.clj # Performance tests
final-test-bench/
└── hnsw-clj-our-solution/ # Production-ready demo
├── interactive_search_31k.clj
├── PERFORMANCE_ANALYSIS*.md
└── run_interactive_31k.sh
;; 1. Build index
(require '[hnsw.ultra-optimized :as opt])
(require '[hnsw.index-io :as io])
(def vectors [["id1" (double-array [0.1 0.2 ...])]
["id2" (double-array [0.4 0.5 ...])]])
(def index (opt/build-index vectors))
(io/save-index index "my-index.hnsw")
;; 2. Load and search
(def index (io/load-index "my-index.hnsw" opt/fast-cosine-distance))
(def results (opt/search index query-vector 10))| Dataset | Vectors | Dims | Build Time | Search (k=10) | QPS | Single Thread |
|---|---|---|---|---|---|---|
| Small | 1,000 | 128 | ~7s* | 0.05 ms | 20,000 | (0.1 ms / 10,000 QPS) |
| Medium | 10,000 | 768 | ~69s* | 0.15 ms | 6,667 | (0.5 ms / 2,000 QPS) |
| Bible | 31,173 | 768 | 215s | 0.212 ms | 4,719 | (1.199 ms / 834 QPS) |
| Large | 100,000 | 768 | ~690s* | 0.50 ms | 2,000 | (2.5 ms / 400 QPS) |
Performance with 20 parallel threads (single thread results in parentheses)
*Build times estimated based on Bible dataset (only Bible dataset values are measured)
Identical parameters: M=16, ef_construction=200, ef_search=50, k=10, vectors=31,173, dimensions=768
Single difference: Distance metric - Python (Cosine) vs Clojure (Euclidean)
| Metric | Python hnswlib | Clojure HNSW-CLJ | Winner | Difference |
|---|---|---|---|---|
| Build Time | 25-30s | 215s | 🐍 Python | 8x faster |
| Single Thread QPS | 1,125 | 834 | 🐍 Python | 1.3x faster |
| Multi Thread QPS (20) | 1,330 | 4,719 | 🔧 Clojure | 3.5x faster |
| Average Latency | 0.85ms | 0.212ms | 🔧 Clojure | 4x better |
| Parallel Scaling | 1.3x (GIL) | 5.7x | 🔧 Clojure | True parallelism |
Use Python hnswlib when:
- Fast prototyping needed
- Building indices quickly (8x faster)
- Integration with Python ML ecosystem
- Single-threaded applications
Use Clojure HNSW-CLJ when:
- Production serving with high QPS requirements
- Real-time applications needing low latency
- High concurrency environments
- Scalable microservices
# 1. Build index with Python (fast: 25-30 sec)
import hnswlib
index = hnswlib.Index(space='cosine', dim=768)
index.init_index(max_elements=31173, ef_construction=200, M=16)
index.add_items(vectors)
index.save_index('index.bin');; 2. Serve with Clojure (high performance: 4,700+ QPS)
(def index (load-index "index.bin"))
(serve-api index {:port 8080})This combines Python's fast build time with Clojure's excellent serving performance!
For detailed comparison, see PARAMETER_COBENCHMARK_RESULTS_ACTUAL.md and BENCHMARK_RESULTS_PYTHON.md
- Clojure 1.11+
- Java 21+ (for Vector API)
- 3GB heap for 31K index
MIT