High-performance Rust library for generating text embeddings using llama-cpp.
- High Performance: Optimized for speed with parallel pre/post-processing
- Thread Safety: Compile-time guarantees for safe concurrent usage
- Multiple Models: Support for managing multiple embedding models
- Batch Processing: Efficient batch embedding generation
- Flexible Configuration: Extensive configuration options for model tuning
- Multiple Pooling Strategies: Mean, CLS, Max, and
MeanSqrtpooling - Hardware Acceleration: Support for Metal (macOS), CUDA (NVIDIA), Vulkan, and optimized CPU backends
# fn main() -> anyhow::Result<()> {
use embellama::{ModelConfig, EngineConfig, EmbeddingEngine, NormalizationMode};
// Build model configuration
let model_config = ModelConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.with_normalization_mode(NormalizationMode::L2)
.build()?;
// Build engine configuration
let engine_config = EngineConfig::builder()
.with_model_config(model_config)
.build()?;
// Create engine
let engine = EmbeddingEngine::new(engine_config)?;
// Generate single embedding
let text = "Hello, world!";
let embedding = engine.embed(None, text)?;
// Generate batch embeddings
let texts = vec!["Text 1", "Text 2", "Text 3"];
let embeddings = engine.embed_batch(None, &texts)?;
# Ok(())
# }The engine can optionally use a singleton pattern for shared access across your application. The singleton methods return Arc<Mutex<EmbeddingEngine>> for thread-safe access:
# fn main() -> anyhow::Result<()> {
# use embellama::{ModelConfig, EngineConfig, EmbeddingEngine};
# let model_config = ModelConfig::builder()
# .with_model_path("/path/to/model.gguf")
# .with_model_name("my-model")
# .build()?;
# let config = EngineConfig::builder()
# .with_model_config(model_config)
# .build()?;
// Get or initialize singleton instance (returns Arc<Mutex<EmbeddingEngine>>)
let engine = EmbeddingEngine::get_or_init(config)?;
// Access the singleton from anywhere in your application
let engine_clone = EmbeddingEngine::instance()
.expect("Engine not initialized");
// Use the engine (requires locking the mutex)
let embedding = {
let engine_guard = engine.lock().unwrap();
engine_guard.embed(None, "text")?
};
# Ok(())
# }The library has been tested with the following GGUF models:
- MiniLM-L6-v2 (
Q4_K_M): ~15MB, 384-dimensional embeddings - used for integration tests - Jina Embeddings v2 Base Code (
Q4_K_M): ~110MB, 768-dimensional embeddings - used for benchmarks - BAAI/bge-reranker-v2-m3 (
Q4_K_M): Cross-encoder reranking model - auto-detected from GGUF metadata
Both BERT-style and LLaMA-style embedding models are supported, as well as cross-encoder reranking models.
Add this to your Cargo.toml:
[dependencies]
embellama = "0.10.1"Platform-specific GPU acceleration and other optional features are available via Cargo features. See the [features] section in Cargo.toml for the full list.
# fn main() -> anyhow::Result<()> {
use embellama::{ModelConfig, EngineConfig};
let model_config = ModelConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.build()?;
let config = EngineConfig::builder()
.with_model_config(model_config)
.build()?;
# Ok(())
# }# fn main() -> anyhow::Result<()> {
use embellama::{ModelConfig, EngineConfig, PoolingStrategy, NormalizationMode};
let model_config = ModelConfig::builder()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.with_context_size(2048)
.with_n_threads(8)
.with_n_gpu_layers(32)
.with_normalization_mode(NormalizationMode::L2)
.with_pooling_strategy(PoolingStrategy::Mean)
.build()?;
let config = EngineConfig::builder()
.with_model_config(model_config)
.with_use_gpu(true)
.with_batch_size(64)
.build()?;
# Ok(())
# }The library can automatically detect and use the best available backend:
# fn main() -> anyhow::Result<()> {
use embellama::{EngineConfig, detect_best_backend, BackendInfo};
// Automatic backend detection
let config = EngineConfig::with_backend_detection()
.with_model_path("/path/to/model.gguf")
.with_model_name("my-model")
.build()?;
// Check which backend was selected
let backend_info = BackendInfo::new();
println!("Using backend: {}", backend_info.backend);
println!("Available features: {:?}", backend_info.available_features);
# Ok(())
# }Embellama supports cross-encoder reranking models like bge-reranker-v2-m3. Reranker models
are auto-detected from GGUF metadata (pooling_type = 4), so you can just load the model
without any special configuration:
# fn main() -> anyhow::Result<()> {
use embellama::{EngineConfig, EmbeddingEngine};
let config = EngineConfig::builder()
.with_model_path("/path/to/bge-reranker-v2-m3.gguf")
.with_model_name("reranker")
.build()?;
let engine = EmbeddingEngine::new(config)?;
let results = engine.rerank(
None,
"What is the capital of France?",
&["Paris is the capital of France.", "Berlin is in Germany."],
None, // return all results
true, // apply sigmoid normalization
)?;
for r in &results {
println!("[{:.4}] Document {}", r.relevance_score, r.index);
}
# Ok(())
# }- Mean: Average pooling across all tokens (default for encoder models)
- CLS: Use the CLS token embedding
- Max: Maximum pooling across dimensions
MeanSqrt: Mean pooling with square root of sequence length normalization- Last: Use the last token embedding (auto-selected for decoder models)
- Rank: Cross-encoder reranking (auto-detected from GGUF metadata)
The LlamaContext from llama-cpp is !Send and !Sync, which means:
- Models cannot be moved between threads
- Models cannot be shared using
Arcalone - Each thread must own its model instance
The library is designed with these constraints in mind:
- Use thread-local storage for model instances
- Batch processing uses parallel pre/post-processing with sequential inference
- The singleton pattern provides
Arc<Mutex<EmbeddingEngine>>for cross-thread coordination
The library provides granular control over model lifecycle:
- Registration: Model configuration stored in registry
- Loading: Model actually loaded in thread-local memory
# fn main() -> anyhow::Result<()> {
# use embellama::{ModelConfig, EngineConfig, EmbeddingEngine};
# let model_config = ModelConfig::builder()
# .with_model_path("/path/to/model.gguf")
# .with_model_name("my-model")
# .build()?;
# let config = EngineConfig::builder()
# .with_model_config(model_config)
# .build()?;
# let engine = EmbeddingEngine::new(config)?;
// Check if model is registered (has configuration)
if engine.is_model_registered("my-model") {
println!("Model configuration exists");
}
// Check if model is loaded in current thread
if engine.is_model_loaded_in_thread("my-model") {
println!("Model is ready to use in this thread");
}
# Ok(())
# }# fn main() -> anyhow::Result<()> {
# use embellama::{ModelConfig, EngineConfig, EmbeddingEngine};
# let model_config = ModelConfig::builder()
# .with_model_path("/path/to/model.gguf")
# .with_model_name("my-model")
# .build()?;
# let config = EngineConfig::builder()
# .with_model_config(model_config)
# .build()?;
# let mut engine = EmbeddingEngine::new(config)?;
// Remove only from current thread (keeps registration)
engine.drop_model_from_thread("my-model")?;
// Model can be reloaded on next use
// Remove only from registry (prevents future loads)
engine.unregister_model("my-model")?;
// Existing thread-local instances continue working
// Full unload - removes from both registry and thread
engine.unload_model("my-model")?;
// Completely removes the model
# Ok(())
# }- Initial model (via
EmbeddingEngine::new()): Loaded immediately in current thread - Additional models (via
load_model()): Lazy-loaded on first use
# fn main() -> anyhow::Result<()> {
# use embellama::{ModelConfig, EngineConfig, EmbeddingEngine};
# let model_config = ModelConfig::builder()
# .with_model_path("/path/to/model.gguf")
# .with_model_name("model1")
# .build()?;
# let config = EngineConfig::builder()
# .with_model_config(model_config)
# .build()?;
# let model_config2 = ModelConfig::builder()
# .with_model_path("/path/to/model2.gguf")
# .with_model_name("model2")
# .build()?;
# let config2 = EngineConfig::builder()
# .with_model_config(model_config2)
# .build()?;
// First model - loaded immediately
let mut engine = EmbeddingEngine::new(config)?;
assert!(engine.is_model_loaded_in_thread("model1"));
// Additional model - lazy loaded
engine.load_model(config2)?;
assert!(engine.is_model_registered("model2"));
assert!(!engine.is_model_loaded_in_thread("model2")); // Not yet loaded
// Triggers actual loading in thread
engine.embed(Some("model2"), "text")?;
assert!(engine.is_model_loaded_in_thread("model2")); // Now loaded
# Ok(())
# }The library is optimized for high performance:
- Parallel tokenization for batch processing
- Efficient memory management
- Configurable thread counts
- GPU acceleration support
Run benchmarks with:
EMBELLAMA_BENCH_MODEL=/path/to/model.gguf cargo bench- Batch Processing: Use
embed_batch()for multiple texts - Thread Configuration: Set
n_threadsbased on CPU cores - GPU Acceleration: Enable GPU for larger models
- Warmup: Call
warmup_model()before processing
For development setup, testing, and contributing guidelines, please see DEVELOPMENT.md.
This project uses git-cliff to generate changelogs from conventional commits. Install with cargo install git-cliff, then:
just changelog # Regenerate CHANGELOG.md
just changelog-unreleased # Preview unreleased changesSee the examples/ directory for more examples:
simple.rs- Basic embedding generationbatch.rs- Batch processing examplemulti_model.rs- Using multiple modelsconfig.rs- Configuration exampleserror_handling.rs- Error handling patternsreranking.rs- Cross-encoder reranking
Run examples with:
cargo run --example simpleLicensed under the Apache License, Version 2.0. See LICENSE for details.
Contributions are welcome! Please see DEVELOPMENT.md for development setup and contribution guidelines.
For issues and questions, please use the GitHub issue tracker.