Documentation site: the VitePress site lives under docs/. Build and preview locally with pnpm docs:dev and pnpm docs:build.
xlai is a Rust-first AI integration workspace for building reusable AI calling flows across native applications and the browser.
The name was inspired by moeru-ai/xsai, which focuses on extra-small AI SDK.
The project is designed around a unified API with pluggable backends. Today it includes:
The long-term goal is to support API-based models, tool integration, skill management, knowledge retrieval, vector search, and local/device inference behind the same overall API model.
This repository is in an early stage and the public API is still evolving.
Already implemented:
- Cargo workspace with native and
wasm32build support - OpenAI-compatible chat backend
- chat sessions with per-session tool registration
- streaming chat output
- configurable tool-call execution mode: concurrent by default, sequential optional
- unit/mock test lane and protected e2e test lane
- GitHub Actions for build, test, formatting, clippy, and e2e
xlai/
├── vendor/
│ └── native/ # git submodules: llama.cpp, ggml (for xlai-sys-*)
├── crates/
│ ├── core/
│ │ └── xlai-core/
│ ├── runtime/
│ │ └── xlai-runtime/
│ ├── backends/
│ │ ├── xlai-backend-openai/
│ │ ├── xlai-backend-transformersjs/
│ │ ├── xlai-backend-gemini/
│ │ └── xlai-backend-llama-cpp/
│ ├── qts/
│ │ ├── xlai-qts-core/
│ │ ├── xlai-qts-manifest/
│ │ └── xlai-qts-cli/
│ ├── sys/
│ │ ├── xlai-build-native/
│ │ ├── xlai-sys-llama/
│ │ └── xlai-sys-ggml/
│ └── platform/
│ ├── xlai-facade/
│ ├── xlai-native/
│ ├── xlai-wasm/
│ └── xlai-ffi/
├── packages/
│ └── xlai/
└── .github/workflows/
For crate boundaries and request flow, see ARCHITECTURE.md. For publish vs internal crates, see docs/development/crates-taxonomy.md.
crates/core/xlai-coreShared domain types and traits for chat, tools, embeddings, knowledge, and vector search.crates/platform/xlai-ffiNative C ABI facade crate for future FFI integrations.crates/runtime/xlai-runtimeRuntime builder, chat session API, streaming, and tool-calling orchestration. Local-backend prompt prep (PreparedLocalChatRequest, tool JSON, templates) lives inxlai_runtime::local_common(used by llama.cpp and transformers.js backends).crates/platform/xlai-facadeInternal native aggregate wiring and re-exports forxlai-nativeonly (not used byxlai-wasm). Not published to crates.io.crates/platform/xlai-nativeNative Rust-facing entrypoint: explicit re-exports, optionalqts, andgeminisubmodule for workspace-only Gemini types.crates/platform/xlai-wasmBrowser-facingwasm-bindgencrate for web integration. Default Cargo featureqtsenables local QTS entrypoints (qtsBrowserTts, manifest validation); the in-browser engine is still a stub until GGML/ORT WASM work lands (seedocs/qts/wasm-browser-runtime.md).crates/backends/xlai-backend-llama-cppNativellama.cppchat backend for local GGUF inference.crates/sys/xlai-build-nativeSharedbuild.rshelpers for native CMake/OpenBLAS/Vulkan and llama.cpp CMake patches (build-dependency only).crates/sys/xlai-sys-llamaVendoredllama.cppnative stack for the local chat backend.crates/sys/xlai-sys-ggmlVendored standaloneggmlnative stack for QTS.packages/xlaiVite-based TypeScript package published as@yai-xlai/xlai, built on top ofxlai-wasm, with Vitest coverage.crates/backends/xlai-backend-openaiOpenAI-compatible backend implementation usingreqwest.crates/backends/xlai-backend-transformersjsBrowser chat backend that delegates generation to a JavaScript adapter (WASM).crates/backends/xlai-backend-geminiGoogle Gemini HTTP backend (workspace-only;publish = false).crates/qts/xlai-qts-manifestSerde types for browser QTS model manifests and capability JSON (no GGML/ORT); used byxlai-wasm(featureqts) and re-exported asxlai_qts_core::browser.crates/qts/xlai-qts-coreQwen3 TTS engine; links standaloneggmlthroughxlai-sys-ggml. Exposes nativeTtsModel(QtsTtsModel, WAV output; tuning viaTtsRequestmetadataxlai.qts.*).VoiceSpec::Cloneuses the first reference sample (inline WAV only): x-vector and ICL prompts, with optionalxlai.qts.voice_clone_mode(xvector|icl). ICL needsqwen3-tts-reference-codec.onnx+ preprocess JSON fromuv run export-model-artifacts(seedocs/qts/export-and-hf-publish.md). Pipelined vocoder chunking and overlap are documented indocs/qts/vocoder-streaming.md.xlai_qts_core::browserre-exportsxlai-qts-manifest.crates/qts/xlai-qts-cliBinaryxlai-qts:synthesize,profile, and interactivetui. Without voice-clone flags,synthesizeusesxlai-runtime+xlai-qts-core(QtsTtsModel). With--voice-clone-promptor--ref-audio, it uses the direct engine path. Runcargo run -p xlai-qts-cli -- --help(or… synthesize --help) for flags.
- Rust stable
- minimum supported Rust version:
1.94
The workspace uses:
- edition
2024 - Apache 2.0 license
cargo build --workspacecargo test --workspacerustup target add wasm32-unknown-unknown
cargo check -p xlai-wasm --target wasm32-unknown-unknown --features qtspnpm installpnpm --filter @yai-xlai/xlai buildpnpm --filter @yai-xlai/xlai testcargo clippy --workspace --all-targets -- -D warningscargo fmt --all -- --checkFor local end-to-end testing, copy the template and fill in your values:
cp .env.example .envCurrent variables:
OPENAI_API_KEYOPENAI_BASE_URLOPENAI_MODELOPENAI_TRANSCRIPTION_MODEL
Local .env files are ignored by Git.
This is the current native Rust usage style:
use xlai_native::core::{
ToolCallExecutionMode, ToolDefinition, ToolParameter, ToolParameterType, ToolResult,
};
use xlai_native::{OpenAiConfig, RuntimeBuilder};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let runtime = RuntimeBuilder::new()
.with_chat_backend(OpenAiConfig::new(
std::env::var("OPENAI_BASE_URL")
.unwrap_or_else(|_| "https://api.openai.com/v1".to_owned()),
std::env::var("OPENAI_API_KEY")?,
std::env::var("OPENAI_MODEL")
.unwrap_or_else(|_| "gpt-4.1-mini".to_owned()),
))
.build()?;
let mut chat = runtime
.chat_session()
.with_system_prompt("Be concise.");
chat.register_tool(
ToolDefinition {
name: "lookup_weather".into(),
description: "Lookup current weather".into(),
parameters: vec![ToolParameter {
name: "city".into(),
description: "City name".into(),
kind: ToolParameterType::String,
required: true,
}],
execution_mode: ToolCallExecutionMode::Concurrent,
},
|arguments| async move {
let city = arguments["city"].as_str().unwrap_or("unknown");
Ok(ToolResult {
tool_name: "lookup_weather".into(),
content: format!("weather for {city}: sunny"),
is_error: false,
metadata: Default::default(),
})
},
);
let response = chat.prompt("What's the weather in Paris?").await?;
println!("{}", response.message.content);
Ok(())
}Agent sessions are available through the same runtime:
use futures_util::StreamExt;
use xlai_native::core::{
ChatChunk, ToolCallExecutionMode, ToolDefinition, ToolParameter, ToolParameterType, ToolResult,
};
use xlai_native::{ChatExecutionEvent, OpenAiConfig, RuntimeBuilder};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let runtime = RuntimeBuilder::new()
.with_chat_backend(OpenAiConfig::new(
std::env::var("OPENAI_BASE_URL")
.unwrap_or_else(|_| "https://api.openai.com/v1".to_owned()),
std::env::var("OPENAI_API_KEY")?,
std::env::var("OPENAI_MODEL")
.unwrap_or_else(|_| "gpt-4.1-mini".to_owned()),
))
.build()?;
let mut agent = runtime
.agent_session()?
.with_system_prompt("Use tools when helpful.");
// Streaming tool loop:
// - `.with_max_tool_round_trips(n)` — cap tool round-trips (default 8).
// - `.with_context_compressor(|msgs, est| async move { Ok(msgs) })` — optional Rust-only hook
// before each streamed model call (see README “Context compression hook”).
agent.register_tool(
ToolDefinition {
name: "lookup_weather".into(),
description: "Lookup current weather".into(),
parameters: vec![ToolParameter {
name: "city".into(),
description: "City name".into(),
kind: ToolParameterType::String,
required: true,
}],
execution_mode: ToolCallExecutionMode::Concurrent,
},
|arguments| async move {
let city = arguments["city"].as_str().unwrap_or("unknown");
Ok(ToolResult {
tool_name: "lookup_weather".into(),
content: format!("weather for {city}: sunny"),
is_error: false,
metadata: Default::default(),
})
},
);
// Unary `prompt` / `execute` / `prompt_*` = one model call (no tool callbacks).
// Use `stream_prompt` / `stream` / `stream_*` for the multi-round tool loop.
let mut stream = agent.stream_prompt("What's the weather in Paris?");
let mut response = None;
while let Some(item) = stream.next().await {
let item = item?;
if let ChatExecutionEvent::Model(ChatChunk::Finished(resp)) = item {
response = Some(resp);
}
}
let response = response.expect("stream ended without a finished model response");
println!("{}", response.message.content);
Ok(())
}Add futures-util to your crate’s dependencies (same major line as the workspace) so StreamExt is available.
For local native inference, the same builder can be pointed at llama.cpp:
use xlai_native::{LlamaCppConfig, RuntimeBuilder};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let runtime = RuntimeBuilder::new()
.with_chat_backend(
LlamaCppConfig::new(std::env::var("LLAMA_CPP_MODEL")?)
.with_context_size(4096)
.with_max_output_tokens(256),
)
.build()?;
let response = runtime
.chat_session()
.with_system_prompt("Be concise.")
.prompt("Reply with a short greeting.")
.await?;
println!("{}", response.message.content);
Ok(())
}Chat sessions can register tools directly with async callbacks. Agent sessions
also expose an McpRegistry via agent.mcp_registry() so MCP-provided tools can
be registered separately from built-in agent tools.
The WebAssembly package mirrors this split with chat(...), createChatSession(...),
agent(...), and createAgentSession(...).
Current behavior:
- tools are registered per chat or agent session
- tool calls are exposed to the model through the runtime request
Chatperforms one model call perprompt/execute/stream; it does not execute tool callbacks or run multiple model rounds—you get the model’s response (including anytool_calls) and can drive the next step yourselfAgent- Unary
prompt/execute/prompt_parts/prompt_content: exactly one model call; registered tools are not run by the runtime (same “no silent multi-minute loop” guarantee as chat). - Streaming
stream/stream_prompt/stream_prompt_content/stream_prompt_parts: runs the automatic tool loop. After each assistant turn that finishes with tool calls, tools run and another model turn starts until a response has no tool calls orAgent::with_max_tool_round_trips(default8) is exceeded.
- Unary
- when tools run (on
Agent), local session tools are used before falling back to a runtime-level tool executor - each tool’s
ToolDefinition::execution_modecontrols how its calls interact with other calls in the same model turn - if any invoked tool in a turn is
Sequential, all tool calls in that turn run sequentially in model order (no overlap) - otherwise, tool calls in a turn run concurrently
| Surface | Control | Effect |
|---|---|---|
Rust Agent |
with_max_tool_round_trips(usize) |
Maximum model↔tool rounds per stream (default 8). |
Chat never runs this loop. Unary agent calls never block on long multi-round tool execution without you choosing a streaming API.
On Agent, with_context_compressor (Rust) registers an async closure that runs once per streaming tool-loop round, immediately before each model call. It receives the full accumulated ChatMessage list for that stream and a best-effort Option<u32> input-token estimate (JSON serialization of the outgoing ChatRequest, bytes÷4 heuristic—not provider-tokenizer-accurate). It must return the message list to send for that round (often a compressed copy). The agent still appends assistant and tool messages to its internal history after each round; only the outgoing request for that step uses the returned list. Returning an empty list fails the stream with a provider error. Agent::register_context_compressor is the &mut self variant for the same hook.
The hook is not used for unary prompt / execute.
JavaScript (@yai-xlai/xlai): on AgentSession, call registerContextCompressor(async (messages, estimatedInputTokens) => messages) before streamPrompt / streamPromptWithContent. The callback receives the same semantics as Rust (message array in Rust JSON shape, estimatedInputTokens as number | null). WASM exports registerContextCompressor, streamPrompt, and streamPromptWithContent on AgentSession.
The runtime supports streamed chat output through ChatChunk and ChatExecutionEvent.
Chat: eachstream/stream_prompt/stream_*call performs one model run; events are deltas and a finalChatChunk::Finishedfor that turn.Agent: streaming uses the same chunk types, but the stream may include multiple model rounds while tools are still being requested. Between rounds you may seeChatExecutionEvent::ToolCallandToolResultevents after a finished assistant message that contained tool calls. An optional context-compressor hook may rewrite the message list before each of those model calls (Rustwith_context_compressor, or JSAgentSession.registerContextCompressor). In JS,streamPrompt/streamPromptWithContentcollect the full event list in order (one round-trip through the WASM bridge; not a browserReadableStreamyet).
Streaming currently includes:
- message start events
- content delta events
- tool call delta events
- final response events (per model turn; agent streams may emit several before the stream ends)
The repository uses two test classes.
These are the default tests and do not require API tokens.
Run locally with:
cargo test --workspaceThese use real provider credentials and are ignored by default.
Run locally with:
cargo test --workspace -- --ignored --test-threads=1The current OpenAI smoke tests will also load .env automatically for local runs.
For ASR/transcription end-to-end coverage, set OPENAI_TRANSCRIPTION_MODEL to a
transcription-capable model such as gpt-4o-mini-transcribe.
.github/workflows/build.yml builds:
- Linux
- Windows
- macOS arm64
- macOS x86_64 via cross-target build
wasm32-unknown-unknown- the
@yai-xlai/xlaipackage bundle through thepnpmworkspace
.github/workflows/test.yml runs:
cargo fmt --all -- --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspacepnpm --filter @yai-xlai/xlai test
.github/workflows/publish.yml runs crates.io and npm checks:
- On pull requests and pushes to
main:cargo publish --dry-runfor the publishable Rust crate subset (in dependency order) andnpm publish --dry-runfor@yai-xlai/xlaiafter a full package build. - On pushes to tags matching
v*: realcargo publishandnpm publishusing secrets on the GitHubpublishenvironment (CRATES_IO_TOKEN,NPM_TOKEN).
See docs/development/publishing.md for ordering, version bumps, and crates that are not on crates.io yet.
.github/workflows/e2e.yml runs ignored tests with provider credentials (and skips QTS model-dir tests until a CI fixture download exists).
It is intended to use a protected GitHub Environment such as e2e, with maintainer approval and environment secrets.
The OpenAI e2e environment currently expects:
OPENAI_API_KEYas a GitHub secretOPENAI_BASE_URLas a GitHub environment variableOPENAI_MODELas a GitHub environment variableOPENAI_TRANSCRIPTION_MODELas a GitHub environment variable
The ignored native llama.cpp smoke test expects:
LLAMA_CPP_MODELpointing to a local GGUF model file, or the default fixture pathfixtures/llama.cpp/Qwen3.5-0.8B-Q4_0.gguf
The repository intentionally ignores downloaded GGUF fixtures under fixtures/llama.cpp/.
CI downloads that fixture with the Hugging Face CLI and caches it between runs.
Ignored QTS integration tests in xlai-qts-core expect a full Qwen3 TTS model directory:
XLAI_QTS_MODEL_DIRpointing at a folder containing the GGUF talker, ONNX vocoder (qwen3-tts-vocoder.onnx), tokenizer, andconfig.json(seexlai_qts_core::ModelPaths). For ICL voice clone, also addqwen3-tts-reference-codec.onnxandqwen3-tts-reference-codec-psreprocess.json(export withuv run export-model-artifacts; seedocs/qts/export-and-hf-publish.md).
CI e2e currently skips those tests because no Hugging Face download step is wired yet; run them locally after downloading weights.
To build xlai-native with the QTS backend available, enable --features qts (optional; avoids linking QTS/ggml in default builds).
- unified API across native and browser targets
- target-specific dependency wiring for native vs
wasm32 - OpenAI-compatible backend uses
reqwest - public APIs aim to be async-first and streaming-capable
Planned or expected next areas include:
- more provider backends
- typed tool argument helpers
- embeddings and retrieval backends
- skill management APIs
- local/device inference backends
- more browser-focused examples
- OpenAI Backends
- OpenRouter Backend
- llama.cpp Backends
- CPU
- OpenBLAS
- Accelerate.framework
- Metal
- Vulkan
- CUDA
- HIP
- OpenVINO
- OpenAI
- QTS
- Transformer
- CPU
- OpenBLAS
- Accelerate.framework
- Metal
- Vulkan
- CUDA
- HIP
- OpenVINO
- Vocoder (ORT default EPs)
- Transformer
Apache License 2.0. See LICENSE.