The Only LLM Inference Server You Don't Have to Trust
Cryptographically prove that a specific model runs unmodified inside hardware-encrypted memory β without trusting the infrastructure operator.
The Problem β’ How Power Solves It β’ Features β’ Architecture β’ Layer-Streaming β’ Installation β’ Configuration β’ API Reference β’ Development
Every LLM inference server β Ollama, vLLM, llama.cpp, TGI, LocalAI β was designed for a world where you trust the machine. You send your prompts to a server and hope the operator doesn't look at them. That's a policy promise, not a technical guarantee.
For healthcare (HIPAA), finance (SOX/GLBA), government (classified data), and any multi-tenant AI deployment where the infrastructure operator is a different party than the data owner β "we promise not to look" is not enough.
A3S Power runs LLM inference inside Trusted Execution Environments (AMD SEV-SNP / Intel TDX). The CPU encrypts all memory. The infrastructure operator cannot read prompts, responses, or model weights β the hardware enforces it.
But hardware isolation alone isn't enough. You need to verify it. Power provides a complete chain of cryptographic proof:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β a3s-box MicroVM (AMD SEV-SNP / Intel TDX) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β a3s-power β β
β β β β
β β 1. Verify model integrity (SHA-256 + Ed25519 signature) β β
β β 2. Bind model hash into hardware attestation report β β
β β 3. Serve inference via OpenAI-compatible API β β
β β 4. Redact all inference content from logs and metrics β
β β 5. Zero all memory on model unload β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Hardware-encrypted memory β host cannot read β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ Client verifies independently:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β a3s-power-verify β
β β Nonce binding (prevents replay) β
β β Model hash binding (proves which model is running) β
β β Hardware signature (AMD KDS P-384 / Intel PCS P-256) β
β β Platform measurement (proves unmodified code) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The difference: every other inference server asks you to trust. Power lets you verify.
| Capability | Ollama | vLLM | TGI | Power |
|---|---|---|---|---|
| OpenAI-compatible API | β | β | β | β |
| GPU acceleration | β | β | β | β |
| Streaming | β | β | β | β |
| TEE hardware isolation (SEV-SNP / TDX) | β | β | β | β |
| Remote attestation (hardware-signed proof) | β | β | β | β |
| Model-attestation binding (prove which model runs) | β | β | β | β |
| RA-TLS (attestation in TLS handshake) | β | β | β | β |
| Encrypted model loading (AES-256-GCM file-backed, picolm RAM, chunk primitive) | β | β | β | β |
| Deep log redaction (10 keys + error sanitization) | β | β | β | β |
| Memory zeroing (zeroize on drop) | β | β | β | β |
| Client-side verification SDK | β | β | β | β |
| Hardware signature verification (AMD KDS / Intel PCS) | β | β | β | β |
| Layer-streaming for memory-constrained TEE | β | β | β | β |
| Pure Rust inference (fully auditable, no C++) | β | β | β | β |
The bottom half of this table is Power's moat. No other inference server has a threat model. They all assume you trust the machine.
A3S Power is a privacy-preserving LLM inference server designed to run inside Trusted Execution Environments (TEE). It provides an OpenAI-compatible API for chat completions, text completions, and embeddings β with hardware-enforced memory encryption, model integrity verification, and automatic log redaction.
Power is built to run inside a3s-box MicroVMs with AMD SEV-SNP or Intel TDX, ensuring that inference data (prompts, responses, model weights) never leaves the encrypted enclave.
These features exist in no other LLM inference server:
- TEE-Aware Runtime: Auto-detects AMD SEV-SNP (
/dev/sev-guest) and Intel TDX (/dev/tdx_guest) at startup; simulated mode for development (A3S_TEE_SIMULATE=1) - Remote Attestation: Real hardware ioctl β AMD
SNP_GET_REPORTand IntelTDX_CMD_GET_REPORT0β generates firmware-signed proof that inference runs in a genuine TEE; full raw reports included for client verification - Model/Runtime/GPU-Attestation Binding:
GET /v1/attestation?model=<name>re-hashes the current local model artifact (file, deterministic directory manifest, or encrypted artifact), emits anAttestationClaimsV2claim set, and bindssha256(canonical_claims_v2)into CPU TEEreport_data; encrypted model pins cover decrypted plaintext when configured and include ciphertext provenance digests; model-bound claims include applied chat-template digests plus a canonical GPU execution/offload digest, andtee_policy_mode = "gpu-confidential"additionally binds NVIDIA GPU CC evidence, NRAS verdict digests, and structured NVIDIA device identity/freshness claims from livenvattest-clicollection or direct NRAS REST attestation using the same request nonce - Request-Level Inference Receipts: Chat and text completion responses include
attestation_receiptplusattestation_receipt_sha256, covering prompt-bearing API input, model runtime chat-template/GPU execution policy claims, exposed decoding parameters, streaming request options, stop tokens, response format, tools including functionstrictschema flags, tool choice, and parallel tool-call policy; local renderers, mistralrs text tokenization, and opt-in proxy upstreams can add aneffective_promptdigest; streaming responses emit the receipt in a final SSE event before[DONE] - RA-TLS Transport: TLS certificate embeds the attestation report as a custom X.509 extension (OID
1.3.6.1.4.1.56560.1.1) β clients verify the TEE during the TLS handshake itself, no separate API call needed - Hardware Signature Verification: Client-side SDK has
VerificationPolicy::strict()/verify_report_strict()for fail-closed verification with mandatory hardware signatures, operator-pinned launch measurement verification, simulated-report rejection, optional required GPU evidence/device-claim/runtime policy checks, NVIDIA NRAS verdict digest pinning, NVIDIA GPU provider/format/evidence-count, exact GPU/NVSwitch topology, claims schema version, and GPU plus NVSwitch UEID/OEM ID/hwmodel/firmware pinning, GPU driver pinning, request receipt shape/digest/policy helpers, attestation-to-receipt runtime policy binding, and effective-prompt digest pinning when a receipt exposes it - Client Verification CLI:
a3s-power-verifydefaults to strict verification with mandatory--expected-measurement; skipping hardware signatures and measurement pinning requires the explicit--allow-offlinedevelopment/offline flag;--hw-cert-cache-ttl-secstunes AMD KDS / Intel PCS certificate cache duration for strict verifier processes; NVIDIA GPU confidential-computing deployments can use--gpu-confidentialto require v2 claims, top-level GPU evidence nonce freshness, structured device nonce freshness, pinned NVIDIA NRAS verdict digest binding, verifier-pinned GPU provider/format/count policy, structured NVIDIA device claims, verifier-pinned exact GPU topology, claims schema version, and identity/version policy, runtime policy, and a pinned GPU execution digest; individual GPU/runtime pins remain available with--require-gpu-evidence,--require-gpu-device-claims,--gpu-provider,--gpu-evidence-format,--gpu-verdict-format,--gpu-evidence-count,--gpu-count,--nvswitch-count,--gpu-claims-version,--gpu-ueid,--gpu-oemid,--gpu-hwmodel,--gpu-driver-version,--gpu-firmware-version,--nvswitch-claims-version,--nvswitch-ueid,--nvswitch-oemid,--nvswitch-hwmodel,--nvswitch-firmware-version,--require-runtime-policy, and--gpu-execution-digest;--print-gpu-execution-digestcomputes GPU execution pins with Power's canonicalizer - Encrypted Model Loading: AES-256-GCM file-backed
DecryptedModelloading with zero-overwrite cleanup;in_memory_decrypt = trueloads verified plaintext directly fromMemoryDecryptedModellocked RAM when the backend supports it (picolmGGUF), otherwise fails closed;streaming_decrypt = truepassesLayerStreamingDecryptedModelplaintext to supporting backends (picolmGGUF), with unsupported backends failing closed. Configured encrypted-model integrity pins and signatures are checked against decrypted plaintext SHA-256, with ciphertext SHA-256 exposed separately in attestation claims - KeyProvider Trait: Abstract key loading for HSM integration;
StaticKeyProvider(file/env) +RotatingKeyProvider(zero-downtime rotation) - Deep Log Redaction: Strips inference content from all log output β 10 sensitive JSON keys (
content,prompt,text,arguments,input,delta,system,message,query,instruction);sanitize_error()strips prompt fragments from error messages;suppress_token_metricsrounds token counts to nearest 10 to prevent side-channel inference - Memory Zeroing:
SensitiveStringwrapper auto-zeroizes on drop; all inference buffers cleared viazeroizecrate β the operator cannot recover prompts or responses from memory dumps - Model Integrity: SHA-256 hash verification at startup + Ed25519 publisher signatures; fails fast on tampering
- picolm Layer-Streaming: Pure Rust GGUF inference with true O(layer_size) peak RAM via
madvise(DONTNEED)page release after each layer. Real transformer ops: multi-head/GQA attention, SwiGLU/GeGLU FFN, RoPE, RMSNorm. FP16 KV cache with fused f16 dot/accumulate (no intermediate buffer). Fused dequant+dot kernels. NEON SIMD (aarch64) + AVX2 (x86_64). Rayon parallel matmul. Pre-computed RoPE tables. Batch prefill, tool calling, grammar-constrained output. Selectable speculative-decoding modes (spec_mode:off/prompt-lookup/ DSpark-likengram-context) with batched layer-streaming verify β a draft block is verified in one weight-streaming pass instead of one pass per token β adaptive draft length, and lossless rejection-sampling acceptance (output matches plain decoding for the same seed). Zero-alloc hot path. 14+ tok/s decode on Apple Silicon. Enables 7B+ models inside 512MB TEE EPC. No C/C++ inference backend, ~4,500 lines of fully auditable Rust. - Pure Rust Inference Path: Default backend via
mistralrs(candle) β no C++ inference engine in the trusted computing base; thetee-minimalbuild (~1,220 dep tree lines) is the smallest auditable LLM inference stack that exists
Full-featured LLM inference, competitive with any standalone server:
- OpenAI-Compatible API:
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddingsβ works with any OpenAI SDK - True Token-by-Token Streaming: Per-token SSE delivery via
stream_chat_request - Multiple Backends: mistralrs (pure Rust, default), llama.cpp (C++ bindings, optional), picolm (TEE layer-streaming, optional), proxy (forwards to an upstream OpenAI-compatible server β vLLM/TGI/SGLang/OpenAI β so Power can front an existing accelerated engine)
- Model Formats: GGUF, SafeTensors (ISQ quantization), Vision/Multimodal (LLaVA, Phi-3-Vision), HuggingFace Embeddings (Qwen3, GTE, NomicBert)
- GPU Acceleration: Auto-detection of Apple Metal and NVIDIA CUDA; configurable layer offloading, multi-GPU support
- Tool/Function Calling: Structured tool definitions with XML, Mistral, and JSON output parsing
- JSON Schema Structured Output: Constrain local llama.cpp output via JSON Schema β GBNF grammar conversion; unsupported local backend/schema combinations fail closed instead of silently ignoring output policy
- Thinking & Reasoning: Streaming
<think>block parser for DeepSeek-R1, QwQ reasoning models - Chat Template Engine: Jinja2-compatible rendering via
minijinja(Llama 3, ChatML, Phi, Gemma, custom); model-provided raw templates fail closed on render errors instead of silently switching prompt formats - KV Cache Reuse: Prefix matching across multi-turn requests for conversation speedup
- Remote Model Hub Pull:
POST /v1/models/pullwith SSE progress, Range resume, concurrent dedup, source-specific token auth for ModelScope or HuggingFace Hub
- Content-Addressed Storage: Model blobs stored by SHA-256 hash with automatic deduplication
- Automatic Model Lifecycle: LRU eviction, configurable keep-alive, background reaper for idle models
- Rate Limiting & Admission Control: Per-second token-bucket on
/v1/*returns429with an OpenAI-style error; concurrency (max_concurrent_requests) uses vLLM-style backpressure β excess requests queue for an admission permit (held across the streamed body) rather than being rejected, with apower_requests_waitinggauge - Prometheus Metrics: 16 metric groups β HTTP, inference, TTFT, GPU, TEE attestations, model decryptions, log redactions
- Audit Logging: JSONL / Encrypted / Async / Noop; flushed on graceful shutdown
- Vsock Transport: AF_VSOCK for a3s-box MicroVM guest-host communication (Linux only)
- HCL Configuration: HashiCorp Configuration Language for all settings
A3S Power is organized into 6 layers. Each layer has a clear responsibility and communicates only with adjacent layers through trait-based interfaces.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β a3s-power β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Layer β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββ ββββββββββββββββ β β
β β β /v1/chat/ β β /v1/models β β /v1/embed β β /v1/attest β β β
β β β completions β β /v1/models/ β β dings β β ation β β β
β β β β β pull β β β β β β β
β β β /v1/ β β /v1/models/ β β β β /health β β β
β β β completions β β :name β β β β /metrics β β β
β β ββββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββ¬βββββββ ββββββββ¬ββββββββ β β
β β β β β β β β
β β ββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββ β β
β β β autoload: LRU eviction β decrypt β integrity check β load β β
β β ββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Server βLayer β β
β β ββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Middleware Stack (outermost β innermost) β β β
β β β RateLimiter β RequestID β Metrics β Tracing β CORS β Auth β β β
β β ββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββ βββββββββββ ββ΄ββββββββββ ββββββββββββ βββββββββββ β β
β β β AppState β β Auth β β Audit β β Metrics β βTransportβ β β
β β β (model β β (Bearer β β (JSONL/ β β(Promethe β βTCP/TLS/ β β β
β β βlifecycle,β β SHA256 β β encrypt/ β β us, 16 β β Vsock) β β β
β β β LRU, β β const- β β async/ β β metric β β β β β
β β β privacy) β β time) β β noop) β β groups) β β β β β
β β ββββββββ¬ββββ βββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β β
β βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BackendβLayer β β
β β ββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β BackendRegistry (priority-based, TEE-aware routing) β β β
β β β βββββββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββββ β β β
β β β β MistralRsBackend β
β LlamaCppBackend β PicolmBackend β β β β
β β β β pure Rust (candle) β C++ bindings β pure Rust β β β β
β β β β GGUF/SafeTensors/ β GGUF only β layer-stream β β β β
β β β β HuggingFace/Vision β KV cache, LoRA β O(layer_size) β β β β
β β β β ISQ quantization β grammar, vision β TEE-optimized β β β β
β β β βββββββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Shared: chat_template Β· gpu Β· json_schema Β· tool_parser β β β
β β β think_parser Β· gguf_stream β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Model Layer β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββ βββββββββββββββ β β
β β β ModelRegistryβ β BlobStorage β β GgufMeta β β HfPull β β β
β β β (RwLock<Map>)β β (SHA-256 β β (parser, β β (Range β β β
β β β manifest β β content- β β memory β β resume, β β β
β β β persistence) β β addressed) β β estim.) β β SSE prog.) β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TEE Layer (cross-cutting security) β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββββββββ β β
β β βAttestation β β Encrypted β β Privacy β β Model Seal β β β
β β β(TeeProviderβ β Model β β(Provider β β (SHA-256 + β β β
β β β SEV-SNP, β β AES-256- β β redact, β β Ed25519 sig) β β β
β β β TDX, ioctl)β β GCM, 3 β β zeroize, β β β β β
β β β β β modes) β β suppress)β β β β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββββββββ β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββββββββ β β
β β βKeyProvider β β TeePolicy β β EPC β β RA-TLS Cert β β β
β β β(Static, β β(allowlist, β β(memory β β (X.509 + β β β
β β β Rotating, β β measure- β β detect, β β attestation β β β
β β β HSM ext.) β β ment pin) β β routing) β β extension) β β β
β β ββββββββββββββ ββββββββββββββ ββββββββββββ βββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Verify Layer (client-side SDK) β β
β β ββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β β
β β β verify_report() β β HardwareVerifier trait β β β
β β β Β· nonce binding (const-time) β β Β· SevSnpVerifier (AMD KDS) β β β
β β β Β· model hash binding β β Β· TdxVerifier (Intel PCS) β β β
β β β Β· measurement check β β Β· ECDSA P-384 / P-256 β β β
β β ββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Infrastructure: config.rs (HCL) Β· dirs.rs Β· error.rs (14 var.) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Power follows the Minimal Core + External Extensions pattern. Core components are stable and non-replaceable; extensions are trait-based and swappable.
Core (7) Extensions (8 trait-based)
βββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββ
AppState (model lifecycle) Backend: MistralRs / LlamaCpp / Picolm
BackendRegistry + Backend trait TeeProvider: SEV-SNP / TDX / Simulated
ModelRegistry + ModelManifest PrivacyProvider: redaction policy
PowerConfig (HCL) TeePolicy: allowlist + measurement pin
PowerError (14 variants β HTTP) KeyProvider: Static / Rotating / KMS
Router + middleware stack AuthProvider: API key (SHA-256)
RequestContext (per-request) AuditLogger: JSONL / Encrypted / Async / Noop
HardwareVerifier: AMD KDS / Intel PCS
Client
β
β POST /v1/chat/completions
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Middleware Stack β
β RateLimiter ββΊ RequestID ββΊ Metrics ββΊ Tracing ββΊ CORS ββΊ Auth β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β chat::handler() β
β β
β 1. Build RequestContext (request_id, auth_id) β
β 2. Privacy: sanitize_log() if redaction enabled β
β 3. ModelRegistry.get(model) β ModelManifest β
β 4. BackendRegistry.find_for_format(format) β Backend β
β β
β 5. autoload::ensure_loaded() β
β ββ LRU eviction if at max_loaded_models β
β ββ If .enc: KeyProvider.get_key() β AES-256-GCM decrypt β
β β ββ MemoryDecryptedModel (mlock RAM, zeroize on drop) β
β β ββ DecryptedModel (temp file, secure wipe on drop) β
β β ββ LayerStreamingDecryptedModel (chunk-by-chunk) β
β ββ model_seal: verify SHA-256 integrity β
β ββ model_seal: verify Ed25519 signature (if configured) β
β ββ Backend.load(manifest) β
β β
β 6. Backend.chat(model, request) β Stream<ChatResponseChunk> β
β 7. Streaming SSE: role β content chunks (TTFT) β usage β DONE β
β 8. Privacy: zeroize buffers, round token counts β
β 9. Timing padding (Β±20% jitter) if configured β
β 10. Audit: log event, Metrics: record duration/tokens β
β 11. If keep_alive=0: Backend.unload() β RAII secure wipe β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The TEE layer is cross-cutting β it integrates at every layer of the stack:
Layer TEE Integration
ββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
API Log redaction, buffer zeroization, token rounding,
timing padding, attestation endpoint (nonce + model bind)
Server Encrypted audit logs (AES-256-GCM), constant-time auth,
RAII decrypted model storage, RA-TLS cert with attestation
X.509 extension, TEE-specific Prometheus counters
Backend EPC-aware routing (auto picolm when model > 75% EPC),
KV cache isolation per request, mlock weight pinning
Model Content-addressed SHA-256 storage, GGUF memory estimation
for EPC budget planning
TEE Attestation (SEV-SNP/TDX ioctl), AES-256-GCM encryption
(file-backed loading plus RAM/streaming primitives), Ed25519 model signatures,
key rotation, policy enforcement, log redaction (10 keys),
SensitiveString (auto-zeroize), EPC memory detection
Verify Client-side: nonce binding, model hash binding,
measurement check (all constant-time), hardware signature
verification via AMD KDS / Intel PCS certificate chains
βββββββββββββββββββββββββββββββββββββββββββ
β KeyProvider.get_key() β
β Static βββ Rotating βββ (HSM ext.) β
ββββββββββββββββββββ¬βββββββββββββββββββββββ
β AES-256-GCM key
ββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
βββββββ΄βββββββ ββββββββ΄ββββββββ ββββββββββββ΄βββββββββββ
β DecryptedMoβ β MemoryDecryptβ β LayerStreamingDecry β
β del (file) β β edModel (RAM)β β ptedModel (chunks) β
β β β β β β
β Temp .dec β β mlock-pinned β β Chunked plaintext β
β file on β β RAM buffer, β β access primitive β
β disk, zero β β zeroize on β β with Zeroizing β
β overwrite β β drop β β chunk buffers β
β + delete β β β β β
β on drop β β β β β
ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββ
End-to-end End-to-end with End-to-end with
backend path picolm GGUF picolm GGUF;
today memory loading unsupported
backends fail closed
Encrypted-model autoload supports in_memory_decrypt = true only when the
selected backend explicitly accepts locked plaintext buffers. Today that means
picolm for GGUF models; other backends fail closed before load.
streaming_decrypt = true similarly requires a backend that explicitly accepts
LayerStreamingDecryptedModel; today that path is picolm for GGUF models.
Three backends are available, each feature-gated:
mistralrs(default): Pure Rust inference via candle. GGUF, SafeTensors, HuggingFace, Vision formats. ISQ on-load quantization. No C++ inference toolchain. Ideal for TEE supply-chain auditing.llamacpp(optional): C++ llama.cpp viallama-cpp-2bindings. GGUF only. Session KV cache with prefix matching, LoRA adapters, MTMD multimodal, grammar constraints, mirostat sampling.picolm(optional): Pure Rust layer-streaming. GGUF only. Real transformer inference (multi-head/GQA attention, SwiGLU/GeGLU FFN, RoPE, RMSNorm). Peak RAM = O(layer_size) not O(model_size) viamadvise(DONTNEED)page release. FP16 KV cache with fused f16 dot/accumulate. Fused dequant+dot kernels (Q4_K, Q6_K, Q8_0). NEON SIMD (aarch64) + AVX2 (x86_64). Rayon parallel matmul. Batch prefill, speculative decoding, tool calling, grammar-constrained output. 14+ tok/s decode on Apple Silicon. Enables 7B+ models in 512MB TEE EPC. No C/C++ inference backend β ~4,500 lines of fully auditable Rust.
The BackendRegistry selects backends by priority and model format. In TEE environments, find_for_tee() auto-routes to picolm when the model exceeds 75% of available EPC memory.
Without any backend feature enabled, Power can manage models but returns "backend not available" for inference.
#[async_trait]
pub trait Backend: Send + Sync {
fn name(&self) -> &str;
fn supports(&self, format: &ModelFormat) -> bool;
async fn load(&self, manifest: &ModelManifest) -> Result<()>;
async fn unload(&self, model_name: &str) -> Result<()>;
async fn chat(&self, model_name: &str, request: ChatRequest)
-> Result<Pin<Box<dyn Stream<Item = Result<ChatResponseChunk>> + Send>>>;
async fn complete(&self, model_name: &str, request: CompletionRequest)
-> Result<Pin<Box<dyn Stream<Item = Result<CompletionResponseChunk>> + Send>>>;
async fn embed(&self, model_name: &str, request: EmbeddingRequest)
-> Result<EmbeddingResponse>;
}All extension points are trait-based with working default implementations β the system works out of the box:
/// Remote attestation provider (TEE hardware abstraction).
pub trait TeeProvider: Send + Sync {
async fn attestation_report(&self, nonce: Option<&[u8]>) -> Result<AttestationReport>;
async fn attestation_report_with_model(
&self, nonce: Option<&[u8]>, model_hash: Option<&[u8]>
) -> Result<AttestationReport>;
fn is_tee_environment(&self) -> bool;
fn tee_type(&self) -> TeeType; // SevSnp | Tdx | Simulated | None
}
/// Privacy protection for inference logs.
pub trait PrivacyProvider: Send + Sync {
fn should_redact(&self) -> bool;
fn sanitize_log(&self, msg: &str) -> String;
fn sanitize_error(&self, err: &str) -> String;
fn should_suppress_token_metrics(&self) -> bool;
}
/// Model decryption key management (extensible to HSM/KMS).
pub trait KeyProvider: Send + Sync {
async fn get_key(&self) -> Result<[u8; 32]>;
async fn rotate_key(&self) -> Result<[u8; 32]>;
fn provider_name(&self) -> &str;
}
/// Authentication mechanism.
pub trait AuthProvider: Send + Sync {
fn authenticate(&self, token: &str) -> Result<AuthId>;
}
/// Audit trail persistence.
pub trait AuditLogger: Send + Sync {
fn log(&self, event: AuditEvent);
async fn flush(&self);
}
/// TEE policy enforcement.
pub trait TeePolicy: Send + Sync {
fn is_allowed(&self, tee_type: TeeType) -> bool;
fn validate_measurement(&self, measurement: &[u8]) -> bool;
}
/// Client-side hardware attestation signature verification.
pub trait HardwareVerifier: Send + Sync {
async fn verify(&self, report: &AttestationReport) -> Result<()>;
}# Default: pure Rust inference via mistral.rs (no C++ toolchain needed)
cargo install a3s-power
# With llama.cpp inference backend (requires C++ compiler + CMake)
cargo install a3s-power --no-default-features --features llamacpp
# Model management only (no inference)
cargo install a3s-power --no-default-featuresgit clone https://github.com/A3S-Lab/Power.git
cd Power
# Default: pure Rust inference via mistral.rs
cargo build --release
# With llama.cpp inference instead
cargo build --release --no-default-features --features llamacpp
# Binary at target/release/a3s-powerbrew tap a3s-lab/tap https://github.com/A3S-Lab/homebrew-tap
brew install a3s-powerConfiguration is read from ~/.a3s/power/config.hcl (HCL format):
host = "127.0.0.1"
port = 11434
max_loaded_models = 1
keep_alive = "5m"
# TEE privacy protection
tee_mode = true
tee_policy_mode = "strict"
redact_logs = true
# Production TEE launch measurement pinning. Strict/gpu-confidential policy
# requires the detected hardware TEE type to have a 48-byte measurement pin.
expected_measurements = {
"sev-snp" = "<96-char-measurement-hex>"
# "tdx" = "<96-char-mrtd-hex>"
}
# Model integrity verification (required by strict policy when tee_mode = true)
model_hashes = {
"llama3.2:3b" = "sha256:abc123..."
"qwen2.5:7b" = "sha256:def456..."
}
# GPU acceleration
gpu {
gpu_layers = -1 # -1 = offload all layers, 0 = CPU only
main_gpu = 0
}
# NVIDIA GPU confidential-computing evidence binding
# Required only when tee_policy_mode = "gpu-confidential".
gpu_attestation {
# Preferred live path: invoke NVIDIA nvattest for each attestation request.
source = "nvattest-cli"
provider = "nvidia-nras"
nvattest_path = "/usr/local/bin/nvattest"
nvattest_verifier = "remote"
nvattest_gpu_evidence_source = "nvml"
# nras_url = "https://<your-nras-endpoint>"
# Alternative compatibility path for external evidence pipelines:
# The configured verdict must be NVIDIA nvattest/NRAS JSON whose eat_nonce
# matches each /v1/attestation nonce; stale verdicts fail closed.
# source = "configured"
# evidence_path = "/run/a3s/nvidia-gpu-evidence.json"
# verdict_path = "/run/a3s/nvidia-nras-verdict.json"
# Direct NRAS REST path for deployments that collect DeviceEvidenceV2 JSON
# through their own NVIDIA evidence collector:
# source = "nras-rest"
# evidence_path = "/run/a3s/nvidia-gpu-evidence-list.json"
# nras_url = "https://nras.attestation.nvidia.com" # no embedded credentials
# nras_gpu_architecture = "HOPPER" # or "BLACKWELL"
# nras_claims_version = "3.0"
# nras_bearer_token_env = "NRAS_BEARER_TOKEN"
}
# Optional proxy backend integration.
# When enabled, Power asks the upstream for the exact rendered prompt digest
# before proxied chat inference. If required = true, missing support fails closed.
# proxy_upstreams = {
# "llama-70b" = "http://vllm:8000"
# }
# proxy_effective_prompt_digest = true
# proxy_effective_prompt_digest_required = false
# proxy_effective_prompt_digest_path = "/v1/chat/effective-prompt-digest"| Field | Default | Description |
|---|---|---|
host |
127.0.0.1 |
HTTP server bind address |
port |
11434 |
HTTP server port |
data_dir |
~/.a3s/power |
Base directory for model storage |
max_loaded_models |
1 |
Maximum models loaded concurrently |
keep_alive |
"5m" |
Auto-unload idle models ("0" = immediate, "-1" = never); invalid config or request values fail closed |
spec_mode |
"prompt-lookup" |
picolm speculative-decoding mode: "off", "prompt-lookup", or "ngram-context"; unknown values fail configuration validation |
use_mlock |
false |
Lock model weights in memory (prevent swapping) |
num_thread |
auto | Thread count for inference |
flash_attention |
false |
Enable flash attention |
num_parallel |
1 |
Concurrent inference slots |
tee_mode |
false |
Enable TEE: attestation, integrity checks, log redaction |
tee_policy_mode |
"strict" |
TEE attestation policy: "strict" for production, "development" for simulated/local tests, "gpu-confidential" for NVIDIA GPU confidential-computing deployments with bound GPU evidence |
expected_measurements |
{} |
Expected 48-byte launch measurements per detected hardware TEE type; required by strict and GPU-confidential policy ("sev-snp" measurement or "tdx" MRTD) |
redact_logs |
false |
Redact inference content from logs |
audit_log |
false |
Enable structured audit logging |
audit_log_path |
null |
Audit log path; defaults to $A3S_POWER_HOME/audit.jsonl; startup fails closed if the path cannot be opened while audit_log = true |
audit_log_encrypt |
false |
Encrypt audit log entries at rest; requires audit_key_source and fails configuration validation when missing |
audit_key_source |
null |
AES-256-GCM key source for encrypted audit logs: { file = "/path/to/key.hex" } or { env = "AUDIT_KEY_VAR" } |
model_hashes |
{} |
Expected SHA-256 hashes for model verification |
model_signing_key |
null |
Valid 32-byte Ed25519 public key (hex) for verifying model .sig signatures; invalid values fail configuration validation; /v1/attestation?model=... re-verifies the current runtime digest signature when no explicit model_hashes pin is configured |
gpu.gpu_layers |
0 |
GPU layer offloading (-1 = all) |
gpu.main_gpu |
0 |
Primary GPU index |
gpu_attestation.source |
"configured" |
GPU CC evidence source: "configured" for file/hex bytes, "nvattest-cli" for live NVIDIA nvattest, or "nras-rest" for direct NVIDIA NRAS REST attestation |
gpu_attestation.provider |
"nvidia-nras" |
Provider label for NVIDIA GPU confidential-computing evidence claims; gpu-confidential production policy requires "nvidia-nras" |
gpu_attestation.evidence_path |
null |
Path to raw NVIDIA GPU CC evidence bytes; mutually exclusive with evidence_hex and fails configuration validation if both are set; required when source = "nras-rest"; gpu-confidential production policy requires an absolute path to an existing non-empty regular file when file-backed evidence is configured; configured evidence sources are capped at 64 MiB |
gpu_attestation.evidence_hex |
null |
Hex-encoded raw NVIDIA GPU CC evidence bytes; mutually exclusive with evidence_path and fails configuration validation if both are set; required when source = "nras-rest" unless evidence_path is set; configured evidence sources are capped at 64 MiB |
gpu_attestation.verdict_path |
null |
Path to raw NVIDIA NRAS verdict bytes; mutually exclusive with verdict_hex and fails configuration validation if both are set; must be unset when source = "nras-rest" because NRAS REST obtains the verdict directly; gpu-confidential production policy requires an absolute path to an existing non-empty regular file when configured evidence uses a file-backed verdict; configured verdict sources are capped at 64 MiB |
gpu_attestation.verdict_hex |
null |
Hex-encoded raw NVIDIA NRAS verdict bytes; mutually exclusive with verdict_path and fails configuration validation if both are set; must be unset when source = "nras-rest" because NRAS REST obtains the verdict directly; configured verdict sources are capped at 64 MiB |
gpu_attestation.nvattest_path |
"nvattest" |
Path to NVIDIA's nvattest CLI when source = "nvattest-cli"; gpu-confidential production policy requires an absolute path to an existing executable file |
gpu_attestation.nvattest_verifier |
"remote" |
nvattest attest --verifier value; must be "remote" or "local" when source = "nvattest-cli"; gpu-confidential mode requires "remote" for NRAS |
gpu_attestation.nvattest_gpu_evidence_source |
"nvml" |
nvattest collect-evidence --gpu-evidence-source; must be "nvml" or "corelib" when source = "nvattest-cli"; use "nvml" for H100 confidential-computing deployments |
gpu_attestation.nvattest_gpu_architecture |
null |
GPU architecture value required when source = "nvattest-cli" and nvattest_gpu_evidence_source = "corelib" |
gpu_attestation.nras_url |
null |
Optional NRAS URL. For nvattest-cli, passed to nvattest attest --nras-url; for nras-rest, may be a service root/base path or full /v4/attest/gpu endpoint. In gpu-confidential production policy, custom NRAS URLs must use HTTPS and must not include embedded credentials |
gpu_attestation.nras_gpu_architecture |
null |
GPU architecture required when source = "nras-rest": "HOPPER" or "BLACKWELL" |
gpu_attestation.nras_claims_version |
"3.0" |
NVIDIA NRAS REST claims version ("2.0" or "3.0"); invalid values fail configuration validation when source = "nras-rest" |
gpu_attestation.nras_bearer_token_env |
null |
Optional environment variable name containing a bearer token for direct NRAS REST calls; the name is trimmed and must be a portable ASCII identifier ([A-Za-z_][A-Za-z0-9_]*); use this instead of embedding credentials in nras_url |
gpu_attestation.nras_timeout_secs |
30 |
Timeout for each direct NRAS REST request; must be greater than zero when source = "nras-rest" |
gpu_attestation.rim_url |
null |
Optional RIM URL passed to nvattest attest --rim-url; gpu-confidential production policy requires HTTPS when configured |
gpu_attestation.ocsp_url |
null |
Optional OCSP URL passed to nvattest attest --ocsp-url; gpu-confidential production policy requires HTTPS when configured |
gpu_attestation.relying_party_policy_path |
null |
Optional relying-party policy file for nvattest attest; gpu-confidential production policy requires an absolute path to an existing non-empty regular file when configured |
gpu_attestation.nvattest_timeout_secs |
30 |
Timeout for each nvattest command; must be greater than zero when source = "nvattest-cli" |
model_key_source |
null |
Decryption key for .enc model files: { file = "/path/to/key.hex" } or { env = "MY_KEY_VAR" } |
key_provider |
"static" |
Key provider type: "static" (uses model_key_source) or "rotating" (uses key_rotation_sources); unknown values fail configuration validation |
key_rotation_sources |
[] |
For rotating provider: list of key sources in rotation order; required when key_provider = "rotating" |
in_memory_decrypt |
false |
Load encrypted GGUF plaintext from locked RAM when the selected backend supports it (picolm); unsupported backends fail closed |
streaming_decrypt |
false |
Load encrypted GGUF plaintext through LayerStreamingDecryptedModel when the selected backend supports it (picolm); unsupported backends fail closed |
suppress_token_metrics |
false |
Round token counts in responses to nearest 10 (prevents exact token-count side-channel) |
rate_limit_rps |
0 |
Max requests per second for /v1/* endpoints (0 = unlimited) |
max_concurrent_requests |
0 |
Max concurrent in-flight inference requests; excess queue for an admission permit held across the streamed response (0 = unlimited) |
proxy_upstreams |
{} |
Map of model name β upstream base URL to proxy to an OpenAI-compatible server (vLLM/TGI/SGLang/OpenAI), e.g. { "llama-70b" = "http://vllm:8000" }. Proxied inference runs on the upstream, outside any TEE |
proxy_effective_prompt_digest |
false |
Ask proxy upstreams for a rendered chat prompt digest before inference and include it in receipts when returned |
proxy_effective_prompt_digest_required |
false |
Fail closed when a proxy upstream does not support or cannot return the rendered prompt digest |
proxy_effective_prompt_digest_path |
"/v1/chat/effective-prompt-digest" |
Upstream endpoint path for proxy rendered prompt digest requests |
tls_port |
null |
TLS server port; when set, a TLS server starts in parallel; configuration validation fails unless the binary was built with the tls feature |
tls_sans |
[] |
Additional DNS names or IP addresses for the TLS certificate; invalid entries fail closed instead of being skipped |
ra_tls |
false |
Embed TEE attestation in TLS cert (RA-TLS); fails configuration validation unless tls_port and tee_mode = true are set, and startup fails closed if no attestation report can be embedded |
vsock_port |
null |
Vsock port for guest-host communication (vsock feature, Linux only) |
| Variable | Description |
|---|---|
A3S_POWER_HOME |
Base directory for all Power data (default: ~/.a3s/power) |
A3S_POWER_HOST |
Server bind address |
A3S_POWER_PORT |
Server port; invalid values fail closed |
A3S_POWER_DATA_DIR |
Model storage directory |
A3S_POWER_MAX_MODELS |
Max concurrent loaded models; invalid values fail closed |
A3S_POWER_KEEP_ALIVE |
Default keep-alive duration |
A3S_POWER_SPEC_MODE |
picolm speculative-decoding mode ("off", "prompt-lookup", or "ngram-context"); invalid values fail closed |
A3S_POWER_MODEL_SOURCE |
Remote model hub source for pull ("modelscope", "hf", or "huggingface"); invalid configured values fail closed |
A3S_POWER_HUB_TOKEN |
Generic bearer token fallback for remote model hub pulls |
A3S_POWER_GPU_LAYERS |
GPU layer offloading; invalid values fail closed |
A3S_POWER_GPU_ATTESTATION_SOURCE |
GPU CC evidence source ("configured", "nvattest-cli", or "nras-rest"); invalid values fail closed |
A3S_POWER_GPU_ATTESTATION_PROVIDER |
Provider label for NVIDIA GPU CC evidence claims |
A3S_POWER_GPU_ATTESTATION_EVIDENCE_PATH |
Path to raw NVIDIA GPU CC evidence bytes |
A3S_POWER_GPU_ATTESTATION_EVIDENCE_HEX |
Hex-encoded raw NVIDIA GPU CC evidence bytes |
A3S_POWER_GPU_ATTESTATION_VERDICT_PATH |
Path to raw NVIDIA NRAS verdict bytes |
A3S_POWER_GPU_ATTESTATION_VERDICT_HEX |
Hex-encoded raw NVIDIA NRAS verdict bytes |
A3S_POWER_GPU_ATTESTATION_NVATTEST_PATH |
Path to NVIDIA's nvattest CLI |
A3S_POWER_GPU_ATTESTATION_NVATTEST_VERIFIER |
nvattest attest --verifier value |
A3S_POWER_GPU_ATTESTATION_NVATTEST_GPU_EVIDENCE_SOURCE |
Live GPU evidence source ("nvml" or "corelib") |
A3S_POWER_GPU_ATTESTATION_NVATTEST_GPU_ARCHITECTURE |
Architecture value for corelib evidence collection |
A3S_POWER_GPU_ATTESTATION_NRAS_URL |
Optional NRAS URL |
A3S_POWER_GPU_ATTESTATION_NRAS_GPU_ARCHITECTURE |
GPU architecture for direct NRAS REST ("HOPPER" or "BLACKWELL") |
A3S_POWER_GPU_ATTESTATION_NRAS_CLAIMS_VERSION |
Claims version for direct NRAS REST ("2.0" or "3.0") |
A3S_POWER_GPU_ATTESTATION_NRAS_BEARER_TOKEN_ENV |
Environment variable containing an optional NRAS REST bearer token |
A3S_POWER_GPU_ATTESTATION_NRAS_TIMEOUT_SECS |
Timeout for each direct NRAS REST request; invalid values fail closed |
A3S_POWER_GPU_ATTESTATION_RIM_URL |
Optional RIM URL |
A3S_POWER_GPU_ATTESTATION_OCSP_URL |
Optional OCSP URL |
A3S_POWER_GPU_ATTESTATION_RELYING_PARTY_POLICY_PATH |
Optional relying-party policy file |
A3S_POWER_GPU_ATTESTATION_NVATTEST_TIMEOUT_SECS |
Timeout for each nvattest command; invalid values fail closed |
A3S_POWER_PROXY_EFFECTIVE_PROMPT_DIGEST |
Enable proxy upstream rendered prompt digest requests; invalid values fail closed |
A3S_POWER_PROXY_EFFECTIVE_PROMPT_DIGEST_REQUIRED |
Require proxy upstream rendered prompt digest support and fail closed when missing; invalid values fail closed |
A3S_POWER_PROXY_EFFECTIVE_PROMPT_DIGEST_PATH |
Upstream endpoint path for rendered prompt digest requests |
A3S_POWER_TEE_MODE |
Enable TEE mode ("1" or "true"); invalid values fail closed |
A3S_POWER_TEE_POLICY_MODE |
Set TEE policy mode ("strict", "development", or "gpu-confidential"); invalid values fail closed |
A3S_POWER_TEE_STRICT |
Legacy shortcut: "1" selects strict policy and removes simulated TEE from the allowlist |
A3S_POWER_REDACT_LOGS |
Enable log redaction ("1" or "true"); invalid values fail closed |
A3S_POWER_TLS_PORT |
TLS server port (tls feature required); invalid values fail closed |
A3S_POWER_RA_TLS |
Enable RA-TLS attestation embedding ("1" or "true"); invalid values fail closed |
A3S_POWER_AUDIT_LOG |
Enable structured audit logging ("1" or "true"); invalid values fail closed |
A3S_POWER_VSOCK_PORT |
Vsock port (vsock feature, Linux only); invalid values fail closed |
A3S_TEE_SIMULATE |
Simulate TEE environment for development ("1") |
When tee_mode = true, Power uses tee_policy_mode = "strict" by default. Strict policy refuses to start unless the detected hardware TEE type has a 48-byte expected_measurements launch-measurement pin and local models are pinned with model_hashes or covered by model_signing_key; it also rejects simulated TEE evidence. Use tee_policy_mode = "development" only for local tests that intentionally rely on A3S_TEE_SIMULATE=1.
tee_policy_mode = "gpu-confidential" is for deployments that require NVIDIA GPU confidential-computing evidence to be bound into the CPU TEE attestation. Power supports ordinary CUDA acceleration separately, but ordinary CUDA is not a GPU confidential-computing attestation claim. In GPU confidential mode, startup requires final gpu.gpu_layers != 0 so a CPU-only execution policy cannot be paired with GPU evidence; use tee_policy_mode = "strict" for CPU-only TEE deployments. On NVIDIA CUDA hosts, the default GPU auto-configuration sets gpu_layers = -1 unless the operator overrides it. In GPU confidential mode, /v1/attestation requires a 32-byte nonce, encoded as 64 hex characters. Power gives that same nonce to the GPU evidence provider, hashes the raw GPU evidence and NRAS verdict bytes, emits a GpuEvidenceClaim, and binds it together with the CPU nonce into sha256(canonical_claims_v2) in CPU TEE report_data. For live nvattest-cli, direct nras-rest, and configured NVIDIA verdict JSON paths that expose claims, the GPU claim also includes structured NVIDIA device claims extracted from the verdict: device type, eat_nonce, hardware model, UEID/OEM ID, driver and firmware versions, measurement result, secure-boot/debug status, and normalized validation booleans for report signature, nonce match, FWID match, RIM schema validation, RIM signature, version match, and measurement availability. Power fails closed if an NVIDIA GPU/NVSwitch claim reports measres != "success", omits secboot or sets it to false, omits dbgstat, or reports a debug state other than disabled.
When gpu_attestation.source = "configured", Power reads externally produced
GPU evidence and verdict bytes from file or hex configuration. Startup verifies
that evidence and verdict sources exist, and file-backed sources must use
absolute paths to existing non-empty regular files in the production profile.
Configured evidence and verdict byte sources are capped at 64 MiB.
Each nonce-bound /v1/attestation call then requires the configured verdict to
be NVIDIA nvattest or NRAS JSON whose device eat_nonce matches the request
nonce; stale or non-parseable verdicts fail closed instead of being rebound to a
fresh CPU nonce.
When gpu_attestation.source = "nvattest-cli", Power invokes NVIDIA's
nvattest binary for collection and attestation. Configure remote NRAS
credentials, service keys, and relying-party policy according to the NVIDIA
nvattest deployment instructions; Power does not store NRAS service keys in
HCL. In gpu-confidential policy mode, startup requires
provider = "nvidia-nras", an absolute executable nvattest_path, and
nvattest_verifier = "remote" so NVIDIA NRAS verifies the GPU evidence; local
nvattest verification and PATH-resolved CLI lookup are reserved for
development or non-production experiments outside the production profile. If a
relying-party policy path is configured, it must be an absolute path to an
existing non-empty regular file. If custom nras_url, rim_url, or ocsp_url
values are configured in the production profile, they must be HTTPS URLs. The
temporary evidence file passed from nvattest collect-evidence to
nvattest attest is created with exclusive create semantics and owner-only
permissions on Unix. Power reads nvattest stdout/stderr with bounded buffers:
evidence and verdict stdout are capped at 64 MiB, and stderr diagnostics are
capped at 1 MiB.
When gpu_attestation.source = "nras-rest", Power reads configured
DeviceEvidenceV2 JSON from evidence_path or evidence_hex, posts
nonce, arch, evidence_list, and claims_version to NVIDIA NRAS
/v4/attest/gpu, then binds the returned detached EAT JSON as the verdict.
File-backed evidence must use an absolute path to an existing non-empty regular
file in the production profile.
GPU confidential mode requires clients to send a 64-character hex nonce to
/v1/attestation. The evidence JSON may be a single
{ evidence, certificate, firmware_version? } object, an evidence_list
array, or an nvattest collect-evidence wrapper whose embedded nonce matches
the attestation nonce. Power validates evidence and certificate locally as
non-empty base64/base64url before posting to NRAS, and rejects evidence lists
or structured verdict device-claim lists with more than 1024 entries. In the
production profile, the default NVIDIA NRAS endpoint is used when nras_url is
omitted. Direct
nras-rest overrides must use HTTPS and may be a service root/base path that
Power expands to /v4/attest/gpu, or the exact full /v4/attest/gpu
endpoint; query strings, fragments, embedded credentials, and unsupported
versioned paths are rejected during gpu-confidential startup and before
direct NRAS REST requests. Direct NRAS REST response bodies are capped at 16
MiB before JSON or detached-EAT parsing. Detached EAT values must appear in
explicit EAT fields, are capped at 1024 tokens, and must contain JWTs with
base64url JSON payloads; unrelated version strings elsewhere in the response
are not treated as token candidates.
If nras_bearer_token_env is configured, it names the environment variable
holding the bearer token; Power trims the name and rejects empty or non-portable
names before making NRAS requests. The token value itself is trimmed, must be
non-empty, and must contain only visible ASCII characters.
Verifiers can pin the exact GPU deployment identity with
ExpectedGpuEvidence and ExpectedGpuDevices in the SDK or with CLI flags
such as --gpu-provider, --gpu-evidence-format, --gpu-verdict-format,
--gpu-evidence-count, --gpu-count, --nvswitch-count,
--gpu-claims-version, --gpu-ueid, --gpu-oemid, --gpu-hwmodel,
--gpu-driver-version, --gpu-firmware-version, --nvswitch-claims-version,
--nvswitch-ueid, --nvswitch-oemid, --nvswitch-hwmodel, and
--nvswitch-firmware-version. UEID pinning is an exact device-set check;
GPU/NVSwitch counts pin the attested topology; claims version, OEM ID, hwmodel,
driver version, and firmware version pins are allow-lists applied to every
attested NVIDIA GPU or NVSwitch claim for the matching device type. OEM ID is
supplemental and does not replace exact UEID pinning or deployment-specific
model/version pins in the production profile.
For production NVIDIA GPU confidential-computing deployments, prefer
a3s-power-verify --gpu-confidential: it requires a 32-byte --nonce,
--gpu-verdict-digest, and bundles v2 claims, top-level GPU evidence nonce
freshness, structured device nonce freshness, NVIDIA NRAS verdict binding,
required GPU provider/format/count pins, structured NVIDIA device claims,
required exact GPU topology plus claims-version and identity/version pins,
runtime policy, and GPU
execution/offload digest checks into one verifier profile. The production
profile requires --gpu-claims-version plus either an exact --gpu-ueid set or
--gpu-count together with at least one of --gpu-hwmodel,
--gpu-driver-version, or --gpu-firmware-version, and it rejects NVIDIA
device claims unless secure boot is enabled and debug is disabled. When
--nvswitch-count is greater than zero, the production profile also requires
--nvswitch-claims-version plus either an exact --nvswitch-ueid set or at
least one of --nvswitch-hwmodel or --nvswitch-firmware-version.
Verifiers can also pin the model execution/offload policy with
--require-runtime-policy --gpu-execution-digest <64-char-hex>, which checks
the attested runtime.execution.gpu_sha256 digest over canonical
gpu_layers, main_gpu, and tensor_split values. To compute that value
without reimplementing Power's canonical JSON semantics, use:
a3s-power-verify --print-gpu-execution-digest \
--gpu-layers <N> \
--main-gpu <N> \
--tensor-split <CSV>For CPU TEE hardware-signature operations, including hw-verify builds,
AMD KDS / Intel PCS outbound access, raw-report requirements, and production
failure handling, see
docs/hardware-verifier-operations.md.
tee_mode = true
tee_policy_mode = "gpu-confidential"
model_hashes = {
"llama3.2:3b" = "sha256:a1b2c3d4e5f6..."
}
gpu_attestation {
source = "nvattest-cli"
provider = "nvidia-nras"
nvattest_path = "/usr/local/bin/nvattest"
nvattest_verifier = "remote"
nvattest_gpu_evidence_source = "nvml"
# nras_url = "https://<your-nras-endpoint>"
}INFO TEE mode enabled tee_type="sev-snp"
INFO Model integrity verified model="llama3.2:3b"
INFO All model integrity checks passed count=1
The TeeProvider detects the TEE environment and generates attestation reports:
| TEE Type | Detection | Description |
|---|---|---|
| AMD SEV-SNP | /dev/sev-guest |
Hardware memory encryption + attestation |
| Intel TDX | /dev/tdx_guest |
Trust Domain Extensions |
| Simulated | A3S_TEE_SIMULATE=1 |
Development/testing mode; rejected by strict production policy |
| None | (default) | No TEE detected |
The /health endpoint exposes TEE status:
{
"status": "ok",
"version": "0.4.0",
"uptime_seconds": 120,
"loaded_models": 1,
"tee": {
"enabled": true,
"type": "sev-snp",
"models_verified": true
}
}When redact_logs = true, the PrivacyProvider automatically strips inference content from all log output:
// Before redaction:
{"content": "tell me a secret", "model": "llama3"}
// After redaction:
{"content": "[REDACTED]", "model": "llama3"}
Redacted JSON keys: "content", "prompt", "text", "arguments", "input", "delta", "system", "message", "query", "instruction" β covering chat messages, tool call arguments, streaming deltas, system prompts, and completion requests.
Error messages that echo prompt content are also sanitized via sanitize_error(). When suppress_token_metrics = true, token counts in responses are rounded to the nearest 10 to prevent exact token-count side-channel inference.
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check with TEE status, version, uptime, loaded models |
GET |
/metrics |
Prometheus metrics (requests, durations, tokens, inference, TTFT, model memory, GPU) |
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming/non-streaming, vision, tools, thinking) |
POST |
/v1/completions |
Text completion (streaming/non-streaming) |
POST |
/v1/embeddings |
Generate embeddings |
GET |
/v1/models |
List all registered models |
GET |
/v1/models/:name |
Get a single model by name |
POST |
/v1/models |
Register a local model artifact (name, path, optional format of gguf, safetensors, or huggingface); unsupported formats and unknown fields fail closed |
DELETE |
/v1/models/:name |
Unload and deregister a model |
POST |
/v1/models/pull |
Pull a GGUF model from HuggingFace Hub (name, optional force and token body fields); unknown fields fail closed; streams SSE progress events; requires hf feature; concurrent pulls of the same model are deduplicated |
GET |
/v1/models/pull/:name/status |
Get persisted pull progress for a model (status, completed, total, error); URL-encode names that contain / or : |
GET |
/v1/attestation |
TEE attestation report (returns 503 if TEE not enabled); optional ?nonce=<hex> binds client nonce; optional ?model=<name> emits v2 model/runtime claims and binds the claims digest into report_data; unknown query parameters fail closed; gpu-confidential mode also binds GPU evidence claims and requires a 32-byte ?nonce=<64-hex> |
The a3s-power models show and a3s-power models rm commands encode model
names as URL path segments automatically. Manual HTTP clients must percent-encode
path parameters that contain /, :, spaces, or query-special characters.
Chat and text completion responses include an attestation_receipt object and
attestation_receipt_sha256. For streaming calls, Power emits a final SSE event
with those fields before [DONE]; when stream_options.include_usage = true or
suppress_token_metrics = true, that final event also includes usage.
Power rejects stream_options on non-streaming requests because those options
do not affect a non-streaming response shape, and currently supports only
stream_options.include_usage; other stream option fields fail closed.
Embedding requests accept only implemented top-level fields; unsupported fields
such as user fail closed instead of being silently dropped.
The v2 receipt covers prompt-bearing API input, model runtime
chat-template/GPU execution policy claims, request decoding parameters
including extended local sampling controls, streaming request options, stop
tokens, response format, tools including function strict schema flags, tool
choice, and parallel tool-call policy. Unknown top-level chat/text completion
fields and unknown nested message/content, response-format, tool definition, and
tool-choice fields fail closed instead of being silently dropped before proxying
or receipt hashing. Local chat backends reject unsupported message roles instead
of coercing them to user; remote/proxy models preserve roles for upstream
enforcement. Chat receipts also include
effective_prompt when the selected backend can expose the exact prompt
representation it submits to the model. llama.cpp and picolm text-only chat
emit kind = "chat.rendered-prompt" for post-template prompt bytes. mistralrs
text chat emits kind = "chat.prompt-token-ids" for a domain-separated SHA-256
over the token ID sequence produced by mistralrs' own chat tokenization path;
vision and multimodal llama.cpp/picolm/mistralrs requests leave the field absent.
Proxy backends leave the field absent by default, but can include an
upstream-declared digest when
proxy_effective_prompt_digest = true and the upstream implements the configured
digest endpoint. The proxy sends the same OpenAI-compatible chat body used for
inference, including structured multimodal content, tools, tool choice,
parallel tool-call policy, response format, and sampling controls, with
stream = false; the endpoint should return either
{ "sha256": "<64 hex>", "kind": "chat.rendered-prompt", "backend": "..." } or
the same object nested under effective_prompt. Unsupported proxy endpoints are
ignored unless proxy_effective_prompt_digest_required = true; malformed
digests fail closed.
a3s-power-verify can bind a saved receipt back to a saved or fetched
attestation report. Receipt verification first checks the receipt schema,
request type/input-kind pairing, and all receipt digest fields before comparing
the receipt runtime policy with the attested runtime policy. Verifiers can also
pin receipt-level policy with --receipt-model, --receipt-request-type,
--receipt-input-digest, --receipt-decoding-parameters-digest,
--receipt-stream-options-digest, --receipt-stop-tokens-digest,
--receipt-response-format-digest, --receipt-tools-digest,
--receipt-tool-choice-digest, --effective-prompt-digest (all digest pins
are 64-character SHA-256 hex values),
--require-effective-prompt-absent, --effective-prompt-backend, and
--effective-prompt-kind. When the original request JSON is available,
--receipt-chat-request-file or --receipt-completion-request-file recomputes
and compares every request-derived receipt field:
a3s-power-verify --file report.json \
--receipt-file receipt.json \
--receipt-chat-request-file chat-request.json \
--receipt-digest <64-char-hex> \
--receipt-model llama3 \
--receipt-request-type chat-completion \
--receipt-input-digest <64-char-hex> \
--receipt-decoding-parameters-digest <64-char-hex> \
--receipt-stream-options-digest <64-char-hex> \
--receipt-stop-tokens-digest <64-char-hex> \
--allow-offlineUse --effective-prompt-digest <64-char-hex> when receipt policy pins the
exact rendered-prompt or prompt-token-ID digest exposed in effective_prompt.
Use --require-effective-prompt-absent for opaque multimodal paths where the
receipt must prove that Power did not overclaim a post-template prompt digest.
When --receipt-chat-request-file points to an image-bearing request,
a3s-power-verify applies that absence requirement by default unless the
verifier explicitly pins an effective-prompt digest, backend, or kind.
Use --require-runtime-policy --gpu-execution-digest <64-char-hex> when
verifier policy pins the exact GPU execution/offload configuration used by the
attested server.
SDK callers that still have the original request can use
verify_receipt_matches_chat_request() or
verify_receipt_matches_completion_request() to recompute and compare all
request-derived receipt fields before separately checking attestation runtime
policy or effective_prompt pins.
Use this command to calculate the GPU execution pin with Power's own canonicalizer:
a3s-power-verify --print-gpu-execution-digest \
--gpu-layers <N> \
--main-gpu <N> \
--tensor-split <CSV>curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello"}]
}'curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'curl http://localhost:11434/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"prompt": "Once upon a time"
}'curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "What is the weather in SF?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
}'curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "List 3 colors with hex codes"}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "color_list",
"strict": true,
"schema": {
"type": "object",
"properties": {
"colors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"hex": {"type": "string"}
},
"required": ["name", "hex"],
"additionalProperties": false
}
}
},
"required": ["colors"],
"additionalProperties": false
}
}
}
}'Local JSON Schema enforcement requires a backend that can apply the requested
grammar. Power rejects unsupported local backend/schema combinations instead of
silently ignoring response_format; remote models preserve the OpenAI wire
shape for upstream enforcement.
Per-request keep_alive overrides must use the same validated duration format
as configuration. Invalid values are rejected instead of falling back to the
server default.
curl http://localhost:11434/v1/modelsRequires the hf feature (cargo build --features hf). Power pulls from
ModelScope by default; set A3S_POWER_MODEL_SOURCE=hf or
A3S_POWER_MODEL_SOURCE=huggingface to use HuggingFace Hub. Any other
configured source value fails closed instead of silently falling back. Streams
SSE progress:
# By quantization tag (resolves filename via HF API)
curl -N http://localhost:11434/v1/models/pull \
-H "Content-Type: application/json" \
-d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M"}'
# By exact filename
curl -N http://localhost:11434/v1/models/pull \
-H "Content-Type: application/json" \
-d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf"}'
# Private/gated model with a hub token
curl -N http://localhost:11434/v1/models/pull \
-H "Content-Type: application/json" \
-d '{"name": "meta-llama/Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf", "token": "hf_..."}'
# Force re-download
curl -N http://localhost:11434/v1/models/pull \
-H "Content-Type: application/json" \
-d '{"name": "bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", "force": true}'
# Check persisted progress for a pull; manual HTTP clients must URL-encode names
# containing "/", ":", spaces, or query-special characters.
curl http://localhost:11434/v1/models/pull/bartowski%2FLlama-3.2-3B-Instruct-GGUF%3AQ4_K_M/status
# The CLI encodes the model name automatically.
a3s-power models status bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_MSSE response stream:
data: {"status":"resuming","offset":104857600,"total":2147483648} β if resuming
data: {"status":"downloading","completed":209715200,"total":2147483648}
data: {"status":"verifying"}
data: {"status":"success","id":"bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M","object":"model","created":1234567890}
Interrupted downloads resume automatically on retry β the partial file is identified by a SHA-256 of the canonical download URL and picked up via HTTP Range requests. Hub download and file-list URLs are built with a URL parser, preserving intended repo/file subdirectories while percent-encoding spaces and query-special characters. Set the selected hub's token env var (MODELSCOPE_TOKEN or HF_TOKEN) or A3S_POWER_HUB_TOKEN as an alternative to passing token in the request body.
curl http://localhost:11434/healthModels are stored in ~/.a3s/power/ (override with $A3S_POWER_HOME):
~/.a3s/power/
βββ config.hcl # HCL configuration
βββ models/
βββ manifests/ # JSON manifest files
β βββ llama3.2-3b.json
β βββ qwen2.5-7b.json
βββ blobs/ # Content-addressed model files
βββ sha256-abc123...
βββ sha256-def456...
Model files are stored by SHA-256 hash, enabling deduplication and integrity verification.
| Flag | Default | Description |
|---|---|---|
mistralrs |
β enabled | Pure Rust inference backend via mistralrs (candle-based). No C++ inference toolchain required. Ideal for TEE auditing. |
llamacpp |
β disabled | llama.cpp inference backend via llama-cpp-2. Requires C++ compiler + CMake. Full-featured (KV cache, LoRA, grammar, mirostat). |
picolm |
β disabled | Pure Rust layer-streaming GGUF inference. Real transformer ops (multi-head attention, SwiGLU FFN, RoPE, RMSNorm). Peak RAM = O(layer_size) not O(model_size) via madvise(DONTNEED). FP16 KV cache with fused f16 dot/accumulate. Fused dequant+dot kernels. NEON SIMD (aarch64) + AVX2 (x86_64). Batch prefill, speculative decoding, tool calling, grammar-constrained output. 14+ tok/s decode on Apple Silicon. Enables 7B+ models in 512MB TEE EPC. No C/C++ inference backend. ~4,500 lines of pure Rust. |
hf |
β disabled | Remote model hub pull (POST /v1/models/pull). Range resume, SSE progress, source-specific hub token auth. |
tls |
β disabled | RA-TLS transport: TLS server with self-signed cert + optional attestation X.509 extension. Adds axum-server, rcgen, time deps. |
vsock |
β disabled | Vsock transport for a3s-box MicroVM guest-host HTTP. Linux only β requires AF_VSOCK kernel support. Adds tokio-vsock and hyper-util deps. |
hw-verify |
β disabled | Hardware attestation signature verification. AMD KDS (ECDSA P-384) + Intel PCS (ECDSA P-256) certificate chain validation. |
tee-minimal |
β disabled | Composite: picolm + tls + vsock. Smallest auditable TEE build β no mistralrs/candle and no C++ inference engine. TLS/crypto still uses native ring/aws-lc-sys build dependencies. |
Without a backend feature (mistralrs, llamacpp, or picolm), Power can manage models but inference calls return "backend not available".
For production TEE deployments (AMD SEV-SNP / Intel TDX), use the tee-minimal build profile:
cargo build --release --no-default-features --features tee-minimalInside a TEE, every crate in the inference path is part of the trusted computing base.
The tee-minimal profile minimizes this surface:
| Profile | Inference backend | Dep tree lines | Native inference deps | Other native deps |
|---|---|---|---|---|
default |
mistralrs (candle) | ~2,000 | None | TLS/HTTP crypto crates may build C crypto helpers |
tee-minimal |
picolm (pure Rust) | ~1,220 | None | ring/aws-lc-sys via TLS/RA-TLS crypto |
llamacpp |
llama.cpp | ~1,800+ | Yes (C++) | C++ compiler + CMake |
- picolm backend: Pure Rust layer-streaming GGUF inference (~4,500 lines, fully auditable). Real transformer ops, 14+ tok/s decode, FP16 KV cache, true O(layer_size) peak RAM.
- Full TEE stack: attestation, model integrity (SHA-256), log redaction, memory zeroing
- Encrypted model loading: AES-256-GCM file-backed loading plus
picolmGGUF loading from locked plaintext RAM orLayerStreamingDecryptedModel; unsupported backends fail closed before load - RA-TLS transport: attestation embedded in X.509 cert
- Vsock transport: for a3s-box MicroVM guest-host communication
Traditional LLM inference loads the entire model into RAM before generating a single token. A 7B Q4_K_M model needs ~4 GB. Inside a TEE, the Encrypted Page Cache (EPC) is often limited to 512 MBβ1 GB. The model simply doesn't fit.
picolm solves this with layer-streaming: instead of loading all weights at once, it memory-maps the GGUF file and processes one transformer layer at a time. Only the current layer's weights occupy physical RAM. After processing, the OS reclaims those pages.
Traditional (mistralrs / llama.cpp):
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β All 32 layers loaded in RAM simultaneously β
β Peak RAM β model_size (e.g. 4 GB for 7B Q4_K_M) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
picolm layer-streaming:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β mmap(model.gguf) β virtual address space only β
β no physical RAM allocated β
β β
β for layer in 0..n_layers: β
β βββββββββββββββββββββββββββ β
β β blk.{layer}.* tensors β β OS pages in β
β β (~120 MB for 7B Q4_K_M) β weights on demand β
β βββββββββββββββββββββββββββ β
β forward_pass(hidden_state, layer_weights) β
β madvise(MADV_DONTNEED) β release physical pages β
β β
β Peak RAM β layer_size + KV cache (FP16) β
β β 120 MB + 44 MB (7B, 2048 ctx) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
The implementation has two components:
1. gguf_stream.rs β Zero-Copy GGUF Parser
Opens the GGUF file via mmap(MAP_PRIVATE | PROT_READ). Parses the header (v2/v3), metadata, and tensor descriptors β but does not load any weight data. Each tensor is recorded as an (offset, size) pair into the mmap region.
When picolm requests a layer's weights, tensor_bytes(name) returns a &[u8] slice directly into the mmap β zero copy, zero allocation. The OS kernel pages in the data on first access and can evict it under memory pressure.
GGUF file on disk:
ββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββββββ
β Header β Metadata β Tensor Data (aligned) β
β 8 bytesβ variable β blk.0.attn_q | blk.0.attn_k | ... β
ββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββββββ
β
mmap returns &[u8] slice
directly into this region
(no memcpy, no allocation)
2. picolm.rs + picolm_ops/ β Layer-Streaming Forward Pass
Iterates blk.0.* through blk.{n-1}.*, applying each layer's weights to the hidden state. After processing layer N, madvise(MADV_DONTNEED) explicitly releases the physical pages. The OS is guaranteed to reclaim them before layer N+1 is paged in β this is what makes peak RAM truly O(layer_size).
Key optimizations:
- TensorCache: All tensor byte slices and types resolved once at load time into a flat array. The hot path indexes by
layer * 10 + slotβ zero string formatting, zero HashMap lookups. - ForwardBuffers: All working buffers (q, k, v, gate, up, down, normed, logits, scores, attn_out) pre-allocated once. Zero heap allocation during inference.
- Fused vec_dot: Dequant+dot in a single pass per row β no intermediate f32 buffer. Dedicated kernels for Q4_K, Q6_K, Q8_0.
- Rayon parallel matmul: Multi-threaded row parallelism for matrices with >64 rows.
- FP16 KV cache: Keys and values stored as
f16, converted on read. Halves KV cache memory. - Pre-computed RoPE: cos/sin tables built at load time. No transcendental functions in the hot path.
// Simplified flow (actual code in src/backend/picolm.rs)
let gguf = GgufFile::open("model.gguf")?; // mmap, parse header only
let tc = TensorCache::build(&gguf, n_layers)?; // resolve tensor pointers once
let rope_table = RopeTable::new(max_seq, head_dim, rope_dim, theta);
let mut hidden = vec![0.0f32; n_embd];
let mut buf = ForwardBuffers::new(/* pre-allocate all working buffers */);
for layer in 0..n_layers {
attention_layer(&mut hidden, &tc, layer, pos, kv_cache, &rope_table, &mut buf)?;
ffn_layer(&mut hidden, &tc, layer, activation, &mut buf)?;
tc.release_layer(&gguf, layer); // madvise(DONTNEED) β free physical pages
}For encrypted models (.enc), LayerStreamingDecryptedModel exposes chunked
plaintext access where each returned chunk is wrapped in Zeroizing<Vec<u8>>.
streaming_decrypt = true passes this source to backends that explicitly support
it. Today that means picolm for GGUF models; unsupported backends fail closed
before load.
- Chunk buffers are zeroized when dropped
- The full decrypted plaintext is still held in locked memory because the current AES-GCM artifact format is not independently seekable
- End-to-end inference from chunked plaintext requires a backend loader that consumes this source directly
Encrypted layer-streaming:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β model.gguf.enc (AES-256-GCM encrypted on disk) β
β β
β after AES-GCM authentication + decrypt to locked RAM: β
β for each requested range: β
β chunk = read_chunk(layer_offset, layer_len) β
β chunk: Zeroizing<Vec<u8>> β auto-zeroed on drop β
β // future backend path consumes chunk directly β
β // chunk dropped β chunk memory zeroed immediately β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Model | Traditional | picolm Layer-Streaming | Reduction |
|---|---|---|---|
| 0.5B Q4_K_M (~350 MB) | ~350 MB | ~15 MB + KV | 23Γ |
| 3B Q4_K_M (~2 GB) | ~2 GB | ~60 MB + KV | 33Γ |
| 7B Q4_K_M (~4 GB) | ~4 GB | ~120 MB + KV | 33Γ |
| 13B Q4_K_M (~7 GB) | ~7 GB | ~200 MB + KV | 35Γ |
| 70B Q4_K_M (~40 GB) | ~40 GB | ~1.1 GB + KV | 36Γ |
KV cache uses FP16 storage (half the memory of F32). For 7B at 2048 context: ~44 MB.
picolm is a production-ready pure Rust inference engine. The full transformer forward pass is implemented:
- Attention: Multi-head attention with Grouped-Query Attention (GQA), Q/K/V bias support (Qwen, Phi)
- FFN: SwiGLU (LLaMA, Mistral, Phi) and GeGLU (Gemma) activation variants
- RoPE: Pre-computed cos/sin tables with partial-dimension support
- RMSNorm: On-the-fly dequantization per layer (output norm pre-dequantized)
- Dequantization: Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32
- Fused vec_dot: Dequant+dot in a single pass β no intermediate f32 buffer
- Parallel matmul: Rayon multi-threaded row parallelism for large matrices
- FP16 KV cache: Half-precision storage with fused f16βf32 dot product and accumulate β no intermediate buffer in attention
- Tensor cache: Pre-resolved tensor pointers β zero HashMap lookups in the hot path
- Pre-allocated buffers: Zero heap allocation during inference (including sampler probs/indices)
- True layer-streaming:
madvise(MADV_DONTNEED)releases physical pages after each layer - BPE tokenizer: Full GPT-style byte-pair encoding with ChatML template support
- Batch prefill: Process prompt tokens in batch for faster time-to-first-token
- Speculative decoding: Prompt-lookup draft for faster decode throughput
- Tool/function calling: OpenAI-compatible
tool_callswith auto-dispatch - Grammar-constrained output: JSON Schema enforcement during generation
- Repeat/frequency/presence penalty: Configurable repetition control (zero-alloc, stack-based dedup)
Performance on Qwen 2.5 0.5B Q4_K_M (Apple Silicon):
- Decode: 14+ tok/s
- Prefill: 15+ tok/s
- 900+ tests across unit, integration, and real-model validation profiles
Profiling breakdown of the decode hot path (per token):
| Stage | % Time | Status |
|---|---|---|
| Embedding lookup | 0.3% | β Optimized |
| Attention (QΒ·K scores + V weighted sum) | 22.1% | β Fused f16 KV dot/accumulate, NEON softmax |
| FFN (gate + up + down matvec) | 63.4% | β Fused vec_dot, Rayon parallel, NEON SiLU/residual |
| Logit projection | 9.1% | β Rayon parallel matmul |
| Sampling | 0.3% | β Zero-alloc (pre-allocated probs/indices) |
Completed optimizations:
- β NEON SIMD for softmax, RMSNorm, SiLU, add_residual (aarch64)
- β AVX2 SIMD for Q4_K, Q6_K vec_dot kernels (x86_64)
- β
Q4_K NEON kernel β register-based nibble extraction via
vld1_lane_u32+vand/vshr - β
Fused f16 KV attention β
k_dot()andv_accumulate()skip intermediate f32 buffer - β
Zero-alloc sampler β pre-allocated
probs_bufandindices_bufinForwardBuffers - β
Zero-alloc repeat penalty β stack-based
[(u32, u32); 64]dedup, no HashMap - β Pre-computed RoPE cos/sin tables β no transcendental functions in hot path
- β
TensorCache β flat array indexed by
layer * SLOTS + slot, zero HashMap lookups - β ForwardBuffers β all working buffers pre-allocated, zero heap allocation per token
- β
FP16 KV cache β halves memory via
halfcrate batch SIMD conversion - β Rayon parallel matmul β multi-threaded row parallelism for matrices with >64 rows
- β Decode profiling instrumentation β per-stage timing breakdown for continuous optimization
Remaining optimization opportunities (diminishing returns):
- π² Block-wise quantized matmul β process multiple output rows per pass for better cache locality
- π² Integer-only Q4_K accumulation β accumulate in i32, avoid f32 conversion overhead
- π² Tiled matmul with explicit prefetch hints β improve L1/L2 cache utilization
- π² Fused gate+up projection β single matmul pass if weight layout permits
- π² AMX/SME acceleration β Apple Silicon matrix coprocessor (requires nightly Rust)
# config.hcl β TEE deployment with file-backed encrypted-model loading
tee_mode = true
redact_logs = true
# File-backed DecryptedModel loading works with file-based backends.
# in_memory_decrypt works for GGUF models when the selected backend is picolm;
# other backends fail closed rather than reading the encrypted path.
# Direct plaintext-buffer mode:
# in_memory_decrypt = true
# LayerStreamingDecryptedModel mode; requires a supporting backend such as picolm GGUF:
# streaming_decrypt = trueSee docs/supply-chain.md for:
- Full dependency listing per feature profile
- Audit status for each crate in the
tee-minimalinference path - Security properties of
LayerStreamingDecryptedModel - How to reproduce dependency counts and audit unsafe blocks
See docs/hardware-verifier-operations.md
for strict AMD SEV-SNP / Intel TDX verifier operations.
# Build with TLS support
cargo build --features tls
# Test TLS cert generation
cargo test --features tls -p a3s-power tee::certTo enable RA-TLS, set tls_port and ra_tls = true alongside tee_mode = true:
tee_mode = true
tls_port = 11443
ra_tls = trueAt startup, the TLS server binds on the configured port with a fresh self-signed ECDSA P-256 certificate. When ra_tls = true, startup first requires a TEE provider to generate an attestation report and embeds it as OID extension 1.3.6.1.4.1.56560.1.1; report generation failures abort startup before the TLS listener is bound. Clients can extract and verify this extension to confirm they are communicating with a genuine TEE before trusting inference output.
# Build
cargo build -p a3s-power # Debug (default: mistralrs)
cargo build -p a3s-power --release # Release
cargo build -p a3s-power --no-default-features --features llamacpp # With llama.cpp
# Test (900+ tests across current validation profiles)
cargo test -p a3s-power --lib -- --test-threads=1
cargo test -p a3s-power --test integration
# Test with TLS feature
cargo test -p a3s-power --features tls --lib -- --test-threads=1
# Lint
cargo clippy -p a3s-power -- -D warnings
cargo fmt -p a3s-power -- --check
# Run
cargo run -p a3s-power # Start serverpower/
βββ Cargo.toml
βββ justfile # Build, test, coverage, lint, CI targets
βββ README.md
βββ src/
βββ main.rs # Entry point: load HCL config β server::start()
βββ lib.rs # Module declarations
βββ config.rs # PowerConfig (HCL deserialization + env overrides)
βββ dirs.rs # Platform paths (~/.a3s/power/{manifests,blobs,pulls})
βββ error.rs # PowerError enum (14 variants) + HTTP status mapping
β
βββ api/ # API layer β OpenAI-compatible HTTP handlers
β βββ mod.rs # Shared utilities, timestamp helpers
β βββ types.rs # OpenAI request/response types (chat, completion, embedding)
β βββ receipt.rs # Request-level attestation receipt hashing
β βββ health.rs # GET /health (TEE status, version, uptime, loaded models)
β βββ autoload.rs # Model lifecycle: LRU eviction β decrypt β verify β load
β βββ openai/ # OpenAI-compatible endpoint handlers
β βββ mod.rs # Route definitions, openai_error() helper
β βββ chat.rs # POST /v1/chat/completions (streaming SSE + JSON)
β βββ completions.rs # POST /v1/completions
β βββ embeddings.rs # POST /v1/embeddings
β βββ models.rs # GET/POST/DELETE /v1/models, POST /v1/models/pull
β βββ attestation.rs # GET /v1/attestation (nonce + model hash binding)
β
βββ backend/ # Backend layer β inference engine abstraction
β βββ mod.rs # Backend trait (8 methods) + BackendRegistry (priority, TEE routing)
β βββ types.rs # ChatRequest, ChatResponseChunk, EmbeddingRequest, Tool, ToolCall
β βββ mistralrs_backend.rs # Pure Rust: GGUF/SafeTensors/HF/Vision, ISQ (feature: mistralrs) β
β βββ llamacpp.rs # C++ bindings: KV cache, LoRA, MTMD vision, grammar (feature: llamacpp)
β βββ picolm.rs # Pure Rust layer-streaming, O(layer_size) RAM (feature: picolm)
β βββ picolm_ops/ # picolm transformer ops (~4,500 lines, pure Rust)
β β βββ attention.rs # Multi-head / GQA attention with Q/K/V bias support
β β βββ buffers.rs # Pre-allocated working buffers (zero heap alloc in hot path)
β β βββ dequant.rs # Dequantization kernels (Q4_K, Q5_K, Q6_K, Q8_0, F16, F32)
β β βββ ffn.rs # SwiGLU / GeGLU feed-forward network
β β βββ kv_cache.rs # FP16 KV cache (half memory vs F32)
β β βββ matmul.rs # Fused vec_dot + rayon parallel matmul
β β βββ norm.rs # RMSNorm (raw + pre-dequantized weights)
β β βββ rope.rs # RoPE with pre-computed cos/sin tables
β β βββ tensor_cache.rs # Per-layer tensor pointer cache (zero HashMap lookups)
β β βββ tokenizer.rs # BPE tokenizer with ChatML template support
β β βββ vec_dot.rs # Fused dequant+dot kernels (Q4_K, Q6_K, Q8_0)
β βββ chat_template.rs # Jinja2 chat template rendering (ChatML/Llama/Phi/Generic)
β βββ gpu.rs # Metal + CUDA detection, auto gpu_layers config
β βββ json_schema.rs # JSON Schema β GBNF grammar for constrained output
β βββ tool_parser.rs # Tool call parsing (XML/Hermes, Mistral, raw JSON)
β βββ think_parser.rs # Streaming <think> block extraction (DeepSeek-R1, QwQ)
β βββ gguf_stream.rs # GGUF v2/v3 mmap reader for picolm layer-streaming
β βββ test_utils.rs # MockBackend for testing
β
βββ model/ # Model layer β storage, registry, pull
β βββ mod.rs # Module declarations
β βββ manifest.rs # ModelManifest, ModelFormat (Gguf/SafeTensors/HuggingFace/Vision)
β βββ registry.rs # ModelRegistry (RwLock<HashMap>, JSON manifest persistence)
β βββ storage.rs # Content-addressed blob store (SHA-256 naming, prune)
β βββ gguf.rs # GGUF metadata reader, memory estimation (KV cache + compute)
β βββ pull.rs # HuggingFace Hub pull with Range resume, SSE progress (feature: hf)
β βββ pull_state.rs # Persistent pull state (Pulling/Done/Failed) as JSON
β
βββ server/ # Server layer β transport, auth, metrics, audit
β βββ mod.rs # Server startup orchestration (TCP/TLS/Vsock), graceful shutdown
β βββ state.rs # AppState: model lifecycle, LRU, decrypted model RAII, privacy
β βββ router.rs # Axum router + middleware: rate limit, request ID, metrics, auth
β βββ auth.rs # AuthProvider trait, ApiKeyAuth (SHA-256, constant-time)
β βββ audit.rs # AuditLogger trait: JSONL / Encrypted / Async / Noop
β βββ metrics.rs # Prometheus metrics (16 groups: HTTP, inference, TTFT, GPU, TEE)
β βββ request_context.rs # Per-request context (request_id, auth_id, created_at)
β βββ lock.rs # Shared RwLock helpers
β βββ vsock.rs # AF_VSOCK transport (feature: vsock, Linux only)
β
βββ tee/ # TEE layer β cross-cutting security
β βββ mod.rs # Module entry
β βββ attestation.rs # TeeProvider trait, SEV-SNP/TDX ioctl, report_data binding
β βββ encrypted_model.rs # AES-256-GCM: DecryptedModel / MemoryDecrypted / LayerStreaming
β βββ key_provider.rs # KeyProvider trait: StaticKeyProvider + RotatingKeyProvider
β βββ model_seal.rs # SHA-256 integrity + Ed25519 signature verification
β βββ policy.rs # TeePolicy trait: allowlist + measurement pinning
β βββ privacy.rs # PrivacyProvider: log redaction (10 keys), SensitiveString, zeroize
β βββ epc.rs # EPC memory detection (/proc/meminfo), 75% threshold routing
β βββ cert.rs # RA-TLS X.509 cert with attestation extension (feature: tls)
β
βββ verify/ # Verify layer β client-side attestation SDK
β βββ mod.rs # verify_report(), nonce/hash/measurement binding (constant-time)
β βββ hw_verify.rs # SevSnpVerifier (AMD KDS) + TdxVerifier (Intel PCS)
β
βββ bin/
βββ a3s-power-verify.rs # CLI for strict attestation report verification
A3S Power is the inference engine of the A3S privacy-preserving AI platform. It runs inside a3s-box MicroVMs to provide hardware-isolated LLM inference.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β A3S Ecosystem β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β a3s-box MicroVM (AMD SEV-SNP / Intel TDX) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β a3s-power β β β
β β β OpenAI API β Vsock/RA-TLS β host β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β Hardware-encrypted memory β host cannot read β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² Vsock β
β β β
β ββββββ΄ββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββ β
β β a3s-gateway β β a3s-event β β a3s-code β β
β β (API route) β β (event bus) β β (AI coding agent) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββ β
β β
β Client-side: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β a3s-power verify SDK β β
β β Nonce binding Β· Model hash binding Β· HW signature check β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Relationship to Power |
|---|---|
| a3s-box | Hosts Power inside TEE-enabled MicroVMs (AMD SEV-SNP / Intel TDX) |
| a3s-code | Uses Power as a local inference backend |
| a3s-gateway | Routes inference requests to Power instances |
| a3s-event | Distributes inference events across the platform |
| verify SDK | Client-side attestation verification (nonce, model hash, HW signature) |
-
Core inference engine (llama.cpp, chat templates, tool calling, structured output, thinking)
-
Pure Rust inference backend β
mistralrsfeature (default): GGUF inference via candle, no C++ dependency; ideal for TEE supply-chain auditing -
OpenAI-compatible API (
/v1/chat/completions,/v1/completions,/v1/models,/v1/embeddings) -
Content-addressed model storage with SHA-256
-
GPU auto-detection and acceleration (Metal, CUDA, multi-GPU)
-
KV cache reuse with prefix matching
-
Prometheus metrics and health endpoint
-
TEE refactoring β removed Ollama compatibility layer (~6,900 lines deleted)
-
HCL-only configuration (removed TOML)
-
TEE awareness β
TeeProvidertrait,DefaultTeeProvider(SEV-SNP, TDX, Simulated) -
Model integrity verification β SHA-256 at startup
-
Privacy protection β
PrivacyProvidertrait, log redaction -
TEE status in
/healthendpoint -
Attestation endpoint β
GET /v1/attestationfor clients to verify TEE -
Memory zeroing β
zeroizecrate,SensitiveStringauto-zeroize wrapper -
Encrypted model loading β AES-256-GCM,
DecryptedModelRAII secure wipe, key from file/env -
PrivacyProvider integrated into inference chain β prompt/response wrapped in
SensitiveString,sanitize_logapplied at every log site -
EncryptedModel integrated into autoload β
.encmodels auto-detected, decrypted, RAII cleanup on unload/eviction -
TEE metrics β Prometheus counters for attestation reports, model decryptions, and log redactions
-
Attestation nonce β
?nonce=<hex>binds client nonce intoreport_datato prevent replay attacks -
RA-TLS transport β
tlsfeature: self-signed ECDSA P-256 cert;ra_tls = trueembeds JSON attestation report as custom X.509 extension (OID 1.3.6.1.4.1.56560.1.1); TLS server spawned in parallel with plain HTTP -
Vsock transport β
vsockfeature (Linux only): AF_VSOCK server for a3s-box MicroVM guest-host HTTP communication; uses same axum router as TCP; no network config required inside the VM -
SEV-SNP ioctl β real
/dev/sev-guestioctl (SNP_GET_REPORT) for hardware attestation reports; extractsreport_data(64 bytes) andmeasurement(48 bytes) from firmware response; full raw report included for client-side verification -
TDX ioctl β real
/dev/tdx-guestioctl (TDX_CMD_GET_REPORT0) for hardware attestation reports; extractsreportdata(64 bytes) andmrtd(48 bytes) from TDREPORT; supports both/dev/tdx-guestand/dev/tdx_guestdevice paths -
KeyProvider trait β
StaticKeyProvider(wraps file/env key source) +RotatingKeyProvider(multiple keys, zero-downtime rotation viarotate_key()); initialized on server startup;AppState.key_providerfield -
Deep log redaction β
PrivacyProvidercovers 10 sensitive JSON keys;sanitize_error()strips prompt fragments from error messages -
Token metric suppression β
suppress_token_metricsconfig rounds token counts to nearest 10 to prevent side-channel inference -
In-memory encrypted-model backend loading β
in_memory_decryptdecrypts intoMemoryDecryptedModellocked RAM and loads GGUF plaintext throughpicolm; unsupported backends fail closed before load -
Rate limiting β token-bucket middleware (
rate_limit_rps) + concurrency cap (max_concurrent_requests) on/v1/*; returns429with OpenAI-style error -
Model/runtime/GPU-attestation binding β
AttestationClaimsV2+sha256(canonical_claims_v2)in CPU TEEreport_data;GET /v1/attestation?model=<name>re-hashes the current local model artifact, including deterministic directory manifests and encrypted plaintext/ciphertext claims, and fails on missing or stale hashes; model-bound claims include applied chat-template digests plus canonical GPU execution/offload digests;gpu-confidentialmode binds NVIDIA GPU CC evidence, NRAS verdict digests, and structured NVIDIA device identity/freshness claims from livenvattest-clicollection or directnras-restattestation and requires a 32-byte nonce -
Embedding model support β
ModelFormat::HuggingFacevariant;MistralRsBackendloads HF embedding models viaEmbeddingModelBuilderwith local path;POST /v1/embeddingsfully functional; register withformat=huggingface -
SafeTensors inference β
ModelFormat::SafeTensorsvariant;MistralRsBackendloads local safetensors chat models viaTextModelBuilderwith ISQ on-load quantization; ISQ type configurable viadefault_parameters.isq(Q4_0, Q4K, Q6K, Q8_0, HQQ4, HQQ8, etc.); omitted ISQ defaults to Q8_0, while explicit invalid ISQ values fail closed; register withformat=safetensors -
Client attestation verification SDK β
verifymodule withverify_report(),verify_report_strict(),VerificationPolicy,ExpectedGpuEvidence,ExpectedGpuDevices,ExpectedReceipt,verify_nonce_binding(),verify_model_hash_binding(),verify_claims_gpu_evidence_binding(),verify_claims_expected_gpu_evidence(),verify_claims_gpu_device_claims(),verify_claims_expected_gpu_devices(),verify_claims_runtime_policy_binding(),verify_receipt_well_formed(),verify_receipt_policy(),verify_receipt_matches_chat_request(),verify_receipt_matches_completion_request(),verify_receipt_against_attestation(),verify_receipt_digest_hex(),verify_receipt_effective_prompt_digest_hex(), andverify_measurement();HardwareVerifiertrait for pluggable hardware signature verification; strict verification requires hardware signatures and an expected launch measurement;VerificationPolicy::gpu_confidential()anda3s-power-verify --gpu-confidentialbundle production NVIDIA GPU confidential-computing checks and require a 32-byte nonce, top-level GPU evidence nonce,--gpu-verdict-digest, GPU provider/format/count, exact GPU/NVSwitch topology, claims schema version, and identity/version pins;a3s-power-verifydefaults to strict mode, requires--expected-measurement, requires--allow-offlineto skip hardware signatures/measurement pinning, supports hardware certificate cache TTL tuning with--hw-cert-cache-ttl-secs, GPU provider/format/count, GPU execution digest, exact GPU/NVSwitch count pins, GPU/NVSwitch claims version pins, and device identity pins including UEID/OEM ID plus--receipt-file/--receipt-digest/--receipt-model/--receipt-request-type/--receipt-chat-request-file/--receipt-completion-request-file/--receipt-input-digest/ receipt decoding, stream-options, and output-policy digest pins /--effective-prompt-digestfor attestation-to-receipt verification, and requires--noncewhen GPU evidence, device-claim, or identity pinning is used -
Graceful shutdown β SIGTERM + Ctrl-C handled via
shutdown_signal(); unloads all models (triggers RAII zeroize of decrypted weights); flushes audit log viaAuditLogger::flush()before exit;AsyncJsonLinesAuditLoggerflush uses oneshot channel to wait for background writer to drain -
Remote model hub pull β
hffeature:POST /v1/models/pulldownloads GGUF models from ModelScope or HuggingFace Hub; supportsowner/repo:Q4_K_M(resolves filename via hub API) andowner/repo/file.gguf(direct); streams SSE progress events (resuming,downloading,verifying,success); resume interrupted downloads via HTTP Range requests (deterministic partial filename = SHA-256 of the canonical URL); hub/API URLs percent-encode repo, filename, and query components while preserving intended subdirectories; source-specific token auth for private/gated models viatokenrequest field,MODELSCOPE_TOKEN/HF_TOKEN, orA3S_POWER_HUB_TOKEN; stores in content-addressed blob store; SHA-256 verified;forceflag for re-download -
Pull concurrent control β
Mutex<HashSet>inAppStatededuplicates concurrent pulls of the same model; returns409 Conflictif a pull is already in progress -
Pull progress persistence β JSON state files in
~/.a3s/power/pulls/;GET /v1/models/pull/:name/statusreturns{status, completed, total, error}and accepts URL-encoded model names; survives server restarts; throttled writes (every 5%) to minimize disk I/O -
True token-by-token streaming β
stream_chat_requestreplaces non-streaming path; eachResponse::Chunkforwarded immediately via mpsc channel;Response::Donesetsfinish_reason -
Request-level inference receipts β
/v1/chat/completionsand/v1/completionsreturn v2attestation_receiptplusattestation_receipt_sha256; receipts include model runtime chat-template/GPU execution policy claims, request decoding/output policy digests, and stream-options digests; streaming responses emit the receipt in a final SSE event before[DONE] -
Effective prompt digest coverage for deterministic chat paths β llama.cpp and picolm text-only chat return local rendered-prompt digests; mistralrs text chat returns a domain-separated prompt-token-ID digest; proxy backends can include an upstream-declared digest through the opt-in
/v1/chat/effective-prompt-digestcontract -
Effective prompt digest coverage for remaining opaque renderers β llama.cpp, picolm, and mistralrs vision/multimodal paths must either expose exact prompt representations or continue leaving
effective_promptabsent -
Vision/multimodal inference β
ModelFormat::Visionvariant;MistralRsBackendloads vision models viaVisionModelBuilderwith ISQ; base64 images accepted viaimagesfield or OpenAIimage_urlcontent parts; decoded withimage+base64crates -
picolm backend β pure Rust layer-streaming GGUF inference (
picolmfeature); real transformer forward pass (multi-head/GQA attention, SwiGLU/GeGLU FFN, RoPE, RMSNorm); fused dequant+dot kernels (Q4_K, Q6_K, Q8_0); rayon parallel matmul; FP16 KV cache; pre-computed RoPE tables; tensor cache (zero HashMap lookups); pre-allocated buffers (zero heap allocation in hot path); true O(layer_size) peak RAM viamadvise(MADV_DONTNEED)page release; BPE tokenizer with ChatML template; 14+ tok/s decode on Apple Silicon; ~4,500 lines of pure Rust; no C/C++ inference backend -
picolm features β batch prefill (faster time-to-first-token); speculative decoding via prompt-lookup; tool/function calling (OpenAI-compatible
tool_calls); grammar-constrained structured output (JSON Schema enforcement); repeat/frequency/presence penalty -
picolm SIMD β NEON (aarch64): softmax, RMSNorm, SiLU, add_residual, Q4_K nibble extraction; AVX2 (x86_64): Q4_K, Q6_K vec_dot kernels
-
picolm performance β fused f16 KV attention (
k_dot/v_accumulateskip intermediate f32 buffer); zero-alloc sampler (pre-allocated probs/indices in ForwardBuffers); zero-alloc repeat penalty (stack-based[(u32,u32); 64]dedup); Q4_K NEON register-based nibble extraction; decode profiling instrumentation (per-stage timing breakdown); 900+ tests across current validation profiles -
EPC memory detection β
tee::epcmodule reads/proc/meminfo;BackendRegistry::find_for_tee()auto-routes to picolm when model exceeds 75% of available EPC -
LayerStreamingDecryptedModelprimitive β chunked access to AES-256-GCM encrypted models; each returned chunk isZeroizing<Vec<u8>>;streaming_decrypt = truepasses this plaintext source to supporting backends and fails closed for unsupported backends -
End-to-end chunked encrypted-model backend loading β
picolmGGUF consumesLayerStreamingDecryptedModelplaintext for streaming decrypt mode instead of loading the encrypted path; the current AES-GCM artifact format is still non-seekable, so full plaintext remains locked in RAM while the handle is live -
tee-minimalfeature profile βpicolm+tls+vsock; smallest auditable TEE build (~1,220 dep tree lines vs ~2,000 for default); no mistralrs/candle and no C++ inference engine; TLS/crypto still brings nativering/aws-lc-sysbuild dependencies -
Supply-chain audit document β
docs/supply-chain.md; per-profile dependency listing, audit status table, threat model
Automated via GitHub Actions:
- CI (
.github/workflows/ci.yml): Format check, Clippy (6 feature combos across all targets), unit tests, cross-build (4 platforms) - Release (
.github/workflows/release.yml): CI gate β 4-platform build β GitHub Release β crates.io β Homebrew formula update
| Target | OS | Cross |
|---|---|---|
aarch64-apple-darwin |
macOS (Apple Silicon) | Native |
x86_64-apple-darwin |
macOS (Intel) | Native |
aarch64-unknown-linux-gnu |
Linux (ARM64) | cross |
x86_64-unknown-linux-gnu |
Linux (x86_64) | Native |
# 1. Bump version in Cargo.toml
# 2. Commit and tag
git add -A && git commit -m "chore: release v0.x.y"
git tag v0.x.y && git push origin main --tags
# 3. GitHub Actions builds, publishes to crates.io, creates GitHub Release, updates Homebrew formulaJoin us on Discord for questions, discussions, and updates.
MIT