cpu-inference

Here are 65 public repositories matching this topic...

kennethleungty / Llama-2-Open-Source-LLM-CPU-Inference

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Updated Nov 6, 2023
Python

brontoguana / krasis

Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware

transformer inference-engine inference-optimization mixture-of-experts cpu-inference large-language-models gpu-inference llm-inference high-performance-inference hybrid-inference gguf-model-support llama-cpp-alternative

Updated Apr 13, 2026
C++

CoderLSF / fast-llama

Star

Runs LLaMA with Extremely HIGH speed

llama inference-engine cpu-inference llama2

Updated Nov 21, 2023
C++

rbitr / llm.f90

Star

LLM inference in Fortran

ai chatbot transformer llama language-model mamba state-space-model cpu-inference llm llamacpp llama2 phi-2

Updated May 30, 2024
Fortran

FoxNoseTech / diarize

Star

Speaker diarization for Python — "who spoke when?" CPU-only, no API keys, Apache 2.0. ~10.8% DER on VoxConverse, 8x faster than real-time.

python audio-analysis speech-to-text speaker-recognition speech-processing speaker-diarization spectral-clustering voice-activity-detection onnx speaker-embedding diarization apache-2 rttm cpu-inference meeting-transcription who-spoke-when

Updated Mar 6, 2026
Python

gabriele-mastrapasqua / qwen3-tts

Star

Pure C inference engine for Qwen3-TTS text-to-speech. No Python, no PyTorch — just C and BLAS. Supports 0.6B and 1.7B models, 9 voices, 10 languages.

multilingual c text-to-speech inference simd tts speech-synthesis pure-c inference-engine voicecloning cpu-inference voiceclone qwen qwen3-tts

Updated Apr 12, 2026
C

gyunggyung / Tiny-MoA

Star

Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.

multilingual lightweight falcon agents moa uv on-device-ai cpu-inference llm llama-cpp mixture-of-agents tool-calling lfm2

Updated Feb 7, 2026
Python

PureBee / purebee

Star

A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.

javascript gpu webassembly inference llama cpu-inference llm

Updated Feb 27, 2026
JavaScript

jozsefszalma / homelab

Star

The bare metal in my basement

Updated Dec 4, 2025

Scottcjn / pse-vcipher-collapse

Star

Non-bijunctive attention collapse for LLM inference — POWER8 hardware AES (vcipher) + AltiVec vec_perm. Hebbian path selection, cross-head diffusion, O(1) KV prefiltering.

machine-learning deep-learning aes inference transformer attention-mechanism powerpc hardware-acceleration altivec hebbian-learning power8 cpu-inference llm llama-cpp non-bijunctive

Updated Mar 31, 2026
C

lucienhuangfu / eLLM

Star

eLLM Infers LLM on CPUs in Real Time

cpu-inference deep-thinking llm-infernece deep-research qwen3 context-engineering rust-llm

Updated Apr 14, 2026
Rust

Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.