Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
-
Updated
Nov 6, 2023 - Python
Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
Runs LLaMA with Extremely HIGH speed
LLM inference in Fortran
Speaker diarization for Python — "who spoke when?" CPU-only, no API keys, Apache 2.0. ~10.8% DER on VoxConverse, 8x faster than real-time.
Pure C inference engine for Qwen3-TTS text-to-speech. No Python, no PyTorch — just C and BLAS. Supports 0.6B and 1.7B models, 9 voices, 10 languages.
Running Mixture of Agents on CPU: LFM2.5 Brain (1.2B) + Falcon-R Reasoner (600M) + Tool Caller (90M). CPU-only, 16GB RAM. Lightweight AI Legion.
A GPU defined in software. Runs Llama 3.2 1B at 3.6 tok/sec. Zero dependencies.
The bare metal in my basement
Non-bijunctive attention collapse for LLM inference — POWER8 hardware AES (vcipher) + AltiVec vec_perm. Hebbian path selection, cross-head diffusion, O(1) KV prefiltering.
eLLM Infers LLM on CPUs in Real Time
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
Portable LLM - A rust library for LLM inference
V-lang api wrapper for llm-inference chatllm.cpp
Wrapper for simplified use of Llama2 GGUF quantized models.
Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch.
Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets
Privacy-focused RAG chatbot for network documentation. Chat with your PDFs locally using Ollama, Chroma & LangChain. CPU-only, fully offline.
Simple large language model playground app
VB.NET api wrapper for llm-inference chatllm.cpp
Add a description, image, and links to the cpu-inference topic page so that developers can more easily learn about it.
To associate your repository with the cpu-inference topic, visit your repo's landing page and select "manage topics."