A practical guide for local LLM deployment with 4-bit quantization.
4-bit量化本地大模型部署实战指南
Local Deployment of Qwen2.5 on RTX 4060: 4-bit Quantization & Gradio UI
本项目完整记录了如何在搭载 NVIDIA RTX 4060 的笔记本上,从零搭建 AI 环境,并实现大语言模型的本地私有化、全速运行。
- 环境搭建:Anaconda + CUDA 12.1 + PyTorch。
- 推理加速:
BitsAndBytes4-bit 量化(显存降低50%以上)。 - 交互方式:提供命令行版(快速测试)和Gradio Web UI版(美观交互,支持手机访问)。
- 显卡:RTX 4060 Laptop (8GB VRAM)
- 模型:Qwen2.5-1.5B-Instruct
- 加载时间:约 7s(4-bit 量化)
- 推理速度:接近网页版原生体验
conda create -n qwen25 python=3.11
conda activate qwen25pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txt选项A:命令行快速测试版(无量化,适合快速验证)
python scripts/run_qwen.py选项B:4-bit 量化 + Gradio Web UI版(推荐!带漂亮聊天界面)
python scripts/my_local_ai.py启动后浏览器会自动打开界面。 手机访问方法:同一 WiFi 下,用手机浏览器打开 http://你的笔记本IP:7860
账号:local,密码:123456。如需修改,请编辑 scripts/my_local_ai.py 最后一行的 auth=("local", "123456")。
- torch(CUDA 12.1)
- transformers, accelerate, bitsandbytes
- gradio
- sentencepiece, protobuf 等
- 第一次运行会自动下载模型(约1GB),建议开启科学上网或使用 hf-mirror。
- 4-bit 量化版(my_local_ai.py)显存占用更低,推荐在 4060 上使用。
- 如果出现 CUDA out of memory,可以尝试降低 max_new_tokens 或关闭浏览器其他标签。
- Gradio 默认端口为 7860,可在代码中修改。
This project documents the end-to-end process of building a local AI environment on a laptop with an NVIDIA RTX 4060, achieving high-speed local inference of LLMs.
- Environment: Precise configuration of CUDA 12.1 + PyTorch (GPU version) using Anaconda.
- Acceleration: Implemented 4-bit Quantization via
BitsAndBytes, reducing VRAM usage by >50% and enabling instant response. - UI & Interaction: Built a custom chat interface with
Gradio, supporting local network tunneling for mobile access.
- GPU: RTX 4060 Laptop (8GB VRAM)
- Model: Qwen2.5-1.5B-Instruct
- Loading Time: ~7s
- Speed: Near-native web experience under 4-bit quantization.
conda create -n qwen25 python=3.11
conda activate qwen25pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txtOption A: Command-line Quick Version (No quantification, suitable for rapid verification)
python scripts/run_qwen.pyOption B: 4-bit quantization + Gradio web UI version (recommended! Comes with a beautiful chat interface)
python scripts/my_local_ai.pyAfter startup, the browser will automatically open the interface. Mobile access method: Under the same WiFi network, open http://your laptop’s IP address:7860 in your mobile browser.
Account:local,Password:123456. To change credentials, edit the last line in scripts/my_local_ai.py.
- First run will download the model (~1GB). Use hf-mirror for faster speed.
- 4-bit version uses less VRAM — strongly recommended for RTX 4060.
- For CUDA out of memory, reduce max_new_tokens or close other tabs.
Author: Haven (Jinan University - AI Major)
License: MIT