Skip to content

Haven16262/LLM-Local-Deployment-Guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Local-Deployment-Guide

A practical guide for local LLM deployment with 4-bit quantization.
4-bit量化本地大模型部署实战指南 License Python CUDA

🚀 4060 笔记本本地化部署 Qwen2.5-1.5B 进阶实战:4-bit 量化与 Gradio 交互

Local Deployment of Qwen2.5 on RTX 4060: 4-bit Quantization & Gradio UI

中文 | English


🇨🇳 中文指南

🌟 项目亮点

本项目完整记录了如何在搭载 NVIDIA RTX 4060 的笔记本上,从零搭建 AI 环境,并实现大语言模型的本地私有化、全速运行

🛠️ 核心技术路线

  1. 环境搭建:Anaconda + CUDA 12.1 + PyTorch。
  2. 推理加速BitsAndBytes 4-bit 量化(显存降低50%以上)。
  3. 交互方式:提供命令行版(快速测试)和Gradio Web UI版(美观交互,支持手机访问)。

📊 性能表现

  • 显卡:RTX 4060 Laptop (8GB VRAM)
  • 模型:Qwen2.5-1.5B-Instruct
  • 加载时间:约 7s(4-bit 量化)
  • 推理速度:接近网页版原生体验

🚀 快速开始(推荐)

1. 创建并激活 Anaconda 环境

conda create -n qwen25 python=3.11
conda activate qwen25

2. 安装Pytorch(CUDA12.1版本)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3.安装依赖项目

pip install -r requirements.txt

4.运行项目(二选一)

选项A:命令行快速测试版(无量化,适合快速验证)

python scripts/run_qwen.py

选项B:4-bit 量化 + Gradio Web UI版(推荐!带漂亮聊天界面)

python scripts/my_local_ai.py

启动后浏览器会自动打开界面。 手机访问方法:同一 WiFi 下,用手机浏览器打开 http://你的笔记本IP:7860

账号:local,密码:123456。如需修改,请编辑 scripts/my_local_ai.py 最后一行的 auth=("local", "123456")

📋 依赖列表(requirements.txt 已包含)

  • torch(CUDA 12.1)
  • transformers, accelerate, bitsandbytes
  • gradio
  • sentencepiece, protobuf 等

注意:

  • 第一次运行会自动下载模型(约1GB),建议开启科学上网或使用 hf-mirror。
  • 4-bit 量化版(my_local_ai.py)显存占用更低,推荐在 4060 上使用。
  • 如果出现 CUDA out of memory,可以尝试降低 max_new_tokens 或关闭浏览器其他标签。
  • Gradio 默认端口为 7860,可在代码中修改。

📚 更多文档


🇬🇧 English Guide

🌟 Highlights

This project documents the end-to-end process of building a local AI environment on a laptop with an NVIDIA RTX 4060, achieving high-speed local inference of LLMs.

🛠️ Tech Stack

  1. Environment: Precise configuration of CUDA 12.1 + PyTorch (GPU version) using Anaconda.
  2. Acceleration: Implemented 4-bit Quantization via BitsAndBytes, reducing VRAM usage by >50% and enabling instant response.
  3. UI & Interaction: Built a custom chat interface with Gradio, supporting local network tunneling for mobile access.

📊 Performance

  • GPU: RTX 4060 Laptop (8GB VRAM)
  • Model: Qwen2.5-1.5B-Instruct
  • Loading Time: ~7s
  • Speed: Near-native web experience under 4-bit quantization.

🚀 Quick Start

1.Create Conda Environment

conda create -n qwen25 python=3.11
conda activate qwen25

2. Install PyTorch (CUDA 12.1)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3.Install Dependencies

pip install -r requirements.txt

4.RUN(Choose One)

Option A: Command-line Quick Version (No quantification, suitable for rapid verification)

python scripts/run_qwen.py

Option B: 4-bit quantization + Gradio web UI version (recommended! Comes with a beautiful chat interface)

python scripts/my_local_ai.py

After startup, the browser will automatically open the interface. Mobile access method: Under the same WiFi network, open http://your laptop’s IP address:7860 in your mobile browser.

Account:local,Password:123456. To change credentials, edit the last line in scripts/my_local_ai.py.

📋 Notes

  • First run will download the model (~1GB). Use hf-mirror for faster speed.
  • 4-bit version uses less VRAM — strongly recommended for RTX 4060.
  • For CUDA out of memory, reduce max_new_tokens or close other tabs.

📚 More documents


Author: Haven (Jinan University - AI Major)
License: MIT

About

A practical guide for local LLM deployment with 4-bit quantization. / 4-bit量化本地大模型部署实战指南

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages