π§ Fine-tune, instruct, and deploy Uzbek AI assistants with HuggingFace Transformers.
Uzbek-first LLM training pipeline
Built on π€ Transformers Β· β‘ PyTorch Β· π§ͺ Prompt Engineering Β· π Accelerate
uzLLM is a lightweight but powerful training framework for fine-tuning open-source LLMs (like Mistral, LLaMA, Falcon) on Uzbek datasets β enabling full control over instruction-following AI, local assistant models, or even Uzbek GPT-style chatbots.
β
Fine-tune HuggingFace LLMs on Uzbek text
β
Instruction-style prompt formatting (OpenAssistant style)
β
Supports Mistral / LLaMA / Falcon / Qwen (any causal LM)
β
Multi-GPU or Colab-compatible
β
Clean conversational chat history handling
β
Easy inference API via CLI or Streamlit (optional)
git clone https://github.com/ShohjahonObruyevOybekovich/UzLLM.git
cd uzllm
pip install -r r.txt
Prepare your dataset in .jsonl format:
json
Copy
Edit
{"instruction": "Toshkent qaerda joylashgan?", "response": "Toshkent OΚ»zbekiston poytaxti."}
Start fine-tuning:
bash
Copy
Edit
python train.py \
--model_name_or_path NousResearch/Nous-Hermes-2-Mistral-7B-DPO \
--train_file data/uzbek-instructions.jsonl \
--output_dir models/uzllm-mistral \
--per_device_train_batch_size 2 \
--num_train_epochs 3 \
--fp16
π§ͺ Example Prompt Format
Your data should follow the [INST] style used in LLaMA/Mistral:
text
Copy
Edit
<s>[INST] <<SYS>>
Siz foydalanuvchiga oβzbek tilida aqlli, aniq va foydali javoblar berasiz.
<</SYS>>
Salom! Bugun ob-havo qanday?
[/INST]
π¦ Inference
python
Copy
Edit
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("models/uzllm-mistral")
model = AutoModelForCausalLM.from_pretrained("models/uzllm-mistral")
prompt = "<s>[INST] Salom, sen kimsan? [/INST]"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
π§° Tech Stack
π€ Transformers
𧨠Datasets
π§ͺ Accelerate
π§ PyTorch
ποΈ Project Structure
bash
Copy
Edit
uzllm/
βββ data/ # Uzbek training data (jsonl)
βββ models/ # Output fine-tuned models
βββ train.py # Training script
βββ infer.py # Inference demo
βββ prompts/ # Prompt templates
βββ README.md
π Contribute Uzbek AI πΊπΏ
We're just getting started. If you're working with Uzbek NLP or LLMs β join us!
π§ Add new Uzbek datasets
π₯ Share fine-tuned checkpoints
π¬ Improve system prompts
π€ Add inference UI (Streamlit / Telegram bot)
π Coming Soon
Streamlit chat demo
LoRA-based training
HuggingFace Hub auto-push
Uzbek QA dataset release
π License
MIT β do whatever you want, just give credit.
Made with β€οΈ for O'zbek tilida AI rivoji uchun.