Skip to content

shokhsmee/UzLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‡ΊπŸ‡Ώ uzLLM β€” Train Your Own Uzbek Large Language Model

🧠 Fine-tune, instruct, and deploy Uzbek AI assistants with HuggingFace Transformers.



Uzbek-first LLM training pipeline
Built on πŸ€— Transformers Β· ⚑ PyTorch Β· πŸ§ͺ Prompt Engineering Β· πŸš€ Accelerate

πŸ“Œ What is uzLLM?

uzLLM is a lightweight but powerful training framework for fine-tuning open-source LLMs (like Mistral, LLaMA, Falcon) on Uzbek datasets β€” enabling full control over instruction-following AI, local assistant models, or even Uzbek GPT-style chatbots.


✨ Features

βœ… Fine-tune HuggingFace LLMs on Uzbek text
βœ… Instruction-style prompt formatting (OpenAssistant style)
βœ… Supports Mistral / LLaMA / Falcon / Qwen (any causal LM)
βœ… Multi-GPU or Colab-compatible
βœ… Clean conversational chat history handling
βœ… Easy inference API via CLI or Streamlit (optional)


πŸ› οΈ Quickstart

git clone https://github.com/ShohjahonObruyevOybekovich/UzLLM.git
cd uzllm
pip install -r r.txt
Prepare your dataset in .jsonl format:

json
Copy
Edit
{"instruction": "Toshkent qaerda joylashgan?", "response": "Toshkent OΚ»zbekiston poytaxti."}
Start fine-tuning:

bash
Copy
Edit
python train.py \
  --model_name_or_path NousResearch/Nous-Hermes-2-Mistral-7B-DPO \
  --train_file data/uzbek-instructions.jsonl \
  --output_dir models/uzllm-mistral \
  --per_device_train_batch_size 2 \
  --num_train_epochs 3 \
  --fp16
πŸ§ͺ Example Prompt Format
Your data should follow the [INST] style used in LLaMA/Mistral:

text
Copy
Edit
<s>[INST] <<SYS>>
Siz foydalanuvchiga oβ€˜zbek tilida aqlli, aniq va foydali javoblar berasiz.
<</SYS>>

Salom! Bugun ob-havo qanday?
[/INST]
πŸ“¦ Inference
python
Copy
Edit
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("models/uzllm-mistral")
model = AutoModelForCausalLM.from_pretrained("models/uzllm-mistral")

prompt = "<s>[INST] Salom, sen kimsan? [/INST]"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
🧰 Tech Stack
πŸ€— Transformers

🧨 Datasets

πŸ§ͺ Accelerate

🧠 PyTorch

πŸ—‚οΈ Project Structure
bash
Copy
Edit
uzllm/
β”œβ”€β”€ data/                  # Uzbek training data (jsonl)
β”œβ”€β”€ models/                # Output fine-tuned models
β”œβ”€β”€ train.py               # Training script
β”œβ”€β”€ infer.py               # Inference demo
β”œβ”€β”€ prompts/               # Prompt templates
└── README.md
πŸ™Œ Contribute Uzbek AI πŸ‡ΊπŸ‡Ώ
We're just getting started. If you're working with Uzbek NLP or LLMs β€” join us!

🧠 Add new Uzbek datasets

πŸ”₯ Share fine-tuned checkpoints

πŸ’¬ Improve system prompts

πŸ€– Add inference UI (Streamlit / Telegram bot)

🏁 Coming Soon
 Streamlit chat demo

 LoRA-based training

 HuggingFace Hub auto-push

 Uzbek QA dataset release

πŸ“œ License
MIT β€” do whatever you want, just give credit.
Made with ❀️ for O'zbek tilida AI rivoji uchun.

About

🧠 Train your own Uzbek LLM with HuggingFace β€” fast, flexible, local.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages