Fine-tune Qwen3-4B to generate high-quality Effect-style TypeScript code using a two-stage pipeline: SFT (Supervised Fine-Tuning) + GRPO (Group Relative Policy Optimization).
Note: Marcus Aurelius training is also available in the Practice repo (src/industry_ml_lab/training/philosopher*).
| Repo | Model | Approach | Use Case |
|---|---|---|---|
| Training | Qwen3-4B | SFT + GRPO with Unsloth LoRA | Production ML pipeline, advanced RLHF |
| Practice | Qwen3-100M | Full fine-tuning with PyTorch | Experimentation, learning, tiny LLM |
Why both?
- Training repo: Effect TypeScript fine-tuning (primary focus)
- Practice repo: Marcus Aurelius as a learning experiment (separate use case)
See training/marcus/README.md for Marcus training details in this repo.
Download the trained LoRA adapter: https://huggingface.co/Kodep/qwen3-4b-effect-codegen-v2
| Feature | Value |
|---|---|
| Base Model | Qwen3-4B (Qwen3ForCausalLM) |
| Number of Layers | 36 |
| Hidden Size | 2560 |
| Attention Heads | 32 (Q), 8 (KV) |
| Vocabulary Size | 151936 |
| Max Position Embeddings | 40960 |
| LoRA Rank | 64 |
| Trainable Parameters | 132M (3.18% of base model) |
| Precision | float16 |
| License | Apache 2.0 |
| Metric | Value |
|---|---|
| Final Reward | 0.9775 ± 0.2134 |
| Reward Improvement | +0.2012 (+26%) |
| KL Divergence | 0.19 |
| Gradient Norm | 0.282 |
| Training Time | ~63 minutes total |
Full post-training summary: See POST_TRAINING_SUMMARY.md
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Kodep/qwen3-4b-effect-codegen-v2",
max_seq_length=4096,
load_in_4bit=True,
)# Stage 1: SFT (Supervised Fine-Tuning)
.\unsloth_env\Scripts\python.exe training\run-sft-training-1.py
# Stage 2: GRPO (Reinforcement Fine-Tuning)
.\unsloth_env\Scripts\python.exe training\run-grpo-training-2.pySee training/README.md for detailed training playbook and testing instructions.
Fast pre-training on exact examples to teach the model Effect code patterns.
- Duration: ~7.5 minutes on RTX 4090
- Final loss: ~0.834
- Output:
training/output-v2/qwen3-4b-effect-codegen-sft/
Reinforcement learning with reward functions to optimize code quality.
- Duration: ~66.6 minutes on RTX 4090
- Final reward: ~0.698
- Output:
training/output-v2/qwen3-4b-effect-codegen-v2/
- +1.0 for
<CODE>tags - +0.5 for Effect imports (
from effect,import Effect) - +0.3 for Schema usage
- +0.2 for exports
- +0.3 to +0.7 based on output length (shorter = lower, up to 2000 chars)
- -0.5 for responses too short (< 100 chars)
# Stage 1: SFT only
.\unsloth_env\Scripts\python.exe training\run-sft-training-1.py
# Stage 2: GRPO (requires SFT first)
.\unsloth_env\Scripts\python.exe training\run-grpo-training-2.py
# Test trained model
.\unsloth_env\Scripts\python.exe training\test-training.py --stage sft # After SFT
.\unsloth_env\Scripts\python.exe training\test-training.py --stage grpo # After GRPO
# Convert to GGUF for llama.cpp inference
.\unsloth_env\Scripts\python.exe training\convert-to-gguf.py
# Standalone evaluation (uses llama.cpp HTTP server)
.\unsloth_env\Scripts\python.exe training\evaluate-model.py --model training\output-v2\qwen3-4b-effect-codegen-v2
# Compare models (local vs remote)
.\unsloth_env\Scripts\python.exe training\compare-models.pySee training/README.md for detailed testing instructions.
Note: Training uses 16-bit float (no quantization) to enable easy GGUF conversion. After training, convert to GGUF format for fast llama.cpp inference.
.\unsloth_env\Scripts\python.exe training\evaluate-model.py --model training\output-v2\qwen3-4b-effect-codegen-v2.\unsloth_env\Scripts\python.exe training\compare-models.pyThis project runs on consumer hardware (RTX 4090, 24GB VRAM) thanks to Unsloth's GPU optimizations:
- 8-bit Quantization - Uses
load_in_8bit=Truewithload_in_4bit=Falsefor slightly better quality than 4-bit. Both flags must be set explicitly when loading 8-bit models. Only 0.02% of parameters (the LoRA adapters) are trainable. - Unsloth Gradient Checkpointing (
use_gradient_checkpointing = "unsloth") - A custom checkpointing algorithm that asynchronously offloads activations to system RAM during training, reducing VRAM usage by ~30% with only ~2% compute overhead. - Custom Triton Kernels - Unsloth replaces standard PyTorch operations with hand-written GPU kernels in Triton. Fused RoPE and MLP kernels deliver 2–5x faster training.
- Smart Packing - Combines multiple short training samples into single tensors, eliminating wasted compute on padding tokens.
Unsloth dynamically patches GRPOConfig at runtime. Key settings:
per_device_train_batch_size=4— Must matchnum_generationsto avoid config patching conflictsunsloth_num_chunks=2— Added by Unsloth for chunked processingunsloth_grpo_mini_batch=4— Added by Unsloth for mini-batch training- Always use
from trl import GRPOConfig, GRPOTrainerafter importing Unsloth
Critical: import unsloth must be at the top of the file, before trl, transformers, peft. Importing Unsloth after TRL causes warnings and potential config patching issues.
| Metric | SFT Stage | GRPO Stage |
|---|---|---|
| Duration | ~7.5 min | ~66.6 min |
| Final loss | 0.834 | — |
| Avg reward | — | 0.698 |
| Reward std | — | 0.056 |
| KL divergence | — | 0.0005 |
| Grad norm | — | ~0.00002 |
| frac_reward_zero | — | 0.375 |
Run evaluation to get per-sample metrics:
.\unsloth_env\Scripts\python.exe training\run-training-v2.py --eval-only
.\unsloth_env\Scripts\python.exe training\evaluate-model.py --model training\output-v2\qwen3-4b-effect-codegen-v2Output is saved to training/output-v2/eval-metrics.json with:
- Per-sample scores: code_tags, effect_imports, schema_usage, exports, length
- Component percentages: what % of responses use each pattern
- Generation time: average time per sample
- Success/fail status: with timeout protection (120s per sample)
Compare against remote models (OpenAI-compatible endpoint):
.\unsloth_env\Scripts\python.exe training\compare-models.pyOutput: training/output/comparison-results.json
- SFT is critical for GRPO: Without SFT pre-training,
frac_reward_zero_stdstays at 1.0 (no reward variance = no learning) - GRPO grad_norm was very small (~0.00002) — model barely updates during GRPO; consider adjusting learning rate
- 8-bit vs 4-bit: 8-bit loads with
load_in_4bit=False+load_in_8bit=True(both flags required) - Post-training eval: Model should not be loaded twice — causes memory issues
├── training/ # Training work
│ ├── run-sft-training-1.py # Stage 1: SFT training
│ ├── run-grpo-training-2.py # Stage 2: GRPO training (requires SFT)
│ ├── test-training.py # Test script (--stage sft | grpo)
│ ├── training/README.md # Detailed training playbook
│ ├── evaluate-model.py # Standalone evaluation script
│ ├── compare-models.py # Local vs remote model comparison
│ ├── main-training.ipynb # Notebook version (reference)
│ ├── backup/ # Original Unsloth notebook
│ └── output-v2/ # Training outputs (gitignored)
│ ├── qwen3-4b-effect-codegen-sft/ # SFT model (~500MB)
│ └── qwen3-4b-effect-codegen-v2/ # GRPO model (~500MB)
├── extracted-code/ # Training data (428 samples, 18MB)
│ └── effect-code-samples.json
├── examples/ # Learning notebooks
├── literature/ # RLHF/GRPO reference material
├── extract-code.ts # Data extraction script
├── LICENSE
├── README.md
└── AGENTS.md
Extract training samples from Effect repositories:
bun run extract-code.ts
# or with custom directories:
TRAINING_DIR=/path/to/repos OUTPUT_DIR=/path/to/output bun run extract-code.ts- NVIDIA GPU with CUDA support (tested on RTX 4090, 24GB VRAM)
- Python 3.13
- CUDA 13.0
- Unsloth 2026.5.8
- PyTorch 2.10.0+cu130
The system Python (python) has a CPU-only PyTorch build and cannot train.
You MUST use the unsloth_env venv.
To enable fast llama.cpp inference, convert your trained model to GGUF format:
.\unsloth_env\Scripts\python.exe training\convert-to-gguf.pyThis creates:
Qwen3-4B.F16.gguf(~8 GB) - Float16 formatQwen3-4B.Q8_0.gguf(~4.3 GB) - Q8_0 quantized format (50% size reduction)
Use the Q8_0 file with llama.cpp for faster inference with minimal quality loss.
Start the llama.cpp server:
.\run-inference.ps1Then run evaluation:
.\unsloth_env\Scripts\python.exe training\run-training-v2.py --eval-only- vLLM is not required for training (excluded due to Windows compatibility; vLLM requires transformers >= 4.56.0, Unsloth requires transformers <= 5.5.0)
- Training uses 16-bit float (not quantized) to enable easy GGUF conversion for llama.cpp
- Model weights (~500MB models) are hosted on Hugging Face, not in this repo
- Total training time: ~75 minutes (7.5 min SFT + 66.6 min GRPO) on RTX 4090
- wandb logging is optional; metrics are logged to
effect-codegen-grpoproject - LSP errors can be ignored when code runs correctly (static analyzer vs runtime patches)
MIT License - see LICENSE