Fine-tuning CodeT5+ for 4 complementary tasks on Python code.
This project fine-tunes a single CodeT5+ model to handle 4 tasks:
- Code Repair (
fix bug:) - Fix bugs in code - Bug Detection (
classify code:) - Detect if code contains a bug (BUGGY/CORRECT) - Code Summarization (
summarize code:) - Generate code descriptions - Code Search (
search code:) - Search code from a description (multiple choice)
.
βββ data/ # Source data
β βββ bug_dataset.json # 402 bugs with fixes (source: TheAlgorithms/Python)
β βββ nl_pl_dataset.json # 1479 code-docstring pairs (source: boltons, more-itertools)
β
βββ final_data/ # β FINAL DATASETS FOR TRAINING
β βββ train.jsonl # 1286 samples (70%)
β βββ val.jsonl # 241 samples (15%)
β βββ test.jsonl # 241 samples (15%)
β
βββ prepare_final_dataset.py # β MAIN SCRIPT - Creates final datasets
β
βββ README_PROJECT.md # This file
Steps:
- Click the badge above or open
colab_finetune.ipynbin Colab - Go to Runtime > Change runtime type > Select GPU (T4)
- Run all cells
- Training takes ~2-3 hours on T4 GPU (free tier)
- Download trained model at the end
Advantages:
- β Free GPU (Tesla T4)
- β No setup required
- β All dependencies pre-installed
- β Save directly to Google Drive
Check your CUDA version first:
nvidia-smiInstall PyTorch (choose one):
# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CPU only (not recommended)
pip install torch torchvision torchaudioVerify CUDA:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"pip install -r requirements.txtThe final dataset is already created in final_data/, but to regenerate it:
python prepare_final_dataset.pyThis script:
- Loads
bug_dataset.jsonandnl_pl_dataset.json - Generates 4 task types
- Balances all tasks to the same number of samples (402 per task)
- Shuffles samples in batches of 4 (1 sample from each task)
- Splits into train/val/test (70/15/15)
- Saves to
final_data/with unit tests for repair samples
Each JSONL line:
{
"task": "repair|detection|summary|search",
"input": "fix bug:\n<code>",
"output": "<fixed_code>",
"tests": ["assert ..."] // Only in test.jsonl for repair task
}Examples:
// Code Repair
{"task": "repair", "input": "fix bug:\ndef add(a,b): return a-b", "output": "def add(a,b): return a+b"}
// Bug Detection
{"task": "detection", "input": "classify code:\ndef add(a,b): return a-b", "output": "BUGGY"}
// Code Summarization
{"task": "summary", "input": "summarize code:\ndef add(a,b): return a+b", "output": "Add two numbers"}
// Code Search
{"task": "search", "input": "search code:\nAdd two numbers\n\nChoices:\n0: def sub(a,b)...\n1: def add(a,b)...", "output": "1"}Run the fine-tuning script:
python finetune_codet5_multitask.pyTraining Configuration:
- Base model:
Salesforce/codet5-base - Batch size: 2 per device (effective: 8 with gradient accumulation)
- Learning rate: 5e-5
- Epochs: 3
- Mixed precision (fp16): Enabled for faster training
- Checkpoints: Saved every 500 steps to
codet5_multitask_checkpoint/ - Final model: Saved to
codet5_multitask_final/
After training, evaluate the model:
python evaluate_model.pyEvaluation Metrics:
- Code Search: Accuracy (exact match of choice index)
- Bug Detection: Accuracy (BUGGY vs CORRECT classification)
- Code Summarization: ROUGE-1/2/L + BLEU-4
- Code Repair: pass@1 (unit test execution) + Exact Match + BLEU-4
Output:
Results are saved to eval_outputs/:
search_errors.json- Mispredicted search samplesdetection_errors.json- Mispredicted detection samplessummary_predictions.json- All summary predictions vs referencesrepair_predictions.json- All repair predictions with test execution results
pass@1 Metric: The pass@1 score measures the percentage of repair samples where the generated code passes ALL unit tests. This is the gold standard for code repair evaluation!
| Split | Total | Repair | Detection | Summary | Search |
|---|---|---|---|---|---|
| Train | 1126 | ~281 | ~281 | ~282 | ~282 |
| Val | 241 | ~60 | ~60 | ~60 | ~61 |
| Test | 241 | ~61 | ~61 | ~60 | ~59 |
Total: 1608 samples (402 per task Γ 4 tasks)
- Source: GitHub repository
TheAlgorithms/Python - Method: PyDriller to analyze commit history
- Validation:
- Buggy code must FAIL tests
- Fixed code must PASS tests
- Tests extracted from doctests
- Result: 402 validated bugs with fixes
- Sources:
boltons,more-itertools,toolz - Method: AST extraction + docstring parsing
- Filtering: Simple functions with clear docstrings
- Result: 1479 function-docstring pairs
Problem: Unequal number of samples per task
- Repair: 402 samples
- Detection: 804 samples (2Γ repair because buggy + correct)
- Summary: 1479 samples
- Search: 1479 samples
Solution: Limit all tasks to 402 samples (the minimum)
Advantage: The model learns each task fairly without bias.
prepare_final_dataset.py- MAIN SCRIPT to create final datasets
data/bug_dataset.json- Original bugs (402 samples)data/nl_pl_dataset.json- Original code-NL pairs (1479 samples)
final_data/train.jsonl- USE FOR TRAINING (1126 samples, 70%)final_data/val.jsonl- USE FOR VALIDATION (241 samples, 15%)final_data/test.jsonl- USE FOR EVALUATION (241 samples, 15%)
See requirements.txt for the complete list. Key dependencies:
- PyTorch (2.0.0+) with CUDA support
- Transformers (4.30.0+) for CodeT5
- Evaluate (0.4.0+) for metrics (ROUGE, BLEU)
- pydriller for data mining (optional)
If you get OOM errors during training:
- Reduce
per_device_train_batch_sizeto 1 - Increase
gradient_accumulation_stepsto maintain effective batch size - Disable fp16:
fp16=Falsein TrainingArguments - Reduce
max_input_lenormax_output_lenin the dataset class
If torch.cuda.is_available() returns False:
- Check NVIDIA driver:
nvidia-smi - Verify CUDA installation:
nvcc --version - Reinstall PyTorch with correct CUDA version (see Setup section)
- Check that PyTorch and CUDA versions are compatible
To speed up training:
- Enable fp16 mixed precision (requires CUDA >= 7.0)
- Increase batch size if GPU memory allows
- Use
torch.compile()if PyTorch >= 2.0 - Consider using multiple GPUs with
accelerate
- CodeT5: Salesforce/codet5-base
- Dataset Sources: TheAlgorithms/Python, boltons, more-itertools
- Evaluation: pass@1 metric for code repair with unit test execution
Academic project - Seoul National University of Technology and Science