Fine-tuning FLAN-T5-small on BoolQ and evaluating robustness under paraphrases, typos, and distractors. Includes reproducible training and evaluation pipeline.
Key Insight: Fine-tuning small instruction-tuned models on binary QA tasks can lead to majority-class collapse, where the model learns to predict the dominant label rather than solving the task.
This project explores how small instruction-tuned language models behave when fine-tuned on binary question answering tasks.
I fine-tuned FLAN-T5-small on the BoolQ dataset and evaluated the model under several input perturbations, including typos, paraphrases, and distracting context.
The goal was to understand how well a small transformer adapts to downstream tasks and to identify potential failure modes during fine-tuning and evaluation.
Experiments use the BoolQ dataset, which consists of natural yes/no questions paired with Wikipedia passages.
Example input format:
question: Did ethanol production require more energy than it produced?
passage: Ethanol fuel production from biomass involves several energy-intensive steps...
answer yes or no
Target output:
yes
or
no
The BoolQ validation set is imbalanced, with approximately:
| Label | Fraction |
|---|---|
| Yes | ~62% |
| No | ~38% |
This imbalance plays an important role when interpreting evaluation results.
Fine-tuned model:
| Property | Value |
|---|---|
| Model | FLAN-T5-small |
| Parameters | ~80M |
| Architecture | Encoder-Decoder Transformer |
| Pretraining | Instruction-tuned FLAN mixture |
Training configuration:
| Parameter | Value |
|---|---|
| Train examples | 6000 (balanced yes/no) |
| Validation examples | 1000 |
| Learning rate | 5e-5 |
| Batch size | 16 |
| Epochs | 2 |
Training was implemented using HuggingFace Transformers.
To evaluate robustness, several perturbed versions of the dataset were created.
| Dataset | Description |
|---|---|
| Clean | Original BoolQ examples |
| Typos | Character-level noise added to questions |
| Distractor | Irrelevant sentences appended to passages |
| Paraphrase | Questions rewritten with alternative phrasing |
These perturbations simulate the kinds of noisy inputs encountered by real-world NLP systems.
Balanced evaluation results (example subset):
| Model | Accuracy | Yes Predictions | No Predictions |
|---|---|---|---|
| Base FLAN-T5 | ~0.55 | Mixed | Mixed |
| Fine-tuned model | ~0.50 | All Yes | 0 |
Accuracy was measured on the BoolQ validation set under multiple input perturbations.
| Model | Clean | Typos | Distractor | Paraphrase |
|---|---|---|---|---|
| FLAN-T5-small (base) | 0.604 | 0.562 | 0.424 | 0.600 |
| FLAN-T5-small (fine-tuned) | 0.622 | 0.622 | 0.622 | 0.622 |
-
The base model shows expected robustness degradation:
- typos reduce performance slightly
- distractor sentences significantly reduce accuracy
-
The fine-tuned model shows identical accuracy across all perturbations.
-
This behavior indicates that the fine-tuned model learned a constant-label prediction strategy, rather than solving the reasoning task.
Fine-tuning small instruction-tuned models on imbalanced binary QA datasets can produce degenerate decision boundaries.
In this experiment, the fine-tuned model converged to predicting the majority label ("yes") for every example.
Because the BoolQ dataset contains ~62% "yes" answers, this trivial strategy achieves:
Accuracy ≈ 0.62
As a result:
- performance appears strong on the original validation distribution
- robustness experiments show no degradation under perturbations
However, this apparent robustness is misleading — the model is not solving the task, but instead exploiting dataset imbalance.
During training, the model sometimes converged to a degenerate strategy of always predicting "yes", the majority label in BoolQ.
Because the dataset contains ~62% "yes" answers, this trivial strategy achieves:
Accuracy ≈ 0.62
Balanced evaluation revealed the true behavior: the fine-tuned model collapsed to ~50% accuracy and predicted only the majority class.
BoolQ’s natural label imbalance allows trivial classifiers to achieve seemingly strong accuracy.
Evaluating on balanced subsets made the model’s behavior much clearer and exposed failure modes that were not visible using the original validation distribution.
When evaluating with standard text generation (model.generate()), model outputs appeared reasonable.
However, likelihood-based scoring between candidate answers ("yes" vs "no") revealed that the model strongly preferred a single label.
This suggests that generative evaluation can obscure degenerate decision boundaries, especially in small models.
Experiments showed that training dynamics for small encoder-decoder models can be unstable when adapting to binary QA tasks.
Factors such as:
- dataset imbalance
- learning rate
- output tokenization
- evaluation strategy
can significantly influence the final model behavior.
Adding perturbations such as typos, paraphrases, and distracting context helps simulate real-world inputs.
However, robustness metrics can be misleading if the underlying dataset contains structural biases. Careful dataset construction and analysis are necessary for meaningful robustness evaluation.
This study focuses on a small instruction-tuned model (FLAN-T5-small).
Additional experiments could explore:
- larger models (FLAN-T5-base / large)
- different fine-tuning objectives
- classification heads instead of generative decoding
- stronger adversarial perturbations
src/
train_llm.py
eval_robustness.py
make_perturbations.py
data/
perturbed/
runs/
model checkpoints
- Python – experiment scripting and data processing
- PyTorch – neural network training framework
- HuggingFace Transformers – loading and fine-tuning FLAN-T5
- HuggingFace Datasets – BoolQ dataset handling
- SentencePiece Tokenization – tokenization used by the T5 architecture
- Google Colab GPU (Tesla T4) – model training and experimentation
- JSON / Python data pipelines – generation of perturbed evaluation datasets
Possible extensions include:
- training larger models such as FLAN-T5-base
- using classification heads instead of generative decoding
- building stronger adversarial perturbations
- exploring retrieval-augmented question answering systems