LLMsForEduQG

This repo contains the replication package of our paper Towards Reliable LLM-based Exam Generation. Lessons Learned and Open Challenges in an Industrial Project accepted and the Industry Showcase Track of ASE 2025.

Requirements

Python 3
an Open AI Key to run the GPT models.
a Hugging Face API Key to run Llama, Mistral and DeepSeek models. You can adapt the code if these are locally deployed.

Installation

You can install and run the experiments with the following commands:

git clone https://github.com/rdegiovanni/LLMsForEduQG.git
cd LLMsForEduQG
pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
export OPENAI_API_KEY=...
export API_KEY_HUGGINGFACE=...

./LLMsForEduQG.sh -h

usage: LLMsForEduQG [-h] [-i INPUT] [-d DIR] [-N MAX] [-r RANDOM] [-m MODELS] [-sw PREFIX] [-p PROMPTS] [-ll] [-pl]
Testing framework to assess the performance of LLMs in generating multiple-choice education questions.
options:
-h, --help 								show this help message and exit
-i INPUT, --input INPUT 				Input .csv file with the ground truth questions.
-d DIR, --dir DIR, -o DIR, --out DIR	Output directory.
-N MAX, -n MAX, --max MAX				Maximum number of questions analysed from the dataset.
-r RANDOM, --random RANDOM				Random order of analysed questions.
-m MODELS, --models MODELS				Select the LLMs to run.
-sw PREFIX, --starts-with PREFIX		Selects all LLMs starting with the <prefix>.
-p PROMPTS, --prompts PROMPTS			Select the prompts to use.
-ll, --llms, --llm-list					List the supported LLMs.
-pl, --prompt-list						List the available prompts.

./LLMsForEduQG.sh -i datasets/SciQ_test.csv -o results/Llama3170Instruct -m "Llama3170Instruct"

Results

You can download our empirical results from the experiments folder: LLMs4EduQG_results.zip.

In the zip file, you will find two folders with the generated questions and analysis for the two datasets used in our experiments, namely, SciQ and the Canterbury Question Bank datasets. For each dataset, we provide the following information:

generated_questions.csv: raw data extracted from all LLM executions.
clean_generated_questions.csv: clean data only from the successful executions.
summary.csv: for each LLM, the average values for all the effectiveness metrics.
statistics.csv: results of the statistical tests between the different prompts.
metric.pdf: for each metric, we provide a boxplot to ease their analysis and comparison.
choices_quality_analysis.csv: analysis of the exact matches w.r.t. the ground truth for each generated question.
choices_summary.csv: summarises the exact matches results for each LLM and prompt.

Cite this paper

If you use our tool or results in your research, please cite our paper:

@inproceedings{DegiovanniCabot2025,
  author    = {Renzo Degiovanni and
	       Jordi Cabot},
  title     = {Towards Reliable LLM-based Exam Generation. Lessons Learned and Open Challenges in an Industrial Project,
  booktitle = {Industry Showcase Track of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), Seoul, South-Korea, 2025},
  publisher = {{IEEE}},
  year      = {2025},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMsForEduQG

Requirements

Installation

Results

Cite this paper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LLMsForEduQG		LLMsForEduQG
datasets		datasets
experiments		experiments
LLMsForEduQG.sh		LLMsForEduQG.sh
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLMsForEduQG

Requirements

Installation

Results

Cite this paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages