Skip to content

aimagelab/ReAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Conference Paper Project Page HF Collection

ReAG

📢 Latest Updates

  • [2026/04/24] 🤗 Inference code and weights release
  • [2026/04/09] 🌟 ReAG has been selected as Highlight
  • [2026/02/21] 📚 ReAG has been accepted @ CVPR 2026

Overview

ReAG is a Reasoning-Augmented Multimodal RAG approach for Knowledge-based VQA. Standard retrieval-augmented methods often retrieve noisy or irrelevant passages, limiting answer quality. ReAG addresses this by combining coarse- and fine-grained retrieval with a critic model that filters out low-quality passages before answer generation. The model is trained with a multi-stage strategy: a supervised fine-tuning as a cold start, followed by reinforcement learning to promote explicit reasoning grounded in retrieved evidence. ReAG significantly outperforms prior methods on Encyclopedic-VQA and InfoSeek.

This repository contains the inference pipeline for ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering (KB-VQA). The model checkpoints are hosted on Hugging Face:

Table of Contents

Environment Setup

The following code was tested using:

  • Python 3.11
  • CUDA 12.6 (for GPU inference)

We use two separate environments:

  1. .venv — main inference environment (PyTorch + Transformers)
  2. evqa-eval — EVQA evaluation environment (TensorFlow stack), kept separate due to TF/PyTorch conflicts

1) Create the inference environment

python3.11 -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip

# Example: Torch wheels for CUDA 12.6
python -m pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 \
  --index-url https://download.pytorch.org/whl/cu126

# Install remaining dependencies
python -m pip install -r requirements.txt

If you use conda, you can install the packages from requirements.txt in an activated conda environment.

2) Create the EVQA evaluation environment

This environment is only needed to run the EVQA evaluation scripts. TensorFlow dependencies conflicts with the main inference stack, so we kept it separate.

python3.11 -m venv evqa_eval/.venv
source evqa_eval/.venv/bin/activate

python -m pip install --upgrade pip
python -m pip install -r src/retrieval_module/evqa_eval/requirements.txt

Datasets

We provide pre-packaged evaluation data for both benchmarks so you can get started quickly without navigating the original dataset repositories.

Infoseek

Download the evaluation data for Infoseek here. For the full dataset, refer to the official repository.

The inference scripts expect Infoseek in JSONL format (--query_path points to a .jsonl file).

Encyclopedic-VQA

Download the evaluation data for Encyclopedic-VQA here, and the evaluation images here. For the full dataset, refer to the official repository.

The inference scripts expect EVQA in JSON format (--query_path points to a .json file).

Knowledge Bases and FAISS Indexes

Our work uses two knowledge bases, one per benchmark. To enhance reproducibility, we provide both the knowledge bases and the pre-built FAISS indexes for the best configuration presented in the paper. Embeddings are generated using the EVA-CLIP model.

  • Infoseek — index available here
  • Encyclopedic-VQA — index available here

Inference

Once datasets and indexes are in place, unzip all archives and update the paths in the .sh scripts to match your local filesystem. We provide two ready-to-use scripts:

These scripts are written as Slurm jobs. For local runs, remove the Slurm directives and srun prefix while keeping the rest of the command unchanged.

Slurm (optional)

If you are on an HPC cluster with SLURM, submit the scripts directly after editing the asset paths:

sbatch retrieval_evqa.sh
sbatch retrieval_infoseek.sh

What the scripts do

Both scripts invoke retrieval.py with a common set of flags representing the full end-to-end ReAG pipeline:

  • --model_name "aimagelab/ReAG-3B"
  • --top_k 20
  • --force_reasoning + --extract_reasoning
  • --crop_query_img
  • --critic_model_name "aimagelab/ReAG-Critic"
  • --eval_passages + --yes_prob_thr 0.1

You can freely add or remove flags inside the .sh files to configure your experiment. Notable options include:

  • --few_shots — Infoseek few-shot prompting
  • --eval_passages — critic-based passage filtering
  • --use_google_lens or --use_oracle — EVQA retrieval variants

Outputs

Results are written under the --output_root directory. The script automatically constructs an experiment folder name based on the dataset, model, retrieval setup, and active flags.

Citation

If you use this code, please cite our CVPR 2026 paper:

@inproceedings{compagnoni2026reag,
  title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
  author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
  year={2026}
}

About

[CVPR 2026] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors