MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection
Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg
🔗 Project Page 📄 Paper
The folder mustreason/ contains the annotated reasoning comprising the MUStReason dataset.
This repository contains experiment scripts for the MUStReason paper workflows.
inputs/: required CSV inputs.methods/: main prompting strategies.ablations/: LLM and decoding ablations.
You need:
inputs/sarcasm_explanation_70B.csvinputs/mustard-sarcasm-labelled.csvutterance_videos/folder (video files, e.g.*.mp4) at the root of this project folder.
For the video file requirements, download the MUStARD++ Balanced dataset from here
- Install dependencies using either
environment.ymlorrequirements.txt. - For VITA, ShareGPT4Video, and VideoGPT+, install dependencies from their respective upstream repositories, since they require additional repo-specific setup.
File: methods/zero_shot_wo_reasoning.py
Meaning:
- Single-pass classification prompt without explicit step-by-step reasoning constraints.
Run:
python methods/zero_shot_wo_reasoning.pyFile: methods/zero_shot_w_reasoning.py
Meaning:
- Single-pass classification with an explicit reasoning-oriented instruction.
Run:
python methods/zero_shot_w_reasoning.pyFile: methods/few_shot.py
Meaning:
- In-context learning with few-shot examples prepended before the target sample.
- Implemented only for models that can process multiple videos in one conversation.
Run:
python methods/few_shot.py4) MMCoT (Structured multimodal CoT inspired from: MultimodalCoT)
File: methods/mmcot.py
Meaning:
- Multi-step cue extraction (visual/audio/facial/utterance as applicable), then final reasoning + label prediction.
- For
sharegpt,video-llava,videogpt,vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.
Run:
python methods/mmcot.pyFile: methods/pragcot.py
Meaning:
- Pragmatic multi-step prompting: extract modality cues, decode dialogue intent, then perform intent-aware final reasoning.
- For
sharegpt,video-llava,videogpt,vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.
Run:
python methods/pragcot.pyInside each method file, edit:
RUN_MODELS = ["all"]to run all supported models, orRUN_MODELS = ["qwenomni"](example) to run one model.
Outputs are written under outputs/<method_name>/<model_name>/.
File: ablations/ablation_wo_face_decoding.py
Meaning:
- Keeps face decoding query step but excludes
response_4contribution from final reasoning prompt.
Run:
python ablations/ablation_wo_face_decoding.pyFile: ablations/ablation_wo_intent_decoding.py
Meaning:
- Runs intent decoding query but excludes
response_6contribution from final reasoning prompt.
Run:
python ablations/ablation_wo_intent_decoding.pyFile: ablations/llm_ablation.py
Meaning:
- Uses descriptor table (
sarcasm_explanation_70B.csv) and runs text-only LLM inference across:- Llama-3.1-8B-Instruct
- Mistral-7B-Instruct-v0.3
- Qwen3-8B
Run:
python ablations/llm_ablation.pyFor questions, feel free to contact:
Anisha Saha
📧 ansaha@mpi-inf.mpg.de
🔗 anisha0325.github.io
If you use this work in your research, please cite:
@misc{saha2025mustreasonbenchmarkdiagnosingpragmatic,
title={MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection},
author={Anisha Saha and Varsha Suresh and Timothy Hospedales and Vera Demberg},
year={2025},
eprint={2510.23727},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.23727},
}