MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg

📌 Overview

MUStReason provides reasoning-aligned annotations that enable detailed evaluation of modality perception and inference failure which are key factors for assessing and improving pragmatic reasoning in multimodal models. In addition, the paper introduces PragCoT, a pragmatic reasoning framework for developing complex reasoning abilities in Video-LMs, through the integration of information from multiple modalities and the systematic resolution of sub-problems.

Dataset

The folder mustreason/ contains the annotated reasoning comprising the MUStReason dataset.

Benchmarking Code

This repository contains experiment scripts for the MUStReason paper workflows.

Folder Structure

inputs/: required CSV inputs.
methods/: main prompting strategies.
ablations/: LLM and decoding ablations.

Required Data

You need:

inputs/sarcasm_explanation_70B.csv
inputs/mustard-sarcasm-labelled.csv
utterance_videos/ folder (video files, e.g. *.mp4) at the root of this project folder.

For the video file requirements, download the MUStARD++ Balanced dataset from here

Environment Setup

Install dependencies using either environment.yml or requirements.txt.
For VITA, ShareGPT4Video, and VideoGPT+, install dependencies from their respective upstream repositories, since they require additional repo-specific setup.

Methods

1) Zero-shot (without reasoning)

File: methods/zero_shot_wo_reasoning.py

Meaning:

Single-pass classification prompt without explicit step-by-step reasoning constraints.

Run:

python methods/zero_shot_wo_reasoning.py

2) Zero-shot (with reasoning)

File: methods/zero_shot_w_reasoning.py

Meaning:

Single-pass classification with an explicit reasoning-oriented instruction.

Run:

python methods/zero_shot_w_reasoning.py

3) Few-shot

File: methods/few_shot.py

Meaning:

In-context learning with few-shot examples prepended before the target sample.
Implemented only for models that can process multiple videos in one conversation.

Run:

python methods/few_shot.py

4) MMCoT (Structured multimodal CoT inspired from: MultimodalCoT)

File: methods/mmcot.py

Meaning:

Multi-step cue extraction (visual/audio/facial/utterance as applicable), then final reasoning + label prediction.
For sharegpt, video-llava, videogpt, vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.

Run:

python methods/mmcot.py

5) PragCoT

File: methods/pragcot.py

Meaning:

Pragmatic multi-step prompting: extract modality cues, decode dialogue intent, then perform intent-aware final reasoning.
For sharegpt, video-llava, videogpt, vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.

Run:

python methods/pragcot.py

Model Selection in Method Scripts

Inside each method file, edit:

RUN_MODELS = ["all"] to run all supported models, or
RUN_MODELS = ["qwenomni"] (example) to run one model.

Outputs are written under outputs/<method_name>/<model_name>/.

Ablations

1) Ablation without face decoding

File: ablations/ablation_wo_face_decoding.py

Meaning:

Keeps face decoding query step but excludes response_4 contribution from final reasoning prompt.

Run:

python ablations/ablation_wo_face_decoding.py

2) Ablation without intent decoding

File: ablations/ablation_wo_intent_decoding.py

Meaning:

Runs intent decoding query but excludes response_6 contribution from final reasoning prompt.

Run:

python ablations/ablation_wo_intent_decoding.py

3) LLM-only ablation (text-only multimodal descriptors)

File: ablations/llm_ablation.py

Meaning:

Uses descriptor table (sarcasm_explanation_70B.csv) and runs text-only LLM inference across:
- Llama-3.1-8B-Instruct
- Mistral-7B-Instruct-v0.3
- Qwen3-8B

Run:

python ablations/llm_ablation.py

📬 Contact

For questions, feel free to contact:

Anisha Saha
📧 ansaha@mpi-inf.mpg.de
🔗 anisha0325.github.io

📚 Citation

If you use this work in your research, please cite:

@misc{saha2025mustreasonbenchmarkdiagnosingpragmatic,
      title={MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection}, 
      author={Anisha Saha and Varsha Suresh and Timothy Hospedales and Vera Demberg},
      year={2025},
      eprint={2510.23727},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.23727}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

📌 Overview

Dataset

Benchmarking Code

Folder Structure

Required Data

Environment Setup

Methods

1) Zero-shot (without reasoning)

2) Zero-shot (with reasoning)

3) Few-shot

4) MMCoT (Structured multimodal CoT inspired from: MultimodalCoT)

5) PragCoT

Model Selection in Method Scripts

Ablations

1) Ablation without face decoding

2) Ablation without intent decoding

3) LLM-only ablation (text-only multimodal descriptors)

📬 Contact

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
ablations		ablations
inputs		inputs
methods		methods
mustreason		mustreason
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

📌 Overview

Dataset

Benchmarking Code

Folder Structure

Required Data

Environment Setup

Methods

1) Zero-shot (without reasoning)

2) Zero-shot (with reasoning)

3) Few-shot

4) MMCoT (Structured multimodal CoT inspired from: MultimodalCoT)

5) PragCoT

Model Selection in Method Scripts

Ablations

1) Ablation without face decoding

2) Ablation without intent decoding

3) LLM-only ablation (text-only multimodal descriptors)

📬 Contact

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages