Skip to content

anisha0325/MUStReason

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg

🔗 Project Page 📄 Paper


📌 Overview

MUStReason provides reasoning-aligned annotations that enable detailed evaluation of modality perception and inference failure which are key factors for assessing and improving pragmatic reasoning in multimodal models. In addition, the paper introduces PragCoT, a pragmatic reasoning framework for developing complex reasoning abilities in Video-LMs, through the integration of information from multiple modalities and the systematic resolution of sub-problems.


Dataset

The folder mustreason/ contains the annotated reasoning comprising the MUStReason dataset.

Benchmarking Code

This repository contains experiment scripts for the MUStReason paper workflows.

Folder Structure

  • inputs/: required CSV inputs.
  • methods/: main prompting strategies.
  • ablations/: LLM and decoding ablations.

Required Data

You need:

  • inputs/sarcasm_explanation_70B.csv
  • inputs/mustard-sarcasm-labelled.csv
  • utterance_videos/ folder (video files, e.g. *.mp4) at the root of this project folder.

For the video file requirements, download the MUStARD++ Balanced dataset from here

Environment Setup

  • Install dependencies using either environment.yml or requirements.txt.
  • For VITA, ShareGPT4Video, and VideoGPT+, install dependencies from their respective upstream repositories, since they require additional repo-specific setup.

Methods

1) Zero-shot (without reasoning)

File: methods/zero_shot_wo_reasoning.py

Meaning:

  • Single-pass classification prompt without explicit step-by-step reasoning constraints.

Run:

python methods/zero_shot_wo_reasoning.py

2) Zero-shot (with reasoning)

File: methods/zero_shot_w_reasoning.py

Meaning:

  • Single-pass classification with an explicit reasoning-oriented instruction.

Run:

python methods/zero_shot_w_reasoning.py

3) Few-shot

File: methods/few_shot.py

Meaning:

  • In-context learning with few-shot examples prepended before the target sample.
  • Implemented only for models that can process multiple videos in one conversation.

Run:

python methods/few_shot.py

4) MMCoT (Structured multimodal CoT inspired from: MultimodalCoT)

File: methods/mmcot.py

Meaning:

  • Multi-step cue extraction (visual/audio/facial/utterance as applicable), then final reasoning + label prediction.
  • For sharegpt, video-llava, videogpt, vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.

Run:

python methods/mmcot.py

5) PragCoT

File: methods/pragcot.py

Meaning:

  • Pragmatic multi-step prompting: extract modality cues, decode dialogue intent, then perform intent-aware final reasoning.
  • For sharegpt, video-llava, videogpt, vita, prompt-2 and prompt-4 are omitted as they cannot process audio. Text utterances are provided explicitly to these models.

Run:

python methods/pragcot.py

Model Selection in Method Scripts

Inside each method file, edit:

  • RUN_MODELS = ["all"] to run all supported models, or
  • RUN_MODELS = ["qwenomni"] (example) to run one model.

Outputs are written under outputs/<method_name>/<model_name>/.

Ablations

1) Ablation without face decoding

File: ablations/ablation_wo_face_decoding.py

Meaning:

  • Keeps face decoding query step but excludes response_4 contribution from final reasoning prompt.

Run:

python ablations/ablation_wo_face_decoding.py

2) Ablation without intent decoding

File: ablations/ablation_wo_intent_decoding.py

Meaning:

  • Runs intent decoding query but excludes response_6 contribution from final reasoning prompt.

Run:

python ablations/ablation_wo_intent_decoding.py

3) LLM-only ablation (text-only multimodal descriptors)

File: ablations/llm_ablation.py

Meaning:

  • Uses descriptor table (sarcasm_explanation_70B.csv) and runs text-only LLM inference across:
    • Llama-3.1-8B-Instruct
    • Mistral-7B-Instruct-v0.3
    • Qwen3-8B

Run:

python ablations/llm_ablation.py

📬 Contact

For questions, feel free to contact:

Anisha Saha
📧 ansaha@mpi-inf.mpg.de
🔗 anisha0325.github.io


📚 Citation

If you use this work in your research, please cite:

@misc{saha2025mustreasonbenchmarkdiagnosingpragmatic,
      title={MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection}, 
      author={Anisha Saha and Varsha Suresh and Timothy Hospedales and Vera Demberg},
      year={2025},
      eprint={2510.23727},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.23727}, 
}

About

[LREC 2026] MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages