Data and code for multimodal sentence acceptability judgment.
Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin, Predicting Sentence Acceptability Judgments in Multimodal Contexts, to appear at CMCL 2026 (arxiv).
Human participants and vision language models (VLMs) rated acceptability of English sentences on a scale of 1 (very unnatural) - 4 (very natural). The sentences were preceded by a relevant visual context (R), irrelevant visual context (I), and no contexts (N).
Human acceptability judgment on 75 original English sentences taken from News, Books, and Wikipedia + 225 backtranslated sentences of them.
GPT-5 generated images describing the 75 English sentences.
Sentence acceptability ratings provided by 7 VLMs (InternVL3-1B, InternVL3-8B, Qwen2.5-3B, Qwen2.5-7B, llava-1.5-7b, gpt-4o & gpt-4o-mini) averaged across multiple attempts (seeds) for each sentence.
Logits extracted for each sentence preceded by relevant, irrelevant, null visual contexts for the 5 open-source VLMs - with multiple attempts (seeds) for each sentence.
Code for sentence acceptability ratings by gpt-4o & gpt-40-mini.
Code for sentence acceptability ratings by InternVL3-1B, InternVL3-8B, Qwen2.5-3B, Qwen2.5-7B & llava-1.5-7b.
Code for logit extractions from open-source models for each sentence following relevant, irrelevant, and null visual contexts.
Pearson and Spearman correlations between [human ratings ~ model ratings], [human ratings ~ normalized model logprobs], [model ratings ~ normalized model logprobs].
Total least square regressions between ratings in each condition pair ([N-R], [N-I], [R-I]).