Code and data for analysing narrative coherence in visually grounded stories from the Visual Writing Prompts (VWP) corpus. We compare human-written narratives with stories generated by five vision-language models (VLMs) using a set of metrics combined into a Narrative Coherence Score (NCS).
Running on Alvis, use existing environment:
module purge
module load virtualenv/20.29.2-GCCcore-14.2.0
module load Python/3.13.1-GCCcore-14.2.0
source /mimer/NOBACKUP/groups/naiss2025-22-1187/coherence-tacl/envs/coherence_tacl/bin/activateThe pipeline has six stages: (1) data preparation, (2) story generation, (3) post-processing, (4) coreference resolution, (5) metric computation, and (6) analysis. The sections below describe each stage.
We sampled 60 visual story sequences from VWP. Download the full story images and character images:
cd data
python download_data.py --csv-file ./vwp-acl2025-subset.csv --output-dir ./sampled_60Two prompt conditions are used:
- Short prompt (
data/prompts/prompt-original-w-names.txt): Follows the original VWP data collection instructions. Human stories come from the VWP corpus; model stories are newly generated. - Long prompt (
data/prompts/prompt-large-w-names.txt): Provides more explicit guidance emphasising tellability, coherence, and character consistency. Model stories are generated with this prompt; additional human stories were collected via Amazon Mechanical Turk.
Stories were generated for all 60 sequences using five VLMs under both prompt conditions. Each model received the story image sequence together with character images and names.
| Model | Type | Instructions |
|---|---|---|
| InternVL3-78B | Open-source | models/internvl3/README.md |
| Qwen3-VL-235B | Open-source | models/qwen3vl/README.md |
| Llama-4-Scout | Open-source | models/llama4scout/README.md |
| Claude 4.5 Sonnet | Proprietary | models/claude/claude45.ipynb |
| GPT-4o | Proprietary | models/gpt/gpt4o.ipynb |
All models used temperature=0.6 and max_tokens=4096.
Long-prompt human stories were collected via AMT (3 descriptions per sequence, 180 total). See mturk/README.md for the full recruitment criteria, workflow, and payment details.
After generating stories with all models, process the outputs for analysis:
Step 1 — Collect model outputs into a single JSON:
cd scripts
python collect_data.py \
--qwen3vl-out ../models/qwen3vl/out-qwen3vl-60stories/ \
--internvl3-out ../models/internvl3/out-internvl3-60stories/ \
--llama4-out ../models/llama4scout/out-llama4scout-60stories/ \
--gpt4o-out ../models/gpt/out-gpt4o-60stories/ \
--claude45-out ../models/claude/out-claude45-60stories/ \
--human-large-csv ../notebooks/collected_60.csv \
--human-original-csv ../data/vwp-acl2025-subset.csv \
--output-json ../data/post-processing/collected_outputs.jsonStep 2 — Clean and normalise story texts (remove reasoning traces, meta-commentary, formatting artifacts, standardise [SEP] markers):
python clean_data.py \
--input-json ../data/post-processing/collected_outputs.json \
--output-json ../data/post-processing/cleaned_outputs.jsonCoreference chains are extracted using the Link-Append model. This is a prerequisite for the coreference, character persistence, and multimodal character grounding metrics.
Step 3 — Prepare LinkAppend inputs:
python prepare_linkappend_inputs.py \
--input-json ../data/post-processing/cleaned_outputs.json \
--output-dir ../models/linkappend/data-inCreates separate JSON files for each model/prompt/seed combination.
Step 4 — Run LinkAppend (SLURM job, 1 A100 GPU):
cd ../models/linkappend
sbatch linkappend-run.slurmProcesses all JSON files in data-in/ → coreference-annotated CoNLL output in data-out/.
Step 5 — Convert CoNLL to JSON:
cd ../../scripts
sbatch conll2json-corefconversion.shConverts .conll files from data-out/ subdirectories into jsonlines files in data-out/conll_to_json/.
We compute five coherence metrics, each targeting a different aspect of narrative organisation. Metric computation notebooks and scripts are in analysis/ and scripts/.
| Metric | Code |
|---|---|
| Coreference | analysis/coreference_profiles.ipynb |
| Implicit discourse relation typology | scripts/run_trainer.py, scripts/train-implicit-no-rst.sh, analysis/implicit_connectives_profiles.ipynb |
| Topic switch | scripts/topic_modelling/, analysis/topic_modelling_profiles.ipynb |
| Character persistence | scripts/character_persistence/, analysis/character_profiles.ipynb |
| Multimodal character grounding | scripts/mcg/, analysis/mcc_profile.ipynb, analysis/groovist_profile.ipynb, analysis/mci_profile.ipynb |
Each metric is computed at the story level and transformed with tanh before aggregation.
We classify implicit discourse relations between adjacent story segments using DeDisCo, an instruction-tuned Qwen3-4B model. See notebooks/create_inputs_for_implicit_connectives.ipynb for input preparation and notebooks/discourse_relation_types.ipynb for relation type analysis.
A single BERTopic model is trained on the combined corpus and applied under multiple topic granularities (nr_topics from 80 to 5, step 5). See scripts/topic_modelling/README.md for pipeline details, scripts/topic_modelling/technical_details.md for configuration, and notebooks/create_inputs_for_berttopic.ipynb for input preparation.
The five metrics are combined into a Narrative Coherence Score (NCS) in two variants:
- Arithmetic mean (
NCS_arith): unweighted average across metrics. - Geometric mean (
NCS_geom): penalises imbalance — lower when a story is strong on some metrics but weak on others.
Computation: analysis/ncs_score.ipynb
| Analysis | Location |
|---|---|
| Descriptive statistics | notebooks/cleaned_outputs_descriptive_stats.ipynb |
| MTurk collection statistics | notebooks/mturk_data_collection_statistics.ipynb |
| Perplexity (complementary probe) | notebooks/perplexity_analysis.ipynb |
| Metric robustness checks | analysis/metric_robustness/ |
| Metric exclusion sensitivity | analysis/metric_exclusion_sensitivity/ |
| Story visualisation with coreference | notebooks/visualize_stories_with_coref.ipynb, examine_stories/ |
Long-prompt human stories were screened for potential AI-generated content using a fine-tuned RoBERTa classifier with 5-fold GroupKFold cross-validation (grouped by visual sequence). Sequences with any flagged story (AI probability > 0.9) were excluded, retaining 54 of 60 sequences.
See scripts/ai-detect/README.md for details.
We compute perplexity with three open-source VLMs (Qwen3-VL, Llama 4 Scout, InternVL-3) on both prompt conditions and on additional web-scraped multimodal data. All evaluator models assign higher perplexity to human-authored texts than to model-generated texts.
- Perplexity analysis:
notebooks/perplexity_analysis.ipynb - Scraped data collection:
scripts/scraped_data/(Wikipedia, Wikinews, RSS photo-essays)