Narrative Coherence in Image-Sequence Storytelling: Humans and Vision-Language Models

Code and data for analysing narrative coherence in visually grounded stories from the Visual Writing Prompts (VWP) corpus. We compare human-written narratives with stories generated by five vision-language models (VLMs) using a set of metrics combined into a Narrative Coherence Score (NCS).

Setup

Running on Alvis, use existing environment:

module purge
module load virtualenv/20.29.2-GCCcore-14.2.0
module load Python/3.13.1-GCCcore-14.2.0
source /mimer/NOBACKUP/groups/naiss2025-22-1187/coherence-tacl/envs/coherence_tacl/bin/activate

Pipeline overview

The pipeline has six stages: (1) data preparation, (2) story generation, (3) post-processing, (4) coreference resolution, (5) metric computation, and (6) analysis. The sections below describe each stage.

1. Data preparation

We sampled 60 visual story sequences from VWP. Download the full story images and character images:

cd data
python download_data.py --csv-file ./vwp-acl2025-subset.csv --output-dir ./sampled_60

Prompt conditions

Two prompt conditions are used:

Short prompt (data/prompts/prompt-original-w-names.txt): Follows the original VWP data collection instructions. Human stories come from the VWP corpus; model stories are newly generated.
Long prompt (data/prompts/prompt-large-w-names.txt): Provides more explicit guidance emphasising tellability, coherence, and character consistency. Model stories are generated with this prompt; additional human stories were collected via Amazon Mechanical Turk.

2. Story generation

Stories were generated for all 60 sequences using five VLMs under both prompt conditions. Each model received the story image sequence together with character images and names.

Model	Type	Instructions
InternVL3-78B	Open-source	`models/internvl3/README.md`
Qwen3-VL-235B	Open-source	`models/qwen3vl/README.md`
Llama-4-Scout	Open-source	`models/llama4scout/README.md`
Claude 4.5 Sonnet	Proprietary	`models/claude/claude45.ipynb`
GPT-4o	Proprietary	`models/gpt/gpt4o.ipynb`

All models used temperature=0.6 and max_tokens=4096.

Human data collection (long prompt)

Long-prompt human stories were collected via AMT (3 descriptions per sequence, 180 total). See mturk/README.md for the full recruitment criteria, workflow, and payment details.

3. Post-processing

After generating stories with all models, process the outputs for analysis:

Step 1 — Collect model outputs into a single JSON:

cd scripts
python collect_data.py \
    --qwen3vl-out ../models/qwen3vl/out-qwen3vl-60stories/ \
    --internvl3-out ../models/internvl3/out-internvl3-60stories/ \
    --llama4-out ../models/llama4scout/out-llama4scout-60stories/ \
    --gpt4o-out ../models/gpt/out-gpt4o-60stories/ \
    --claude45-out ../models/claude/out-claude45-60stories/ \
    --human-large-csv ../notebooks/collected_60.csv \
    --human-original-csv ../data/vwp-acl2025-subset.csv \
    --output-json ../data/post-processing/collected_outputs.json

Step 2 — Clean and normalise story texts (remove reasoning traces, meta-commentary, formatting artifacts, standardise [SEP] markers):

python clean_data.py \
    --input-json ../data/post-processing/collected_outputs.json \
    --output-json ../data/post-processing/cleaned_outputs.json

4. Coreference resolution

Coreference chains are extracted using the Link-Append model. This is a prerequisite for the coreference, character persistence, and multimodal character grounding metrics.

Step 3 — Prepare LinkAppend inputs:

python prepare_linkappend_inputs.py \
    --input-json ../data/post-processing/cleaned_outputs.json \
    --output-dir ../models/linkappend/data-in

Creates separate JSON files for each model/prompt/seed combination.

Step 4 — Run LinkAppend (SLURM job, 1 A100 GPU):

cd ../models/linkappend
sbatch linkappend-run.slurm

Processes all JSON files in data-in/ → coreference-annotated CoNLL output in data-out/.

Step 5 — Convert CoNLL to JSON:

cd ../../scripts
sbatch conll2json-corefconversion.sh

Converts .conll files from data-out/ subdirectories into jsonlines files in data-out/conll_to_json/.

5. Coherence metric computation

We compute five coherence metrics, each targeting a different aspect of narrative organisation. Metric computation notebooks and scripts are in analysis/ and scripts/.

Metric	Code
Coreference	`analysis/coreference_profiles.ipynb`
Implicit discourse relation typology	`scripts/run_trainer.py`, `scripts/train-implicit-no-rst.sh`, `analysis/implicit_connectives_profiles.ipynb`
Topic switch	`scripts/topic_modelling/`, `analysis/topic_modelling_profiles.ipynb`
Character persistence	`scripts/character_persistence/`, `analysis/character_profiles.ipynb`
Multimodal character grounding	`scripts/mcg/`, `analysis/mcc_profile.ipynb`, `analysis/groovist_profile.ipynb`, `analysis/mci_profile.ipynb`

Each metric is computed at the story level and transformed with tanh before aggregation.

Implicit discourse relation classifier

We classify implicit discourse relations between adjacent story segments using DeDisCo, an instruction-tuned Qwen3-4B model. See notebooks/create_inputs_for_implicit_connectives.ipynb for input preparation and notebooks/discourse_relation_types.ipynb for relation type analysis.

Topic modelling (BERTopic)

A single BERTopic model is trained on the combined corpus and applied under multiple topic granularities (nr_topics from 80 to 5, step 5). See scripts/topic_modelling/README.md for pipeline details, scripts/topic_modelling/technical_details.md for configuration, and notebooks/create_inputs_for_berttopic.ipynb for input preparation.

6. Narrative coherence score and analysis

The five metrics are combined into a Narrative Coherence Score (NCS) in two variants:

Arithmetic mean (NCS_arith): unweighted average across metrics.
Geometric mean (NCS_geom): penalises imbalance — lower when a story is strong on some metrics but weak on others.

Computation: analysis/ncs_score.ipynb

Additional analyses

Analysis	Location
Descriptive statistics	`notebooks/cleaned_outputs_descriptive_stats.ipynb`
MTurk collection statistics	`notebooks/mturk_data_collection_statistics.ipynb`
Perplexity (complementary probe)	`notebooks/perplexity_analysis.ipynb`
Metric robustness checks	`analysis/metric_robustness/`
Metric exclusion sensitivity	`analysis/metric_exclusion_sensitivity/`
Story visualisation with coreference	`notebooks/visualize_stories_with_coref.ipynb`, `examine_stories/`

Quality control (long prompt)

Long-prompt human stories were screened for potential AI-generated content using a fine-tuned RoBERTa classifier with 5-fold GroupKFold cross-validation (grouped by visual sequence). Sequences with any flagged story (AI probability > 0.9) were excluded, retaining 54 of 60 sequences.

See scripts/ai-detect/README.md for details.

Perplexity evaluation

We compute perplexity with three open-source VLMs (Qwen3-VL, Llama 4 Scout, InternVL-3) on both prompt conditions and on additional web-scraped multimodal data. All evaluator models assign higher perplexity to human-authored texts than to model-generated texts.

Perplexity analysis: notebooks/perplexity_analysis.ipynb
Scraped data collection: scripts/scraped_data/ (Wikipedia, Wikinews, RSS photo-essays)

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
analysis		analysis
coref-reeval @ 0a350a8		coref-reeval @ 0a350a8
corefconversion @ e8970b1		corefconversion @ e8970b1
data		data
examine_stories		examine_stories
figures		figures
models		models
mturk		mturk
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Narrative Coherence in Image-Sequence Storytelling: Humans and Vision-Language Models

Setup

Pipeline overview

1. Data preparation

Prompt conditions

2. Story generation

Human data collection (long prompt)

3. Post-processing

4. Coreference resolution

5. Coherence metric computation

Implicit discourse relation classifier

Topic modelling (BERTopic)

6. Narrative coherence score and analysis

Additional analyses

Quality control (long prompt)

Perplexity evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Narrative Coherence in Image-Sequence Storytelling: Humans and Vision-Language Models

Setup

Pipeline overview

1. Data preparation

Prompt conditions

2. Story generation

Human data collection (long prompt)

3. Post-processing

4. Coreference resolution

5. Coherence metric computation

Implicit discourse relation classifier

Topic modelling (BERTopic)

6. Narrative coherence score and analysis

Additional analyses

Quality control (long prompt)

Perplexity evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages