Skip to content

GU-CLASP/coherence-driven-humans

Repository files navigation

Narrative Coherence in Image-Sequence Storytelling: Humans and Vision-Language Models

Code and data for analysing narrative coherence in visually grounded stories from the Visual Writing Prompts (VWP) corpus. We compare human-written narratives with stories generated by five vision-language models (VLMs) using a set of metrics combined into a Narrative Coherence Score (NCS).

Setup

Running on Alvis, use existing environment:

module purge
module load virtualenv/20.29.2-GCCcore-14.2.0
module load Python/3.13.1-GCCcore-14.2.0
source /mimer/NOBACKUP/groups/naiss2025-22-1187/coherence-tacl/envs/coherence_tacl/bin/activate

Pipeline overview

The pipeline has six stages: (1) data preparation, (2) story generation, (3) post-processing, (4) coreference resolution, (5) metric computation, and (6) analysis. The sections below describe each stage.


1. Data preparation

We sampled 60 visual story sequences from VWP. Download the full story images and character images:

cd data
python download_data.py --csv-file ./vwp-acl2025-subset.csv --output-dir ./sampled_60

Prompt conditions

Two prompt conditions are used:

  • Short prompt (data/prompts/prompt-original-w-names.txt): Follows the original VWP data collection instructions. Human stories come from the VWP corpus; model stories are newly generated.
  • Long prompt (data/prompts/prompt-large-w-names.txt): Provides more explicit guidance emphasising tellability, coherence, and character consistency. Model stories are generated with this prompt; additional human stories were collected via Amazon Mechanical Turk.

2. Story generation

Stories were generated for all 60 sequences using five VLMs under both prompt conditions. Each model received the story image sequence together with character images and names.

Model Type Instructions
InternVL3-78B Open-source models/internvl3/README.md
Qwen3-VL-235B Open-source models/qwen3vl/README.md
Llama-4-Scout Open-source models/llama4scout/README.md
Claude 4.5 Sonnet Proprietary models/claude/claude45.ipynb
GPT-4o Proprietary models/gpt/gpt4o.ipynb

All models used temperature=0.6 and max_tokens=4096.

Human data collection (long prompt)

Long-prompt human stories were collected via AMT (3 descriptions per sequence, 180 total). See mturk/README.md for the full recruitment criteria, workflow, and payment details.


3. Post-processing

After generating stories with all models, process the outputs for analysis:

Step 1 — Collect model outputs into a single JSON:

cd scripts
python collect_data.py \
    --qwen3vl-out ../models/qwen3vl/out-qwen3vl-60stories/ \
    --internvl3-out ../models/internvl3/out-internvl3-60stories/ \
    --llama4-out ../models/llama4scout/out-llama4scout-60stories/ \
    --gpt4o-out ../models/gpt/out-gpt4o-60stories/ \
    --claude45-out ../models/claude/out-claude45-60stories/ \
    --human-large-csv ../notebooks/collected_60.csv \
    --human-original-csv ../data/vwp-acl2025-subset.csv \
    --output-json ../data/post-processing/collected_outputs.json

Step 2 — Clean and normalise story texts (remove reasoning traces, meta-commentary, formatting artifacts, standardise [SEP] markers):

python clean_data.py \
    --input-json ../data/post-processing/collected_outputs.json \
    --output-json ../data/post-processing/cleaned_outputs.json

4. Coreference resolution

Coreference chains are extracted using the Link-Append model. This is a prerequisite for the coreference, character persistence, and multimodal character grounding metrics.

Step 3 — Prepare LinkAppend inputs:

python prepare_linkappend_inputs.py \
    --input-json ../data/post-processing/cleaned_outputs.json \
    --output-dir ../models/linkappend/data-in

Creates separate JSON files for each model/prompt/seed combination.

Step 4 — Run LinkAppend (SLURM job, 1 A100 GPU):

cd ../models/linkappend
sbatch linkappend-run.slurm

Processes all JSON files in data-in/ → coreference-annotated CoNLL output in data-out/.

Step 5 — Convert CoNLL to JSON:

cd ../../scripts
sbatch conll2json-corefconversion.sh

Converts .conll files from data-out/ subdirectories into jsonlines files in data-out/conll_to_json/.


5. Coherence metric computation

We compute five coherence metrics, each targeting a different aspect of narrative organisation. Metric computation notebooks and scripts are in analysis/ and scripts/.

Metric Code
Coreference analysis/coreference_profiles.ipynb
Implicit discourse relation typology scripts/run_trainer.py, scripts/train-implicit-no-rst.sh, analysis/implicit_connectives_profiles.ipynb
Topic switch scripts/topic_modelling/, analysis/topic_modelling_profiles.ipynb
Character persistence scripts/character_persistence/, analysis/character_profiles.ipynb
Multimodal character grounding scripts/mcg/, analysis/mcc_profile.ipynb, analysis/groovist_profile.ipynb, analysis/mci_profile.ipynb

Each metric is computed at the story level and transformed with tanh before aggregation.

Implicit discourse relation classifier

We classify implicit discourse relations between adjacent story segments using DeDisCo, an instruction-tuned Qwen3-4B model. See notebooks/create_inputs_for_implicit_connectives.ipynb for input preparation and notebooks/discourse_relation_types.ipynb for relation type analysis.

Topic modelling (BERTopic)

A single BERTopic model is trained on the combined corpus and applied under multiple topic granularities (nr_topics from 80 to 5, step 5). See scripts/topic_modelling/README.md for pipeline details, scripts/topic_modelling/technical_details.md for configuration, and notebooks/create_inputs_for_berttopic.ipynb for input preparation.


6. Narrative coherence score and analysis

The five metrics are combined into a Narrative Coherence Score (NCS) in two variants:

  • Arithmetic mean (NCS_arith): unweighted average across metrics.
  • Geometric mean (NCS_geom): penalises imbalance — lower when a story is strong on some metrics but weak on others.

Computation: analysis/ncs_score.ipynb

Additional analyses

Analysis Location
Descriptive statistics notebooks/cleaned_outputs_descriptive_stats.ipynb
MTurk collection statistics notebooks/mturk_data_collection_statistics.ipynb
Perplexity (complementary probe) notebooks/perplexity_analysis.ipynb
Metric robustness checks analysis/metric_robustness/
Metric exclusion sensitivity analysis/metric_exclusion_sensitivity/
Story visualisation with coreference notebooks/visualize_stories_with_coref.ipynb, examine_stories/

Quality control (long prompt)

Long-prompt human stories were screened for potential AI-generated content using a fine-tuned RoBERTa classifier with 5-fold GroupKFold cross-validation (grouped by visual sequence). Sequences with any flagged story (AI probability > 0.9) were excluded, retaining 54 of 60 sequences.

See scripts/ai-detect/README.md for details.

Perplexity evaluation

We compute perplexity with three open-source VLMs (Qwen3-VL, Llama 4 Scout, InternVL-3) on both prompt conditions and on additional web-scraped multimodal data. All evaluator models assign higher perplexity to human-authored texts than to model-generated texts.

  • Perplexity analysis: notebooks/perplexity_analysis.ipynb
  • Scraped data collection: scripts/scraped_data/ (Wikipedia, Wikinews, RSS photo-essays)

About

Code for "Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors