feat: add is_vlm param to safe_conversations_generator for multimodal data support#545
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Vision-Language Models (VLM) by propagating an is_vlm flag through the data loading pipeline and including image data in the conversation generator. Feedback points out a critical typo in the dictionary key name ("conversation" vs "conversations") that would break downstream processing, as well as an accidental module-level import of unittest.result. Additionally, minor formatting and indentation issues in the new logic should be addressed.
| if is_vlm: | ||
| image = row.get("image", "") | ||
| result = { | ||
| "conversation":cleaned_convs, | ||
| "image":image | ||
| } | ||
| else: | ||
| result = {"conversations": cleaned_convs} |
There was a problem hiding this comment.
There are several issues in this block that will break the VLM data loading pipeline:
- Key Name Mismatch: The key
"conversation"on line 386 should be"conversations"(plural). The downstream preprocessing logic (e.g., inspecforge/data/preprocessing.py) specifically expects the plural form. Using the singular form here will result in aKeyErrorduring training. - Indentation: The
elseblock on line 390 has an extra leading space (21 spaces instead of 20). - Formatting: There is a missing space after the colons in the dictionary keys on lines 386 and 387.
Applying the suggested fix will ensure consistency and correct functionality.
| if is_vlm: | |
| image = row.get("image", "") | |
| result = { | |
| "conversation":cleaned_convs, | |
| "image":image | |
| } | |
| else: | |
| result = {"conversations": cleaned_convs} | |
| if is_vlm: | |
| image = row.get("image", "") | |
| result = { | |
| "conversations": cleaned_convs, | |
| "image": image | |
| } | |
| else: | |
| result = {"conversations": cleaned_convs} |
There was a problem hiding this comment.
Format issues have been fixed. Please review again.
| import os | ||
| import re | ||
| from contextlib import contextmanager | ||
| from unittest import result |
There was a problem hiding this comment.
Motivation
The safe_conversations_generator function currently only extracts the conversations field from JSONL data, discarding the image field that is present in multimodal datasets (e.g., ALLaVA-4V, ShareGPT4V). When loading multimodal JSONL data and running the VLM training pipeline, this causes a KeyError: "image" at preprocess_vlm_conversations ( specforge/data/preprocessing.py ), where the function iterates over examples["image"] but the image column is missing from the dataset. This makes it impossible to load VLM training data that contains image paths through the standard data loading pipeline.
Modifications
specforge/utils.py
Added is_vlm parameter (default False ) to safe_conversations_generator
When is_vlm=True , the image field from the source data is preserved in the yielded dict; when is_vlm=False , no image key is added, keeping the schema clean for text-only datasets
scripts/train_eagle3.py
Updated both Dataset.from_generator calls (train & eval) to pass is_vlm=args.is_vlm via gen_kwargs
scripts/prepare_hidden_states.py
Updated Dataset.from_generator call to pass is_vlm=args.is_vlm via gen_kwargs
Related Issues
None.
Accuracy Test
This PR only modifies data loading logic and does not affect model-side code (kernels, architecture). No accuracy impact expected.
Benchmark & Profiling
No performance impact — this PR only adds a conditional field in the data generator.
Checklist