Skip to content

feat: add is_vlm param to safe_conversations_generator for multimodal data support#545

Open
sunny-infra wants to merge 5 commits into
sgl-project:mainfrom
sunny-infra:feat/support-vlm-data-loading-from-jsonl
Open

feat: add is_vlm param to safe_conversations_generator for multimodal data support#545
sunny-infra wants to merge 5 commits into
sgl-project:mainfrom
sunny-infra:feat/support-vlm-data-loading-from-jsonl

Conversation

@sunny-infra
Copy link
Copy Markdown

Motivation

The safe_conversations_generator function currently only extracts the conversations field from JSONL data, discarding the image field that is present in multimodal datasets (e.g., ALLaVA-4V, ShareGPT4V). When loading multimodal JSONL data and running the VLM training pipeline, this causes a KeyError: "image" at preprocess_vlm_conversations ( specforge/data/preprocessing.py ), where the function iterates over examples["image"] but the image column is missing from the dataset. This makes it impossible to load VLM training data that contains image paths through the standard data loading pipeline.

Modifications

specforge/utils.py

  • Added is_vlm parameter (default False ) to safe_conversations_generator

  • When is_vlm=True , the image field from the source data is preserved in the yielded dict; when is_vlm=False , no image key is added, keeping the schema clean for text-only datasets
    scripts/train_eagle3.py

  • Updated both Dataset.from_generator calls (train & eval) to pass is_vlm=args.is_vlm via gen_kwargs
    scripts/prepare_hidden_states.py

  • Updated Dataset.from_generator call to pass is_vlm=args.is_vlm via gen_kwargs

Related Issues

None.

Accuracy Test

This PR only modifies data loading logic and does not affect model-side code (kernels, architecture). No accuracy impact expected.

Benchmark & Profiling

No performance impact — this PR only adds a conditional field in the data generator.

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Vision-Language Models (VLM) by propagating an is_vlm flag through the data loading pipeline and including image data in the conversation generator. Feedback points out a critical typo in the dictionary key name ("conversation" vs "conversations") that would break downstream processing, as well as an accidental module-level import of unittest.result. Additionally, minor formatting and indentation issues in the new logic should be addressed.

Comment thread specforge/utils.py Outdated
Comment on lines +383 to +390
if is_vlm:
image = row.get("image", "")
result = {
"conversation":cleaned_convs,
"image":image
}
else:
result = {"conversations": cleaned_convs}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are several issues in this block that will break the VLM data loading pipeline:

  1. Key Name Mismatch: The key "conversation" on line 386 should be "conversations" (plural). The downstream preprocessing logic (e.g., in specforge/data/preprocessing.py) specifically expects the plural form. Using the singular form here will result in a KeyError during training.
  2. Indentation: The else block on line 390 has an extra leading space (21 spaces instead of 20).
  3. Formatting: There is a missing space after the colons in the dictionary keys on lines 386 and 387.

Applying the suggested fix will ensure consistency and correct functionality.

Suggested change
if is_vlm:
image = row.get("image", "")
result = {
"conversation":cleaned_convs,
"image":image
}
else:
result = {"conversations": cleaned_convs}
if is_vlm:
image = row.get("image", "")
result = {
"conversations": cleaned_convs,
"image": image
}
else:
result = {"conversations": cleaned_convs}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format issues have been fixed. Please review again.

Comment thread specforge/utils.py Outdated
import os
import re
from contextlib import contextmanager
from unittest import result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import from unittest import result appears to be an accidental auto-import. The variable result is used as a local dictionary within the safe_conversations_generator function, and this module-level import is unnecessary and could lead to confusion or shadowing issues.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant