feat: add is_vlm param to safe_conversations_generator for multimodal data support by sunny-infra · Pull Request #545 · sgl-project/SpecForge

sunny-infra · 2026-04-24T08:41:51Z

Motivation

The safe_conversations_generator function currently only extracts the conversations field from JSONL data, discarding the image field that is present in multimodal datasets (e.g., ALLaVA-4V, ShareGPT4V). When loading multimodal JSONL data and running the VLM training pipeline, this causes a KeyError: "image" at preprocess_vlm_conversations ( specforge/data/preprocessing.py ), where the function iterates over examples["image"] but the image column is missing from the dataset. This makes it impossible to load VLM training data that contains image paths through the standard data loading pipeline.

Modifications

specforge/utils.py

Added is_vlm parameter (default False ) to safe_conversations_generator
When is_vlm=True , the image field from the source data is preserved in the yielded dict; when is_vlm=False , no image key is added, keeping the schema clean for text-only datasets
scripts/train_eagle3.py
Updated both Dataset.from_generator calls (train & eval) to pass is_vlm=args.is_vlm via gen_kwargs
scripts/prepare_hidden_states.py
Updated Dataset.from_generator call to pass is_vlm=args.is_vlm via gen_kwargs

Related Issues

None.

Accuracy Test

This PR only modifies data loading logic and does not affect model-side code (kernels, architecture). No accuracy impact expected.

Benchmark & Profiling

No performance impact — this PR only adds a conditional field in the data generator.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

… support

… support v2

gemini-code-assist

Code Review

This pull request introduces support for Vision-Language Models (VLM) by propagating an is_vlm flag through the data loading pipeline and including image data in the conversation generator. Feedback points out a critical typo in the dictionary key name ("conversation" vs "conversations") that would break downstream processing, as well as an accidental module-level import of unittest.result. Additionally, minor formatting and indentation issues in the new logic should be addressed.

gemini-code-assist · 2026-04-24T08:43:10Z

+                if is_vlm:
+                    image = row.get("image", "")
+                    result = {
+                        "conversation":cleaned_convs,
+                        "image":image
+                    }
+                else:
+                     result = {"conversations": cleaned_convs}


There are several issues in this block that will break the VLM data loading pipeline:

Key Name Mismatch: The key "conversation" on line 386 should be "conversations" (plural). The downstream preprocessing logic (e.g., in specforge/data/preprocessing.py) specifically expects the plural form. Using the singular form here will result in a KeyError during training.

Indentation: The else block on line 390 has an extra leading space (21 spaces instead of 20).

Formatting: There is a missing space after the colons in the dictionary keys on lines 386 and 387.

Applying the suggested fix will ensure consistency and correct functionality.

Suggested change

if is_vlm:

image = row.get("image", "")

result = {

"conversation":cleaned_convs,

"image":image

}

else:

result = {"conversations": cleaned_convs}

if is_vlm:

image = row.get("image", "")

result = {

"conversations": cleaned_convs,

"image": image

}

else:

result = {"conversations": cleaned_convs}

Format issues have been fixed. Please review again.

gemini-code-assist · 2026-04-24T08:43:10Z

 import os
 import re
 from contextlib import contextmanager
+from unittest import result


The import from unittest import result appears to be an accidental auto-import. The variable result is used as a local dictionary within the safe_conversations_generator function, and this module-level import is unnecessary and could lead to confusion or shadowing issues.

Please review again.

sunny-infra added 3 commits April 23, 2026 21:47

add the configs for qwen3-vl-8b-instruct model

41ba68b

feat: add is_vlm param to safe_conversations_generator for multimodal…

b894fbc

… support

feat: add is_vlm param to safe_conversations_generator for multimodal…

9defccf

… support v2

sunny-infra requested review from FlamingoPg, shuaills and sleepcoo as code owners April 24, 2026 08:41

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

sunny-infra added 2 commits April 24, 2026 16:58

fix: fixed some formatting issues.

eadc83b

fix: fixed some formatting issues v2

a91c698

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add is_vlm param to safe_conversations_generator for multimodal data support#545

feat: add is_vlm param to safe_conversations_generator for multimodal data support#545
sunny-infra wants to merge 5 commits into
sgl-project:mainfrom
sunny-infra:feat/support-vlm-data-loading-from-jsonl

sunny-infra commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

sunny-infra Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

sunny-infra Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunny-infra commented Apr 24, 2026

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sunny-infra Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sunny-infra Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant