Pretraining Data Preparation by ilkerkesen · Pull Request #59 · sign/WeLT

ilkerkesen · 2026-02-11T15:27:56Z

Fixes

Description

This PR implements a data preparation script, which allows us to create custom pretraining datasets using the specified data resources (i.e., Hugginface dataset). The training script is also adapted to work with the output produces by this script. This PR is critical, as it will enable us to pretrain a model on HPC clusters without internet access.

Technical details

The implemented script takes a Huggingface dataset as an input, streams the shuffled data, create a compressed shard of *.jsonl.gz file when hits the maximum number of words per shard limit, and stop when exceeds the maximum number of words limit. While processing examples, we could set a maximum sequence length limit where longer sequences are chunked, also we could drop the last chunk if we want. Additionally, I used the words-segmentation package to split raw text into units. We can also set the unit type (e.g,. words vs. chars).

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow the contribution guidelines.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

… of the specified data resource.

AmitMY

Review

Bugs

1. Metadata file naming mismatch — tests will fail

prepare_data.py writes to {prefix}-metadata.json (e.g. wikitext-wikitext-2-raw-v1-metadata.json), but every test opens metadata.json:

with open(f"{temp_output_dir}/metadata.json") as f:  # test_prepare_data.py:53

2. Wrong dependency

pyproject.toml adds zstandard but the code uses Python's built-in gzip — zstandard is never imported or used.

3. Missing script entry point

The README documents welt-prepare-data as a CLI command, but [project.scripts] has no entry for it. The command won't exist after install.

4. README lists --max_bytes_per_word but it's not in the argparse

5. Hardcoded seed=42 in train.py

The new train_test_split call hardcodes seed=42 instead of using the training seed.

Code Duplication — Reuse Opportunities

6. ~100 lines of argparse duplicates DataTrainingArguments

prepare_data.py defines its own argparse for --dataset_name, --dataset_config, --dataset_split, --text_column, --text_template, --seed, etc. Most of these already exist in DataTrainingArguments. The entire argparse block could be replaced by extending DataTrainingArguments with the few new fields (unit_type, max_total_units, num_units_per_file, drop_remainder, shuffle_buffer_size, output_path) and using HfArgumentParser — the same pattern train.py already uses.

7. stream_texts() duplicates dataset loading from init_datasets()

prepare_data.py reimplements HuggingFace dataset loading with streaming + text template formatting. train.py already does this in init_datasets(). Could factor out a shared function.

8. Word segmentation reimplements processor logic

stream_examples() creates its own WordsSegmentationTokenizer and manually chunks text. processor.pretokenize_dataset() already does word segmentation. Could reuse the processor's pretokenizer.

Design Issues

9. init_datasets change in train.py bypasses processing but caller doesn't know

The new preprocessed_data_path branch returns early before process_split, but the caller in train() still runs processor.pretokenize_dataset() and pack_dataset() on the result. Preprocessed data would get double-processed.

10. import glob breaks isort ordering

Added between sys and the blank line — should be before math. Could also use pathlib.Path.glob() to avoid the import.

11. Empty dataset edge case

If zero examples match, the cleanup branch (shard_units == 0 and shard_index > 0) is never entered, leaving an empty shard file and metadata claiming num_shards=1.

Summary — what to cut/reuse

What	Lines saved (approx)	How
Replace argparse with extended `DataTrainingArguments` + `HfArgumentParser`	~90	Add 5-6 fields to dataclass, delete argparse block
Reuse `init_datasets` streaming path for data loading	~20	Extract shared `load_streaming_dataset()`
Reuse processor's pretokenizer	~5	Import from processor instead of creating new instance
Delete unused `zstandard` dep	1	Remove from pyproject.toml
Add missing script entry point	+1	Add to `[project.scripts]`

The biggest win is replacing the argparse with the dataclass pattern — it eliminates ~90 lines and keeps argument definitions in one place.

…e could import it and re-use when needed.

…ion phase.

…processing as suggested by Claude

…for secure handling

ilkerkesen · 2026-02-11T20:26:34Z

@AmitMY I made a lot of improvements. The major one improvement is creating train and validation splits at data preparation phase. Why? We will work on multiple data resources, and this methodology allows us to compute perplexity on a balanced validation split.

The minor improvements,

ShardWriter becomes a context manager, used with with statements for safer handling file operations on shards.
Omit counting whitespace characters while counting the characters.
Address most of the issues mentioned in the first review.
A couple of additional refactoring.

also makes the language arg required for the data preparation script

ilkerkesen · 2026-02-12T21:37:34Z

@AmitMY I made the following improvements,

I allowed preparation per split (e.g., create shards for only training or validation purpose).
Handle multiple data resources correctly.
I implemented a verification script for the prepared data. So, this checks whether the number of shards matches with the reported number in the corresponding -metadata.json file.
Save language and example id information.
Adapt the welt training script to drop extra columns (language, id, etc. keep them just in case).
Now run_clm.py supports yaml files.
Some refactorings suggested by claude.

I tested one training run for each model.

ilkerkesen added 5 commits February 11, 2026 01:11

implement pretraining data preparation script, which creates a subset…

a16808b

… of the specified data resource.

adapt welt pretraining implementation to work with the new data format

a7faae7

document how to run the data preparation script

371e8ff

make shard limit naming more flexible (100_000 -> 100_000_000)

d95038d

move the module import to top-level

2986d4d

AmitMY reviewed Feb 11, 2026

View reviewed changes

ilkerkesen added 12 commits February 11, 2026 17:02

fix bugs listed in the PR review

6843bf2

fix the issues raised in the PR review

630666e

fix lint errors

b81d46d

implement data loading for the prepared data within the package, so w…

b3667ea

…e could import it and re-use when needed.

create train / validation splits at preparation time

fba018b

separate train and validation splits at data preparation phase

6800329

make [(train/validation)]_split_units args required for data preparat…

69f85a3

…ion phase.

implement extract_text procedure to prevent duplicated text_template …

afbc16c

…processing as suggested by Claude

make ShardWriter a context manager, initiated with /with/ statements …

fee1721

…for secure handling

do not count whitespace chars while keeping statistics

d1e127c

refactor data preparation tests

3323330

rely on the words segmenter to count number of characters

fc33081

ilkerkesen added 5 commits February 12, 2026 19:51

separate split metadata files and enable preparation per split

fc19d87

also makes the language arg required for the data preparation script

implement verification for the prepared data

dc20377

apply refactorings suggested by claude

c68e192

handle mutliple resources while verying the data

0651fc8

discard extra columns in training script

3330539

AmitMY approved these changes Feb 17, 2026

View reviewed changes

AmitMY merged commit 5c0a881 into sign:main Feb 17, 2026
1 of 3 checks passed

ilkerkesen deleted the data-preparation branch February 20, 2026 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pretraining Data Preparation#59

Pretraining Data Preparation#59
AmitMY merged 22 commits into
sign:mainfrom
ilkerkesen:data-preparation

ilkerkesen commented Feb 11, 2026

Uh oh!

AmitMY left a comment

Uh oh!

ilkerkesen commented Feb 11, 2026

Uh oh!

ilkerkesen commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ilkerkesen commented Feb 11, 2026

Fixes

Description

Technical details

Checklist

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Review

Bugs

Code Duplication — Reuse Opportunities

Design Issues

Summary — what to cut/reuse

Uh oh!

ilkerkesen commented Feb 11, 2026

Uh oh!

ilkerkesen commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilkerkesen commented Feb 12, 2026 •

edited

Loading