Skip to content

Pretraining Data Preparation#59

Merged
AmitMY merged 22 commits into
sign:mainfrom
ilkerkesen:data-preparation
Feb 17, 2026
Merged

Pretraining Data Preparation#59
AmitMY merged 22 commits into
sign:mainfrom
ilkerkesen:data-preparation

Conversation

@ilkerkesen

Copy link
Copy Markdown
Contributor

Fixes

Fixes #58 by @ilkerkesen

Description

This PR implements a data preparation script, which allows us to create custom pretraining datasets using the specified data resources (i.e., Hugginface dataset). The training script is also adapted to work with the output produces by this script. This PR is critical, as it will enable us to pretrain a model on HPC clusters without internet access.

Technical details

The implemented script takes a Huggingface dataset as an input, streams the shuffled data, create a compressed shard of *.jsonl.gz file when hits the maximum number of words per shard limit, and stop when exceeds the maximum number of words limit. While processing examples, we could set a maximum sequence length limit where longer sequences are chunked, also we could drop the last chunk if we want. Additionally, I used the words-segmentation package to split raw text into units. We can also set the unit type (e.g,. words vs. chars).

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow the contribution guidelines.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

@AmitMY AmitMY left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Bugs

1. Metadata file naming mismatch — tests will fail

prepare_data.py writes to {prefix}-metadata.json (e.g. wikitext-wikitext-2-raw-v1-metadata.json), but every test opens metadata.json:

with open(f"{temp_output_dir}/metadata.json") as f:  # test_prepare_data.py:53

2. Wrong dependency

pyproject.toml adds zstandard but the code uses Python's built-in gzipzstandard is never imported or used.

3. Missing script entry point

The README documents welt-prepare-data as a CLI command, but [project.scripts] has no entry for it. The command won't exist after install.

4. README lists --max_bytes_per_word but it's not in the argparse

5. Hardcoded seed=42 in train.py

The new train_test_split call hardcodes seed=42 instead of using the training seed.


Code Duplication — Reuse Opportunities

6. ~100 lines of argparse duplicates DataTrainingArguments

prepare_data.py defines its own argparse for --dataset_name, --dataset_config, --dataset_split, --text_column, --text_template, --seed, etc. Most of these already exist in DataTrainingArguments. The entire argparse block could be replaced by extending DataTrainingArguments with the few new fields (unit_type, max_total_units, num_units_per_file, drop_remainder, shuffle_buffer_size, output_path) and using HfArgumentParser — the same pattern train.py already uses.

7. stream_texts() duplicates dataset loading from init_datasets()

prepare_data.py reimplements HuggingFace dataset loading with streaming + text template formatting. train.py already does this in init_datasets(). Could factor out a shared function.

8. Word segmentation reimplements processor logic

stream_examples() creates its own WordsSegmentationTokenizer and manually chunks text. processor.pretokenize_dataset() already does word segmentation. Could reuse the processor's pretokenizer.


Design Issues

9. init_datasets change in train.py bypasses processing but caller doesn't know

The new preprocessed_data_path branch returns early before process_split, but the caller in train() still runs processor.pretokenize_dataset() and pack_dataset() on the result. Preprocessed data would get double-processed.

10. import glob breaks isort ordering

Added between sys and the blank line — should be before math. Could also use pathlib.Path.glob() to avoid the import.

11. Empty dataset edge case

If zero examples match, the cleanup branch (shard_units == 0 and shard_index > 0) is never entered, leaving an empty shard file and metadata claiming num_shards=1.


Summary — what to cut/reuse

What Lines saved (approx) How
Replace argparse with extended DataTrainingArguments + HfArgumentParser ~90 Add 5-6 fields to dataclass, delete argparse block
Reuse init_datasets streaming path for data loading ~20 Extract shared load_streaming_dataset()
Reuse processor's pretokenizer ~5 Import from processor instead of creating new instance
Delete unused zstandard dep 1 Remove from pyproject.toml
Add missing script entry point +1 Add to [project.scripts]

The biggest win is replacing the argparse with the dataclass pattern — it eliminates ~90 lines and keeps argument definitions in one place.

@ilkerkesen

Copy link
Copy Markdown
Contributor Author

@AmitMY I made a lot of improvements. The major one improvement is creating train and validation splits at data preparation phase. Why? We will work on multiple data resources, and this methodology allows us to compute perplexity on a balanced validation split.

The minor improvements,

  1. ShardWriter becomes a context manager, used with with statements for safer handling file operations on shards.
  2. Omit counting whitespace characters while counting the characters.
  3. Address most of the issues mentioned in the first review.
  4. A couple of additional refactoring.

@ilkerkesen

ilkerkesen commented Feb 12, 2026

Copy link
Copy Markdown
Contributor Author

@AmitMY I made the following improvements,

  1. I allowed preparation per split (e.g., create shards for only training or validation purpose).
  2. Handle multiple data resources correctly.
  3. I implemented a verification script for the prepared data. So, this checks whether the number of shards matches with the reported number in the corresponding -metadata.json file.
  4. Save language and example id information.
  5. Adapt the welt training script to drop extra columns (language, id, etc. keep them just in case).
  6. Now run_clm.py supports yaml files.
  7. Some refactorings suggested by claude.

I tested one training run for each model.

@AmitMY AmitMY merged commit 5c0a881 into sign:main Feb 17, 2026
1 of 3 checks passed
@ilkerkesen ilkerkesen deleted the data-preparation branch February 20, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Implement Data Preparation

2 participants