Pretraining Data Preparation#59
Conversation
… of the specified data resource.
AmitMY
left a comment
There was a problem hiding this comment.
Review
Bugs
1. Metadata file naming mismatch — tests will fail
prepare_data.py writes to {prefix}-metadata.json (e.g. wikitext-wikitext-2-raw-v1-metadata.json), but every test opens metadata.json:
with open(f"{temp_output_dir}/metadata.json") as f: # test_prepare_data.py:532. Wrong dependency
pyproject.toml adds zstandard but the code uses Python's built-in gzip — zstandard is never imported or used.
3. Missing script entry point
The README documents welt-prepare-data as a CLI command, but [project.scripts] has no entry for it. The command won't exist after install.
4. README lists --max_bytes_per_word but it's not in the argparse
5. Hardcoded seed=42 in train.py
The new train_test_split call hardcodes seed=42 instead of using the training seed.
Code Duplication — Reuse Opportunities
6. ~100 lines of argparse duplicates DataTrainingArguments
prepare_data.py defines its own argparse for --dataset_name, --dataset_config, --dataset_split, --text_column, --text_template, --seed, etc. Most of these already exist in DataTrainingArguments. The entire argparse block could be replaced by extending DataTrainingArguments with the few new fields (unit_type, max_total_units, num_units_per_file, drop_remainder, shuffle_buffer_size, output_path) and using HfArgumentParser — the same pattern train.py already uses.
7. stream_texts() duplicates dataset loading from init_datasets()
prepare_data.py reimplements HuggingFace dataset loading with streaming + text template formatting. train.py already does this in init_datasets(). Could factor out a shared function.
8. Word segmentation reimplements processor logic
stream_examples() creates its own WordsSegmentationTokenizer and manually chunks text. processor.pretokenize_dataset() already does word segmentation. Could reuse the processor's pretokenizer.
Design Issues
9. init_datasets change in train.py bypasses processing but caller doesn't know
The new preprocessed_data_path branch returns early before process_split, but the caller in train() still runs processor.pretokenize_dataset() and pack_dataset() on the result. Preprocessed data would get double-processed.
10. import glob breaks isort ordering
Added between sys and the blank line — should be before math. Could also use pathlib.Path.glob() to avoid the import.
11. Empty dataset edge case
If zero examples match, the cleanup branch (shard_units == 0 and shard_index > 0) is never entered, leaving an empty shard file and metadata claiming num_shards=1.
Summary — what to cut/reuse
| What | Lines saved (approx) | How |
|---|---|---|
Replace argparse with extended DataTrainingArguments + HfArgumentParser |
~90 | Add 5-6 fields to dataclass, delete argparse block |
Reuse init_datasets streaming path for data loading |
~20 | Extract shared load_streaming_dataset() |
| Reuse processor's pretokenizer | ~5 | Import from processor instead of creating new instance |
Delete unused zstandard dep |
1 | Remove from pyproject.toml |
| Add missing script entry point | +1 | Add to [project.scripts] |
The biggest win is replacing the argparse with the dataclass pattern — it eliminates ~90 lines and keeps argument definitions in one place.
…e could import it and re-use when needed.
…processing as suggested by Claude
…for secure handling
|
@AmitMY I made a lot of improvements. The major one improvement is creating train and validation splits at data preparation phase. Why? We will work on multiple data resources, and this methodology allows us to compute perplexity on a balanced validation split. The minor improvements,
|
also makes the language arg required for the data preparation script
|
@AmitMY I made the following improvements,
I tested one training run for each model. |
Fixes
Fixes #58 by @ilkerkesen
Description
This PR implements a data preparation script, which allows us to create custom pretraining datasets using the specified data resources (i.e., Hugginface dataset). The training script is also adapted to work with the output produces by this script. This PR is critical, as it will enable us to pretrain a model on HPC clusters without internet access.
Technical details
The implemented script takes a Huggingface dataset as an input, streams the shuffled data, create a compressed shard of
*.jsonl.gzfile when hits the maximum number of words per shard limit, and stop when exceeds the maximum number of words limit. While processing examples, we could set a maximum sequence length limit where longer sequences are chunked, also we could drop the last chunk if we want. Additionally, I used thewords-segmentationpackage to split raw text into units. We can also set the unit type (e.g,. words vs. chars).Checklist
Update index.md).mainormaster).visible errors.