Skip to content

Don't fail import when the gated Llama-2 tokenizer is inaccessible (c4.py)#10

Open
kylefoxaustin wants to merge 1 commit into
IST-DASLab:mainfrom
kylefoxaustin:fix-c4-gated-tokenizer-import
Open

Don't fail import when the gated Llama-2 tokenizer is inaccessible (c4.py)#10
kylefoxaustin wants to merge 1 commit into
IST-DASLab:mainfrom
kylefoxaustin:fix-c4-gated-tokenizer-import

Conversation

@kylefoxaustin
Copy link
Copy Markdown

Problem

src/data/c4.py loads the gated meta-llama/Llama-2-7b-hf tokenizer at module-import time (a top-level statement). Because src/data/utils.py imports c4, this fires on any dataset import — so a user without gated access to that repo cannot train on any dataset (wikitext, shakespeare, etc.). They hit:

huggingface_hub.errors.GatedRepoError: 403 Client Error ...
OSError: You are trying to access a gated repo.

Fix

Guard the module-level load in try/except so import never fails. c4 still loads/uses the tokenizer when that dataset is actually selected (a user training on c4 is expected to have access).

Testing

--dataset wikitext now imports and trains with no access to meta-llama/Llama-2-7b-hf.

src/data/c4.py loads the gated meta-llama/Llama-2-7b-hf tokenizer at module-import
time. Because src/data/utils.py imports c4, this fires on ANY dataset import, so a
user without gated access to that repo cannot train on any dataset (wikitext,
shakespeare, ...) -- they hit GatedRepoError / 'trying to access a gated repo'.
Guard the load in try/except so import never fails; c4 still uses the tokenizer
when actually selected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant