Skip to content

Fix WikiText-103 download (dead metamind S3 URL -> working mirror + UA)#11

Open
kylefoxaustin wants to merge 1 commit into
IST-DASLab:mainfrom
kylefoxaustin:fix-wikitext-dead-url
Open

Fix WikiText-103 download (dead metamind S3 URL -> working mirror + UA)#11
kylefoxaustin wants to merge 1 commit into
IST-DASLab:mainfrom
kylefoxaustin:fix-wikitext-dead-url

Conversation

@kylefoxaustin
Copy link
Copy Markdown

Problem

get_wikitext_data downloads WikiText-103 from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip, which now returns HTTP 301 (the metamind S3 bucket is gone). So --dataset wikitext fails:

urllib.error.HTTPError: HTTP Error 301: Moved Permanently

Fix

Point at the Smerity mirror (https://wikitext.smerity.com/wikitext-103-raw-v1.zip), which serves the identical zip (wikitext-103-raw/wiki.{train,valid,test}.raw). That host rejects the default Python urllib User-Agent with 403, so the download is issued via urllib.request.Request with a browser UA.

Testing

--dataset wikitext downloads, extracts, and tokenizes successfully on a fresh machine.

get_wikitext_data downloaded from the metamind S3 bucket, which now returns HTTP
301 (gone), so --dataset wikitext fails with HTTPError 301. Point at the Smerity
mirror (same zip layout) and send a browser User-Agent, since that host rejects
the default Python urllib UA with 403.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant