Word2Vec

A complete implementation of Word2Vec (Skip-gram and CBOW) with negative sampling, built using PyTorch.

Overview

Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013)
Read the detailed blog post explaining Word2Vec

Features

Two Model Architectures:
- Skip-gram: Predicts context words from center word
- CBOW (Continuous Bag of Words): Predicts center word from context
Negative Sampling: Efficient training using negative sampling with 0.75 power distribution
Complete Pipeline:
- Text preprocessing and tokenization
- Vocabulary building with frequency filtering
- Training pair generation
- Word similarity search
- Word analogies
- t-SNE visualization of embeddings

Training Details

Corpus: Music lyrics from Linkin Park, Pink Floyd, The Beatles, Nirvana, Metallica, and The Doors
Vocabulary Size: 3,885 unique words
Embedding Dimension: 100
Window Size: 3
Negative Samples: 5
Learning Rate: 0.005
Epochs: 60
Device: CUDA (GPU) if available, else CPU

TODO

Implement subsampling of frequent words to improve training quality and speed

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
fetch_lyrics.py		fetch_lyrics.py
test.py		test.py
word2vec.ipynb		word2vec.ipynb
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec

Overview

Features

Training Details

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Word2Vec

Overview

Features

Training Details

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages