Skip to content

OmarDawoud4/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec

A complete implementation of Word2Vec (Skip-gram and CBOW) with negative sampling, built using PyTorch.

Overview

Features

  • Two Model Architectures:

    • Skip-gram: Predicts context words from center word
    • CBOW (Continuous Bag of Words): Predicts center word from context
  • Negative Sampling: Efficient training using negative sampling with 0.75 power distribution

  • Complete Pipeline:

    • Text preprocessing and tokenization
    • Vocabulary building with frequency filtering
    • Training pair generation
    • Word similarity search
    • Word analogies
    • t-SNE visualization of embeddings

Training Details

  • Corpus: Music lyrics from Linkin Park, Pink Floyd, The Beatles, Nirvana, Metallica, and The Doors
  • Vocabulary Size: 3,885 unique words
  • Embedding Dimension: 100
  • Window Size: 3
  • Negative Samples: 5
  • Learning Rate: 0.005
  • Epochs: 60
  • Device: CUDA (GPU) if available, else CPU

TODO

  • Implement subsampling of frequent words to improve training quality and speed

About

Implementation of the word2vec paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors