Skip to content

addsarah/sentiment-analysis-ml

Repository files navigation

Sentiment Analysis using Machine Learning

Project Description

This project aims to perform sentiment analysis on user reviews of an application. The process involves data scraping, preprocessing, labeling, exploratory data analysis, and developing machine learning models (LSTM, Bi-LSTM, GRU) to classify sentiment into negative, neutral, and positive categories.

Submission Requirements (as per project brief)

  • Data Scraping: Independently scrape data using Python from sources like PlayStore, X, Instagram, or e-commerce, with a minimum of 3,000 samples.
  • Data Preprocessing & Labeling: Data must undergo feature extraction and labeling before model training.
  • Model Performance: Developed models must achieve a minimum accuracy of 85% on the testing set.
  • Additional Challenge (Deep Learning): For extra points, use deep learning algorithms with training and testing accuracy above 92%, and a dataset of at least 10,000 samples covering three sentiment classes.
  • Training Schemes: Implement three different training schemes with varying combinations of algorithms, feature extraction methods, or data splitting.
  • Inference/Testing: Include an inference or testing process that yields categorical output (negative, neutral, positive) in .ipynb or .py format.
  • Deliverables: The submission must include a training notebook (.ipynb), scraping code (.py or .ipynb), requirements.txt, scraped dataset (.csv or .json), and a compressed folder (zip).
  • Notebook Execution: The submitted notebook must be pre-executed, showing all outputs.

Author

Name: Sarah Adibah

Email: sarahadibah@06gmail.com

References

Project Structure and Key Steps

1. Library Installation

Required libraries such as google-play-scraper, nltk, sastrawi, and tensorflow are installed to support data scraping, text preprocessing, and model development.

2. Library Import

Essential Python libraries like pandas, seaborn, matplotlib, tensorflow, sklearn, nltk, and Sastrawi are imported for data manipulation, visualization, machine learning, and natural language processing tasks.

3. Data Scraping

User reviews for the 'com.linkedin.android' application are scraped from the Google Play Store using google_play_scraper. A total of 10,000 reviews are collected and saved to a CSV file.

4. Loading Dataset

The scraped data is loaded into a pandas DataFrame. Irrelevant columns are dropped, and duplicate rows are removed, resulting in a clean dataset for further processing.

5. Data Preprocessing

This section covers several text preprocessing steps:

  • Data Cleaning: Functions are defined to remove mentions, hashtags, links, numbers, punctuation, and extra spaces from the review text.
  • Casefolding: All text is converted to lowercase.
  • Slangword Fixing: Common Indonesian slang words are replaced with their standard equivalents using a predefined dictionary.
  • Tokenizing: Text is tokenized into individual words.
  • Filtering: Stopwords (common words with little semantic value) are removed.
  • Data Labeling: Sentiment polarity (positive, negative, neutral) is determined using a lexicon-based approach. Lexicons for positive and negative words are loaded from GitHub.
  • Dataset Statistics: The distribution of sentiment classes is visualized using a pie chart and a bar chart.
  • Word Cloud: Word clouds are generated for positive, negative, and neutral reviews to visualize the most frequent words in each sentiment category.
  • One Hot Encoding: The sentiment polarity is converted into one-hot encoded columns.
  • Data Splitting: The dataset is split into training and testing sets for model development and evaluation.

6. Model Development

Deep learning models are developed for sentiment classification:

  • Tokenizing: Text data is tokenized and padded for input into neural networks.
  • Callback and Function Initialization: Custom callbacks for early stopping and a utility function for plotting model accuracy and loss are defined.
  • LSTM: A Long Short-Term Memory (LSTM) model is built, compiled, and trained.
  • Bi-LSTM: A Bidirectional LSTM (Bi-LSTM) model is built, compiled, and trained.
  • GRU: A Gated Recurrent Unit (GRU) model is built, compiled, and trained.

7. Model Evaluation

The trained models are evaluated based on their accuracy on both the training and testing datasets. The results are summarized in a DataFrame.

8. Model Testing

A function sentiment_predict is provided to take new text input, preprocess it, and predict its sentiment using the best-performing model (GRU in this case).

9. Requirements

The requirements.txt file is generated, listing all the Python packages and their versions used in the project.

Releases

No releases published

Packages

 
 
 

Contributors