This project aims to perform sentiment analysis on user reviews of an application. The process involves data scraping, preprocessing, labeling, exploratory data analysis, and developing machine learning models (LSTM, Bi-LSTM, GRU) to classify sentiment into negative, neutral, and positive categories.
- Data Scraping: Independently scrape data using Python from sources like PlayStore, X, Instagram, or e-commerce, with a minimum of 3,000 samples.
- Data Preprocessing & Labeling: Data must undergo feature extraction and labeling before model training.
- Model Performance: Developed models must achieve a minimum accuracy of 85% on the testing set.
- Additional Challenge (Deep Learning): For extra points, use deep learning algorithms with training and testing accuracy above 92%, and a dataset of at least 10,000 samples covering three sentiment classes.
- Training Schemes: Implement three different training schemes with varying combinations of algorithms, feature extraction methods, or data splitting.
- Inference/Testing: Include an inference or testing process that yields categorical output (negative, neutral, positive) in
.ipynbor.pyformat. - Deliverables: The submission must include a training notebook (
.ipynb), scraping code (.pyor.ipynb),requirements.txt, scraped dataset (.csvor.json), and a compressed folder (zip). - Notebook Execution: The submitted notebook must be pre-executed, showing all outputs.
Name: Sarah Adibah
Email: sarahadibah@06gmail.com
Required libraries such as google-play-scraper, nltk, sastrawi, and tensorflow are installed to support data scraping, text preprocessing, and model development.
Essential Python libraries like pandas, seaborn, matplotlib, tensorflow, sklearn, nltk, and Sastrawi are imported for data manipulation, visualization, machine learning, and natural language processing tasks.
User reviews for the 'com.linkedin.android' application are scraped from the Google Play Store using google_play_scraper. A total of 10,000 reviews are collected and saved to a CSV file.
The scraped data is loaded into a pandas DataFrame. Irrelevant columns are dropped, and duplicate rows are removed, resulting in a clean dataset for further processing.
This section covers several text preprocessing steps:
- Data Cleaning: Functions are defined to remove mentions, hashtags, links, numbers, punctuation, and extra spaces from the review text.
- Casefolding: All text is converted to lowercase.
- Slangword Fixing: Common Indonesian slang words are replaced with their standard equivalents using a predefined dictionary.
- Tokenizing: Text is tokenized into individual words.
- Filtering: Stopwords (common words with little semantic value) are removed.
- Data Labeling: Sentiment polarity (positive, negative, neutral) is determined using a lexicon-based approach. Lexicons for positive and negative words are loaded from GitHub.
- Dataset Statistics: The distribution of sentiment classes is visualized using a pie chart and a bar chart.
- Word Cloud: Word clouds are generated for positive, negative, and neutral reviews to visualize the most frequent words in each sentiment category.
- One Hot Encoding: The sentiment polarity is converted into one-hot encoded columns.
- Data Splitting: The dataset is split into training and testing sets for model development and evaluation.
Deep learning models are developed for sentiment classification:
- Tokenizing: Text data is tokenized and padded for input into neural networks.
- Callback and Function Initialization: Custom callbacks for early stopping and a utility function for plotting model accuracy and loss are defined.
- LSTM: A Long Short-Term Memory (LSTM) model is built, compiled, and trained.
- Bi-LSTM: A Bidirectional LSTM (Bi-LSTM) model is built, compiled, and trained.
- GRU: A Gated Recurrent Unit (GRU) model is built, compiled, and trained.
The trained models are evaluated based on their accuracy on both the training and testing datasets. The results are summarized in a DataFrame.
A function sentiment_predict is provided to take new text input, preprocess it, and predict its sentiment using the best-performing model (GRU in this case).
The requirements.txt file is generated, listing all the Python packages and their versions used in the project.