Sentiment Analysis

Supervised classification of textual reviews based on its sentiment into one of the five polarities:

Methodology

Text Pre-processing: The raw data was processed to convert it into a format that can be used for further processing. The following steps were applied:
- Case normalisation
- Tokenisation
- Lemmitization
Feature Generation: Once the data was cleansed, relevant features were extracted from the it such as:
- Creation of N-grams
- Term and inverse document frequency
Model : Logistic regression is the classifier used for determining the polarity of a review.

Datasets:

train_data.csv:

The training set consists of 650,000 product reviews.
train_label.csv:

This dataset contains the sentiment lables of the training dataset. The label set (1,2,3,4,5) refer to five polarity levels (strong negative, weak negative, neutral, weak positive, strong and positive) respectively.
test_data.csv:

The test set consists of 50,000 product reviews.
predicted_label.csv:

This dataset contains the predicted sentiment labels of the test data.