This project predicts IMDb movie ratings based on key features such as release year, duration, and number of votes. It uses a regression model built with Python, scikit-learn, and Pandas, and follows a complete machine learning pipeline from data preprocessing to model deployment.
- Dataset: IMDb Movies India.csv (from Kaggle)
- Goal: Predict movie ratings using features like Year, Duration, and Votes
- Model Used:
SGDRegressor(Stochastic Gradient Descent) - Libraries:
pandas,numpy,matplotlib,seaborn,scikit-learn
- Handled null values and cleaned feature columns (e.g., removed symbols from strings)
- Converted datatypes (
Year,Votes,Duration) - Removed less relevant columns (e.g.,
Genre,Actors)
- Used histograms, boxplots, and scatterplots to understand feature distribution
- Plotted a heatmap for correlation analysis
- Selected numerical features:
Year,Duration,Votesas inputs (X) - Used
Ratingas the target output (y)
- Created a pipeline with:
StandardScalerfor feature scalingSGDRegressorfor regression
- Split data into training and testing sets
- Evaluated using:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R² Score
- Accepted user input for new movie data
- Predicted rating using trained pipeline
new_input = pd.DataFrame({
'Year': [2023],
'Duration': [120],
'Votes': [10000],
})
predicted_rating = pipeline.predict(new_input)
print("Predicted Rating:", predicted_rating)