A machine learning project that predicts wine quality using physicochemical properties of Portuguese "Vinho Verde" red wine.
The dataset contains 1,599 samples of red wine with 11 physicochemical features and 1 quality rating:
- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulphates
- Alcohol
- Quality: Score between 0-10 (converted to binary: Good ≥7, Bad <7)
- Test Accuracy: 92.5%
- Cross-validation Score: 86.4% (±0.016)
- Optimized with GridSearchCV
- Test Accuracy: 89.4%
- Cross-validation Score: 87.0% (±0.019)
- Optimized with GridSearchCV
- Test Accuracy: 90.3%
- Cross-validation Score: 79.8% (±0.060)
- Data Preprocessing: Label binarization for quality classification
- Feature Selection: Top 5 most important features identified
- Hyperparameter Optimization: GridSearchCV for model tuning
- Cross-validation: 5-fold CV for robust evaluation
- Visualization: Correlation heatmaps and feature analysis
- Alcohol (17.7% importance)
- Sulphates (11.7% importance)
- Volatile Acidity (11.0% importance)
- Citric Acid (9.7% importance)
- Density (9.3% importance)
pip install numpy pandas matplotlib seaborn scikit-learn# Clone the repository
git clone https://github.com/beater35/ml-classification-wine-quality.git
cd ml-classification-wine-quality
# Load and run the Jupyter notebook
jupyter notebook classification_model_comparison_wine_quality.ipynb# Example prediction with top 5 features
input_data = [10.0, 0.47, 0.65, 0.0, 0.9946] # [alcohol, sulphates, volatile_acidity, citric_acid, density]
prediction = final_rf_model.predict([input_data])
if prediction[0] == 1:
print("Good Quality Wine")
else:
print("Bad Quality Wine")| Model | Test Accuracy | CV Score (Mean ± Std) |
|---|---|---|
| Random Forest | 92.5% | 86.4% ± 1.6% |
| Logistic Regression | 89.4% | 87.0% ± 1.9% |
| Decision Tree | 90.3% | 79.8% ± 6.0% |
wine-quality-classification/
├── classification_model_comparison_wine_quality.ipynb
└── README.md
- Data Exploration: Statistical analysis and visualization
- Preprocessing: Binary classification setup (Good/Bad quality)
- Model Training: Three different algorithms tested
- Hyperparameter Tuning: GridSearchCV optimization
- Feature Selection: Importance-based feature ranking
- Model Evaluation: Cross-validation and test accuracy
The Random Forest Classifier achieved the best performance with 92.5% test accuracy after hyperparameter optimization and feature selection. The model successfully identifies wine quality based on physicochemical properties, with alcohol content being the most influential factor.
- Dataset: Kaggle – Wine Quality Dataset