This project builds a machine learning pipeline to predict farmer income using tabular data with categorical and numerical features.
The solution leverages data preprocessing, feature engineering, target encoding, skewness correction, imputation, and hyperparameter-tuned XGBoost models with GPU acceleration.
- Handles 50k+ samples with 105 features.
- Automated data preprocessing:
- Target encoding for categorical variables.
- Log transformation for skewed numerical features.
- Missing value imputation with
SimpleImputer.
- Optimized XGBoost using Optuna + K-Fold cross-validation.
- Achieved MAPE = 9.52% (≈25% improvement over baseline).
- CUDA-accelerated training with early stopping.
- Reproducible end-to-end pipeline with feature importance analysis.
- Generates a ready-to-submit
submission.csv.
├── LTF Challenge data with dictionary.xlsx # Input dataset (Train + Test sheets)
├── farmer_income_prediction.py # Main training & inference script
└── README.md # Project documentation
Install the dependencies: pip install pandas numpy scikit-learn category_encoders xgboost scipy openpyxl optuna 📊 Dataset Input file: LTF Challenge data with dictionary.xlsx
Sheets:
TrainData → contains training features + target (Target_Variable/Total Income).
TestData → contains features for inference.
ID column: FarmerID.
Load Train + Test data from Excel.
Concatenate to apply uniform preprocessing.
Encode categorical features with TargetEncoder.
Correct skewness in numerical features using log1p.
Impute missing values with mean strategy.
Train/validate with XGBoost (Optuna-tuned hyperparameters).
Evaluate with Mean Absolute Percentage Error (MAPE).
Predict on test set and save submission.csv.
Run the script:
python farmer_income_prediction.py During training:
Uses train_test_split (80/20).
Trains XGBoost with early stopping (patience = 100).
Prints validation MAPE score.
Validation performance:
MAPE ≈ 9.52% on held-out validation set.
FarmerID,PredictedIncome 10001,15432.78 10002,18293.41 ...
GPU support enabled via tree_method="hist" and device="cuda".
Log transformation is applied to target for stability.
Feature importance can be displayed for explainability.
25% improvement over baseline model.
Scalable to 10k+ unseen records.
End-to-end automated pipeline for reproducible results.