Skip to content

YugVarshney/Farmer-Income-Prediction

Repository files navigation

Farmer Income Prediction (XGBoost + Optuna + Cross-Validation)

This project builds a machine learning pipeline to predict farmer income using tabular data with categorical and numerical features.
The solution leverages data preprocessing, feature engineering, target encoding, skewness correction, imputation, and hyperparameter-tuned XGBoost models with GPU acceleration.


🚀 Features

  • Handles 50k+ samples with 105 features.
  • Automated data preprocessing:
    • Target encoding for categorical variables.
    • Log transformation for skewed numerical features.
    • Missing value imputation with SimpleImputer.
  • Optimized XGBoost using Optuna + K-Fold cross-validation.
  • Achieved MAPE = 9.52% (≈25% improvement over baseline).
  • CUDA-accelerated training with early stopping.
  • Reproducible end-to-end pipeline with feature importance analysis.
  • Generates a ready-to-submit submission.csv.

📂 Project Structure

├── LTF Challenge data with dictionary.xlsx # Input dataset (Train + Test sheets)

├── farmer_income_prediction.py # Main training & inference script

└── README.md # Project documentation

⚙️ Requirements

Install the dependencies: pip install pandas numpy scikit-learn category_encoders xgboost scipy openpyxl optuna 📊 Dataset Input file: LTF Challenge data with dictionary.xlsx

Sheets:

TrainData → contains training features + target (Target_Variable/Total Income).

TestData → contains features for inference.

ID column: FarmerID.

🛠️ Pipeline Overview

Load Train + Test data from Excel.

Concatenate to apply uniform preprocessing.

Encode categorical features with TargetEncoder.

Correct skewness in numerical features using log1p.

Impute missing values with mean strategy.

Train/validate with XGBoost (Optuna-tuned hyperparameters).

Evaluate with Mean Absolute Percentage Error (MAPE).

Predict on test set and save submission.csv.

📈 Training

Run the script:

python farmer_income_prediction.py During training:

Uses train_test_split (80/20).

Trains XGBoost with early stopping (patience = 100).

Prints validation MAPE score.

📤 Output

Validation performance:

MAPE ≈ 9.52% on held-out validation set.

FarmerID,PredictedIncome 10001,15432.78 10002,18293.41 ...

📌 Notes

GPU support enabled via tree_method="hist" and device="cuda".

Log transformation is applied to target for stability.

Feature importance can be displayed for explainability.

🏆 Results

25% improvement over baseline model.

Scalable to 10k+ unseen records.

End-to-end automated pipeline for reproducible results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors