Farmer Income Prediction (XGBoost + Optuna + Cross-Validation)

This project builds a machine learning pipeline to predict farmer income using tabular data with categorical and numerical features.
The solution leverages data preprocessing, feature engineering, target encoding, skewness correction, imputation, and hyperparameter-tuned XGBoost models with GPU acceleration.

🚀 Features

Handles 50k+ samples with 105 features.
Automated data preprocessing:
- Target encoding for categorical variables.
- Log transformation for skewed numerical features.
- Missing value imputation with SimpleImputer.
Optimized XGBoost using Optuna + K-Fold cross-validation.
Achieved MAPE = 9.52% (≈25% improvement over baseline).
CUDA-accelerated training with early stopping.
Reproducible end-to-end pipeline with feature importance analysis.
Generates a ready-to-submit submission.csv.

📂 Project Structure

├── LTF Challenge data with dictionary.xlsx # Input dataset (Train + Test sheets)

├── farmer_income_prediction.py # Main training & inference script

└── README.md # Project documentation

⚙️ Requirements

Install the dependencies: pip install pandas numpy scikit-learn category_encoders xgboost scipy openpyxl optuna 📊 Dataset Input file: LTF Challenge data with dictionary.xlsx

Sheets:

TrainData → contains training features + target (Target_Variable/Total Income).

TestData → contains features for inference.

ID column: FarmerID.

🛠️ Pipeline Overview

Load Train + Test data from Excel.

Concatenate to apply uniform preprocessing.

Encode categorical features with TargetEncoder.

Correct skewness in numerical features using log1p.

Impute missing values with mean strategy.

Train/validate with XGBoost (Optuna-tuned hyperparameters).

Evaluate with Mean Absolute Percentage Error (MAPE).

Predict on test set and save submission.csv.

📈 Training

Run the script:

python farmer_income_prediction.py During training:

Uses train_test_split (80/20).

Trains XGBoost with early stopping (patience = 100).

Prints validation MAPE score.

📤 Output

Validation performance:

MAPE ≈ 9.52% on held-out validation set.

FarmerID,PredictedIncome 10001,15432.78 10002,18293.41 ...

📌 Notes

GPU support enabled via tree_method="hist" and device="cuda".

Log transformation is applied to target for stability.

Feature importance can be displayed for explainability.

🏆 Results

25% improvement over baseline model.

Scalable to 10k+ unseen records.

End-to-end automated pipeline for reproducible results.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Farmer_Income_Prediction.ipynb		Farmer_Income_Prediction.ipynb
LTF Challenge data with dictionary.xlsx		LTF Challenge data with dictionary.xlsx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Farmer Income Prediction (XGBoost + Optuna + Cross-Validation)

🚀 Features

📂 Project Structure

⚙️ Requirements

🛠️ Pipeline Overview

📈 Training

📤 Output

📌 Notes

🏆 Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Farmer Income Prediction (XGBoost + Optuna + Cross-Validation)

🚀 Features

📂 Project Structure

⚙️ Requirements

🛠️ Pipeline Overview

📈 Training

📤 Output

📌 Notes

🏆 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages