A complete end-to-end machine learning project built on Google Vertex AI to forecast transit ridership revenue (ROI) and predict mode of transportation based on factors such as fare, duration, and weather. This repository includes data ingestion, feature engineering, model training, and deployment pipeline orchestration.
TransportAnalytics/
├── config/ # Configuration files for pipeline, training, or GCP integration
│ └── .placeholder
├── data_ingestion/ # Scripts to fetch and upload datasets to Google Cloud
│ └── download_kaggle_and_upload_gcs.py
├── deployment/ # (Optional) Scripts for batch or real-time model predictions
│ └── .placeholder
├── notebooks/ # Jupyter notebooks for EDA and insights
│ ├── eda_mta_ridership.ipynb
│ └── eda_mode_choice.ipynb
├── pipeline/ # Vertex AI pipeline orchestration scripts
│ └── vertex_pipeline.py
├── preprocessing/ # Feature engineering scripts and processed datasets
│ ├── feature_engineering.py
│ └── merged_feature_data.csv
├── training/ # Model training scripts
│ ├── train_ridership_model.py
│ └── train_mode_classifier.py
├── .gitignore
└── README.md
- Google Cloud Project (
your-gcp-projectid) - Vertex AI API enabled
- BigQuery and Cloud Storage set up
- Service Account with Vertex AI permissions
- Python ≥ 3.8
- Clone the repository:
git clone https://github.com/YOUR_USERNAME/TransportAnalytics.git
cd TransportAnalytics- Create and activate a virtual environment (required in Cloud Shell or locally):
This ensures isolated package installations and avoids permission issues, especially in Google Cloud Shell.
python3 -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt
# or manually:
pip install google-cloud-aiplatform kfp pandas scikit-learn- Download datasets from Kaggle and upload to GCS (required before feature engineering):
export KAGGLE_USERNAME=you_kaggle_username
export KAGGLE_KEY=your_kaggle_key
python data_ingestion/download_kaggle_and_upload_gcs.pyNote: This step is mandatory before running the feature engineering script.
- Run feature engineering:
python preprocessing/feature_engineering.py- Train models locally (optional):
python training/train_ridership_model.py
python training/train_mode_classifier.py- Compile and submit Vertex AI pipeline:
python pipeline/vertex_pipeline.py- Deploy the pipeline using Python SDK:
from google.cloud import aiplatform
from google.cloud.aiplatform.pipeline_jobs import PipelineJob
aiplatform.init(project="your-gcp-projectid", location="your-gcp-project-location")
pipeline_job = PipelineJob(
display_name="ridership-forecast-pipeline",
template_path="vertex_ridership_pipeline.json",
enable_caching=True,
)
pipeline_job.run()-
Datasets Used:
-
ML Models:
RandomForestRegressor: Predict transit ridership revenueRandomForestClassifier: Predict mode of transport based on influencing factors
-
Feature Engineering Highlights:
- Encoding categorical variables like
weatherandmode - Engineering features like
fare_per_minute - Merging datasets on common temporal dimensions
- Encoding categorical variables like
- Cloud function or batch endpoint for scoring
- BigQuery ML support
- Looker Studio dashboard for visualization
Feel free to fork this repo, submit pull requests, or create issues. Contributions are welcome!