Healthcare Prediction

Author :

Samar Krimi

Business Problem :

A healthcare prediction to predict whether a patient is likely to get stroke. Stroke can be very hard to predict and therefore try to hinder, because it is the result of many different pathophysiologies.

Source of data :

healthcare-dataset-stroke-data.csv Here is the link for where the data is found from kaggle: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Attribute Information :

id: unique identifier
gender: "Male", "Female" or "Other"
age: age of the patient
hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
ever_married: "No" or "Yes"
work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
Residence_type: "Rural" or "Urban"
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"* *Note: "Unknown" in smoking_status means that the information is unavailable for this patient
stroke: 1 if the patient had a stroke or 0 if not

Data Description :

This is a healthcare dataset used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. This is a binary classification problem. There are 2 possible classes : predict stroke (target): 1 if the patient had a stroke or 0 if not. These classes are highly unbalanced. The data contains 12 attributes (columns or features) and 5110 observations (rows), each row represents a specific patient.

Exploratory Data Analysis :

Numeric Feature Inspection :

Observations :

For bmi Feature vs. Target :

I would expect this feature to be a classificator of the target: I think it's important to know about its body mass index to avoid certain diseases due to obesity.

This feature doesn't appear to be a classificator of the target because it has a very low correlation with it, diagnosis based on body mass index is not very relevant to determine if the patient will have a stroke.

For age Feature vs. Target :

I would expect this feature to be a classificator of the target: I think stroke increases with age, patients who are more than 45 are most likely to develop a stroke.

This feature doesn't appear to be a classificator of the target because it has a low correlation with it.

For avg_glucose_level vs. Target Observations:

I would expect this feature to be a classificator of the target: I think it's important to know about its average glucose level in blood to avoid diabetes which is a serious chronic disease, the most important Average glucose level in blood is between 60 and 100.

This feature doesn't appear to be a classificator of the target because it has a very low correlation with it.

Categorical Feature Inspection :

Observations :

gender {Male, Female} : Stroke targets male patients more than females.

hypertension {0 if the patient doesn't have hypertension, 1 if the patient has hypertension} : If the patient does not have hypertension, he has a great chance to avoid stroke.

heart_disease {0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease} : If the patient doesn't have cardiovascular disease, he's more likely to avoid stroke.
ever_married {No or Yes} : Patients who haven't been married in their lives will be spared by the stroke.
work_type {children, Govt_jov, Never_worked, Private or Self-employed} : patients that Never_worked are undiagnosed, children don't have a stroke, Patients who have - private jobs are more likely to develop stroke than self_employed or patient with govermental jobs, may be they are more stressed by their work schedules.
Residence_type {Rural, Urban} : Urban life induce stroke more than rural life.
smoking_status {formerly smoked, never smoked, smokes or occasional smoker}: patients how have never smoked are more likely to be spared from stroke although in some cases related to life quality they may develop stroke.

Model Developpement

We will evaluate 4 types of Models on train & test data with Classification Report, Normalized Confusion Matrix & ROC curves :

2 Sequential Models : LGBMClassifier XGBClassifier

2 Ensemble Sequential Models : ensemble.AdaBoostClassifier ensemble.GradientBoostingClassifier

2 Linear Models : linear_model.LogisticRegression linear_model.SGDClassifier

2 Ensemble Parallel Models : ensemble.BaggingClassifier ensemble.RandomForestClassifier

--> for LGBMClassifier & XGBClassifier, I will evaluate the default models without any regularization.

--> for AdaBoostClassifier & GradientBoostingClassifier, I will tunned some hyperparameters with RandomizedSearchCV.

--> for LogisticRegression & SGDClassifier, I will use Class Weights to tell the relative importance of each class, using class_weight='balanced'.

Best Model

LGBMClassifier & XGBClassifier are the best predictive models for stroke target we choose LGBMClassifier because it detects highly FN=0.94 (the most problemetic) with AUC=0.82 on test set and perfect AUC=1 on train set.

Recommendations :

My "production" model is LGBMClassifier that will be tunned with RandomizedSearchCV & Regularized by Ridge/Lasso/ElasticNet in order to improve the model classification performance.

Stroke targets male patients more than females. They must check frequently their high blood pressure, get treated early if they have cardiovascular disease or get tested regularly, stop smoking, avoid conflicts between spouses and stressful jobs, explore rural life more often and have healthy life quality.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Best Model Metrics.png		Best Model Metrics.png
LGBM.png		LGBM.png
Multivariate Visualization for Categorical Features.png		Multivariate Visualization for Categorical Features.png
Multivariate Visualization for Numeric Features.png		Multivariate Visualization for Numeric Features.png
PROJECT 2 NON TECHNICAL.pdf		PROJECT 2 NON TECHNICAL.pdf
Project_2_Part_1.ipynb		Project_2_Part_1.ipynb
Project_2_Part_2.ipynb		Project_2_Part_2.ipynb
Project_2_Part_3.ipynb		Project_2_Part_3.ipynb
Project_2_Part_4.ipynb		Project_2_Part_4.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare Prediction

Author :

Business Problem :

Source of data :

Attribute Information :

Data Description :

Exploratory Data Analysis :

Numeric Feature Inspection :

Observations :

Categorical Feature Inspection :

Observations :

Model Developpement

Best Model

Recommendations :

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Healthcare Prediction

Author :

Business Problem :

Source of data :

Attribute Information :

Data Description :

Exploratory Data Analysis :

Numeric Feature Inspection :

Observations :

Categorical Feature Inspection :

Observations :

Model Developpement

Best Model

Recommendations :

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages