Samar Krimi
A healthcare prediction to predict whether a patient is likely to get stroke. Stroke can be very hard to predict and therefore try to hinder, because it is the result of many different pathophysiologies.
healthcare-dataset-stroke-data.csv Here is the link for where the data is found from kaggle: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
- id: unique identifier
- gender: "Male", "Female" or "Other"
- age: age of the patient
- hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- ever_married: "No" or "Yes"
- work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- Residence_type: "Rural" or "Urban"
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"* *Note: "Unknown" in smoking_status means that the information is unavailable for this patient
- stroke: 1 if the patient had a stroke or 0 if not
This is a healthcare dataset used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. This is a binary classification problem. There are 2 possible classes : predict stroke (target): 1 if the patient had a stroke or 0 if not. These classes are highly unbalanced. The data contains 12 attributes (columns or features) and 5110 observations (rows), each row represents a specific patient.
- For bmi Feature vs. Target :
I would expect this feature to be a classificator of the target: I think it's important to know about its body mass index to avoid certain diseases due to obesity.
This feature doesn't appear to be a classificator of the target because it has a very low correlation with it, diagnosis based on body mass index is not very relevant to determine if the patient will have a stroke.
- For age Feature vs. Target :
I would expect this feature to be a classificator of the target: I think stroke increases with age, patients who are more than 45 are most likely to develop a stroke.
This feature doesn't appear to be a classificator of the target because it has a low correlation with it.
- For avg_glucose_level vs. Target Observations:
I would expect this feature to be a classificator of the target: I think it's important to know about its average glucose level in blood to avoid diabetes which is a serious chronic disease, the most important Average glucose level in blood is between 60 and 100.
This feature doesn't appear to be a classificator of the target because it has a very low correlation with it.
- gender {Male, Female} : Stroke targets male patients more than females.
hypertension {0 if the patient doesn't have hypertension, 1 if the patient has hypertension} : If the patient does not have hypertension, he has a great chance to avoid stroke.
-
heart_disease {0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease} : If the patient doesn't have cardiovascular disease, he's more likely to avoid stroke.
-
ever_married {No or Yes} : Patients who haven't been married in their lives will be spared by the stroke.
-
work_type {children, Govt_jov, Never_worked, Private or Self-employed} : patients that Never_worked are undiagnosed, children don't have a stroke, Patients who have - private jobs are more likely to develop stroke than self_employed or patient with govermental jobs, may be they are more stressed by their work schedules.
-
Residence_type {Rural, Urban} : Urban life induce stroke more than rural life.
-
smoking_status {formerly smoked, never smoked, smokes or occasional smoker}: patients how have never smoked are more likely to be spared from stroke although in some cases related to life quality they may develop stroke.
We will evaluate 4 types of Models on train & test data with Classification Report, Normalized Confusion Matrix & ROC curves :
2 Sequential Models : LGBMClassifier XGBClassifier
2 Ensemble Sequential Models : ensemble.AdaBoostClassifier ensemble.GradientBoostingClassifier
2 Linear Models : linear_model.LogisticRegression linear_model.SGDClassifier
2 Ensemble Parallel Models : ensemble.BaggingClassifier ensemble.RandomForestClassifier
--> for LGBMClassifier & XGBClassifier, I will evaluate the default models without any regularization.
--> for AdaBoostClassifier & GradientBoostingClassifier, I will tunned some hyperparameters with RandomizedSearchCV.
--> for LogisticRegression & SGDClassifier, I will use Class Weights to tell the relative importance of each class, using class_weight='balanced'.
LGBMClassifier & XGBClassifier are the best predictive models for stroke target we choose LGBMClassifier because it detects highly FN=0.94 (the most problemetic) with AUC=0.82 on test set and perfect AUC=1 on train set.
My "production" model is LGBMClassifier that will be tunned with RandomizedSearchCV & Regularized by Ridge/Lasso/ElasticNet in order to improve the model classification performance.
- Stroke targets male patients more than females. They must check frequently their high blood pressure, get treated early if they have cardiovascular disease or get tested regularly, stop smoking, avoid conflicts between spouses and stressful jobs, explore rural life more often and have healthy life quality.




