Conversation
tasks (TODO 1-4) - Add implementation for loading the Titanic dataset, cleaning irrelevant columns, renaming Pclass, and creating a stratified train-test split. - Implement a naive baseline model for accuracy comparison. - Update environment dependencies by removing incompatible `pandasgui` package for Windows compatibility. - Configure project for `uv` package management.
There was a problem hiding this comment.
Code Review
This pull request adds several new sections to the machine learning course, including anomaly detection methods, hyperparameter tuning with GridSearchCV, and ensemble techniques like Voting and Stacking. It also introduces a Titanic dataset analysis script and removes the pandasgui dependency. The review feedback identifies a data leakage issue in the LocalOutlierFactor implementation, suggesting the use of novelty=True for proper test set evaluation. Other recommendations include moving imports to the top level to comply with PEP 8, simplifying model selection logic using a dictionary, and adopting a majority-class baseline for more stable performance benchmarking.
| # Define the anomaly detection methods | ||
| methods = { | ||
| 'IsolationForest': ensemble.IsolationForest(contamination=0.1, random_state=42), | ||
| 'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20), |
There was a problem hiding this comment.
To use LocalOutlierFactor for detecting anomalies on new data (the test set), you should set novelty=True. This allows you to fit the model on the training data and then call predict on new, unseen data. The current approach of using fit_predict on the test set (in lines 317 and 325) leads to data leakage.
| 'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20), | |
| 'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20, novelty=True), |
| if method_name == 'LocalOutlierFactor': | ||
| method.fit(X_train_features) | ||
| y_pred_train = method.fit_predict(X_train_features) | ||
| y_pred_test = method.fit_predict(X_test_features) | ||
| else: | ||
| method.fit(X_train_features) | ||
| y_pred_train = method.predict(X_train_features) | ||
| y_pred_test = method.predict(X_test_features) | ||
|
|
||
| # Predict on mesh | ||
| if method_name == 'LocalOutlierFactor': | ||
| Z = method.fit_predict(np.c_[xx.ravel(), yy.ravel()]) | ||
| else: | ||
| Z = method.predict(np.c_[xx.ravel(), yy.ravel()]) |
There was a problem hiding this comment.
After setting novelty=True for LocalOutlierFactor (as suggested for line 291), this block can be greatly simplified. The special if/else logic for LocalOutlierFactor is no longer necessary and contains data leakage issues by fitting on the test set. You can use a single, unified approach for all models. Additionally, the y_pred_train variable is unused.
method.fit(X_train_features)
y_pred_test = method.predict(X_test_features)
# Predict on mesh
Z = method.predict(np.c_[xx.ravel(), yy.ravel()])|
|
||
| print_function_name() | ||
|
|
||
| from sklearn import covariance |
There was a problem hiding this comment.
According to the PEP 8 style guide, all imports should be at the top of the file. This improves readability and makes it easier to see the script's dependencies at a glance. I've noticed several imports inside functions (e.g., in todo_6, todo_7, todo_10). Please consider moving them to the top-level of the module.
References
- Imports should usually be on separate lines. Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (PEP 8) (link)
| if best_model_name == 'Voting (Soft)': | ||
| best_model = voting_soft | ||
| elif best_model_name == 'Voting (Hard)': | ||
| best_model = voting_hard | ||
| elif best_model_name == 'Stacking': | ||
| best_model = stacking_clf | ||
| elif best_model_name == 'SVM': | ||
| best_model = clf_svm | ||
| elif best_model_name == 'Decision Tree': | ||
| best_model = clf_dt | ||
| elif best_model_name == 'KNN': | ||
| best_model = clf_knn | ||
| else: # Logistic Regression | ||
| best_model = clf_lr |
There was a problem hiding this comment.
This if/elif chain can be simplified by using a dictionary to map model names to model objects. This makes the code more concise and easier to maintain if you add more models.
model_map = {
'Voting (Soft)': voting_soft,
'Voting (Hard)': voting_hard,
'Stacking': stacking_clf,
'SVM': clf_svm,
'Decision Tree': clf_dt,
'KNN': clf_knn,
'Logistic Regression': clf_lr,
}
best_model = model_map[best_model_name]| def todo_4(y_test): | ||
| print("\n--- TODO 4 ---") | ||
| # Naive baseline: random guess based on random probability | ||
| y_pred_random = np.random.choice([0, 1], size=len(y_test)) | ||
| accuracy = metrics.accuracy_score(y_test, y_pred_random) | ||
| print(f"Random baseline accuracy: {accuracy:.4f}") |
There was a problem hiding this comment.
A random baseline's performance varies with each run, which isn't ideal for a stable benchmark. A more robust and standard naive baseline is to always predict the majority class from the training data. This provides a deterministic score to compare against.
Note: You'll need to update the function signature to accept y_train and update the call in main to todo_4(y_train, y_test).
| def todo_4(y_test): | |
| print("\n--- TODO 4 ---") | |
| # Naive baseline: random guess based on random probability | |
| y_pred_random = np.random.choice([0, 1], size=len(y_test)) | |
| accuracy = metrics.accuracy_score(y_test, y_pred_random) | |
| print(f"Random baseline accuracy: {accuracy:.4f}") | |
| def todo_4(y_train, y_test): | |
| print("\n--- TODO 4 ---") | |
| # Naive baseline: always predict the most frequent class from the training set | |
| most_frequent_class = y_train.mode()[0] | |
| y_pred_baseline = np.full_like(y_test, fill_value=most_frequent_class) | |
| accuracy = metrics.accuracy_score(y_test, y_pred_baseline) | |
| print(f"Majority class baseline accuracy: {accuracy:.4f}") |
Description
This PR addresses TODO tasks 1 through 4 for Lab 05, establishing a functional baseline for the Titanic classification experiment.
Changes
boat,body) and non-essential ones (home.dest).PclasstoTicketClassfor clarity.pyproject.tomlto transition touvfor dependency management.pandasguifrom dependencies to resolve installation issues on Windows.Verification
uv runto confirm successful data processing and baseline accuracy calculation./gemini review