Skip to content

Feature/gemini cli tests#3

Open
Michal-Fularz wants to merge 3 commits intomainfrom
feature/gemini-cli-tests
Open

Feature/gemini cli tests#3
Michal-Fularz wants to merge 3 commits intomainfrom
feature/gemini-cli-tests

Conversation

@Michal-Fularz
Copy link
Copy Markdown
Member

Description

This PR addresses TODO tasks 1 through 4 for Lab 05, establishing a functional baseline for the Titanic classification experiment.

Changes

  • Lab 05 Implementation:
    • Added data loading and inspection logic.
    • Dropped leakage-prone columns (boat, body) and non-essential ones (home.dest).
    • Renamed Pclass to TicketClass for clarity.
    • Implemented a stratified train-test split (90/10 ratio).
    • Added a naive baseline classifier for evaluating future models.
  • Environment & Build:
    • Updated pyproject.toml to transition to uv for dependency management.
    • Removed pandasgui from dependencies to resolve installation issues on Windows.

Verification

  • Ran the updated script via uv run to confirm successful data processing and baseline accuracy calculation.

/gemini review

     tasks (TODO 1-4)
 - Add implementation for loading the Titanic dataset, cleaning irrelevant columns, renaming Pclass, and creating a stratified      train-test split.
- Implement a naive baseline model for accuracy comparison.
- Update environment dependencies by removing incompatible `pandasgui` package for Windows compatibility.
- Configure project for `uv` package management.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new sections to the machine learning course, including anomaly detection methods, hyperparameter tuning with GridSearchCV, and ensemble techniques like Voting and Stacking. It also introduces a Titanic dataset analysis script and removes the pandasgui dependency. The review feedback identifies a data leakage issue in the LocalOutlierFactor implementation, suggesting the use of novelty=True for proper test set evaluation. Other recommendations include moving imports to the top level to comply with PEP 8, simplifying model selection logic using a dictionary, and adopting a majority-class baseline for more stable performance benchmarking.

# Define the anomaly detection methods
methods = {
'IsolationForest': ensemble.IsolationForest(contamination=0.1, random_state=42),
'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To use LocalOutlierFactor for detecting anomalies on new data (the test set), you should set novelty=True. This allows you to fit the model on the training data and then call predict on new, unseen data. The current approach of using fit_predict on the test set (in lines 317 and 325) leads to data leakage.

Suggested change
'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20),
'LocalOutlierFactor': neighbors.LocalOutlierFactor(n_neighbors=20, novelty=True),

Comment on lines +314 to +327
if method_name == 'LocalOutlierFactor':
method.fit(X_train_features)
y_pred_train = method.fit_predict(X_train_features)
y_pred_test = method.fit_predict(X_test_features)
else:
method.fit(X_train_features)
y_pred_train = method.predict(X_train_features)
y_pred_test = method.predict(X_test_features)

# Predict on mesh
if method_name == 'LocalOutlierFactor':
Z = method.fit_predict(np.c_[xx.ravel(), yy.ravel()])
else:
Z = method.predict(np.c_[xx.ravel(), yy.ravel()])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

After setting novelty=True for LocalOutlierFactor (as suggested for line 291), this block can be greatly simplified. The special if/else logic for LocalOutlierFactor is no longer necessary and contains data leakage issues by fitting on the test set. You can use a single, unified approach for all models. Additionally, the y_pred_train variable is unused.

        method.fit(X_train_features)
        y_pred_test = method.predict(X_test_features)

        # Predict on mesh
        Z = method.predict(np.c_[xx.ravel(), yy.ravel()])


print_function_name()

from sklearn import covariance
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the PEP 8 style guide, all imports should be at the top of the file. This improves readability and makes it easier to see the script's dependencies at a glance. I've noticed several imports inside functions (e.g., in todo_6, todo_7, todo_10). Please consider moving them to the top-level of the module.

References
  1. Imports should usually be on separate lines. Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (PEP 8) (link)

Comment on lines +773 to +786
if best_model_name == 'Voting (Soft)':
best_model = voting_soft
elif best_model_name == 'Voting (Hard)':
best_model = voting_hard
elif best_model_name == 'Stacking':
best_model = stacking_clf
elif best_model_name == 'SVM':
best_model = clf_svm
elif best_model_name == 'Decision Tree':
best_model = clf_dt
elif best_model_name == 'KNN':
best_model = clf_knn
else: # Logistic Regression
best_model = clf_lr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This if/elif chain can be simplified by using a dictionary to map model names to model objects. This makes the code more concise and easier to maintain if you add more models.

    model_map = {
        'Voting (Soft)': voting_soft,
        'Voting (Hard)': voting_hard,
        'Stacking': stacking_clf,
        'SVM': clf_svm,
        'Decision Tree': clf_dt,
        'KNN': clf_knn,
        'Logistic Regression': clf_lr,
    }
    best_model = model_map[best_model_name]

Comment on lines +50 to +55
def todo_4(y_test):
print("\n--- TODO 4 ---")
# Naive baseline: random guess based on random probability
y_pred_random = np.random.choice([0, 1], size=len(y_test))
accuracy = metrics.accuracy_score(y_test, y_pred_random)
print(f"Random baseline accuracy: {accuracy:.4f}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A random baseline's performance varies with each run, which isn't ideal for a stable benchmark. A more robust and standard naive baseline is to always predict the majority class from the training data. This provides a deterministic score to compare against.

Note: You'll need to update the function signature to accept y_train and update the call in main to todo_4(y_train, y_test).

Suggested change
def todo_4(y_test):
print("\n--- TODO 4 ---")
# Naive baseline: random guess based on random probability
y_pred_random = np.random.choice([0, 1], size=len(y_test))
accuracy = metrics.accuracy_score(y_test, y_pred_random)
print(f"Random baseline accuracy: {accuracy:.4f}")
def todo_4(y_train, y_test):
print("\n--- TODO 4 ---")
# Naive baseline: always predict the most frequent class from the training set
most_frequent_class = y_train.mode()[0]
y_pred_baseline = np.full_like(y_test, fill_value=most_frequent_class)
accuracy = metrics.accuracy_score(y_test, y_pred_baseline)
print(f"Majority class baseline accuracy: {accuracy:.4f}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant