Skip to content

Feat/default models#1

Open
Reallyeasy1 wants to merge 6 commits into
mainfrom
feat/default-models
Open

Feat/default models#1
Reallyeasy1 wants to merge 6 commits into
mainfrom
feat/default-models

Conversation

@Reallyeasy1

@Reallyeasy1 Reallyeasy1 commented Mar 3, 2026

Copy link
Copy Markdown
Collaborator

Additional Notes:
Main change is in colab_package/

You can ignore notebooks/ as they are the source of truth for the colab_package/.

Closes #3

Summary by CodeRabbit

  • New Features

    • Complete sarcasm classification experiment suite: data-prep, multiple model baselines (TF-IDF+LR, Naive Bayes, BERT/DistilBERT), training, evaluation, and error-analysis workflows; consolidated combined notebook for easy execution.
  • Documentation

    • Added comprehensive experiment plan, dataset audit, and research summary detailing methodology, metrics, split strategy, and reproducibility checklist.
  • Chores

    • Added dependency manifest and updated ignore rules to exclude local AI artifacts.

@Reallyeasy1 Reallyeasy1 self-assigned this Mar 3, 2026
@Reallyeasy1 Reallyeasy1 requested a review from SeeYangZhi March 3, 2026 08:39
@coderabbitai

coderabbitai Bot commented Mar 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)
  • colab_package/sarcasm_classification.ipynb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a full sarcasm-classification pipeline: data preparation, TF-IDF/Naive Bayes baselines, DistilBERT training, error analysis notebooks, documentation, a notebook-merging utility, dependency manifest, and .gitignore updates. All changes are new files; no public API surface modified.

Changes

Cohort / File(s) Summary
Repo metadata
/.gitignore, requirements.txt
Added claude.md and .claude/ to .gitignore; introduced a requirements.txt declaring ML/NLP, plotting, and Jupyter dependencies.
Build Utility
build_combined.py
New script that merges five notebooks into colab_package/sarcasm_classification.ipynb, normalizes cells, validates assembled code, and writes outputs with directory scaffolding.
Dataset & Splitting
notebooks/01_data_preparation.ipynb
New notebook for data discovery, validation, duplicate/null checks, creation of binary/type datasets, group-aware train/val/test splitting, leakage checks, and saving datasets/splits/visualizations to outputs.
TF-IDF Baseline
notebooks/02_tfidf_lr_baseline.ipynb
New TF-IDF + Logistic Regression notebook with GroupKFold grid search, hyperparameter grids, evaluation, confusion matrices, feature importance, and persisted metrics/predictions.
Naive Bayes Baseline
notebooks/03_naive_bayes_baseline.ipynb
New notebook implementing multiple NB pipelines (Count/Tfidf + Multinomial/ComplementNB), grid search with GroupKFold, evaluation, and artifact saving for binary and type tasks.
BERT Training
notebooks/04_bert_classification.ipynb
New PyTorch DistilBERT-based training notebook: TrainConfig dataclass, HeadlineDataset, class-weighting, training/validation/test loops with early stopping, checkpointing, and saved metrics/artifacts for binary and type tasks.
Error Analysis & Reports
notebooks/05_error_analysis.ipynb
New notebook that loads model metrics/predictions, builds comparison tables, performs error extraction and heuristic failure-mode categorization, and writes markdown reports and CSV examples.
Documentation
docs/dataset_audit.md, docs/experiment_plan.md, docs/research_summary.md
Added dataset audit, experiment plan (splitting, CV, hyperparameters, outputs), and research/methodology summary documents describing assumptions and evaluation protocols.

Sequence Diagram

sequenceDiagram
    participant User
    participant DataPrep as 01_DataPrep
    participant TFIDF as 02_TFIDF_LR
    participant NB as 03_NaiveBayes
    participant BERT as 04_BERT
    participant Error as 05_ErrorAnalysis
    participant Storage as Outputs

    User->>DataPrep: run data prep notebook
    DataPrep->>DataPrep: validate JSONL, derive labels, create splits
    DataPrep->>Storage: save datasets & splits

    par Run classical baselines
        TFIDF->>Storage: load splits
        TFIDF->>TFIDF: GridSearchCV (GroupKFold), evaluate
        TFIDF->>Storage: save metrics & preds

        NB->>Storage: load splits
        NB->>NB: GridSearchCV (GroupKFold), evaluate
        NB->>Storage: save metrics & preds
    end

    BERT->>Storage: load splits
    BERT->>BERT: tokenize, train (early stopping), evaluate
    BERT->>Storage: save checkpoints, metrics & preds

    Storage->>Error: provide metrics & predictions
    Error->>Error: compare models, extract errors, create reports
    Error->>Storage: save reports & visuals
    Storage-->>User: artifacts available
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A tiny rabbit hops with glee,
New notebooks blooming for all to see.
Data split, models train and play,
Errors hunted, insights on display.
Hooray—sarcasm tamed, one hop away! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Feat/default models' is vague and does not clearly convey the actual scope of changes. While the PR adds multiple notebooks, documentation, and build utilities, the title does not reflect these substantial additions. Consider a more descriptive title that captures the main focus, such as 'Add sarcasm classification notebooks and experiment infrastructure' or similar to better represent the comprehensive changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/default-models

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai

coderabbitai Bot commented Mar 3, 2026

Copy link
Copy Markdown

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #2

coderabbitai Bot added a commit that referenced this pull request Mar 3, 2026
Docstrings generation was requested by @Reallyeasy1.

* #1 (comment)

The following files were modified:

* `build_combined.py`

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (3)
requirements.txt (1)

2-24: Use deterministic dependency pinning for reproducible runs.

Lines 2–24 use floating minimums only (>=), so reinstalls can drift and change model metrics over time. For this project, keep a pinned lock artifact (or fully pinned requirements.txt) and optionally keep a separate requirements.in for broad constraints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` around lines 2 - 24, The requirements.txt currently uses
permissive >= pins which allows dependency drift; replace it with a
deterministic, fully pinned requirements lock (use exact version pins like ==
for every package) or add a generated lock artifact (e.g., requirements.txt as
the locked file) while keeping a high-level requirements.in with >= constraints;
regenerate the lock via your tool of choice (pip-compile, pip freeze in a clean
env, or poetry/poetry.lock) and commit the locked requirements.txt so future
installs are reproducible and deterministic.
build_combined.py (1)

200-210: Remove unused out_dir_val parameter from content_cells_nb.

out_dir_val is never used, which makes call sites noisier and the helper harder to read.

Suggested fix
-def content_cells_nb(cells, out_dir_var, out_dir_val):
+def content_cells_nb(cells, out_dir_var):
@@
-for c in content_cells_nb(nb02, "OUT_TFIDF", "OUT_TFIDF"):
+for c in content_cells_nb(nb02, "OUT_TFIDF"):
@@
-for c in content_cells_nb(nb03, "OUT_NB", "OUT_NB"):
+for c in content_cells_nb(nb03, "OUT_NB"):
@@
-for c in content_cells_nb(nb04, "BERT_OUT", "BERT_OUT"):
+for c in content_cells_nb(nb04, "BERT_OUT"):
@@
-for c in content_cells_nb(nb05, "REPORTS_DIR", "REPORTS_DIR"):
+for c in content_cells_nb(nb05, "REPORTS_DIR"):

Also applies to: 241-241, 249-249, 257-257, 265-265

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build_combined.py` around lines 200 - 210, The function content_cells_nb has
an unused parameter out_dir_val; remove out_dir_val from its signature and any
internal references, leaving only (cells, out_dir_var), and update every call
site that currently passes a third argument to call content_cells_nb with just
(cells, out_dir_var). Ensure any tests or helpers that referenced the old
three-arg form are updated accordingly; search for content_cells_nb usages in
this file to locate and fix all callers.
notebooks/05_error_analysis.ipynb (1)

577-581: Avoid hard-coded class-frequency claims in generated reports.

Line 580 hard-codes rhetorical_question = 3.7%, which can become incorrect when data changes and make the report stale.

Suggested fix
-    lines.append("4. **Class imbalance** (rhetorical_question = 3.7%) is a significant challenge")
+    if error_source_type is not None and "type_label" in error_source_type.columns:
+        freq = (error_source_type["type_label"] == "rhetorical_question").mean() * 100
+        lines.append(f"4. **Class imbalance** (rhetorical_question = {freq:.1f}%) is a significant challenge")
+    else:
+        lines.append("4. **Class imbalance** is a significant challenge")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/05_error_analysis.ipynb` around lines 577 - 581, The report
currently hard-codes "rhetorical_question = 3.7%" inside a lines.append string;
instead compute the class frequency from the dataset and interpolate it into the
string. Find the lines.append call that contains "Class imbalance
(rhetorical_question = 3.7%)", calculate the percentage from the label column
(e.g., count_true / total * 100) and format it (e.g., f"{pct:.1f}%") before
passing the formatted string to lines.append so the report always reflects the
current data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@build_combined.py`:
- Around line 282-289: When syntax errors are found (the errors list is
non-empty) stop the build instead of continuing: after printing the error
summary (the block that iterates over errors) call sys.exit(1) or raise
SystemExit so the script exits non‑zero and prevents subsequent writing of the
combined notebook; alternatively move the code that writes the combined notebook
so it only executes in the else branch where you currently print "syntax OK".
Refer to the errors variable and the all_cells check in build_combined.py and
ensure the combined-notebook write logic (the function or code that
serializes/writes the notebook) only runs when no syntax errors are present.
- Around line 35-138: SETUP_SRC is missing imports and DEVICE/Dataset/DataLoader
definitions needed by notebook part 04; update SETUP_SRC to import torch,
transformers (e.g., from transformers import AutoTokenizer, AutoModel, etc. as
used elsewhere), and tqdm (or tqdm.auto), and import Dataset and DataLoader from
torch.utils.data, then add DEVICE = torch.device("cuda" if
torch.cuda.is_available() else "cpu") (and set torch.manual_seed(SEED) after
SEED) so code referencing DEVICE, Dataset, DataLoader, torch, or transformers
resolves at runtime; modify the block that defines imports and SEED in SETUP_SRC
accordingly, keeping symbol names exactly as used (SETUP_SRC, SEED, DEVICE,
Dataset, DataLoader, torch, transformers, tqdm).

In `@docs/dataset_audit.md`:
- Around line 16-24: The markdown file has lint warnings: the two tables that
start with the header row "| Field | Type | Description |" need a blank line
before and after each table to satisfy MD058, and the fenced code block that
currently lacks a language tag should be updated to include an appropriate
language identifier (e.g., ```python) to satisfy MD040; locate the table blocks
and the fenced block in docs/dataset_audit.md (search for the table header
string and the triple-backtick fence) and insert the surrounding blank lines for
each table and add the language tag to the fence.

In `@docs/research_summary.md`:
- Around line 112-121: Surround the markdown table that begins with "The 6
strategy classes are **significantly imbalanced**:" (the type-distribution table
listing sarcasm, irony, satire, etc.) with a blank line before the paragraph
introducing the table and a blank line after the table end so there is an empty
line both above and below the table to satisfy markdownlint MD058.

In `@notebooks/01_data_preparation.ipynb`:
- Around line 350-353: Replace the runtime-only assert checks that verify split
integrity (e.g., "assert train_groups.isdisjoint(val_groups)...", "assert
train_groups.isdisjoint(test_groups)...", "assert
val_groups.isdisjoint(test_groups)...") with unconditional guard checks that
raise explicit exceptions (for example raise ValueError with the same message)
so these validations always run even when Python is optimized; do the same for
the other analogous assertions in the file (the checks referencing
train_groups/val_groups/test_groups around the later block) to ensure
deterministic error handling.
- Line 24: The _find_root function can pick a generic system-level directory
(e.g., a global "data" folder); update _find_root to only accept a candidate
parent as ROOT if it both contains one of the existing project markers
(outputs/, notebooks/, data/) and also looks like a repository/project root —
e.g., it contains a repo marker such as .git, pyproject.toml, setup.cfg, or
package.json — and insist the candidate is within the current project tree (one
of Path.cwd() or its parents) and not a top-level filesystem root or home
directory; change the acceptance logic in _find_root accordingly and leave the
fallback to data_file.parent unchanged (symbols to edit: _find_root, markers,
ROOT, OUT_DATASETS, OUT_SPLITS).
- Around line 191-193: Remove the redundant imports in Cell 10: delete the line
"from __future__ import annotations" (future imports must remain only at the
top-level/module start) and remove the duplicate "from urllib.parse import
urlparse" since both are already defined in the setup cell; ensure only the
setup cell retains these imports and rerun the notebook to confirm no missing
references to urlparse.

In `@notebooks/02_tfidf_lr_baseline.ipynb`:
- Around line 184-185: The notebook currently refits the best estimator
(best_bin / best_estimator_) on train+val and then calls evaluate(..., X_val,
y_val, ...) which yields validation metrics computed on data used during refit;
instead compute and report true holdout validation metrics either by (a)
evaluating the selected estimator before refit (use the estimator from the CV
split or GridSearchCV.best_estimator_ saved prior to refit) on X_val/y_val, or
(b) compute validation metrics from cross-validation results (e.g.,
grid.cv_results_ or the scorer on X_val with a copy of the model trained only on
train set). Concretely, locate uses of best_bin / best_estimator_ and the
evaluate(...) calls and change the workflow so evaluation on X_val/y_val happens
with the model that was not refit on train+val (or derive metrics from
cv_results_) before performing any refit on the combined training+validation
data.
- Around line 420-431: The code currently overwrites the original semantic
"type_label" by assigning y_val_t / y_test_t into val_type_copy["type_label"]
and test_type_copy["type_label"]; instead, preserve the original type_label and
store numeric IDs in dedicated fields (e.g.,
val_type_copy["true_id"]/test_type_copy["true_id"] or
val_type_copy["type_id"]/test_type_copy["type_id"]), then set
val_type_copy["predicted_id"], val_type_copy["predicted_label"],
val_type_copy["true_label"], val_type_copy["correct"] (and the corresponding
test_type_copy fields) using y_val_t/y_test_t and
y_val_pred_type/y_test_pred_type so the original string "type_label" remains
unchanged.

In `@notebooks/03_naive_bayes_baseline.ipynb`:
- Around line 178-179: The code is evaluating validation performance after
refitting the selected estimator on train+val (best_estimator_), so those scores
are no longer true holdout metrics; fix by either (A) remove/skip the
evaluate(...) calls on X_val (e.g., drop evaluate(best_model_bin, X_val, y_val,
...) and evaluate(best_model_multi, X_val, ...) after refit) or (B) preserve a
pre-refit copy of the chosen estimator and evaluate that on X_val (use
sklearn.base.clone(best_estimator_) or set GridSearchCV(refit=False) and re-fit
a cloned estimator only on train+val for final test evaluation), ensuring
references to best_estimator_, best_model_bin and best_model_multi and the
evaluate(...) calls are updated accordingly.

In `@notebooks/04_bert_classification.ipynb`:
- Line 153: Guard against empty loaders before dividing by len(loader): wherever
you compute averages like total_loss / len(loader) (and similar metric sums)
first check if len(loader) == 0 and return a safe default (e.g., 0.0 or
float('nan')) or skip the averaging; implement this by computing denom =
len(loader) and if denom == 0 return 0.0 else return total_loss / denom, and
apply the same pattern to all other places that divide by len(loader) (the
total_loss / len(loader) occurrence and the other metric-averaging blocks
mentioned).
- Around line 425-429: The encoder function enc currently does direct lookup
into label2id which will raise KeyError if val/test contain labels unseen in
train; before calling enc for X_val_type/X_test_type validate that
set(val_type["type_label"]) and set(test_type["type_label"]) are subsets of
label2id.keys(), and if not either (a) raise a clear error listing unknown
labels so the user can fix the splits, or (b) map unknown labels to a default
fallback id (e.g., label2id.get(label, unknown_id)); update usages of enc and
the variables X_val_type, y_val_type, X_test_type, y_test_type accordingly and
mention label2id and enc in the check and handling logic.
- Around line 339-342: The classification_report calls for validation and test
(the calls using classification_report(y_val, val_preds,
target_names=label_names, zero_division=0) and classification_report(y_test,
test_preds, target_names=label_names, zero_division=0)) can raise ValueError
when some classes are missing; fix by passing an explicit labels list to ensure
stable multi-class evaluation—compute the label id list (e.g., label_ids =
list(range(len(label_names))) or derive from your label-to-id mapping) and add
labels=label_ids to both classification_report calls so target_names aligns with
the provided labels.

In `@notebooks/05_error_analysis.ipynb`:
- Around line 333-336: The formatting expression in the loop over
error_sample_type uses :20s on values that can be numeric, causing ValueError;
in the loop that reads true_l = row[true_col_type] and pred_l =
row.get(pred_col_type, "?"), cast both true_l and pred_l to strings before
applying the width format (e.g., convert via str(...) or f-string !s conversion)
so the print call print(f"  [True={...:20s}  Pred={...:20s}]
{row['text'][:90]}") never receives non-string types.

---

Nitpick comments:
In `@build_combined.py`:
- Around line 200-210: The function content_cells_nb has an unused parameter
out_dir_val; remove out_dir_val from its signature and any internal references,
leaving only (cells, out_dir_var), and update every call site that currently
passes a third argument to call content_cells_nb with just (cells, out_dir_var).
Ensure any tests or helpers that referenced the old three-arg form are updated
accordingly; search for content_cells_nb usages in this file to locate and fix
all callers.

In `@notebooks/05_error_analysis.ipynb`:
- Around line 577-581: The report currently hard-codes "rhetorical_question =
3.7%" inside a lines.append string; instead compute the class frequency from the
dataset and interpolate it into the string. Find the lines.append call that
contains "Class imbalance (rhetorical_question = 3.7%)", calculate the
percentage from the label column (e.g., count_true / total * 100) and format it
(e.g., f"{pct:.1f}%") before passing the formatted string to lines.append so the
report always reflects the current data.

In `@requirements.txt`:
- Around line 2-24: The requirements.txt currently uses permissive >= pins which
allows dependency drift; replace it with a deterministic, fully pinned
requirements lock (use exact version pins like == for every package) or add a
generated lock artifact (e.g., requirements.txt as the locked file) while
keeping a high-level requirements.in with >= constraints; regenerate the lock
via your tool of choice (pip-compile, pip freeze in a clean env, or
poetry/poetry.lock) and commit the locked requirements.txt so future installs
are reproducible and deterministic.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c695478 and d9b7b0c.

📒 Files selected for processing (14)
  • .gitignore
  • build_combined.py
  • colab_package/sarcasm_classification.ipynb
  • colab_package/sarcasm_pairs_step35_clean.jsonl
  • docs/dataset_audit.md
  • docs/experiment_plan.md
  • docs/research_summary.md
  • notebooks/01_data_preparation.ipynb
  • notebooks/02_tfidf_lr_baseline.ipynb
  • notebooks/03_naive_bayes_baseline.ipynb
  • notebooks/04_bert_classification.ipynb
  • notebooks/05_error_analysis.ipynb
  • notebooks/sarcasm_pairs_step35_clean.jsonl
  • requirements.txt

Comment thread build_combined.py
Comment on lines +35 to +138
SETUP_SRC = "\n".join([
"# ============================================================",
"# SETUP — imports, file upload, paths",
"# ============================================================",
"from __future__ import annotations",
"import json, hashlib, random, os, warnings, shutil",
"from dataclasses import dataclass",
"from pathlib import Path",
"from collections import Counter",
"from urllib.parse import urlparse",
"from typing import Optional",
"",
"import numpy as np",
"import pandas as pd",
"import matplotlib.pyplot as plt",
"import matplotlib.gridspec as gridspec",
"import seaborn as sns",
"",
"from sklearn.pipeline import Pipeline",
"from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer",
"from sklearn.linear_model import LogisticRegression",
"from sklearn.naive_bayes import MultinomialNB, ComplementNB",
"from sklearn.model_selection import GridSearchCV, GroupKFold",
"from sklearn.metrics import (",
" accuracy_score, precision_score, recall_score,",
" f1_score, classification_report, confusion_matrix,",
")",
"",
"warnings.filterwarnings('ignore')",
"",
"SEED = 42",
"random.seed(SEED)",
"np.random.seed(SEED)",
"",
"# ── Locate or upload the JSONL data file ─────────────────────────────────",
'FILENAME = "sarcasm_pairs_step35_clean.jsonl"',
"",
"def _locate_file(filename):",
" candidates = []",
" for root in [Path.cwd()] + list(Path.cwd().parents):",
" for sub in [",
' Path("data") / "processed" / filename,',
' Path("data") / filename,',
" Path(filename),",
" ]:",
" candidates.append(root / sub)",
" for p in [",
' Path("/content") / filename,',
' Path("/mnt/data") / filename,',
" ]:",
" candidates.append(p)",
" _c = Path('/content')",
" for p in (_c.rglob(filename) if _c.exists() else []):",
" candidates.append(p)",
" for p in candidates:",
" if p.is_file():",
" return p",
" return None",
"",
"print(f'cwd: {Path.cwd()}')",
"print(f'files in cwd: {[p.name for p in Path.cwd().iterdir()][:10]}')",
"",
"DATA_FILE = _locate_file(FILENAME)",
"if DATA_FILE is None:",
" try:",
" from google.colab import files as _cf",
' print(f"Upload {FILENAME!r}:")',
" _up = _cf.upload()",
" if not _up:",
' raise RuntimeError("No file uploaded.")',
" _name = list(_up.keys())[0]",
' DATA_FILE = Path("/content") / FILENAME',
" if Path(_name) != DATA_FILE:",
" shutil.move(_name, str(DATA_FILE))",
' print(f"Saved to {DATA_FILE}")',
" except ImportError:",
" raise FileNotFoundError(",
' f"Cannot find {FILENAME!r}. Place it in the same folder as this notebook."',
" )",
"",
"# ── Project root + all output directories ────────────────────────────────",
"def _find_root(data_file):",
' for parent in [data_file.parent] + list(data_file.parents):',
' if any((parent / m).exists() for m in ["outputs","notebooks","data"]):',
" return parent",
" return data_file.parent",
"",
"ROOT = _find_root(DATA_FILE)",
"",
"OUT_DATASETS = ROOT / 'outputs' / 'datasets'",
"OUT_SPLITS = ROOT / 'outputs' / 'splits'",
"OUT_TFIDF = ROOT / 'outputs' / 'classical' / 'tfidf_lr'",
"OUT_NB = ROOT / 'outputs' / 'classical' / 'naive_bayes'",
"BERT_OUT = ROOT / 'outputs' / 'bert'",
"REPORTS_DIR = ROOT / 'outputs' / 'reports'",
"SPLITS = OUT_SPLITS",
"",
"for d in [OUT_DATASETS, OUT_SPLITS, OUT_TFIDF, OUT_NB, BERT_OUT, REPORTS_DIR]:",
" d.mkdir(parents=True, exist_ok=True)",
"",
'print(f"Data : {DATA_FILE}")',
'print(f"Root : {ROOT}")',
'print(f"Output : {ROOT / \'outputs\'}")',
]) + "\n"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 85


🏁 Script executed:

wc -l build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 89


🏁 Script executed:

cat -n build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 13265


🏁 Script executed:

fd "04_bert_classification.ipynb"

Repository: SeeYangZhi/Project-LLMao

Length of output: 106


🏁 Script executed:

cat notebooks/04_bert_classification.ipynb | python3 << 'EOF'
import json
import sys

nb = json.load(sys.stdin)
for i, cell in enumerate(nb["cells"]):
    src = cell["source"]
    if isinstance(src, list):
        src = "".join(src)
    
    # Look for torch, transformers, DEVICE, Dataset, DataLoader
    keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"]
    has_keyword = any(kw in src for kw in keywords)
    
    if has_keyword or (i < 5):  # Show first 5 cells + any with keywords
        cell_type = cell["cell_type"]
        print(f"\n=== Cell {i} ({cell_type}) ===")
        if cell_type == "code":
            # Show first 50 lines
            lines = src.split("\n")[:50]
            for j, line in enumerate(lines, 1):
                print(f"{j:3d}: {line}")
            if len(src.split("\n")) > 50:
                print("     ...")
        else:
            print(src[:200])
EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 788


🏁 Script executed:

python3 << 'EOF'
import json
from pathlib import Path

nb_path = Path("notebooks/04_bert_classification.ipynb")
with open(nb_path, 'r') as f:
    nb = json.load(f)

for i, cell in enumerate(nb["cells"]):
    src = cell["source"]
    if isinstance(src, list):
        src = "".join(src)
    
    # Look for torch, transformers, DEVICE, Dataset, DataLoader
    keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"]
    has_keyword = any(kw in src for kw in keywords)
    
    if has_keyword or (i < 5):  # Show first 5 cells + any with keywords
        cell_type = cell["cell_type"]
        print(f"\n=== Cell {i} ({cell_type}) ===")
        if cell_type == "code":
            # Show first 80 lines
            lines = src.split("\n")[:80]
            for j, line in enumerate(lines, 1):
                print(f"{j:3d}: {line}")
            if len(src.split("\n")) > 80:
                print("     ...")
        else:
            print(src[:300])
EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 11941


Add missing torch, transformers, and tqdm imports plus DEVICE initialization to SETUP_SRC.

The combined notebook skips notebook 04's setup cell (filtered by is_setup_cell), but SETUP_SRC does not define torch, transformers imports, Dataset, DataLoader, or DEVICE. Part 4 will fail at runtime (e.g., NameError: name 'DEVICE' is not defined, NameError: name 'Dataset' is not defined).

Suggested fix
 SETUP_SRC = "\n".join([
     "import seaborn as sns",
+    "import torch",
+    "from torch import nn",
+    "from torch.utils.data import Dataset, DataLoader",
+    "from transformers import (",
+    "    AutoTokenizer,",
+    "    AutoModelForSequenceClassification,",
+    "    get_linear_schedule_with_warmup,",
+    ")",
+    "from tqdm.auto import tqdm",
     "",
     "warnings.filterwarnings('ignore')",
     "",
     "SEED = 42",
     "random.seed(SEED)",
     "np.random.seed(SEED)",
+    "torch.manual_seed(SEED)",
+    "if torch.cuda.is_available():",
+    "    torch.cuda.manual_seed_all(SEED)",
+    "    torch.backends.cudnn.deterministic = True",
+    "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')",
+    "print(f'Device: {DEVICE}')",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build_combined.py` around lines 35 - 138, SETUP_SRC is missing imports and
DEVICE/Dataset/DataLoader definitions needed by notebook part 04; update
SETUP_SRC to import torch, transformers (e.g., from transformers import
AutoTokenizer, AutoModel, etc. as used elsewhere), and tqdm (or tqdm.auto), and
import Dataset and DataLoader from torch.utils.data, then add DEVICE =
torch.device("cuda" if torch.cuda.is_available() else "cpu") (and set
torch.manual_seed(SEED) after SEED) so code referencing DEVICE, Dataset,
DataLoader, torch, or transformers resolves at runtime; modify the block that
defines imports and SEED in SETUP_SRC accordingly, keeping symbol names exactly
as used (SETUP_SRC, SEED, DEVICE, Dataset, DataLoader, torch, transformers,
tqdm).

Comment thread build_combined.py
Comment on lines +282 to +289
if errors:
print(f"\n{len(errors)} syntax error(s):")
for i, ln, msg, line in errors:
print(f" Cell {i:3d} line {ln}: {line!r}")
print(f" {msg}")
else:
print(f"All {len(all_cells)} cells: syntax OK")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail the build when syntax errors are detected.

Currently the script prints syntax errors but still writes the combined notebook, which can ship broken artifacts.

Suggested fix
 if errors:
     print(f"\n{len(errors)} syntax error(s):")
     for i, ln, msg, line in errors:
         print(f"  Cell {i:3d} line {ln}: {line!r}")
         print(f"           {msg}")
+    raise RuntimeError("Aborting: combined notebook contains syntax errors.")
 else:
     print(f"All {len(all_cells)} cells: syntax OK")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if errors:
print(f"\n{len(errors)} syntax error(s):")
for i, ln, msg, line in errors:
print(f" Cell {i:3d} line {ln}: {line!r}")
print(f" {msg}")
else:
print(f"All {len(all_cells)} cells: syntax OK")
if errors:
print(f"\n{len(errors)} syntax error(s):")
for i, ln, msg, line in errors:
print(f" Cell {i:3d} line {ln}: {line!r}")
print(f" {msg}")
raise RuntimeError("Aborting: combined notebook contains syntax errors.")
else:
print(f"All {len(all_cells)} cells: syntax OK")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build_combined.py` around lines 282 - 289, When syntax errors are found (the
errors list is non-empty) stop the build instead of continuing: after printing
the error summary (the block that iterates over errors) call sys.exit(1) or
raise SystemExit so the script exits non‑zero and prevents subsequent writing of
the combined notebook; alternatively move the code that writes the combined
notebook so it only executes in the else branch where you currently print
"syntax OK". Refer to the errors variable and the all_cells check in
build_combined.py and ensure the combined-notebook write logic (the function or
code that serializes/writes the notebook) only runs when no syntax errors are
present.

Comment thread docs/dataset_audit.md
Comment on lines +16 to +24
| Field | Type | Description |
|-------|------|-------------|
| `original_headline` | string | The source headline (sarcastic OR non-sarcastic depending on `type`) |
| `generated_headline` | string | The transformed headline (the opposite label) |
| `strategy` | string | Sarcasm type/strategy label (6 values, see below) |
| `type` | string | Direction of transformation: `sarcastic_to_non` or `non_to_sarcastic` |
| `model_used` | string | The LLM that generated the rewrite |
| `article_link` | URL string | Source article URL; used for grouping |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdownlint warnings in tables and code fences.

Line 16/Line 73 tables need surrounding blank lines (MD058), and the fenced block at Line 106 should include a language tag (MD040), e.g. ```python.

Also applies to: 73-76, 106-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/dataset_audit.md` around lines 16 - 24, The markdown file has lint
warnings: the two tables that start with the header row "| Field | Type |
Description |" need a blank line before and after each table to satisfy MD058,
and the fenced code block that currently lacks a language tag should be updated
to include an appropriate language identifier (e.g., ```python) to satisfy
MD040; locate the table blocks and the fenced block in docs/dataset_audit.md
(search for the table header string and the triple-backtick fence) and insert
the surrounding blank lines for each table and add the language tag to the
fence.

Comment thread docs/research_summary.md
Comment on lines +112 to +121
The 6 strategy classes are **significantly imbalanced**:
| Strategy | Count (approx) |
|----------|---------------|
| sarcasm | ~8,699 |
| irony | ~6,102 |
| satire | ~5,224 |
| overstatement | ~3,976 |
| understatement | ~3,295 |
| rhetorical_question | ~1,037 |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the type-distribution table.

The table starting at Line 113 should be surrounded by blank lines to satisfy markdownlint MD058.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 113-113: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/research_summary.md` around lines 112 - 121, Surround the markdown table
that begins with "The 6 strategy classes are **significantly imbalanced**:" (the
type-distribution table listing sarcasm, irony, satire, etc.) with a blank line
before the paragraph introducing the table and a blank line after the table end
so there is an empty line both above and below the table to satisfy markdownlint
MD058.

"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import os; from pathlib import Path as _P; print(\"cwd:\", _P.cwd()); print(\"files:\", list(_P.cwd().iterdir())[:10])\n\nimport json, hashlib, random, os\nfrom pathlib import Path\nfrom collections import Counter\nfrom urllib.parse import urlparse\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nSEED = 42\nrandom.seed(SEED)\nnp.random.seed(SEED)\n\nFILENAME = \"sarcasm_pairs_step35_clean.jsonl\"\n\n# \u2500\u2500 Step 1: locate or upload the data file \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _locate_file(filename):\n \"\"\"Search common locations; return Path if found, else None.\"\"\"\n candidates = []\n for root in [Path.cwd()] + list(Path.cwd().parents):\n for sub in [\n Path(\"data\") / \"processed\" / filename,\n Path(\"data\") / filename,\n Path(filename),\n ]:\n candidates.append(root / sub)\n for p in [\n Path(\"/content\") / filename,\n Path(\"/mnt/data\") / filename,\n Path(\"/content/drive/MyDrive\") / filename,\n ]:\n candidates.append(p)\n _content = Path(\"/content\")\n for p in (_content.rglob(filename) if _content.exists() else []):\n candidates.append(p)\n for p in candidates:\n if p.is_file():\n return p\n return None\n\nDATA_FILE = _locate_file(FILENAME)\n\nif DATA_FILE is None:\n # \u2500\u2500 Auto-upload in Colab \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n try:\n from google.colab import files as _colab_files\n print(f\"{FILENAME!r} not found. Please upload it now:\")\n _uploaded = _colab_files.upload()\n if not _uploaded:\n raise RuntimeError(\"No file uploaded. Please upload the JSONL file and re-run.\")\n _fname = list(_uploaded.keys())[0]\n if not _fname.endswith(FILENAME.split('/')[-1]):\n raise ValueError(\n f\"Wrong file uploaded: {_fname!r}. Expected {FILENAME!r}.\"\n )\n DATA_FILE = Path(\"/content\") / FILENAME\n if Path(_fname) != DATA_FILE:\n import shutil\n shutil.move(_fname, str(DATA_FILE))\n print(f\"Uploaded to {DATA_FILE}\")\n except ImportError:\n # Not in Colab\n raise FileNotFoundError(\n f\"Cannot find {FILENAME!r}.\\n\"\n f\"cwd: {Path.cwd()}\\n\"\n \"Place the file at: <project_root>/data/processed/\" + FILENAME\n )\n\n# \u2500\u2500 Step 2: determine project root \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _find_root(data_file):\n \"\"\"Walk up from data_file to find the project root (contains outputs/ or notebooks/).\"\"\"\n markers = [\n Path(\"outputs\"),\n Path(\"notebooks\"),\n Path(\"data\"),\n ]\n for parent in [data_file.parent] + list(data_file.parents):\n if any((parent / m).exists() for m in markers):\n return parent\n return data_file.parent # fallback\n\nROOT = _find_root(DATA_FILE)\n\nOUT_DATASETS = ROOT / \"outputs\" / \"datasets\"\nOUT_SPLITS = ROOT / \"outputs\" / \"splits\"\nOUT_DATASETS.mkdir(parents=True, exist_ok=True)\nOUT_SPLITS.mkdir(parents=True, exist_ok=True)\n\nprint(f\"Dataset : {DATA_FILE}\")\nprint(f\"Root : {ROOT}\")\nprint(f\"Outputs : {OUT_DATASETS}\")\n"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden root detection to avoid writing outside the project tree.

In Line 24, _find_root can match overly generic parents (e.g., a system-level data dir), which risks setting ROOT incorrectly and writing outputs in unintended locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/01_data_preparation.ipynb` at line 24, The _find_root function can
pick a generic system-level directory (e.g., a global "data" folder); update
_find_root to only accept a candidate parent as ROOT if it both contains one of
the existing project markers (outputs/, notebooks/, data/) and also looks like a
repository/project root — e.g., it contains a repo marker such as .git,
pyproject.toml, setup.cfg, or package.json — and insist the candidate is within
the current project tree (one of Path.cwd() or its parents) and not a top-level
filesystem root or home directory; change the acceptance logic in _find_root
accordingly and leave the fallback to data_file.parent unchanged (symbols to
edit: _find_root, markers, ROOT, OUT_DATASETS, OUT_SPLITS).

Comment on lines +178 to +179
"val_m_bin, y_val_pred_bin = evaluate(best_model_bin, X_val, y_val, label_names_bin, \"Val\")\n",
"test_m_bin, y_test_pred_bin = evaluate(best_model_bin, X_test, y_test, label_names_bin, \"Test\")\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validation evaluation is leaked after refit on train+val.

At Line 176/Line 331, best_estimator_ has already been fit on train+val, so Line 178-179 and Line 333-334 are not true holdout validation results.
Either (a) skip post-refit val scoring, or (b) tune on train-only and keep val truly held out.

Also applies to: 333-334

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/03_naive_bayes_baseline.ipynb` around lines 178 - 179, The code is
evaluating validation performance after refitting the selected estimator on
train+val (best_estimator_), so those scores are no longer true holdout metrics;
fix by either (A) remove/skip the evaluate(...) calls on X_val (e.g., drop
evaluate(best_model_bin, X_val, y_val, ...) and evaluate(best_model_multi,
X_val, ...) after refit) or (B) preserve a pre-refit copy of the chosen
estimator and evaluate that on X_val (use sklearn.base.clone(best_estimator_) or
set GridSearchCV(refit=False) and re-fit a cloned estimator only on train+val
for final test evaluation), ensuring references to best_estimator_,
best_model_bin and best_model_multi and the evaluate(...) calls are updated
accordingly.

" scheduler.step()\n",
"\n",
" total_loss += loss.item()\n",
" return total_loss / len(loader)\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard empty splits to avoid division-by-zero in epoch metrics.

Line 153 and Line 178 divide by len(loader). Any empty split (or filtered split) crashes training/evaluation.

Suggested fix
 def train_bert(
@@
 ) -> dict:
@@
+    for split_name, X, y in [
+        ("train", X_train, y_train),
+        ("val", X_val, y_val),
+        ("test", X_test, y_test),
+    ]:
+        if len(X) == 0 or len(y) == 0:
+            raise ValueError(f"{split_name} split is empty; cannot train/evaluate.")
+
     # Datasets
     train_ds = HeadlineDataset(X_train, y_train, tokenizer, cfg.max_length)

Also applies to: 178-178, 239-247

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/04_bert_classification.ipynb` at line 153, Guard against empty
loaders before dividing by len(loader): wherever you compute averages like
total_loss / len(loader) (and similar metric sums) first check if len(loader) ==
0 and return a safe default (e.g., 0.0 or float('nan')) or skip the averaging;
implement this by computing denom = len(loader) and if denom == 0 return 0.0
else return total_loss / denom, and apply the same pattern to all other places
that divide by len(loader) (the total_loss / len(loader) occurrence and the
other metric-averaging blocks mentioned).

Comment on lines +339 to +342
" print(\"\\n=== Val (best checkpoint) ===\")\n",
" print(classification_report(y_val, val_preds, target_names=label_names, zero_division=0))\n",
" print(\"\\n=== Test (best checkpoint) ===\")\n",
" print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0))\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "04_bert_classification.ipynb" -type f

Repository: SeeYangZhi/Project-LLMao

Length of output: 108


🏁 Script executed:

cat ./notebooks/04_bert_classification.ipynb | python3 << 'EOF'
import json
import sys

data = json.load(sys.stdin)
cells = data.get('cells', [])

# Flatten all code and track line numbers
line_num = 1
for cell in cells:
    if cell.get('cell_type') == 'code':
        source = cell.get('source', [])
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 330 <= line_num <= 350:
                print(f"{line_num}: {line}")
            line_num += 1

print(f"\nTotal lines processed: {line_num - 1}", file=sys.stderr)
EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 788


🏁 Script executed:

jq '.cells[] | select(.cell_type == "code") | .source' ./notebooks/04_bert_classification.ipynb | head -200

Repository: SeeYangZhi/Project-LLMao

Length of output: 10283


🏁 Script executed:

jq -r '.cells[] | select(.cell_type == "code") | .source | join("")' ./notebooks/04_bert_classification.ipynb > /tmp/notebook_code.txt
grep -n "Val (best checkpoint)" /tmp/notebook_code.txt

Repository: SeeYangZhi/Project-LLMao

Length of output: 172


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 'Val (best checkpoint)' in line or 'Test (best checkpoint)' in line:
                print(f"Line {line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 178


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 320 <= line_num <= 335:
                print(f"{line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 985


🌐 Web query:

sklearn classification_report target_names labels parameter ValueError imbalanced

💡 Result:

In sklearn.metrics.classification_report, target_names must be the same length and order as the labels that the report is computed over. If you don’t pass labels=..., scikit-learn infers the label set from the classes that actually appear in y_true and/or y_pred (so in imbalanced splits, some classes may be missing), which commonly makes len(target_names) not match and triggers:

ValueError: Number of classes, X, does not match size of target_names, Y. Try specifying the labels parameter (sklearn.org)

Fix (robust for imbalanced / missing classes)

Pass both labels (all classes you want shown, in a fixed order) and matching target_names:

from sklearn.metrics import classification_report

labels = list(le.classes_)           # or [0,1,2,...] in the order you want
target_names = list(le.classes_)     # same length + same order as labels

print(classification_report(
    y_true, y_pred,
    labels=labels,
    target_names=target_names,
    zero_division=0
))
  • labels controls which classes are included. (sklearn.org)
  • target_names must “match the labels (same order)” and have shape (n_labels,). (sklearn.org)

Alternative

If you don’t need custom names, omit target_names entirely and let scikit-learn print the raw label values. (sklearn.org)

Citations:


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 'label_names' in line and ('=' in line or 'def' in line):
                print(f"{line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 1260


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 'STRATEGY_LABELS' in line:
                print(f"{line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 324


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 300 <= line_num <= 380:
                print(f"{line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 4851


🏁 Script executed:

python3 << 'EOF'
import json

with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
    nb = json.load(f)

line_num = 1
for cell in nb['cells']:
    if cell['cell_type'] == 'code':
        source = cell['source']
        if isinstance(source, list):
            source_text = ''.join(source)
        else:
            source_text = source
        
        lines = source_text.split('\n')
        for line in lines:
            if 'train_type' in line or 'val_type' in line or 'test_type' in line or 'stratif' in line.lower():
                print(f"{line_num}: {line}")
            line_num += 1

EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 1145


Pass explicit label IDs to classification_report for stable multi-class eval.

Lines 328 and 330 may raise ValueError when a class is absent in y_true/y_pred (common in imbalanced splits) because target_names is set without labels.

Suggested fix
+    label_ids = list(range(len(label_names)))
     print("\n=== Val (best checkpoint) ===")
-    print(classification_report(y_val,  val_preds,  target_names=label_names, zero_division=0))
+    print(classification_report(
+        y_val, val_preds, labels=label_ids, target_names=label_names, zero_division=0
+    ))
     print("\n=== Test (best checkpoint) ===")
-    print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0))
+    print(classification_report(
+        y_test, test_preds, labels=label_ids, target_names=label_names, zero_division=0
+    ))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/04_bert_classification.ipynb` around lines 339 - 342, The
classification_report calls for validation and test (the calls using
classification_report(y_val, val_preds, target_names=label_names,
zero_division=0) and classification_report(y_test, test_preds,
target_names=label_names, zero_division=0)) can raise ValueError when some
classes are missing; fix by passing an explicit labels list to ensure stable
multi-class evaluation—compute the label id list (e.g., label_ids =
list(range(len(label_names))) or derive from your label-to-id mapping) and add
labels=label_ids to both classification_report calls so target_names aligns with
the provided labels.

Comment on lines +425 to +429
"def enc(df): return [label2id[l] for l in df[\"type_label\"]]\n",
"\n",
"X_train_type = train_type[\"text\"].tolist(); y_train_type = enc(train_type)\n",
"X_val_type = val_type[\"text\"].tolist(); y_val_type = enc(val_type)\n",
"X_test_type = test_type[\"text\"].tolist(); y_test_type = enc(test_type)\n",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate split labels before encoding to prevent KeyError.

Line 425 directly indexes label2id for val/test labels. If any label exists outside train labels, the notebook fails mid-run.

Suggested fix
-def enc(df): return [label2id[l] for l in df["type_label"]]
+def enc(df: pd.DataFrame, split_name: str) -> list[int]:
+    unknown = sorted(set(df["type_label"]) - set(label2id))
+    if unknown:
+        raise ValueError(f"Unknown labels in {split_name} split: {unknown}")
+    return [label2id[label] for label in df["type_label"]]
 
-X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type)
-X_val_type   = val_type["text"].tolist();   y_val_type   = enc(val_type)
-X_test_type  = test_type["text"].tolist();  y_test_type  = enc(test_type)
+X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type, "train")
+X_val_type   = val_type["text"].tolist();   y_val_type   = enc(val_type, "val")
+X_test_type  = test_type["text"].tolist();  y_test_type  = enc(test_type, "test")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/04_bert_classification.ipynb` around lines 425 - 429, The encoder
function enc currently does direct lookup into label2id which will raise
KeyError if val/test contain labels unseen in train; before calling enc for
X_val_type/X_test_type validate that set(val_type["type_label"]) and
set(test_type["type_label"]) are subsets of label2id.keys(), and if not either
(a) raise a clear error listing unknown labels so the user can fix the splits,
or (b) map unknown labels to a default fallback id (e.g., label2id.get(label,
unknown_id)); update usages of enc and the variables X_val_type, y_val_type,
X_test_type, y_test_type accordingly and mention label2id and enc in the check
and handling logic.

Comment on lines +333 to +336
" for _, row in error_sample_type.head(15).iterrows():\n",
" true_l = row[true_col_type]\n",
" pred_l = row.get(pred_col_type, \"?\")\n",
" print(f\" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}\")"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cast labels to strings before :20s formatting.

Line 336 can crash with ValueError when pred_col_type falls back to numeric IDs (predicted) and is formatted as string width.

Suggested fix
-    for _, row in error_sample_type.head(15).iterrows():
-        true_l = row[true_col_type]
-        pred_l = row.get(pred_col_type, "?")
-        print(f"  [True={true_l:20s}  Pred={pred_l:20s}] {row['text'][:90]}")
+    for _, row in error_sample_type.head(15).iterrows():
+        true_l = str(row[true_col_type])
+        pred_l = str(row.get(pred_col_type, "?"))
+        print(f"  [True={true_l:20s}  Pred={pred_l:20s}] {row['text'][:90]}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
" for _, row in error_sample_type.head(15).iterrows():\n",
" true_l = row[true_col_type]\n",
" pred_l = row.get(pred_col_type, \"?\")\n",
" print(f\" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}\")"
for _, row in error_sample_type.head(15).iterrows():
true_l = str(row[true_col_type])
pred_l = str(row.get(pred_col_type, "?"))
print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@notebooks/05_error_analysis.ipynb` around lines 333 - 336, The formatting
expression in the loop over error_sample_type uses :20s on values that can be
numeric, causing ValueError; in the loop that reads true_l = row[true_col_type]
and pred_l = row.get(pred_col_type, "?"), cast both true_l and pred_l to strings
before applying the width format (e.g., convert via str(...) or f-string !s
conversion) so the print call print(f"  [True={...:20s}  Pred={...:20s}]
{row['text'][:90]}") never receives non-string types.

@SeeYangZhi SeeYangZhi force-pushed the main branch 4 times, most recently from a51c032 to 1ac8cca Compare March 4, 2026 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Naive Bayes, TF-IDF, distillBERT classifier models for sarcasm classification

1 participant