Feat/default models by Reallyeasy1 · Pull Request #1 · SeeYangZhi/Project-LLMao

Reallyeasy1 · 2026-03-03T08:38:54Z

Additional Notes:
Main change is in colab_package/

You can ignore notebooks/ as they are the source of truth for the colab_package/.

Closes #3

Summary by CodeRabbit

New Features
- Complete sarcasm classification experiment suite: data-prep, multiple model baselines (TF-IDF+LR, Naive Bayes, BERT/DistilBERT), training, evaluation, and error-analysis workflows; consolidated combined notebook for easy execution.
Documentation
- Added comprehensive experiment plan, dataset audit, and research summary detailing methodology, metrics, split strategy, and reproducibility checklist.
Chores
- Added dependency manifest and updated ignore rules to exclude local AI artifacts.

coderabbitai · 2026-03-03T08:39:16Z

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

💤 Files selected but had no reviewable changes (1)

colab_package/sarcasm_classification.ipynb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a full sarcasm-classification pipeline: data preparation, TF-IDF/Naive Bayes baselines, DistilBERT training, error analysis notebooks, documentation, a notebook-merging utility, dependency manifest, and .gitignore updates. All changes are new files; no public API surface modified.

Changes

Cohort / File(s)	Summary
Repo metadata `/.gitignore`, `requirements.txt`	Added `claude.md` and `.claude/` to .gitignore; introduced a requirements.txt declaring ML/NLP, plotting, and Jupyter dependencies.
Build Utility `build_combined.py`	New script that merges five notebooks into `colab_package/sarcasm_classification.ipynb`, normalizes cells, validates assembled code, and writes outputs with directory scaffolding.
Dataset & Splitting `notebooks/01_data_preparation.ipynb`	New notebook for data discovery, validation, duplicate/null checks, creation of binary/type datasets, group-aware train/val/test splitting, leakage checks, and saving datasets/splits/visualizations to outputs.
TF-IDF Baseline `notebooks/02_tfidf_lr_baseline.ipynb`	New TF-IDF + Logistic Regression notebook with GroupKFold grid search, hyperparameter grids, evaluation, confusion matrices, feature importance, and persisted metrics/predictions.
Naive Bayes Baseline `notebooks/03_naive_bayes_baseline.ipynb`	New notebook implementing multiple NB pipelines (Count/Tfidf + Multinomial/ComplementNB), grid search with GroupKFold, evaluation, and artifact saving for binary and type tasks.
BERT Training `notebooks/04_bert_classification.ipynb`	New PyTorch DistilBERT-based training notebook: `TrainConfig` dataclass, `HeadlineDataset`, class-weighting, training/validation/test loops with early stopping, checkpointing, and saved metrics/artifacts for binary and type tasks.
Error Analysis & Reports `notebooks/05_error_analysis.ipynb`	New notebook that loads model metrics/predictions, builds comparison tables, performs error extraction and heuristic failure-mode categorization, and writes markdown reports and CSV examples.
Documentation `docs/dataset_audit.md`, `docs/experiment_plan.md`, `docs/research_summary.md`	Added dataset audit, experiment plan (splitting, CV, hyperparameters, outputs), and research/methodology summary documents describing assumptions and evaluation protocols.

Sequence Diagram

sequenceDiagram
    participant User
    participant DataPrep as 01_DataPrep
    participant TFIDF as 02_TFIDF_LR
    participant NB as 03_NaiveBayes
    participant BERT as 04_BERT
    participant Error as 05_ErrorAnalysis
    participant Storage as Outputs

    User->>DataPrep: run data prep notebook
    DataPrep->>DataPrep: validate JSONL, derive labels, create splits
    DataPrep->>Storage: save datasets & splits

    par Run classical baselines
        TFIDF->>Storage: load splits
        TFIDF->>TFIDF: GridSearchCV (GroupKFold), evaluate
        TFIDF->>Storage: save metrics & preds

        NB->>Storage: load splits
        NB->>NB: GridSearchCV (GroupKFold), evaluate
        NB->>Storage: save metrics & preds
    end

    BERT->>Storage: load splits
    BERT->>BERT: tokenize, train (early stopping), evaluate
    BERT->>Storage: save checkpoints, metrics & preds

    Storage->>Error: provide metrics & predictions
    Error->>Error: compare models, extract errors, create reports
    Error->>Storage: save reports & visuals
    Storage-->>User: artifacts available

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A tiny rabbit hops with glee,
New notebooks blooming for all to see.
Data split, models train and play,
Errors hunted, insights on display.
Hooray—sarcasm tamed, one hop away! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Feat/default models' is vague and does not clearly convey the actual scope of changes. While the PR adds multiple notebooks, documentation, and build utilities, the title does not reflect these substantial additions.	Consider a more descriptive title that captures the main focus, such as 'Add sarcasm classification notebooks and experiment infrastructure' or similar to better represent the comprehensive changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/default-models

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-03-03T08:41:02Z

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #2

@Reallyeasy1

Docstrings generation was requested by @Reallyeasy1. * #1 (comment) The following files were modified: * `build_combined.py`

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (3)

requirements.txt (1)

2-24: Use deterministic dependency pinning for reproducible runs.

Lines 2–24 use floating minimums only (>=), so reinstalls can drift and change model metrics over time. For this project, keep a pinned lock artifact (or fully pinned requirements.txt) and optionally keep a separate requirements.in for broad constraints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` around lines 2 - 24, The requirements.txt currently uses
permissive >= pins which allows dependency drift; replace it with a
deterministic, fully pinned requirements lock (use exact version pins like ==
for every package) or add a generated lock artifact (e.g., requirements.txt as
the locked file) while keeping a high-level requirements.in with >= constraints;
regenerate the lock via your tool of choice (pip-compile, pip freeze in a clean
env, or poetry/poetry.lock) and commit the locked requirements.txt so future
installs are reproducible and deterministic.

build_combined.py (1)

200-210: Remove unused out_dir_val parameter from content_cells_nb.

out_dir_val is never used, which makes call sites noisier and the helper harder to read.

Suggested fix

-def content_cells_nb(cells, out_dir_var, out_dir_val):
+def content_cells_nb(cells, out_dir_var):
@@
-for c in content_cells_nb(nb02, "OUT_TFIDF", "OUT_TFIDF"):
+for c in content_cells_nb(nb02, "OUT_TFIDF"):
@@
-for c in content_cells_nb(nb03, "OUT_NB", "OUT_NB"):
+for c in content_cells_nb(nb03, "OUT_NB"):
@@
-for c in content_cells_nb(nb04, "BERT_OUT", "BERT_OUT"):
+for c in content_cells_nb(nb04, "BERT_OUT"):
@@
-for c in content_cells_nb(nb05, "REPORTS_DIR", "REPORTS_DIR"):
+for c in content_cells_nb(nb05, "REPORTS_DIR"):

Also applies to: 241-241, 249-249, 257-257, 265-265

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@build_combined.py` around lines 200 - 210, The function content_cells_nb has
an unused parameter out_dir_val; remove out_dir_val from its signature and any
internal references, leaving only (cells, out_dir_var), and update every call
site that currently passes a third argument to call content_cells_nb with just
(cells, out_dir_var). Ensure any tests or helpers that referenced the old
three-arg form are updated accordingly; search for content_cells_nb usages in
this file to locate and fix all callers.

notebooks/05_error_analysis.ipynb (1)

577-581: Avoid hard-coded class-frequency claims in generated reports.

Line 580 hard-codes rhetorical_question = 3.7%, which can become incorrect when data changes and make the report stale.

Suggested fix

-    lines.append("4. **Class imbalance** (rhetorical_question = 3.7%) is a significant challenge")
+    if error_source_type is not None and "type_label" in error_source_type.columns:
+        freq = (error_source_type["type_label"] == "rhetorical_question").mean() * 100
+        lines.append(f"4. **Class imbalance** (rhetorical_question = {freq:.1f}%) is a significant challenge")
+    else:
+        lines.append("4. **Class imbalance** is a significant challenge")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@notebooks/05_error_analysis.ipynb` around lines 577 - 581, The report
currently hard-codes "rhetorical_question = 3.7%" inside a lines.append string;
instead compute the class frequency from the dataset and interpolate it into the
string. Find the lines.append call that contains "Class imbalance
(rhetorical_question = 3.7%)", calculate the percentage from the label column
(e.g., count_true / total * 100) and format it (e.g., f"{pct:.1f}%") before
passing the formatted string to lines.append so the report always reflects the
current data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@build_combined.py`:
- Around line 282-289: When syntax errors are found (the errors list is
non-empty) stop the build instead of continuing: after printing the error
summary (the block that iterates over errors) call sys.exit(1) or raise
SystemExit so the script exits non‑zero and prevents subsequent writing of the
combined notebook; alternatively move the code that writes the combined notebook
so it only executes in the else branch where you currently print "syntax OK".
Refer to the errors variable and the all_cells check in build_combined.py and
ensure the combined-notebook write logic (the function or code that
serializes/writes the notebook) only runs when no syntax errors are present.
- Around line 35-138: SETUP_SRC is missing imports and DEVICE/Dataset/DataLoader
definitions needed by notebook part 04; update SETUP_SRC to import torch,
transformers (e.g., from transformers import AutoTokenizer, AutoModel, etc. as
used elsewhere), and tqdm (or tqdm.auto), and import Dataset and DataLoader from
torch.utils.data, then add DEVICE = torch.device("cuda" if
torch.cuda.is_available() else "cpu") (and set torch.manual_seed(SEED) after
SEED) so code referencing DEVICE, Dataset, DataLoader, torch, or transformers
resolves at runtime; modify the block that defines imports and SEED in SETUP_SRC
accordingly, keeping symbol names exactly as used (SETUP_SRC, SEED, DEVICE,
Dataset, DataLoader, torch, transformers, tqdm).

In `@docs/dataset_audit.md`:
- Around line 16-24: The markdown file has lint warnings: the two tables that
start with the header row "| Field | Type | Description |" need a blank line
before and after each table to satisfy MD058, and the fenced code block that
currently lacks a language tag should be updated to include an appropriate
language identifier (e.g., ```python) to satisfy MD040; locate the table blocks
and the fenced block in docs/dataset_audit.md (search for the table header
string and the triple-backtick fence) and insert the surrounding blank lines for
each table and add the language tag to the fence.

In `@docs/research_summary.md`:
- Around line 112-121: Surround the markdown table that begins with "The 6
strategy classes are **significantly imbalanced**:" (the type-distribution table
listing sarcasm, irony, satire, etc.) with a blank line before the paragraph
introducing the table and a blank line after the table end so there is an empty
line both above and below the table to satisfy markdownlint MD058.

In `@notebooks/01_data_preparation.ipynb`:
- Around line 350-353: Replace the runtime-only assert checks that verify split
integrity (e.g., "assert train_groups.isdisjoint(val_groups)...", "assert
train_groups.isdisjoint(test_groups)...", "assert
val_groups.isdisjoint(test_groups)...") with unconditional guard checks that
raise explicit exceptions (for example raise ValueError with the same message)
so these validations always run even when Python is optimized; do the same for
the other analogous assertions in the file (the checks referencing
train_groups/val_groups/test_groups around the later block) to ensure
deterministic error handling.
- Line 24: The _find_root function can pick a generic system-level directory
(e.g., a global "data" folder); update _find_root to only accept a candidate
parent as ROOT if it both contains one of the existing project markers
(outputs/, notebooks/, data/) and also looks like a repository/project root —
e.g., it contains a repo marker such as .git, pyproject.toml, setup.cfg, or
package.json — and insist the candidate is within the current project tree (one
of Path.cwd() or its parents) and not a top-level filesystem root or home
directory; change the acceptance logic in _find_root accordingly and leave the
fallback to data_file.parent unchanged (symbols to edit: _find_root, markers,
ROOT, OUT_DATASETS, OUT_SPLITS).
- Around line 191-193: Remove the redundant imports in Cell 10: delete the line
"from __future__ import annotations" (future imports must remain only at the
top-level/module start) and remove the duplicate "from urllib.parse import
urlparse" since both are already defined in the setup cell; ensure only the
setup cell retains these imports and rerun the notebook to confirm no missing
references to urlparse.

In `@notebooks/02_tfidf_lr_baseline.ipynb`:
- Around line 184-185: The notebook currently refits the best estimator
(best_bin / best_estimator_) on train+val and then calls evaluate(..., X_val,
y_val, ...) which yields validation metrics computed on data used during refit;
instead compute and report true holdout validation metrics either by (a)
evaluating the selected estimator before refit (use the estimator from the CV
split or GridSearchCV.best_estimator_ saved prior to refit) on X_val/y_val, or
(b) compute validation metrics from cross-validation results (e.g.,
grid.cv_results_ or the scorer on X_val with a copy of the model trained only on
train set). Concretely, locate uses of best_bin / best_estimator_ and the
evaluate(...) calls and change the workflow so evaluation on X_val/y_val happens
with the model that was not refit on train+val (or derive metrics from
cv_results_) before performing any refit on the combined training+validation
data.
- Around line 420-431: The code currently overwrites the original semantic
"type_label" by assigning y_val_t / y_test_t into val_type_copy["type_label"]
and test_type_copy["type_label"]; instead, preserve the original type_label and
store numeric IDs in dedicated fields (e.g.,
val_type_copy["true_id"]/test_type_copy["true_id"] or
val_type_copy["type_id"]/test_type_copy["type_id"]), then set
val_type_copy["predicted_id"], val_type_copy["predicted_label"],
val_type_copy["true_label"], val_type_copy["correct"] (and the corresponding
test_type_copy fields) using y_val_t/y_test_t and
y_val_pred_type/y_test_pred_type so the original string "type_label" remains
unchanged.

In `@notebooks/03_naive_bayes_baseline.ipynb`:
- Around line 178-179: The code is evaluating validation performance after
refitting the selected estimator on train+val (best_estimator_), so those scores
are no longer true holdout metrics; fix by either (A) remove/skip the
evaluate(...) calls on X_val (e.g., drop evaluate(best_model_bin, X_val, y_val,
...) and evaluate(best_model_multi, X_val, ...) after refit) or (B) preserve a
pre-refit copy of the chosen estimator and evaluate that on X_val (use
sklearn.base.clone(best_estimator_) or set GridSearchCV(refit=False) and re-fit
a cloned estimator only on train+val for final test evaluation), ensuring
references to best_estimator_, best_model_bin and best_model_multi and the
evaluate(...) calls are updated accordingly.

In `@notebooks/04_bert_classification.ipynb`:
- Line 153: Guard against empty loaders before dividing by len(loader): wherever
you compute averages like total_loss / len(loader) (and similar metric sums)
first check if len(loader) == 0 and return a safe default (e.g., 0.0 or
float('nan')) or skip the averaging; implement this by computing denom =
len(loader) and if denom == 0 return 0.0 else return total_loss / denom, and
apply the same pattern to all other places that divide by len(loader) (the
total_loss / len(loader) occurrence and the other metric-averaging blocks
mentioned).
- Around line 425-429: The encoder function enc currently does direct lookup
into label2id which will raise KeyError if val/test contain labels unseen in
train; before calling enc for X_val_type/X_test_type validate that
set(val_type["type_label"]) and set(test_type["type_label"]) are subsets of
label2id.keys(), and if not either (a) raise a clear error listing unknown
labels so the user can fix the splits, or (b) map unknown labels to a default
fallback id (e.g., label2id.get(label, unknown_id)); update usages of enc and
the variables X_val_type, y_val_type, X_test_type, y_test_type accordingly and
mention label2id and enc in the check and handling logic.
- Around line 339-342: The classification_report calls for validation and test
(the calls using classification_report(y_val, val_preds,
target_names=label_names, zero_division=0) and classification_report(y_test,
test_preds, target_names=label_names, zero_division=0)) can raise ValueError
when some classes are missing; fix by passing an explicit labels list to ensure
stable multi-class evaluation—compute the label id list (e.g., label_ids =
list(range(len(label_names))) or derive from your label-to-id mapping) and add
labels=label_ids to both classification_report calls so target_names aligns with
the provided labels.

In `@notebooks/05_error_analysis.ipynb`:
- Around line 333-336: The formatting expression in the loop over
error_sample_type uses :20s on values that can be numeric, causing ValueError;
in the loop that reads true_l = row[true_col_type] and pred_l =
row.get(pred_col_type, "?"), cast both true_l and pred_l to strings before
applying the width format (e.g., convert via str(...) or f-string !s conversion)
so the print call print(f"  [True={...:20s}  Pred={...:20s}]
{row['text'][:90]}") never receives non-string types.

---

Nitpick comments:
In `@build_combined.py`:
- Around line 200-210: The function content_cells_nb has an unused parameter
out_dir_val; remove out_dir_val from its signature and any internal references,
leaving only (cells, out_dir_var), and update every call site that currently
passes a third argument to call content_cells_nb with just (cells, out_dir_var).
Ensure any tests or helpers that referenced the old three-arg form are updated
accordingly; search for content_cells_nb usages in this file to locate and fix
all callers.

In `@notebooks/05_error_analysis.ipynb`:
- Around line 577-581: The report currently hard-codes "rhetorical_question =
3.7%" inside a lines.append string; instead compute the class frequency from the
dataset and interpolate it into the string. Find the lines.append call that
contains "Class imbalance (rhetorical_question = 3.7%)", calculate the
percentage from the label column (e.g., count_true / total * 100) and format it
(e.g., f"{pct:.1f}%") before passing the formatted string to lines.append so the
report always reflects the current data.

In `@requirements.txt`:
- Around line 2-24: The requirements.txt currently uses permissive >= pins which
allows dependency drift; replace it with a deterministic, fully pinned
requirements lock (use exact version pins like == for every package) or add a
generated lock artifact (e.g., requirements.txt as the locked file) while
keeping a high-level requirements.in with >= constraints; regenerate the lock
via your tool of choice (pip-compile, pip freeze in a clean env, or
poetry/poetry.lock) and commit the locked requirements.txt so future installs
are reproducible and deterministic.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c695478 and d9b7b0c.

📒 Files selected for processing (14)

.gitignore
build_combined.py
colab_package/sarcasm_classification.ipynb
colab_package/sarcasm_pairs_step35_clean.jsonl
docs/dataset_audit.md
docs/experiment_plan.md
docs/research_summary.md
notebooks/01_data_preparation.ipynb
notebooks/02_tfidf_lr_baseline.ipynb
notebooks/03_naive_bayes_baseline.ipynb
notebooks/04_bert_classification.ipynb
notebooks/05_error_analysis.ipynb
notebooks/sarcasm_pairs_step35_clean.jsonl
requirements.txt

coderabbitai · 2026-03-03T08:45:28Z

+SETUP_SRC = "\n".join([
+    "# ============================================================",
+    "# SETUP — imports, file upload, paths",
+    "# ============================================================",
+    "from __future__ import annotations",
+    "import json, hashlib, random, os, warnings, shutil",
+    "from dataclasses import dataclass",
+    "from pathlib import Path",
+    "from collections import Counter",
+    "from urllib.parse import urlparse",
+    "from typing import Optional",
+    "",
+    "import numpy as np",
+    "import pandas as pd",
+    "import matplotlib.pyplot as plt",
+    "import matplotlib.gridspec as gridspec",
+    "import seaborn as sns",
+    "",
+    "from sklearn.pipeline import Pipeline",
+    "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer",
+    "from sklearn.linear_model import LogisticRegression",
+    "from sklearn.naive_bayes import MultinomialNB, ComplementNB",
+    "from sklearn.model_selection import GridSearchCV, GroupKFold",
+    "from sklearn.metrics import (",
+    "    accuracy_score, precision_score, recall_score,",
+    "    f1_score, classification_report, confusion_matrix,",
+    ")",
+    "",
+    "warnings.filterwarnings('ignore')",
+    "",
+    "SEED = 42",
+    "random.seed(SEED)",
+    "np.random.seed(SEED)",
+    "",
+    "# ── Locate or upload the JSONL data file ─────────────────────────────────",
+    'FILENAME = "sarcasm_pairs_step35_clean.jsonl"',
+    "",
+    "def _locate_file(filename):",
+    "    candidates = []",
+    "    for root in [Path.cwd()] + list(Path.cwd().parents):",
+    "        for sub in [",
+    '            Path("data") / "processed" / filename,',
+    '            Path("data") / filename,',
+    "            Path(filename),",
+    "        ]:",
+    "            candidates.append(root / sub)",
+    "    for p in [",
+    '        Path("/content") / filename,',
+    '        Path("/mnt/data") / filename,',
+    "    ]:",
+    "        candidates.append(p)",
+    "    _c = Path('/content')",
+    "    for p in (_c.rglob(filename) if _c.exists() else []):",
+    "        candidates.append(p)",
+    "    for p in candidates:",
+    "        if p.is_file():",
+    "            return p",
+    "    return None",
+    "",
+    "print(f'cwd: {Path.cwd()}')",
+    "print(f'files in cwd: {[p.name for p in Path.cwd().iterdir()][:10]}')",
+    "",
+    "DATA_FILE = _locate_file(FILENAME)",
+    "if DATA_FILE is None:",
+    "    try:",
+    "        from google.colab import files as _cf",
+    '        print(f"Upload {FILENAME!r}:")',
+    "        _up = _cf.upload()",
+    "        if not _up:",
+    '            raise RuntimeError("No file uploaded.")',
+    "        _name = list(_up.keys())[0]",
+    '        DATA_FILE = Path("/content") / FILENAME',
+    "        if Path(_name) != DATA_FILE:",
+    "            shutil.move(_name, str(DATA_FILE))",
+    '        print(f"Saved to {DATA_FILE}")',
+    "    except ImportError:",
+    "        raise FileNotFoundError(",
+    '            f"Cannot find {FILENAME!r}. Place it in the same folder as this notebook."',
+    "        )",
+    "",
+    "# ── Project root + all output directories ────────────────────────────────",
+    "def _find_root(data_file):",
+    '    for parent in [data_file.parent] + list(data_file.parents):',
+    '        if any((parent / m).exists() for m in ["outputs","notebooks","data"]):',
+    "            return parent",
+    "    return data_file.parent",
+    "",
+    "ROOT = _find_root(DATA_FILE)",
+    "",
+    "OUT_DATASETS  = ROOT / 'outputs' / 'datasets'",
+    "OUT_SPLITS    = ROOT / 'outputs' / 'splits'",
+    "OUT_TFIDF     = ROOT / 'outputs' / 'classical' / 'tfidf_lr'",
+    "OUT_NB        = ROOT / 'outputs' / 'classical' / 'naive_bayes'",
+    "BERT_OUT      = ROOT / 'outputs' / 'bert'",
+    "REPORTS_DIR   = ROOT / 'outputs' / 'reports'",
+    "SPLITS        = OUT_SPLITS",
+    "",
+    "for d in [OUT_DATASETS, OUT_SPLITS, OUT_TFIDF, OUT_NB, BERT_OUT, REPORTS_DIR]:",
+    "    d.mkdir(parents=True, exist_ok=True)",
+    "",
+    'print(f"Data   : {DATA_FILE}")',
+    'print(f"Root   : {ROOT}")',
+    'print(f"Output : {ROOT / \'outputs\'}")',
+]) + "\n"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 85

🏁 Script executed:

wc -l build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 89

🏁 Script executed:

cat -n build_combined.py

Repository: SeeYangZhi/Project-LLMao

Length of output: 13265

🏁 Script executed:

fd "04_bert_classification.ipynb"

Repository: SeeYangZhi/Project-LLMao

Length of output: 106

🏁 Script executed:

cat notebooks/04_bert_classification.ipynb | python3 << 'EOF' import json import sys nb = json.load(sys.stdin) for i, cell in enumerate(nb["cells"]): src = cell["source"] if isinstance(src, list): src = "".join(src) # Look for torch, transformers, DEVICE, Dataset, DataLoader keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"] has_keyword = any(kw in src for kw in keywords) if has_keyword or (i < 5): # Show first 5 cells + any with keywords cell_type = cell["cell_type"] print(f"\n=== Cell {i} ({cell_type}) ===") if cell_type == "code": # Show first 50 lines lines = src.split("\n")[:50] for j, line in enumerate(lines, 1): print(f"{j:3d}: {line}") if len(src.split("\n")) > 50: print(" ...") else: print(src[:200]) EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 788

🏁 Script executed:

python3 << 'EOF' import json from pathlib import Path nb_path = Path("notebooks/04_bert_classification.ipynb") with open(nb_path, 'r') as f: nb = json.load(f) for i, cell in enumerate(nb["cells"]): src = cell["source"] if isinstance(src, list): src = "".join(src) # Look for torch, transformers, DEVICE, Dataset, DataLoader keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"] has_keyword = any(kw in src for kw in keywords) if has_keyword or (i < 5): # Show first 5 cells + any with keywords cell_type = cell["cell_type"] print(f"\n=== Cell {i} ({cell_type}) ===") if cell_type == "code": # Show first 80 lines lines = src.split("\n")[:80] for j, line in enumerate(lines, 1): print(f"{j:3d}: {line}") if len(src.split("\n")) > 80: print(" ...") else: print(src[:300]) EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 11941

Add missing torch, transformers, and tqdm imports plus DEVICE initialization to SETUP_SRC.

The combined notebook skips notebook 04's setup cell (filtered by is_setup_cell), but SETUP_SRC does not define torch, transformers imports, Dataset, DataLoader, or DEVICE. Part 4 will fail at runtime (e.g., NameError: name 'DEVICE' is not defined, NameError: name 'Dataset' is not defined).

Suggested fix

SETUP_SRC = "\n".join([ "import seaborn as sns", + "import torch", + "from torch import nn", + "from torch.utils.data import Dataset, DataLoader", + "from transformers import (", + " AutoTokenizer,", + " AutoModelForSequenceClassification,", + " get_linear_schedule_with_warmup,", + ")", + "from tqdm.auto import tqdm", "", "warnings.filterwarnings('ignore')", "", "SEED = 42", "random.seed(SEED)", "np.random.seed(SEED)", + "torch.manual_seed(SEED)", + "if torch.cuda.is_available():", + " torch.cuda.manual_seed_all(SEED)", + " torch.backends.cudnn.deterministic = True", + "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')", + "print(f'Device: {DEVICE}')",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@build_combined.py` around lines 35 - 138, SETUP_SRC is missing imports and DEVICE/Dataset/DataLoader definitions needed by notebook part 04; update SETUP_SRC to import torch, transformers (e.g., from transformers import AutoTokenizer, AutoModel, etc. as used elsewhere), and tqdm (or tqdm.auto), and import Dataset and DataLoader from torch.utils.data, then add DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") (and set torch.manual_seed(SEED) after SEED) so code referencing DEVICE, Dataset, DataLoader, torch, or transformers resolves at runtime; modify the block that defines imports and SEED in SETUP_SRC accordingly, keeping symbol names exactly as used (SETUP_SRC, SEED, DEVICE, Dataset, DataLoader, torch, transformers, tqdm).

coderabbitai · 2026-03-03T08:45:28Z

+if errors:
+    print(f"\n{len(errors)} syntax error(s):")
+    for i, ln, msg, line in errors:
+        print(f"  Cell {i:3d} line {ln}: {line!r}")
+        print(f"           {msg}")
+else:
+    print(f"All {len(all_cells)} cells: syntax OK")
+


⚠️ Potential issue | 🟠 Major

Fail the build when syntax errors are detected.

Currently the script prints syntax errors but still writes the combined notebook, which can ship broken artifacts.

Suggested fix

if errors: print(f"\n{len(errors)} syntax error(s):") for i, ln, msg, line in errors: print(f" Cell {i:3d} line {ln}: {line!r}") print(f" {msg}") + raise RuntimeError("Aborting: combined notebook contains syntax errors.") else: print(f"All {len(all_cells)} cells: syntax OK")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if errors:

print(f"\n{len(errors)} syntax error(s):")

for i, ln, msg, line in errors:

print(f" Cell {i:3d} line {ln}: {line!r}")

print(f" {msg}")

else:

print(f"All {len(all_cells)} cells: syntax OK")

if errors:

print(f"\n{len(errors)} syntax error(s):")

for i, ln, msg, line in errors:

print(f" Cell {i:3d} line {ln}: {line!r}")

print(f" {msg}")

raise RuntimeError("Aborting: combined notebook contains syntax errors.")

else:

print(f"All {len(all_cells)} cells: syntax OK")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@build_combined.py` around lines 282 - 289, When syntax errors are found (the errors list is non-empty) stop the build instead of continuing: after printing the error summary (the block that iterates over errors) call sys.exit(1) or raise SystemExit so the script exits non‑zero and prevents subsequent writing of the combined notebook; alternatively move the code that writes the combined notebook so it only executes in the else branch where you currently print "syntax OK". Refer to the errors variable and the all_cells check in build_combined.py and ensure the combined-notebook write logic (the function or code that serializes/writes the notebook) only runs when no syntax errors are present.

coderabbitai · 2026-03-03T08:45:28Z

+| Field | Type | Description |
+|-------|------|-------------|
+| `original_headline` | string | The source headline (sarcastic OR non-sarcastic depending on `type`) |
+| `generated_headline` | string | The transformed headline (the opposite label) |
+| `strategy` | string | Sarcasm type/strategy label (6 values, see below) |
+| `type` | string | Direction of transformation: `sarcastic_to_non` or `non_to_sarcastic` |
+| `model_used` | string | The LLM that generated the rewrite |
+| `article_link` | URL string | Source article URL; used for grouping |
+


⚠️ Potential issue | 🟡 Minor

Fix markdownlint warnings in tables and code fences.

Line 16/Line 73 tables need surrounding blank lines (MD058), and the fenced block at Line 106 should include a language tag (MD040), e.g. ```python.

Also applies to: 73-76, 106-114

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/dataset_audit.md` around lines 16 - 24, The markdown file has lint warnings: the two tables that start with the header row "| Field | Type | Description |" need a blank line before and after each table to satisfy MD058, and the fenced code block that currently lacks a language tag should be updated to include an appropriate language identifier (e.g., ```python) to satisfy MD040; locate the table blocks and the fenced block in docs/dataset_audit.md (search for the table header string and the triple-backtick fence) and insert the surrounding blank lines for each table and add the language tag to the fence.

coderabbitai · 2026-03-03T08:45:28Z

+The 6 strategy classes are **significantly imbalanced**:
+| Strategy | Count (approx) |
+|----------|---------------|
+| sarcasm | ~8,699 |
+| irony | ~6,102 |
+| satire | ~5,224 |
+| overstatement | ~3,976 |
+| understatement | ~3,295 |
+| rhetorical_question | ~1,037 |
+


⚠️ Potential issue | 🟡 Minor

Add blank lines around the type-distribution table.

The table starting at Line 113 should be surrounded by blank lines to satisfy markdownlint MD058.

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 113-113: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/research_summary.md` around lines 112 - 121, Surround the markdown table that begins with "The 6 strategy classes are **significantly imbalanced**:" (the type-distribution table listing sarcasm, irony, satire, etc.) with a blank line before the paragraph introducing the table and a blank line after the table end so there is an empty line both above and below the table to satisfy markdownlint MD058.

coderabbitai · 2026-03-03T08:45:28Z

+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os; from pathlib import Path as _P; print(\"cwd:\", _P.cwd()); print(\"files:\", list(_P.cwd().iterdir())[:10])\n\nimport json, hashlib, random, os\nfrom pathlib import Path\nfrom collections import Counter\nfrom urllib.parse import urlparse\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nSEED = 42\nrandom.seed(SEED)\nnp.random.seed(SEED)\n\nFILENAME = \"sarcasm_pairs_step35_clean.jsonl\"\n\n# \u2500\u2500 Step 1: locate or upload the data file \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _locate_file(filename):\n    \"\"\"Search common locations; return Path if found, else None.\"\"\"\n    candidates = []\n    for root in [Path.cwd()] + list(Path.cwd().parents):\n        for sub in [\n            Path(\"data\") / \"processed\" / filename,\n            Path(\"data\") / filename,\n            Path(filename),\n        ]:\n            candidates.append(root / sub)\n    for p in [\n        Path(\"/content\") / filename,\n        Path(\"/mnt/data\") / filename,\n        Path(\"/content/drive/MyDrive\") / filename,\n    ]:\n        candidates.append(p)\n    _content = Path(\"/content\")\n    for p in (_content.rglob(filename) if _content.exists() else []):\n        candidates.append(p)\n    for p in candidates:\n        if p.is_file():\n            return p\n    return None\n\nDATA_FILE = _locate_file(FILENAME)\n\nif DATA_FILE is None:\n    # \u2500\u2500 Auto-upload in Colab \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n    try:\n        from google.colab import files as _colab_files\n        print(f\"{FILENAME!r} not found. Please upload it now:\")\n        _uploaded = _colab_files.upload()\n        if not _uploaded:\n            raise RuntimeError(\"No file uploaded. Please upload the JSONL file and re-run.\")\n        _fname = list(_uploaded.keys())[0]\n        if not _fname.endswith(FILENAME.split('/')[-1]):\n            raise ValueError(\n                f\"Wrong file uploaded: {_fname!r}. Expected {FILENAME!r}.\"\n            )\n        DATA_FILE = Path(\"/content\") / FILENAME\n        if Path(_fname) != DATA_FILE:\n            import shutil\n            shutil.move(_fname, str(DATA_FILE))\n        print(f\"Uploaded to {DATA_FILE}\")\n    except ImportError:\n        # Not in Colab\n        raise FileNotFoundError(\n            f\"Cannot find {FILENAME!r}.\\n\"\n            f\"cwd: {Path.cwd()}\\n\"\n            \"Place the file at: <project_root>/data/processed/\" + FILENAME\n        )\n\n# \u2500\u2500 Step 2: determine project root \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _find_root(data_file):\n    \"\"\"Walk up from data_file to find the project root (contains outputs/ or notebooks/).\"\"\"\n    markers = [\n        Path(\"outputs\"),\n        Path(\"notebooks\"),\n        Path(\"data\"),\n    ]\n    for parent in [data_file.parent] + list(data_file.parents):\n        if any((parent / m).exists() for m in markers):\n            return parent\n    return data_file.parent  # fallback\n\nROOT = _find_root(DATA_FILE)\n\nOUT_DATASETS = ROOT / \"outputs\" / \"datasets\"\nOUT_SPLITS   = ROOT / \"outputs\" / \"splits\"\nOUT_DATASETS.mkdir(parents=True, exist_ok=True)\nOUT_SPLITS.mkdir(parents=True, exist_ok=True)\n\nprint(f\"Dataset : {DATA_FILE}\")\nprint(f\"Root    : {ROOT}\")\nprint(f\"Outputs : {OUT_DATASETS}\")\n"


⚠️ Potential issue | 🟠 Major

Harden root detection to avoid writing outside the project tree.

In Line 24, _find_root can match overly generic parents (e.g., a system-level data dir), which risks setting ROOT incorrectly and writing outputs in unintended locations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/01_data_preparation.ipynb` at line 24, The _find_root function can pick a generic system-level directory (e.g., a global "data" folder); update _find_root to only accept a candidate parent as ROOT if it both contains one of the existing project markers (outputs/, notebooks/, data/) and also looks like a repository/project root — e.g., it contains a repo marker such as .git, pyproject.toml, setup.cfg, or package.json — and insist the candidate is within the current project tree (one of Path.cwd() or its parents) and not a top-level filesystem root or home directory; change the acceptance logic in _find_root accordingly and leave the fallback to data_file.parent unchanged (symbols to edit: _find_root, markers, ROOT, OUT_DATASETS, OUT_SPLITS).

coderabbitai · 2026-03-03T08:45:28Z

+    "val_m_bin,  y_val_pred_bin  = evaluate(best_model_bin, X_val,  y_val,  label_names_bin, \"Val\")\n",
+    "test_m_bin, y_test_pred_bin = evaluate(best_model_bin, X_test, y_test, label_names_bin, \"Test\")\n",


⚠️ Potential issue | 🟠 Major

Validation evaluation is leaked after refit on train+val.

At Line 176/Line 331, best_estimator_ has already been fit on train+val, so Line 178-179 and Line 333-334 are not true holdout validation results.
Either (a) skip post-refit val scoring, or (b) tune on train-only and keep val truly held out.

Also applies to: 333-334

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/03_naive_bayes_baseline.ipynb` around lines 178 - 179, The code is evaluating validation performance after refitting the selected estimator on train+val (best_estimator_), so those scores are no longer true holdout metrics; fix by either (A) remove/skip the evaluate(...) calls on X_val (e.g., drop evaluate(best_model_bin, X_val, y_val, ...) and evaluate(best_model_multi, X_val, ...) after refit) or (B) preserve a pre-refit copy of the chosen estimator and evaluate that on X_val (use sklearn.base.clone(best_estimator_) or set GridSearchCV(refit=False) and re-fit a cloned estimator only on train+val for final test evaluation), ensuring references to best_estimator_, best_model_bin and best_model_multi and the evaluate(...) calls are updated accordingly.

coderabbitai · 2026-03-03T08:45:28Z

+    "        scheduler.step()\n",
+    "\n",
+    "        total_loss += loss.item()\n",
+    "    return total_loss / len(loader)\n",


⚠️ Potential issue | 🟠 Major

Guard empty splits to avoid division-by-zero in epoch metrics.

Line 153 and Line 178 divide by len(loader). Any empty split (or filtered split) crashes training/evaluation.

Suggested fix

def train_bert( @@ ) -> dict: @@ + for split_name, X, y in [ + ("train", X_train, y_train), + ("val", X_val, y_val), + ("test", X_test, y_test), + ]: + if len(X) == 0 or len(y) == 0: + raise ValueError(f"{split_name} split is empty; cannot train/evaluate.") + # Datasets train_ds = HeadlineDataset(X_train, y_train, tokenizer, cfg.max_length)

Also applies to: 178-178, 239-247

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/04_bert_classification.ipynb` at line 153, Guard against empty loaders before dividing by len(loader): wherever you compute averages like total_loss / len(loader) (and similar metric sums) first check if len(loader) == 0 and return a safe default (e.g., 0.0 or float('nan')) or skip the averaging; implement this by computing denom = len(loader) and if denom == 0 return 0.0 else return total_loss / denom, and apply the same pattern to all other places that divide by len(loader) (the total_loss / len(loader) occurrence and the other metric-averaging blocks mentioned).

coderabbitai · 2026-03-03T08:45:28Z

+    "    print(\"\\n=== Val (best checkpoint) ===\")\n",
+    "    print(classification_report(y_val,  val_preds,  target_names=label_names, zero_division=0))\n",
+    "    print(\"\\n=== Test (best checkpoint) ===\")\n",
+    "    print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0))\n",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "04_bert_classification.ipynb" -type f

Repository: SeeYangZhi/Project-LLMao

Length of output: 108

🏁 Script executed:

cat ./notebooks/04_bert_classification.ipynb | python3 << 'EOF' import json import sys data = json.load(sys.stdin) cells = data.get('cells', []) # Flatten all code and track line numbers line_num = 1 for cell in cells: if cell.get('cell_type') == 'code': source = cell.get('source', []) if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 330 <= line_num <= 350: print(f"{line_num}: {line}") line_num += 1 print(f"\nTotal lines processed: {line_num - 1}", file=sys.stderr) EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 788

🏁 Script executed:

jq '.cells[] | select(.cell_type == "code") | .source' ./notebooks/04_bert_classification.ipynb | head -200

Repository: SeeYangZhi/Project-LLMao

Length of output: 10283

🏁 Script executed:

jq -r '.cells[] | select(.cell_type == "code") | .source | join("")' ./notebooks/04_bert_classification.ipynb > /tmp/notebook_code.txt grep -n "Val (best checkpoint)" /tmp/notebook_code.txt

Repository: SeeYangZhi/Project-LLMao

Length of output: 172

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 'Val (best checkpoint)' in line or 'Test (best checkpoint)' in line: print(f"Line {line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 178

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 320 <= line_num <= 335: print(f"{line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 985

🌐 Web query:

sklearn classification_report target_names labels parameter ValueError imbalanced

💡 Result:

In sklearn.metrics.classification_report, target_names must be the same length and order as the labels that the report is computed over. If you don’t pass labels=..., scikit-learn infers the label set from the classes that actually appear in y_true and/or y_pred (so in imbalanced splits, some classes may be missing), which commonly makes len(target_names) not match and triggers:

ValueError: Number of classes, X, does not match size of target_names, Y. Try specifying the labels parameter (sklearn.org)

Fix (robust for imbalanced / missing classes)

Pass both labels (all classes you want shown, in a fixed order) and matching target_names:

from sklearn.metrics import classification_report labels = list(le.classes_) # or [0,1,2,...] in the order you want target_names = list(le.classes_) # same length + same order as labels print(classification_report( y_true, y_pred, labels=labels, target_names=target_names, zero_division=0 ))

labels controls which classes are included. (sklearn.org)

target_names must “match the labels (same order)” and have shape (n_labels,). (sklearn.org)

Alternative

If you don’t need custom names, omit target_names entirely and let scikit-learn print the raw label values. (sklearn.org)

Citations:

1: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai

2: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai

3: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai

4: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 'label_names' in line and ('=' in line or 'def' in line): print(f"{line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 1260

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 'STRATEGY_LABELS' in line: print(f"{line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 324

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 300 <= line_num <= 380: print(f"{line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 4851

🏁 Script executed:

python3 << 'EOF' import json with open('./notebooks/04_bert_classification.ipynb', 'r') as f: nb = json.load(f) line_num = 1 for cell in nb['cells']: if cell['cell_type'] == 'code': source = cell['source'] if isinstance(source, list): source_text = ''.join(source) else: source_text = source lines = source_text.split('\n') for line in lines: if 'train_type' in line or 'val_type' in line or 'test_type' in line or 'stratif' in line.lower(): print(f"{line_num}: {line}") line_num += 1 EOF

Repository: SeeYangZhi/Project-LLMao

Length of output: 1145

Pass explicit label IDs to classification_report for stable multi-class eval.

Lines 328 and 330 may raise ValueError when a class is absent in y_true/y_pred (common in imbalanced splits) because target_names is set without labels.

Suggested fix

+ label_ids = list(range(len(label_names))) print("\n=== Val (best checkpoint) ===") - print(classification_report(y_val, val_preds, target_names=label_names, zero_division=0)) + print(classification_report( + y_val, val_preds, labels=label_ids, target_names=label_names, zero_division=0 + )) print("\n=== Test (best checkpoint) ===") - print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0)) + print(classification_report( + y_test, test_preds, labels=label_ids, target_names=label_names, zero_division=0 + ))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/04_bert_classification.ipynb` around lines 339 - 342, The classification_report calls for validation and test (the calls using classification_report(y_val, val_preds, target_names=label_names, zero_division=0) and classification_report(y_test, test_preds, target_names=label_names, zero_division=0)) can raise ValueError when some classes are missing; fix by passing an explicit labels list to ensure stable multi-class evaluation—compute the label id list (e.g., label_ids = list(range(len(label_names))) or derive from your label-to-id mapping) and add labels=label_ids to both classification_report calls so target_names aligns with the provided labels.

coderabbitai · 2026-03-03T08:45:28Z

+    "def enc(df): return [label2id[l] for l in df[\"type_label\"]]\n",
+    "\n",
+    "X_train_type = train_type[\"text\"].tolist(); y_train_type = enc(train_type)\n",
+    "X_val_type   = val_type[\"text\"].tolist();   y_val_type   = enc(val_type)\n",
+    "X_test_type  = test_type[\"text\"].tolist();  y_test_type  = enc(test_type)\n",


⚠️ Potential issue | 🟠 Major

Validate split labels before encoding to prevent KeyError.

Line 425 directly indexes label2id for val/test labels. If any label exists outside train labels, the notebook fails mid-run.

Suggested fix

-def enc(df): return [label2id[l] for l in df["type_label"]] +def enc(df: pd.DataFrame, split_name: str) -> list[int]: + unknown = sorted(set(df["type_label"]) - set(label2id)) + if unknown: + raise ValueError(f"Unknown labels in {split_name} split: {unknown}") + return [label2id[label] for label in df["type_label"]] -X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type) -X_val_type = val_type["text"].tolist(); y_val_type = enc(val_type) -X_test_type = test_type["text"].tolist(); y_test_type = enc(test_type) +X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type, "train") +X_val_type = val_type["text"].tolist(); y_val_type = enc(val_type, "val") +X_test_type = test_type["text"].tolist(); y_test_type = enc(test_type, "test")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/04_bert_classification.ipynb` around lines 425 - 429, The encoder function enc currently does direct lookup into label2id which will raise KeyError if val/test contain labels unseen in train; before calling enc for X_val_type/X_test_type validate that set(val_type["type_label"]) and set(test_type["type_label"]) are subsets of label2id.keys(), and if not either (a) raise a clear error listing unknown labels so the user can fix the splits, or (b) map unknown labels to a default fallback id (e.g., label2id.get(label, unknown_id)); update usages of enc and the variables X_val_type, y_val_type, X_test_type, y_test_type accordingly and mention label2id and enc in the check and handling logic.

coderabbitai · 2026-03-03T08:45:28Z

+    "    for _, row in error_sample_type.head(15).iterrows():\n",
+    "        true_l = row[true_col_type]\n",
+    "        pred_l = row.get(pred_col_type, \"?\")\n",
+    "        print(f\"  [True={true_l:20s}  Pred={pred_l:20s}] {row['text'][:90]}\")"


⚠️ Potential issue | 🟠 Major

Cast labels to strings before :20s formatting.

Line 336 can crash with ValueError when pred_col_type falls back to numeric IDs (predicted) and is formatted as string width.

Suggested fix

- for _, row in error_sample_type.head(15).iterrows(): - true_l = row[true_col_type] - pred_l = row.get(pred_col_type, "?") - print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}") + for _, row in error_sample_type.head(15).iterrows(): + true_l = str(row[true_col_type]) + pred_l = str(row.get(pred_col_type, "?")) + print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

" for _, row in error_sample_type.head(15).iterrows():\n",

" true_l = row[true_col_type]\n",

" pred_l = row.get(pred_col_type, \"?\")\n",

" print(f\" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}\")"

for _, row in error_sample_type.head(15).iterrows():

true_l = str(row[true_col_type])

pred_l = str(row.get(pred_col_type, "?"))

print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@notebooks/05_error_analysis.ipynb` around lines 333 - 336, The formatting expression in the loop over error_sample_type uses :20s on values that can be numeric, causing ValueError; in the loop that reads true_l = row[true_col_type] and pred_l = row.get(pred_col_type, "?"), cast both true_l and pred_l to strings before applying the width format (e.g., convert via str(...) or f-string !s conversion) so the print call print(f" [True={...:20s} Pred={...:20s}] {row['text'][:90]}") never receives non-string types.

…in ipynb

Reallyeasy1 added 2 commits March 3, 2026 16:34

feat: add sarcasm classification pipeline with Colab

58b7b60

fix: rename sarcasm ipynb file

d9b7b0c

Reallyeasy1 self-assigned this Mar 3, 2026

Reallyeasy1 requested a review from SeeYangZhi March 3, 2026 08:39

coderabbitai Bot added a commit that referenced this pull request Mar 3, 2026

📝 Add docstrings to feat/default-models

b0fa464

Docstrings generation was requested by @Reallyeasy1. * #1 (comment) The following files were modified: * `build_combined.py`

coderabbitai Bot mentioned this pull request Mar 3, 2026

📝 Add docstrings to feat/default-models #2

Open

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

Reallyeasy1 added 3 commits March 3, 2026 17:10

Fix: fix ipynb formatting error in sarcasm_classification.ipynb

846de4a

Fix: test ipynb formatting error

025fe94

Fix: fix formatting of ipynb file (finally works)

8335397

Reallyeasy1 mentioned this pull request Mar 3, 2026

Add Naive Bayes, TF-IDF, distillBERT classifier models for sarcasm classification #3

Open

Feat: Finish evaluation of transformers and fix character issue in ma…

089982f

…in ipynb

SeeYangZhi force-pushed the main branch 4 times, most recently from a51c032 to 1ac8cca Compare March 4, 2026 02:44

		"val_m_bin, y_val_pred_bin = evaluate(best_model_bin, X_val, y_val, label_names_bin, \"Val\")\n",
		"test_m_bin, y_test_pred_bin = evaluate(best_model_bin, X_test, y_test, label_names_bin, \"Test\")\n",

Conversation

Reallyeasy1 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Fix (robust for imbalanced / missing classes)

Alternative

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reallyeasy1 commented Mar 3, 2026 •

edited

Loading

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading