Feat/default models#1
Conversation
|
Important Review skippedReview was skipped as selected files did not have any reviewable changes. 💤 Files selected but had no reviewable changes (1)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a full sarcasm-classification pipeline: data preparation, TF-IDF/Naive Bayes baselines, DistilBERT training, error analysis notebooks, documentation, a notebook-merging utility, dependency manifest, and .gitignore updates. All changes are new files; no public API surface modified. Changes
Sequence DiagramsequenceDiagram
participant User
participant DataPrep as 01_DataPrep
participant TFIDF as 02_TFIDF_LR
participant NB as 03_NaiveBayes
participant BERT as 04_BERT
participant Error as 05_ErrorAnalysis
participant Storage as Outputs
User->>DataPrep: run data prep notebook
DataPrep->>DataPrep: validate JSONL, derive labels, create splits
DataPrep->>Storage: save datasets & splits
par Run classical baselines
TFIDF->>Storage: load splits
TFIDF->>TFIDF: GridSearchCV (GroupKFold), evaluate
TFIDF->>Storage: save metrics & preds
NB->>Storage: load splits
NB->>NB: GridSearchCV (GroupKFold), evaluate
NB->>Storage: save metrics & preds
end
BERT->>Storage: load splits
BERT->>BERT: tokenize, train (early stopping), evaluate
BERT->>Storage: save checkpoints, metrics & preds
Storage->>Error: provide metrics & predictions
Error->>Error: compare models, extract errors, create reports
Error->>Storage: save reports & visuals
Storage-->>User: artifacts available
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @Reallyeasy1. * #1 (comment) The following files were modified: * `build_combined.py`
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (3)
requirements.txt (1)
2-24: Use deterministic dependency pinning for reproducible runs.Lines 2–24 use floating minimums only (
>=), so reinstalls can drift and change model metrics over time. For this project, keep a pinned lock artifact (or fully pinnedrequirements.txt) and optionally keep a separaterequirements.infor broad constraints.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 2 - 24, The requirements.txt currently uses permissive >= pins which allows dependency drift; replace it with a deterministic, fully pinned requirements lock (use exact version pins like == for every package) or add a generated lock artifact (e.g., requirements.txt as the locked file) while keeping a high-level requirements.in with >= constraints; regenerate the lock via your tool of choice (pip-compile, pip freeze in a clean env, or poetry/poetry.lock) and commit the locked requirements.txt so future installs are reproducible and deterministic.build_combined.py (1)
200-210: Remove unusedout_dir_valparameter fromcontent_cells_nb.
out_dir_valis never used, which makes call sites noisier and the helper harder to read.Suggested fix
-def content_cells_nb(cells, out_dir_var, out_dir_val): +def content_cells_nb(cells, out_dir_var): @@ -for c in content_cells_nb(nb02, "OUT_TFIDF", "OUT_TFIDF"): +for c in content_cells_nb(nb02, "OUT_TFIDF"): @@ -for c in content_cells_nb(nb03, "OUT_NB", "OUT_NB"): +for c in content_cells_nb(nb03, "OUT_NB"): @@ -for c in content_cells_nb(nb04, "BERT_OUT", "BERT_OUT"): +for c in content_cells_nb(nb04, "BERT_OUT"): @@ -for c in content_cells_nb(nb05, "REPORTS_DIR", "REPORTS_DIR"): +for c in content_cells_nb(nb05, "REPORTS_DIR"):Also applies to: 241-241, 249-249, 257-257, 265-265
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@build_combined.py` around lines 200 - 210, The function content_cells_nb has an unused parameter out_dir_val; remove out_dir_val from its signature and any internal references, leaving only (cells, out_dir_var), and update every call site that currently passes a third argument to call content_cells_nb with just (cells, out_dir_var). Ensure any tests or helpers that referenced the old three-arg form are updated accordingly; search for content_cells_nb usages in this file to locate and fix all callers.notebooks/05_error_analysis.ipynb (1)
577-581: Avoid hard-coded class-frequency claims in generated reports.Line 580 hard-codes
rhetorical_question = 3.7%, which can become incorrect when data changes and make the report stale.Suggested fix
- lines.append("4. **Class imbalance** (rhetorical_question = 3.7%) is a significant challenge") + if error_source_type is not None and "type_label" in error_source_type.columns: + freq = (error_source_type["type_label"] == "rhetorical_question").mean() * 100 + lines.append(f"4. **Class imbalance** (rhetorical_question = {freq:.1f}%) is a significant challenge") + else: + lines.append("4. **Class imbalance** is a significant challenge")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@notebooks/05_error_analysis.ipynb` around lines 577 - 581, The report currently hard-codes "rhetorical_question = 3.7%" inside a lines.append string; instead compute the class frequency from the dataset and interpolate it into the string. Find the lines.append call that contains "Class imbalance (rhetorical_question = 3.7%)", calculate the percentage from the label column (e.g., count_true / total * 100) and format it (e.g., f"{pct:.1f}%") before passing the formatted string to lines.append so the report always reflects the current data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@build_combined.py`:
- Around line 282-289: When syntax errors are found (the errors list is
non-empty) stop the build instead of continuing: after printing the error
summary (the block that iterates over errors) call sys.exit(1) or raise
SystemExit so the script exits non‑zero and prevents subsequent writing of the
combined notebook; alternatively move the code that writes the combined notebook
so it only executes in the else branch where you currently print "syntax OK".
Refer to the errors variable and the all_cells check in build_combined.py and
ensure the combined-notebook write logic (the function or code that
serializes/writes the notebook) only runs when no syntax errors are present.
- Around line 35-138: SETUP_SRC is missing imports and DEVICE/Dataset/DataLoader
definitions needed by notebook part 04; update SETUP_SRC to import torch,
transformers (e.g., from transformers import AutoTokenizer, AutoModel, etc. as
used elsewhere), and tqdm (or tqdm.auto), and import Dataset and DataLoader from
torch.utils.data, then add DEVICE = torch.device("cuda" if
torch.cuda.is_available() else "cpu") (and set torch.manual_seed(SEED) after
SEED) so code referencing DEVICE, Dataset, DataLoader, torch, or transformers
resolves at runtime; modify the block that defines imports and SEED in SETUP_SRC
accordingly, keeping symbol names exactly as used (SETUP_SRC, SEED, DEVICE,
Dataset, DataLoader, torch, transformers, tqdm).
In `@docs/dataset_audit.md`:
- Around line 16-24: The markdown file has lint warnings: the two tables that
start with the header row "| Field | Type | Description |" need a blank line
before and after each table to satisfy MD058, and the fenced code block that
currently lacks a language tag should be updated to include an appropriate
language identifier (e.g., ```python) to satisfy MD040; locate the table blocks
and the fenced block in docs/dataset_audit.md (search for the table header
string and the triple-backtick fence) and insert the surrounding blank lines for
each table and add the language tag to the fence.
In `@docs/research_summary.md`:
- Around line 112-121: Surround the markdown table that begins with "The 6
strategy classes are **significantly imbalanced**:" (the type-distribution table
listing sarcasm, irony, satire, etc.) with a blank line before the paragraph
introducing the table and a blank line after the table end so there is an empty
line both above and below the table to satisfy markdownlint MD058.
In `@notebooks/01_data_preparation.ipynb`:
- Around line 350-353: Replace the runtime-only assert checks that verify split
integrity (e.g., "assert train_groups.isdisjoint(val_groups)...", "assert
train_groups.isdisjoint(test_groups)...", "assert
val_groups.isdisjoint(test_groups)...") with unconditional guard checks that
raise explicit exceptions (for example raise ValueError with the same message)
so these validations always run even when Python is optimized; do the same for
the other analogous assertions in the file (the checks referencing
train_groups/val_groups/test_groups around the later block) to ensure
deterministic error handling.
- Line 24: The _find_root function can pick a generic system-level directory
(e.g., a global "data" folder); update _find_root to only accept a candidate
parent as ROOT if it both contains one of the existing project markers
(outputs/, notebooks/, data/) and also looks like a repository/project root —
e.g., it contains a repo marker such as .git, pyproject.toml, setup.cfg, or
package.json — and insist the candidate is within the current project tree (one
of Path.cwd() or its parents) and not a top-level filesystem root or home
directory; change the acceptance logic in _find_root accordingly and leave the
fallback to data_file.parent unchanged (symbols to edit: _find_root, markers,
ROOT, OUT_DATASETS, OUT_SPLITS).
- Around line 191-193: Remove the redundant imports in Cell 10: delete the line
"from __future__ import annotations" (future imports must remain only at the
top-level/module start) and remove the duplicate "from urllib.parse import
urlparse" since both are already defined in the setup cell; ensure only the
setup cell retains these imports and rerun the notebook to confirm no missing
references to urlparse.
In `@notebooks/02_tfidf_lr_baseline.ipynb`:
- Around line 184-185: The notebook currently refits the best estimator
(best_bin / best_estimator_) on train+val and then calls evaluate(..., X_val,
y_val, ...) which yields validation metrics computed on data used during refit;
instead compute and report true holdout validation metrics either by (a)
evaluating the selected estimator before refit (use the estimator from the CV
split or GridSearchCV.best_estimator_ saved prior to refit) on X_val/y_val, or
(b) compute validation metrics from cross-validation results (e.g.,
grid.cv_results_ or the scorer on X_val with a copy of the model trained only on
train set). Concretely, locate uses of best_bin / best_estimator_ and the
evaluate(...) calls and change the workflow so evaluation on X_val/y_val happens
with the model that was not refit on train+val (or derive metrics from
cv_results_) before performing any refit on the combined training+validation
data.
- Around line 420-431: The code currently overwrites the original semantic
"type_label" by assigning y_val_t / y_test_t into val_type_copy["type_label"]
and test_type_copy["type_label"]; instead, preserve the original type_label and
store numeric IDs in dedicated fields (e.g.,
val_type_copy["true_id"]/test_type_copy["true_id"] or
val_type_copy["type_id"]/test_type_copy["type_id"]), then set
val_type_copy["predicted_id"], val_type_copy["predicted_label"],
val_type_copy["true_label"], val_type_copy["correct"] (and the corresponding
test_type_copy fields) using y_val_t/y_test_t and
y_val_pred_type/y_test_pred_type so the original string "type_label" remains
unchanged.
In `@notebooks/03_naive_bayes_baseline.ipynb`:
- Around line 178-179: The code is evaluating validation performance after
refitting the selected estimator on train+val (best_estimator_), so those scores
are no longer true holdout metrics; fix by either (A) remove/skip the
evaluate(...) calls on X_val (e.g., drop evaluate(best_model_bin, X_val, y_val,
...) and evaluate(best_model_multi, X_val, ...) after refit) or (B) preserve a
pre-refit copy of the chosen estimator and evaluate that on X_val (use
sklearn.base.clone(best_estimator_) or set GridSearchCV(refit=False) and re-fit
a cloned estimator only on train+val for final test evaluation), ensuring
references to best_estimator_, best_model_bin and best_model_multi and the
evaluate(...) calls are updated accordingly.
In `@notebooks/04_bert_classification.ipynb`:
- Line 153: Guard against empty loaders before dividing by len(loader): wherever
you compute averages like total_loss / len(loader) (and similar metric sums)
first check if len(loader) == 0 and return a safe default (e.g., 0.0 or
float('nan')) or skip the averaging; implement this by computing denom =
len(loader) and if denom == 0 return 0.0 else return total_loss / denom, and
apply the same pattern to all other places that divide by len(loader) (the
total_loss / len(loader) occurrence and the other metric-averaging blocks
mentioned).
- Around line 425-429: The encoder function enc currently does direct lookup
into label2id which will raise KeyError if val/test contain labels unseen in
train; before calling enc for X_val_type/X_test_type validate that
set(val_type["type_label"]) and set(test_type["type_label"]) are subsets of
label2id.keys(), and if not either (a) raise a clear error listing unknown
labels so the user can fix the splits, or (b) map unknown labels to a default
fallback id (e.g., label2id.get(label, unknown_id)); update usages of enc and
the variables X_val_type, y_val_type, X_test_type, y_test_type accordingly and
mention label2id and enc in the check and handling logic.
- Around line 339-342: The classification_report calls for validation and test
(the calls using classification_report(y_val, val_preds,
target_names=label_names, zero_division=0) and classification_report(y_test,
test_preds, target_names=label_names, zero_division=0)) can raise ValueError
when some classes are missing; fix by passing an explicit labels list to ensure
stable multi-class evaluation—compute the label id list (e.g., label_ids =
list(range(len(label_names))) or derive from your label-to-id mapping) and add
labels=label_ids to both classification_report calls so target_names aligns with
the provided labels.
In `@notebooks/05_error_analysis.ipynb`:
- Around line 333-336: The formatting expression in the loop over
error_sample_type uses :20s on values that can be numeric, causing ValueError;
in the loop that reads true_l = row[true_col_type] and pred_l =
row.get(pred_col_type, "?"), cast both true_l and pred_l to strings before
applying the width format (e.g., convert via str(...) or f-string !s conversion)
so the print call print(f" [True={...:20s} Pred={...:20s}]
{row['text'][:90]}") never receives non-string types.
---
Nitpick comments:
In `@build_combined.py`:
- Around line 200-210: The function content_cells_nb has an unused parameter
out_dir_val; remove out_dir_val from its signature and any internal references,
leaving only (cells, out_dir_var), and update every call site that currently
passes a third argument to call content_cells_nb with just (cells, out_dir_var).
Ensure any tests or helpers that referenced the old three-arg form are updated
accordingly; search for content_cells_nb usages in this file to locate and fix
all callers.
In `@notebooks/05_error_analysis.ipynb`:
- Around line 577-581: The report currently hard-codes "rhetorical_question =
3.7%" inside a lines.append string; instead compute the class frequency from the
dataset and interpolate it into the string. Find the lines.append call that
contains "Class imbalance (rhetorical_question = 3.7%)", calculate the
percentage from the label column (e.g., count_true / total * 100) and format it
(e.g., f"{pct:.1f}%") before passing the formatted string to lines.append so the
report always reflects the current data.
In `@requirements.txt`:
- Around line 2-24: The requirements.txt currently uses permissive >= pins which
allows dependency drift; replace it with a deterministic, fully pinned
requirements lock (use exact version pins like == for every package) or add a
generated lock artifact (e.g., requirements.txt as the locked file) while
keeping a high-level requirements.in with >= constraints; regenerate the lock
via your tool of choice (pip-compile, pip freeze in a clean env, or
poetry/poetry.lock) and commit the locked requirements.txt so future installs
are reproducible and deterministic.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
.gitignorebuild_combined.pycolab_package/sarcasm_classification.ipynbcolab_package/sarcasm_pairs_step35_clean.jsonldocs/dataset_audit.mddocs/experiment_plan.mddocs/research_summary.mdnotebooks/01_data_preparation.ipynbnotebooks/02_tfidf_lr_baseline.ipynbnotebooks/03_naive_bayes_baseline.ipynbnotebooks/04_bert_classification.ipynbnotebooks/05_error_analysis.ipynbnotebooks/sarcasm_pairs_step35_clean.jsonlrequirements.txt
| SETUP_SRC = "\n".join([ | ||
| "# ============================================================", | ||
| "# SETUP — imports, file upload, paths", | ||
| "# ============================================================", | ||
| "from __future__ import annotations", | ||
| "import json, hashlib, random, os, warnings, shutil", | ||
| "from dataclasses import dataclass", | ||
| "from pathlib import Path", | ||
| "from collections import Counter", | ||
| "from urllib.parse import urlparse", | ||
| "from typing import Optional", | ||
| "", | ||
| "import numpy as np", | ||
| "import pandas as pd", | ||
| "import matplotlib.pyplot as plt", | ||
| "import matplotlib.gridspec as gridspec", | ||
| "import seaborn as sns", | ||
| "", | ||
| "from sklearn.pipeline import Pipeline", | ||
| "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer", | ||
| "from sklearn.linear_model import LogisticRegression", | ||
| "from sklearn.naive_bayes import MultinomialNB, ComplementNB", | ||
| "from sklearn.model_selection import GridSearchCV, GroupKFold", | ||
| "from sklearn.metrics import (", | ||
| " accuracy_score, precision_score, recall_score,", | ||
| " f1_score, classification_report, confusion_matrix,", | ||
| ")", | ||
| "", | ||
| "warnings.filterwarnings('ignore')", | ||
| "", | ||
| "SEED = 42", | ||
| "random.seed(SEED)", | ||
| "np.random.seed(SEED)", | ||
| "", | ||
| "# ── Locate or upload the JSONL data file ─────────────────────────────────", | ||
| 'FILENAME = "sarcasm_pairs_step35_clean.jsonl"', | ||
| "", | ||
| "def _locate_file(filename):", | ||
| " candidates = []", | ||
| " for root in [Path.cwd()] + list(Path.cwd().parents):", | ||
| " for sub in [", | ||
| ' Path("data") / "processed" / filename,', | ||
| ' Path("data") / filename,', | ||
| " Path(filename),", | ||
| " ]:", | ||
| " candidates.append(root / sub)", | ||
| " for p in [", | ||
| ' Path("/content") / filename,', | ||
| ' Path("/mnt/data") / filename,', | ||
| " ]:", | ||
| " candidates.append(p)", | ||
| " _c = Path('/content')", | ||
| " for p in (_c.rglob(filename) if _c.exists() else []):", | ||
| " candidates.append(p)", | ||
| " for p in candidates:", | ||
| " if p.is_file():", | ||
| " return p", | ||
| " return None", | ||
| "", | ||
| "print(f'cwd: {Path.cwd()}')", | ||
| "print(f'files in cwd: {[p.name for p in Path.cwd().iterdir()][:10]}')", | ||
| "", | ||
| "DATA_FILE = _locate_file(FILENAME)", | ||
| "if DATA_FILE is None:", | ||
| " try:", | ||
| " from google.colab import files as _cf", | ||
| ' print(f"Upload {FILENAME!r}:")', | ||
| " _up = _cf.upload()", | ||
| " if not _up:", | ||
| ' raise RuntimeError("No file uploaded.")', | ||
| " _name = list(_up.keys())[0]", | ||
| ' DATA_FILE = Path("/content") / FILENAME', | ||
| " if Path(_name) != DATA_FILE:", | ||
| " shutil.move(_name, str(DATA_FILE))", | ||
| ' print(f"Saved to {DATA_FILE}")', | ||
| " except ImportError:", | ||
| " raise FileNotFoundError(", | ||
| ' f"Cannot find {FILENAME!r}. Place it in the same folder as this notebook."', | ||
| " )", | ||
| "", | ||
| "# ── Project root + all output directories ────────────────────────────────", | ||
| "def _find_root(data_file):", | ||
| ' for parent in [data_file.parent] + list(data_file.parents):', | ||
| ' if any((parent / m).exists() for m in ["outputs","notebooks","data"]):', | ||
| " return parent", | ||
| " return data_file.parent", | ||
| "", | ||
| "ROOT = _find_root(DATA_FILE)", | ||
| "", | ||
| "OUT_DATASETS = ROOT / 'outputs' / 'datasets'", | ||
| "OUT_SPLITS = ROOT / 'outputs' / 'splits'", | ||
| "OUT_TFIDF = ROOT / 'outputs' / 'classical' / 'tfidf_lr'", | ||
| "OUT_NB = ROOT / 'outputs' / 'classical' / 'naive_bayes'", | ||
| "BERT_OUT = ROOT / 'outputs' / 'bert'", | ||
| "REPORTS_DIR = ROOT / 'outputs' / 'reports'", | ||
| "SPLITS = OUT_SPLITS", | ||
| "", | ||
| "for d in [OUT_DATASETS, OUT_SPLITS, OUT_TFIDF, OUT_NB, BERT_OUT, REPORTS_DIR]:", | ||
| " d.mkdir(parents=True, exist_ok=True)", | ||
| "", | ||
| 'print(f"Data : {DATA_FILE}")', | ||
| 'print(f"Root : {ROOT}")', | ||
| 'print(f"Output : {ROOT / \'outputs\'}")', | ||
| ]) + "\n" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd build_combined.pyRepository: SeeYangZhi/Project-LLMao
Length of output: 85
🏁 Script executed:
wc -l build_combined.pyRepository: SeeYangZhi/Project-LLMao
Length of output: 89
🏁 Script executed:
cat -n build_combined.pyRepository: SeeYangZhi/Project-LLMao
Length of output: 13265
🏁 Script executed:
fd "04_bert_classification.ipynb"Repository: SeeYangZhi/Project-LLMao
Length of output: 106
🏁 Script executed:
cat notebooks/04_bert_classification.ipynb | python3 << 'EOF'
import json
import sys
nb = json.load(sys.stdin)
for i, cell in enumerate(nb["cells"]):
src = cell["source"]
if isinstance(src, list):
src = "".join(src)
# Look for torch, transformers, DEVICE, Dataset, DataLoader
keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"]
has_keyword = any(kw in src for kw in keywords)
if has_keyword or (i < 5): # Show first 5 cells + any with keywords
cell_type = cell["cell_type"]
print(f"\n=== Cell {i} ({cell_type}) ===")
if cell_type == "code":
# Show first 50 lines
lines = src.split("\n")[:50]
for j, line in enumerate(lines, 1):
print(f"{j:3d}: {line}")
if len(src.split("\n")) > 50:
print(" ...")
else:
print(src[:200])
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 788
🏁 Script executed:
python3 << 'EOF'
import json
from pathlib import Path
nb_path = Path("notebooks/04_bert_classification.ipynb")
with open(nb_path, 'r') as f:
nb = json.load(f)
for i, cell in enumerate(nb["cells"]):
src = cell["source"]
if isinstance(src, list):
src = "".join(src)
# Look for torch, transformers, DEVICE, Dataset, DataLoader
keywords = ["torch", "transformers", "DEVICE", "Dataset", "DataLoader"]
has_keyword = any(kw in src for kw in keywords)
if has_keyword or (i < 5): # Show first 5 cells + any with keywords
cell_type = cell["cell_type"]
print(f"\n=== Cell {i} ({cell_type}) ===")
if cell_type == "code":
# Show first 80 lines
lines = src.split("\n")[:80]
for j, line in enumerate(lines, 1):
print(f"{j:3d}: {line}")
if len(src.split("\n")) > 80:
print(" ...")
else:
print(src[:300])
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 11941
Add missing torch, transformers, and tqdm imports plus DEVICE initialization to SETUP_SRC.
The combined notebook skips notebook 04's setup cell (filtered by is_setup_cell), but SETUP_SRC does not define torch, transformers imports, Dataset, DataLoader, or DEVICE. Part 4 will fail at runtime (e.g., NameError: name 'DEVICE' is not defined, NameError: name 'Dataset' is not defined).
Suggested fix
SETUP_SRC = "\n".join([
"import seaborn as sns",
+ "import torch",
+ "from torch import nn",
+ "from torch.utils.data import Dataset, DataLoader",
+ "from transformers import (",
+ " AutoTokenizer,",
+ " AutoModelForSequenceClassification,",
+ " get_linear_schedule_with_warmup,",
+ ")",
+ "from tqdm.auto import tqdm",
"",
"warnings.filterwarnings('ignore')",
"",
"SEED = 42",
"random.seed(SEED)",
"np.random.seed(SEED)",
+ "torch.manual_seed(SEED)",
+ "if torch.cuda.is_available():",
+ " torch.cuda.manual_seed_all(SEED)",
+ " torch.backends.cudnn.deterministic = True",
+ "DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')",
+ "print(f'Device: {DEVICE}')",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@build_combined.py` around lines 35 - 138, SETUP_SRC is missing imports and
DEVICE/Dataset/DataLoader definitions needed by notebook part 04; update
SETUP_SRC to import torch, transformers (e.g., from transformers import
AutoTokenizer, AutoModel, etc. as used elsewhere), and tqdm (or tqdm.auto), and
import Dataset and DataLoader from torch.utils.data, then add DEVICE =
torch.device("cuda" if torch.cuda.is_available() else "cpu") (and set
torch.manual_seed(SEED) after SEED) so code referencing DEVICE, Dataset,
DataLoader, torch, or transformers resolves at runtime; modify the block that
defines imports and SEED in SETUP_SRC accordingly, keeping symbol names exactly
as used (SETUP_SRC, SEED, DEVICE, Dataset, DataLoader, torch, transformers,
tqdm).
| if errors: | ||
| print(f"\n{len(errors)} syntax error(s):") | ||
| for i, ln, msg, line in errors: | ||
| print(f" Cell {i:3d} line {ln}: {line!r}") | ||
| print(f" {msg}") | ||
| else: | ||
| print(f"All {len(all_cells)} cells: syntax OK") | ||
|
|
There was a problem hiding this comment.
Fail the build when syntax errors are detected.
Currently the script prints syntax errors but still writes the combined notebook, which can ship broken artifacts.
Suggested fix
if errors:
print(f"\n{len(errors)} syntax error(s):")
for i, ln, msg, line in errors:
print(f" Cell {i:3d} line {ln}: {line!r}")
print(f" {msg}")
+ raise RuntimeError("Aborting: combined notebook contains syntax errors.")
else:
print(f"All {len(all_cells)} cells: syntax OK")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if errors: | |
| print(f"\n{len(errors)} syntax error(s):") | |
| for i, ln, msg, line in errors: | |
| print(f" Cell {i:3d} line {ln}: {line!r}") | |
| print(f" {msg}") | |
| else: | |
| print(f"All {len(all_cells)} cells: syntax OK") | |
| if errors: | |
| print(f"\n{len(errors)} syntax error(s):") | |
| for i, ln, msg, line in errors: | |
| print(f" Cell {i:3d} line {ln}: {line!r}") | |
| print(f" {msg}") | |
| raise RuntimeError("Aborting: combined notebook contains syntax errors.") | |
| else: | |
| print(f"All {len(all_cells)} cells: syntax OK") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@build_combined.py` around lines 282 - 289, When syntax errors are found (the
errors list is non-empty) stop the build instead of continuing: after printing
the error summary (the block that iterates over errors) call sys.exit(1) or
raise SystemExit so the script exits non‑zero and prevents subsequent writing of
the combined notebook; alternatively move the code that writes the combined
notebook so it only executes in the else branch where you currently print
"syntax OK". Refer to the errors variable and the all_cells check in
build_combined.py and ensure the combined-notebook write logic (the function or
code that serializes/writes the notebook) only runs when no syntax errors are
present.
| | Field | Type | Description | | ||
| |-------|------|-------------| | ||
| | `original_headline` | string | The source headline (sarcastic OR non-sarcastic depending on `type`) | | ||
| | `generated_headline` | string | The transformed headline (the opposite label) | | ||
| | `strategy` | string | Sarcasm type/strategy label (6 values, see below) | | ||
| | `type` | string | Direction of transformation: `sarcastic_to_non` or `non_to_sarcastic` | | ||
| | `model_used` | string | The LLM that generated the rewrite | | ||
| | `article_link` | URL string | Source article URL; used for grouping | | ||
|
|
There was a problem hiding this comment.
Fix markdownlint warnings in tables and code fences.
Line 16/Line 73 tables need surrounding blank lines (MD058), and the fenced block at Line 106 should include a language tag (MD040), e.g. ```python.
Also applies to: 73-76, 106-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/dataset_audit.md` around lines 16 - 24, The markdown file has lint
warnings: the two tables that start with the header row "| Field | Type |
Description |" need a blank line before and after each table to satisfy MD058,
and the fenced code block that currently lacks a language tag should be updated
to include an appropriate language identifier (e.g., ```python) to satisfy
MD040; locate the table blocks and the fenced block in docs/dataset_audit.md
(search for the table header string and the triple-backtick fence) and insert
the surrounding blank lines for each table and add the language tag to the
fence.
| The 6 strategy classes are **significantly imbalanced**: | ||
| | Strategy | Count (approx) | | ||
| |----------|---------------| | ||
| | sarcasm | ~8,699 | | ||
| | irony | ~6,102 | | ||
| | satire | ~5,224 | | ||
| | overstatement | ~3,976 | | ||
| | understatement | ~3,295 | | ||
| | rhetorical_question | ~1,037 | | ||
|
|
There was a problem hiding this comment.
Add blank lines around the type-distribution table.
The table starting at Line 113 should be surrounded by blank lines to satisfy markdownlint MD058.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 113-113: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/research_summary.md` around lines 112 - 121, Surround the markdown table
that begins with "The 6 strategy classes are **significantly imbalanced**:" (the
type-distribution table listing sarcasm, irony, satire, etc.) with a blank line
before the paragraph introducing the table and a blank line after the table end
so there is an empty line both above and below the table to satisfy markdownlint
MD058.
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "import os; from pathlib import Path as _P; print(\"cwd:\", _P.cwd()); print(\"files:\", list(_P.cwd().iterdir())[:10])\n\nimport json, hashlib, random, os\nfrom pathlib import Path\nfrom collections import Counter\nfrom urllib.parse import urlparse\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nSEED = 42\nrandom.seed(SEED)\nnp.random.seed(SEED)\n\nFILENAME = \"sarcasm_pairs_step35_clean.jsonl\"\n\n# \u2500\u2500 Step 1: locate or upload the data file \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _locate_file(filename):\n \"\"\"Search common locations; return Path if found, else None.\"\"\"\n candidates = []\n for root in [Path.cwd()] + list(Path.cwd().parents):\n for sub in [\n Path(\"data\") / \"processed\" / filename,\n Path(\"data\") / filename,\n Path(filename),\n ]:\n candidates.append(root / sub)\n for p in [\n Path(\"/content\") / filename,\n Path(\"/mnt/data\") / filename,\n Path(\"/content/drive/MyDrive\") / filename,\n ]:\n candidates.append(p)\n _content = Path(\"/content\")\n for p in (_content.rglob(filename) if _content.exists() else []):\n candidates.append(p)\n for p in candidates:\n if p.is_file():\n return p\n return None\n\nDATA_FILE = _locate_file(FILENAME)\n\nif DATA_FILE is None:\n # \u2500\u2500 Auto-upload in Colab \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n try:\n from google.colab import files as _colab_files\n print(f\"{FILENAME!r} not found. Please upload it now:\")\n _uploaded = _colab_files.upload()\n if not _uploaded:\n raise RuntimeError(\"No file uploaded. Please upload the JSONL file and re-run.\")\n _fname = list(_uploaded.keys())[0]\n if not _fname.endswith(FILENAME.split('/')[-1]):\n raise ValueError(\n f\"Wrong file uploaded: {_fname!r}. Expected {FILENAME!r}.\"\n )\n DATA_FILE = Path(\"/content\") / FILENAME\n if Path(_fname) != DATA_FILE:\n import shutil\n shutil.move(_fname, str(DATA_FILE))\n print(f\"Uploaded to {DATA_FILE}\")\n except ImportError:\n # Not in Colab\n raise FileNotFoundError(\n f\"Cannot find {FILENAME!r}.\\n\"\n f\"cwd: {Path.cwd()}\\n\"\n \"Place the file at: <project_root>/data/processed/\" + FILENAME\n )\n\n# \u2500\u2500 Step 2: determine project root \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\ndef _find_root(data_file):\n \"\"\"Walk up from data_file to find the project root (contains outputs/ or notebooks/).\"\"\"\n markers = [\n Path(\"outputs\"),\n Path(\"notebooks\"),\n Path(\"data\"),\n ]\n for parent in [data_file.parent] + list(data_file.parents):\n if any((parent / m).exists() for m in markers):\n return parent\n return data_file.parent # fallback\n\nROOT = _find_root(DATA_FILE)\n\nOUT_DATASETS = ROOT / \"outputs\" / \"datasets\"\nOUT_SPLITS = ROOT / \"outputs\" / \"splits\"\nOUT_DATASETS.mkdir(parents=True, exist_ok=True)\nOUT_SPLITS.mkdir(parents=True, exist_ok=True)\n\nprint(f\"Dataset : {DATA_FILE}\")\nprint(f\"Root : {ROOT}\")\nprint(f\"Outputs : {OUT_DATASETS}\")\n" |
There was a problem hiding this comment.
Harden root detection to avoid writing outside the project tree.
In Line 24, _find_root can match overly generic parents (e.g., a system-level data dir), which risks setting ROOT incorrectly and writing outputs in unintended locations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/01_data_preparation.ipynb` at line 24, The _find_root function can
pick a generic system-level directory (e.g., a global "data" folder); update
_find_root to only accept a candidate parent as ROOT if it both contains one of
the existing project markers (outputs/, notebooks/, data/) and also looks like a
repository/project root — e.g., it contains a repo marker such as .git,
pyproject.toml, setup.cfg, or package.json — and insist the candidate is within
the current project tree (one of Path.cwd() or its parents) and not a top-level
filesystem root or home directory; change the acceptance logic in _find_root
accordingly and leave the fallback to data_file.parent unchanged (symbols to
edit: _find_root, markers, ROOT, OUT_DATASETS, OUT_SPLITS).
| "val_m_bin, y_val_pred_bin = evaluate(best_model_bin, X_val, y_val, label_names_bin, \"Val\")\n", | ||
| "test_m_bin, y_test_pred_bin = evaluate(best_model_bin, X_test, y_test, label_names_bin, \"Test\")\n", |
There was a problem hiding this comment.
Validation evaluation is leaked after refit on train+val.
At Line 176/Line 331, best_estimator_ has already been fit on train+val, so Line 178-179 and Line 333-334 are not true holdout validation results.
Either (a) skip post-refit val scoring, or (b) tune on train-only and keep val truly held out.
Also applies to: 333-334
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/03_naive_bayes_baseline.ipynb` around lines 178 - 179, The code is
evaluating validation performance after refitting the selected estimator on
train+val (best_estimator_), so those scores are no longer true holdout metrics;
fix by either (A) remove/skip the evaluate(...) calls on X_val (e.g., drop
evaluate(best_model_bin, X_val, y_val, ...) and evaluate(best_model_multi,
X_val, ...) after refit) or (B) preserve a pre-refit copy of the chosen
estimator and evaluate that on X_val (use sklearn.base.clone(best_estimator_) or
set GridSearchCV(refit=False) and re-fit a cloned estimator only on train+val
for final test evaluation), ensuring references to best_estimator_,
best_model_bin and best_model_multi and the evaluate(...) calls are updated
accordingly.
| " scheduler.step()\n", | ||
| "\n", | ||
| " total_loss += loss.item()\n", | ||
| " return total_loss / len(loader)\n", |
There was a problem hiding this comment.
Guard empty splits to avoid division-by-zero in epoch metrics.
Line 153 and Line 178 divide by len(loader). Any empty split (or filtered split) crashes training/evaluation.
Suggested fix
def train_bert(
@@
) -> dict:
@@
+ for split_name, X, y in [
+ ("train", X_train, y_train),
+ ("val", X_val, y_val),
+ ("test", X_test, y_test),
+ ]:
+ if len(X) == 0 or len(y) == 0:
+ raise ValueError(f"{split_name} split is empty; cannot train/evaluate.")
+
# Datasets
train_ds = HeadlineDataset(X_train, y_train, tokenizer, cfg.max_length)Also applies to: 178-178, 239-247
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/04_bert_classification.ipynb` at line 153, Guard against empty
loaders before dividing by len(loader): wherever you compute averages like
total_loss / len(loader) (and similar metric sums) first check if len(loader) ==
0 and return a safe default (e.g., 0.0 or float('nan')) or skip the averaging;
implement this by computing denom = len(loader) and if denom == 0 return 0.0
else return total_loss / denom, and apply the same pattern to all other places
that divide by len(loader) (the total_loss / len(loader) occurrence and the
other metric-averaging blocks mentioned).
| " print(\"\\n=== Val (best checkpoint) ===\")\n", | ||
| " print(classification_report(y_val, val_preds, target_names=label_names, zero_division=0))\n", | ||
| " print(\"\\n=== Test (best checkpoint) ===\")\n", | ||
| " print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0))\n", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "04_bert_classification.ipynb" -type fRepository: SeeYangZhi/Project-LLMao
Length of output: 108
🏁 Script executed:
cat ./notebooks/04_bert_classification.ipynb | python3 << 'EOF'
import json
import sys
data = json.load(sys.stdin)
cells = data.get('cells', [])
# Flatten all code and track line numbers
line_num = 1
for cell in cells:
if cell.get('cell_type') == 'code':
source = cell.get('source', [])
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 330 <= line_num <= 350:
print(f"{line_num}: {line}")
line_num += 1
print(f"\nTotal lines processed: {line_num - 1}", file=sys.stderr)
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 788
🏁 Script executed:
jq '.cells[] | select(.cell_type == "code") | .source' ./notebooks/04_bert_classification.ipynb | head -200Repository: SeeYangZhi/Project-LLMao
Length of output: 10283
🏁 Script executed:
jq -r '.cells[] | select(.cell_type == "code") | .source | join("")' ./notebooks/04_bert_classification.ipynb > /tmp/notebook_code.txt
grep -n "Val (best checkpoint)" /tmp/notebook_code.txtRepository: SeeYangZhi/Project-LLMao
Length of output: 172
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 'Val (best checkpoint)' in line or 'Test (best checkpoint)' in line:
print(f"Line {line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 178
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 320 <= line_num <= 335:
print(f"{line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 985
🌐 Web query:
sklearn classification_report target_names labels parameter ValueError imbalanced
💡 Result:
In sklearn.metrics.classification_report, target_names must be the same length and order as the labels that the report is computed over. If you don’t pass labels=..., scikit-learn infers the label set from the classes that actually appear in y_true and/or y_pred (so in imbalanced splits, some classes may be missing), which commonly makes len(target_names) not match and triggers:
ValueError: Number of classes, X, does not match size of target_names, Y. Try specifying the labels parameter(sklearn.org)
Fix (robust for imbalanced / missing classes)
Pass both labels (all classes you want shown, in a fixed order) and matching target_names:
from sklearn.metrics import classification_report
labels = list(le.classes_) # or [0,1,2,...] in the order you want
target_names = list(le.classes_) # same length + same order as labels
print(classification_report(
y_true, y_pred,
labels=labels,
target_names=target_names,
zero_division=0
))labelscontrols which classes are included. (sklearn.org)target_namesmust “match the labels (same order)” and have shape(n_labels,). (sklearn.org)
Alternative
If you don’t need custom names, omit target_names entirely and let scikit-learn print the raw label values. (sklearn.org)
Citations:
- 1: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai
- 2: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai
- 3: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai
- 4: https://sklearn.org/stable/modules/generated/sklearn.metrics.classification_report.html?utm_source=openai
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 'label_names' in line and ('=' in line or 'def' in line):
print(f"{line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 1260
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 'STRATEGY_LABELS' in line:
print(f"{line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 324
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 300 <= line_num <= 380:
print(f"{line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 4851
🏁 Script executed:
python3 << 'EOF'
import json
with open('./notebooks/04_bert_classification.ipynb', 'r') as f:
nb = json.load(f)
line_num = 1
for cell in nb['cells']:
if cell['cell_type'] == 'code':
source = cell['source']
if isinstance(source, list):
source_text = ''.join(source)
else:
source_text = source
lines = source_text.split('\n')
for line in lines:
if 'train_type' in line or 'val_type' in line or 'test_type' in line or 'stratif' in line.lower():
print(f"{line_num}: {line}")
line_num += 1
EOFRepository: SeeYangZhi/Project-LLMao
Length of output: 1145
Pass explicit label IDs to classification_report for stable multi-class eval.
Lines 328 and 330 may raise ValueError when a class is absent in y_true/y_pred (common in imbalanced splits) because target_names is set without labels.
Suggested fix
+ label_ids = list(range(len(label_names)))
print("\n=== Val (best checkpoint) ===")
- print(classification_report(y_val, val_preds, target_names=label_names, zero_division=0))
+ print(classification_report(
+ y_val, val_preds, labels=label_ids, target_names=label_names, zero_division=0
+ ))
print("\n=== Test (best checkpoint) ===")
- print(classification_report(y_test, test_preds, target_names=label_names, zero_division=0))
+ print(classification_report(
+ y_test, test_preds, labels=label_ids, target_names=label_names, zero_division=0
+ ))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/04_bert_classification.ipynb` around lines 339 - 342, The
classification_report calls for validation and test (the calls using
classification_report(y_val, val_preds, target_names=label_names,
zero_division=0) and classification_report(y_test, test_preds,
target_names=label_names, zero_division=0)) can raise ValueError when some
classes are missing; fix by passing an explicit labels list to ensure stable
multi-class evaluation—compute the label id list (e.g., label_ids =
list(range(len(label_names))) or derive from your label-to-id mapping) and add
labels=label_ids to both classification_report calls so target_names aligns with
the provided labels.
| "def enc(df): return [label2id[l] for l in df[\"type_label\"]]\n", | ||
| "\n", | ||
| "X_train_type = train_type[\"text\"].tolist(); y_train_type = enc(train_type)\n", | ||
| "X_val_type = val_type[\"text\"].tolist(); y_val_type = enc(val_type)\n", | ||
| "X_test_type = test_type[\"text\"].tolist(); y_test_type = enc(test_type)\n", |
There was a problem hiding this comment.
Validate split labels before encoding to prevent KeyError.
Line 425 directly indexes label2id for val/test labels. If any label exists outside train labels, the notebook fails mid-run.
Suggested fix
-def enc(df): return [label2id[l] for l in df["type_label"]]
+def enc(df: pd.DataFrame, split_name: str) -> list[int]:
+ unknown = sorted(set(df["type_label"]) - set(label2id))
+ if unknown:
+ raise ValueError(f"Unknown labels in {split_name} split: {unknown}")
+ return [label2id[label] for label in df["type_label"]]
-X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type)
-X_val_type = val_type["text"].tolist(); y_val_type = enc(val_type)
-X_test_type = test_type["text"].tolist(); y_test_type = enc(test_type)
+X_train_type = train_type["text"].tolist(); y_train_type = enc(train_type, "train")
+X_val_type = val_type["text"].tolist(); y_val_type = enc(val_type, "val")
+X_test_type = test_type["text"].tolist(); y_test_type = enc(test_type, "test")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/04_bert_classification.ipynb` around lines 425 - 429, The encoder
function enc currently does direct lookup into label2id which will raise
KeyError if val/test contain labels unseen in train; before calling enc for
X_val_type/X_test_type validate that set(val_type["type_label"]) and
set(test_type["type_label"]) are subsets of label2id.keys(), and if not either
(a) raise a clear error listing unknown labels so the user can fix the splits,
or (b) map unknown labels to a default fallback id (e.g., label2id.get(label,
unknown_id)); update usages of enc and the variables X_val_type, y_val_type,
X_test_type, y_test_type accordingly and mention label2id and enc in the check
and handling logic.
| " for _, row in error_sample_type.head(15).iterrows():\n", | ||
| " true_l = row[true_col_type]\n", | ||
| " pred_l = row.get(pred_col_type, \"?\")\n", | ||
| " print(f\" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}\")" |
There was a problem hiding this comment.
Cast labels to strings before :20s formatting.
Line 336 can crash with ValueError when pred_col_type falls back to numeric IDs (predicted) and is formatted as string width.
Suggested fix
- for _, row in error_sample_type.head(15).iterrows():
- true_l = row[true_col_type]
- pred_l = row.get(pred_col_type, "?")
- print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}")
+ for _, row in error_sample_type.head(15).iterrows():
+ true_l = str(row[true_col_type])
+ pred_l = str(row.get(pred_col_type, "?"))
+ print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| " for _, row in error_sample_type.head(15).iterrows():\n", | |
| " true_l = row[true_col_type]\n", | |
| " pred_l = row.get(pred_col_type, \"?\")\n", | |
| " print(f\" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}\")" | |
| for _, row in error_sample_type.head(15).iterrows(): | |
| true_l = str(row[true_col_type]) | |
| pred_l = str(row.get(pred_col_type, "?")) | |
| print(f" [True={true_l:20s} Pred={pred_l:20s}] {row['text'][:90]}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@notebooks/05_error_analysis.ipynb` around lines 333 - 336, The formatting
expression in the loop over error_sample_type uses :20s on values that can be
numeric, causing ValueError; in the loop that reads true_l = row[true_col_type]
and pred_l = row.get(pred_col_type, "?"), cast both true_l and pred_l to strings
before applying the width format (e.g., convert via str(...) or f-string !s
conversion) so the print call print(f" [True={...:20s} Pred={...:20s}]
{row['text'][:90]}") never receives non-string types.
a51c032 to
1ac8cca
Compare
Additional Notes:
Main change is in colab_package/
You can ignore notebooks/ as they are the source of truth for the colab_package/.
Closes #3
Summary by CodeRabbit
New Features
Documentation
Chores