Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77
Open
esma-kr wants to merge 1 commit into
Open
Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77esma-kr wants to merge 1 commit into
esma-kr wants to merge 1 commit into
Conversation
…oupings Ground truth: - Hand-label the remaining 66 cropped_tags images (was 10) in the TagSchema shape, marking absent fields null/[]. Scorer self-check passes (321 TP). Tooling: - scripts/analyze_eval.py: learning curve (deterministic cumulative + Monte-Carlo mean/std over random subsets) and per-group P/R/F for natural groupings (content type, material complexity, language, fiber tier). - scripts/tag.py: record NO_TAG_DETECTED / errors in --json output instead of dropping them, so predictions.json has one row per image and the scorer counts them as misses. Findings (Gemini 2.5 Pro, docs/evaluation-findings.md + docs/eval-results.json): - Whole-set overall F1 0.898 (country 0.976, materials 0.951, care 0.821). - Growing the set barely moves the mean F1 (~0.89) but collapses its variance (std 0.11 at n=5 -> 0.03 at n=30); the original 10 overstated performance. - Groups differ sharply: full tags 0.97 vs care-only crops 0.60. - Dominant failure: 12 care-only crops return NO_TAG_DETECTED, causing ~92% of all care recall loss (care recall 0.745 end-to-end vs 0.973 when detected). - Smaller errors: hallucination on low-contrast woven labels, country over-inference from vendor codes, gentle/cold enum near-misses, and a flat materials schema that penalizes multi-part garments. Commit predictions.json / predictions.complete.json for reproducibility. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the merged evaluation framework (#76). Grows the ground-truth set and analyzes how scores behave as the set grows and across natural groupings — answering: as more ground truth is added, how do P/R/F change, and do natural groupings score similarly?
What's here
cropped_tagsimages in theTagSchemashape, marking absent fieldsnull/[]. Scorer self-check passes (321 TP, all fields 1.000).scripts/analyze_eval.py— learning curve (deterministic cumulative + Monte-Carlo mean/std over random subsets) and per-group P/R/F for four natural groupings.scripts/tag.py— recordsNO_TAG_DETECTED/errors in--jsonoutput instead of dropping them, sopredictions.jsonhas one row per image and the scorer counts them as misses.docs/evaluation-findings.md+docs/eval-results.json— the write-up and raw results.predictions.json/predictions.complete.jsoncommitted for reproducibility.Findings (Gemini 2.5 Pro)
care, almost entirely on recall (0.745).NO_TAG_DETECTED, causing ~92% of all care recall loss — care recall is 0.745 end-to-end vs 0.973 when a tag is detected. It's a detection-gate problem, not extraction.countryover-inference from vendor codes (e.g.VN…),gentle/coldenum near-misses, and a flat materials schema that penalizes multi-part garments.Recommendations
Loosen the tag-detection gate for care-only crops (recovers care recall to ~0.97, lifts overall to ~0.95); stop inferring country from vendor/RN codes; add garment-part structure to the materials schema; treat n≥30 as the minimum for a trustworthy number.
🤖 Generated with Claude Code