Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings by esma-kr · Pull Request #77 · benevolentbandwidth/ecotag

esma-kr · 2026-06-10T13:45:59Z

Builds on the merged evaluation framework (#76). Grows the ground-truth set and analyzes how scores behave as the set grows and across natural groupings — answering: as more ground truth is added, how do P/R/F change, and do natural groupings score similarly?

What's here

Ground truth: 10 → 76. Hand-labeled the remaining 66 cropped_tags images in the TagSchema shape, marking absent fields null/[]. Scorer self-check passes (321 TP, all fields 1.000).
scripts/analyze_eval.py — learning curve (deterministic cumulative + Monte-Carlo mean/std over random subsets) and per-group P/R/F for four natural groupings.
scripts/tag.py — records NO_TAG_DETECTED/errors in --json output instead of dropping them, so predictions.json has one row per image and the scorer counts them as misses.
docs/evaluation-findings.md + docs/eval-results.json — the write-up and raw results. predictions.json / predictions.complete.json committed for reproducibility.

Findings (Gemini 2.5 Pro)

Whole set: overall F1 0.898 (country 0.976, materials 0.951, care 0.821). The weak field is care, almost entirely on recall (0.745).
As the set grows: the mean F1 barely moves (~0.89), but its variance collapses (std 0.11 at n=5 → 0.03 at n=30 → 0 at n=76). The original 10 happened to be clean tags and overstated performance — the real value of more ground truth is a trustworthy number, not a different one.
Groupings differ sharply (they do not score alike): full tags 0.97 vs care-only crops 0.60; multilingual 0.97 vs monolingual 0.87 (a proxy for content type); premium vs commodity fiber identical (~0.90).
Dominant failure mode: 12 care-only crops return NO_TAG_DETECTED, causing ~92% of all care recall loss — care recall is 0.745 end-to-end vs 0.973 when a tag is detected. It's a detection-gate problem, not extraction.
Smaller error patterns: hallucination on low-contrast woven labels, country over-inference from vendor codes (e.g. VN…), gentle/cold enum near-misses, and a flat materials schema that penalizes multi-part garments.

Recommendations

Loosen the tag-detection gate for care-only crops (recovers care recall to ~0.97, lifts overall to ~0.95); stop inferring country from vendor/RN codes; add garment-part structure to the materials schema; treat n≥30 as the minimum for a trustworthy number.

🤖 Generated with Claude Code

…oupings Ground truth: - Hand-label the remaining 66 cropped_tags images (was 10) in the TagSchema shape, marking absent fields null/[]. Scorer self-check passes (321 TP). Tooling: - scripts/analyze_eval.py: learning curve (deterministic cumulative + Monte-Carlo mean/std over random subsets) and per-group P/R/F for natural groupings (content type, material complexity, language, fiber tier). - scripts/tag.py: record NO_TAG_DETECTED / errors in --json output instead of dropping them, so predictions.json has one row per image and the scorer counts them as misses. Findings (Gemini 2.5 Pro, docs/evaluation-findings.md + docs/eval-results.json): - Whole-set overall F1 0.898 (country 0.976, materials 0.951, care 0.821). - Growing the set barely moves the mean F1 (~0.89) but collapses its variance (std 0.11 at n=5 -> 0.03 at n=30); the original 10 overstated performance. - Groups differ sharply: full tags 0.97 vs care-only crops 0.60. - Dominant failure: 12 care-only crops return NO_TAG_DETECTED, causing ~92% of all care recall loss (care recall 0.745 end-to-end vs 0.973 when detected). - Smaller errors: hallucination on low-contrast woven labels, country over-inference from vendor codes, gentle/cold enum near-misses, and a flat materials schema that penalizes multi-part garments. Commit predictions.json / predictions.complete.json for reproducibility. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77

Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77
esma-kr wants to merge 1 commit into
benevolentbandwidth:mainfrom
esma-kr:eval-ground-truth-analysis

esma-kr commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

esma-kr commented Jun 10, 2026

What's here

Findings (Gemini 2.5 Pro)

Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant