Skip to content

Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77

Open
esma-kr wants to merge 1 commit into
benevolentbandwidth:mainfrom
esma-kr:eval-ground-truth-analysis
Open

Expand ground truth to 76 images; analyze P/R/F vs set size and data groupings#77
esma-kr wants to merge 1 commit into
benevolentbandwidth:mainfrom
esma-kr:eval-ground-truth-analysis

Conversation

@esma-kr

@esma-kr esma-kr commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Builds on the merged evaluation framework (#76). Grows the ground-truth set and analyzes how scores behave as the set grows and across natural groupings — answering: as more ground truth is added, how do P/R/F change, and do natural groupings score similarly?

What's here

  • Ground truth: 10 → 76. Hand-labeled the remaining 66 cropped_tags images in the TagSchema shape, marking absent fields null/[]. Scorer self-check passes (321 TP, all fields 1.000).
  • scripts/analyze_eval.py — learning curve (deterministic cumulative + Monte-Carlo mean/std over random subsets) and per-group P/R/F for four natural groupings.
  • scripts/tag.py — records NO_TAG_DETECTED/errors in --json output instead of dropping them, so predictions.json has one row per image and the scorer counts them as misses.
  • docs/evaluation-findings.md + docs/eval-results.json — the write-up and raw results. predictions.json / predictions.complete.json committed for reproducibility.

Findings (Gemini 2.5 Pro)

  • Whole set: overall F1 0.898 (country 0.976, materials 0.951, care 0.821). The weak field is care, almost entirely on recall (0.745).
  • As the set grows: the mean F1 barely moves (~0.89), but its variance collapses (std 0.11 at n=5 → 0.03 at n=30 → 0 at n=76). The original 10 happened to be clean tags and overstated performance — the real value of more ground truth is a trustworthy number, not a different one.
  • Groupings differ sharply (they do not score alike): full tags 0.97 vs care-only crops 0.60; multilingual 0.97 vs monolingual 0.87 (a proxy for content type); premium vs commodity fiber identical (~0.90).
  • Dominant failure mode: 12 care-only crops return NO_TAG_DETECTED, causing ~92% of all care recall loss — care recall is 0.745 end-to-end vs 0.973 when a tag is detected. It's a detection-gate problem, not extraction.
  • Smaller error patterns: hallucination on low-contrast woven labels, country over-inference from vendor codes (e.g. VN…), gentle/cold enum near-misses, and a flat materials schema that penalizes multi-part garments.

Recommendations

Loosen the tag-detection gate for care-only crops (recovers care recall to ~0.97, lifts overall to ~0.95); stop inferring country from vendor/RN codes; add garment-part structure to the materials schema; treat n≥30 as the minimum for a trustworthy number.

🤖 Generated with Claude Code

…oupings

Ground truth:
- Hand-label the remaining 66 cropped_tags images (was 10) in the TagSchema
  shape, marking absent fields null/[]. Scorer self-check passes (321 TP).

Tooling:
- scripts/analyze_eval.py: learning curve (deterministic cumulative +
  Monte-Carlo mean/std over random subsets) and per-group P/R/F for natural
  groupings (content type, material complexity, language, fiber tier).
- scripts/tag.py: record NO_TAG_DETECTED / errors in --json output instead of
  dropping them, so predictions.json has one row per image and the scorer
  counts them as misses.

Findings (Gemini 2.5 Pro, docs/evaluation-findings.md + docs/eval-results.json):
- Whole-set overall F1 0.898 (country 0.976, materials 0.951, care 0.821).
- Growing the set barely moves the mean F1 (~0.89) but collapses its variance
  (std 0.11 at n=5 -> 0.03 at n=30); the original 10 overstated performance.
- Groups differ sharply: full tags 0.97 vs care-only crops 0.60.
- Dominant failure: 12 care-only crops return NO_TAG_DETECTED, causing ~92% of
  all care recall loss (care recall 0.745 end-to-end vs 0.973 when detected).
- Smaller errors: hallucination on low-contrast woven labels, country
  over-inference from vendor codes, gentle/cold enum near-misses, and a flat
  materials schema that penalizes multi-part garments.

Commit predictions.json / predictions.complete.json for reproducibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant