Skip to content

T038 + T039: OTEL span instrumentation and per-adapter READMEs#38

Open
likith1908 wants to merge 3 commits into
mainfrom
T038-T039-otel-spans-readmes
Open

T038 + T039: OTEL span instrumentation and per-adapter READMEs#38
likith1908 wants to merge 3 commits into
mainfrom
T038-T039-otel-spans-readmes

Conversation

@likith1908

Copy link
Copy Markdown
Contributor

Summary

Implements T038 (OTEL span instrumentation) and T039 (per-adapter READMEs) as a combined PR on top of T037.

  • iris_engine/ocr/tracing.py (new): shared async context manager instrument_extract wrapping every adapter's extract() call. Records span attributes from US6, emits structured log lines on success and on each typed error. A single central module keeps all adapters consistent.
  • All five adapter extract() methods (adi, datalab, paddleocr, local, in-memory) wrapped with instrument_extract. Result attributes (ocr.total_pages, ocr.total_latency_ms, ocr.success) set after the inner await so they reflect the real outcome.
  • opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20 added to root dev deps for the test exporter only.
  • tests/contract/conftest.py (new): module-level InMemorySpanExporter + TracerProvider wired up once; span_exporter fixture clears and yields it per test.
  • C-OCR-011 skip placeholder replaced with a real parametrised test verifying span name, tenant_id, adapter_id, document_id, and ocr.success=True for all five adapters.
  • Four new README.md files (one per adapter package) covering prerequisites, environment variables, running tests, how each adapter works, and known limitations. PaddleOCR README includes a full airgapped deployment section with step-by-step instructions for pre-seeding PP-DocLayoutV3 and the doc-preprocessor sub-model.

Task reference

  • Task IDs: T038, T039
  • Workstream: 003-ocr-adapter-set

Acceptance criteria

T038

  • iris_engine/ocr/tracing.py - shared instrument_extract CM and log_extract_success helper
  • All five adapters emit span with attributes: ocr.adapter_id, ocr.tenant_id, ocr.product_slug, ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms, ocr.success, ocr.error_category (on error)
  • opentelemetry-api added to iris-engine deps; opentelemetry-sdk added to root dev deps
  • C-OCR-011 passes for all 5 adapters under make test

T039

  • ocr-adi/README.md - prerequisites, env vars, running tests, how it works, limitations
  • ocr-datalab/README.md - prerequisites, env vars, running tests, how it works, limitations
  • ocr-local/README.md - prerequisites (system binary + language packs for all platforms), env vars, running tests, how it works, limitations
  • ocr-paddleocr/README.md - prerequisites (CPU + GPU wheel), model download, full airgapped deployment (3-step), env vars, running tests, how it works, limitations

PR checklist

  • Every acceptance criterion in the task's tasks.md entry is satisfied.
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Notes for the reviewer

Deviations from the original task acceptance (flagged for awareness)

T038 - InMemoryOCREngine.extract() now wrapped by instrument_extract:
The original T038 acceptance listed only the four real adapters. InMemoryOCREngine was also wrapped so that C-OCR-011 achieves full five-adapter coverage matching every other contract clause. The in-memory engine is used as a fixture in integration and agent tests, so having it emit a real span is useful for those callers too.

T039 - PaddleOCR README airgapped section added (originally T035 note):
T035 tasks.md noted: "Full airgap requires pre-seeding; document in T039." The README covers PADDLEX_HOME to redirect the PaddleX sub-model cache and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK to skip the ModelScope connectivity check at startup - both required for a truly airgapped host. The adapter currently wires only vl_rec_model_dir (via IRIS_PADDLEOCR_OFFLINE + IRIS_PADDLEOCR_MODEL_PATH); wiring layout_detection_model_dir as a separate env var is a follow-on task for production airgapped support.

get_tracer called inside CM, not at module level

trace.get_tracer("iris.ocr", "1.0.0") is called inside instrument_extract each time rather than cached at import time. This ensures tests that configure the SDK TracerProvider before the first call get the real SDK tracer, not the no-op tracer captured at import. The cost is negligible (tracer lookup is a dict get in the provider).

IRIS_ADI_MODEL env var

Documented in the ADI README with a note that it is not yet read by the adapter directly - it will be wired through the selector/factory layer in a future workstream.

DATALAB_BASE_URL removed

Was in a previous draft of the Datalab README. Removed because nothing in the stack reads it (not in any .py file and not in .env.example).

Rebase note

This PR is stacked on #37. Once #37 merges, rebase onto main before merging this one.

@likith1908 likith1908 marked this pull request as ready for review June 16, 2026 10:06
@likith1908 likith1908 requested a review from anmolg1997 as a code owner June 16, 2026 10:06
anmolg1997
anmolg1997 previously approved these changes Jun 17, 2026

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Clean close-out of WS003 — T038 and T039 both done, and the contract suite's C-OCR-011 placeholder is now a real test.

What I verified

T038 — OTEL instrumentation (iris_engine/ocr/tracing.py)

  • Single shared async CM instrument_extract wrapping all five adapters — the right call. One central module means every adapter emits identically-shaped spans; no per-adapter drift in attribute names.
  • Span attributes match US6: adapter_id, tenant_id, product_slug, document_id, content_type set up front; total_pages / total_latency_ms / success set after the inner await so they reflect the real outcome (not optimistic pre-values).
  • Error path is thorough: typed OCRErrorocr.success=False + ocr.error_category=<ExceptionName> + record_exception + StatusCode.ERROR + a structured warning log. A separate except Exception catches the unexpected case with error_category="UnexpectedError". Both re-raise — instrumentation never swallows.
  • opentelemetry-api as an iris-engine dep, opentelemetry-sdk in root dev only (test exporter). Correct split — the SDK/exporter is a test/runtime-wiring concern, the API is what library code codes against.
  • tests/contract/conftest.py wires an InMemorySpanExporter once and the span_exporter fixture clears per test. C-OCR-011 is now a real parametrised assertion across all 5 adapters: span name, tenant_id, adapter_id, document_id, success=True. The skip is gone.

T039 — four adapter READMEs

  • One per adapter: prerequisites, env vars, running tests, how it works, limitations. The PaddleOCR README's airgapped section (pre-seeding PP-DocLayoutV3 + doc-preprocessor) is exactly the gap I flagged on #35 — it's now documented where an ops person will find it. Good follow-through.

Sensible scope deviation, well-flagged

  • Wrapping InMemoryOCREngine too (original T038 listed only the 4 real adapters) so C-OCR-011 gets full 5-adapter coverage. Agreed — the in-memory engine is a fixture for integration/agent tests, so a real span there is useful downstream. Flagged in the PR body rather than done silently.

Gates

  • Contract suite: 71 passed, 0 skipped (was 66 + 1 skip — the +5 is C-OCR-011 across 5 adapters now live).
  • Full suite: 373 passed. lint: 3 contracts kept, 0 broken (iris_engine→opentelemetry is an external dep, layers intact). mypy clean on 19 files.

One optional style note (not blocking)

The adapters do from iris_engine.ocr.tracing import instrument_extract, log_extract_success as a function-local import inside extract(). Since opentelemetry-api is now a hard iris-engine dependency, a module-top import would work and read more conventionally. Function-local is harmless (avoids any import-time cost, no cycle), just slightly unusual — fine to leave or hoist whenever you next touch these files.

Status

WS003 is complete: T030-T039 all done. With #35/#36/#37 already approved ahead of this, the OCR workstream merges as a clean six-commit-deep stack. Onto #39 (T040, LLMProvider Protocol) and #40 (T041+T042) — the start of WS004.

@likith1908 likith1908 force-pushed the T037-ocr-contract-suite branch from 193ab70 to b6c54b9 Compare June 17, 2026 10:49
@likith1908

Copy link
Copy Markdown
Contributor Author

Hoisted the instrument_extract / log_extract_success imports to module level in all 5 files (adi, datalab, local, paddleocr, in_memory). Since opentelemetry-api is now a hard iris-engine dep there is no reason to keep them function-local. All hooks green, 372 passed.

@likith1908 likith1908 changed the base branch from T037-ocr-contract-suite to main June 17, 2026 13:22
@likith1908 likith1908 dismissed anmolg1997’s stale review June 17, 2026 13:22

The base branch was changed.

Adds iris_engine.ocr.tracing with instrument_extract() async context manager
and log_extract_success(). Every adapter's extract() is wrapped: span name
ocr.extract, attributes ocr.adapter_id, ocr.tenant_id, ocr.product_slug,
ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms,
ocr.success. OCRError subtypes recorded with ocr.error_category; unexpected
exceptions re-raised with StatusCode.ERROR.

InMemoryOCREngine also wrapped so C-OCR-011 covers all 5 adapters.
conftest.py wires an InMemorySpanExporter at module level; span_exporter
fixture clears it per test. C-OCR-011 parametrised across all 5 adapters.

opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20
added to root dev deps. mypy ignore_missing_imports added for opentelemetry
(no upstream type stubs).
Four new README files - one per OCR adapter - documenting prerequisites,
environment variables, running tests, how each adapter works, and known
limitations. PaddleOCR README includes a full airgapped deployment section
covering PADDLEX_HOME and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK for
environments without internet access.
@likith1908 likith1908 force-pushed the T038-T039-otel-spans-readmes branch from 9ee71d9 to e4c4eb4 Compare June 17, 2026 13:23
@likith1908

Copy link
Copy Markdown
Contributor Author

Rebased onto main - 3 commits (T038 OTEL instrumentation, T039 READMEs, T038 tracing import hoist). Please approve to squash merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants