T038 + T039: OTEL span instrumentation and per-adapter READMEs#38
T038 + T039: OTEL span instrumentation and per-adapter READMEs#38likith1908 wants to merge 3 commits into
Conversation
anmolg1997
left a comment
There was a problem hiding this comment.
Approved. Clean close-out of WS003 — T038 and T039 both done, and the contract suite's C-OCR-011 placeholder is now a real test.
What I verified
T038 — OTEL instrumentation (iris_engine/ocr/tracing.py)
- Single shared async CM
instrument_extractwrapping all five adapters — the right call. One central module means every adapter emits identically-shaped spans; no per-adapter drift in attribute names. - Span attributes match US6: adapter_id, tenant_id, product_slug, document_id, content_type set up front; total_pages / total_latency_ms / success set after the inner await so they reflect the real outcome (not optimistic pre-values).
- Error path is thorough: typed
OCRError→ocr.success=False+ocr.error_category=<ExceptionName>+record_exception+StatusCode.ERROR+ a structured warning log. A separateexcept Exceptioncatches the unexpected case witherror_category="UnexpectedError". Both re-raise — instrumentation never swallows. opentelemetry-apias an iris-engine dep,opentelemetry-sdkin root dev only (test exporter). Correct split — the SDK/exporter is a test/runtime-wiring concern, the API is what library code codes against.tests/contract/conftest.pywires anInMemorySpanExporteronce and thespan_exporterfixture clears per test. C-OCR-011 is now a real parametrised assertion across all 5 adapters: span name, tenant_id, adapter_id, document_id, success=True. The skip is gone.
T039 — four adapter READMEs
- One per adapter: prerequisites, env vars, running tests, how it works, limitations. The PaddleOCR README's airgapped section (pre-seeding PP-DocLayoutV3 + doc-preprocessor) is exactly the gap I flagged on #35 — it's now documented where an ops person will find it. Good follow-through.
Sensible scope deviation, well-flagged
- Wrapping
InMemoryOCREnginetoo (original T038 listed only the 4 real adapters) so C-OCR-011 gets full 5-adapter coverage. Agreed — the in-memory engine is a fixture for integration/agent tests, so a real span there is useful downstream. Flagged in the PR body rather than done silently.
Gates
- Contract suite: 71 passed, 0 skipped (was 66 + 1 skip — the +5 is C-OCR-011 across 5 adapters now live).
- Full suite: 373 passed. lint: 3 contracts kept, 0 broken (iris_engine→opentelemetry is an external dep, layers intact). mypy clean on 19 files.
One optional style note (not blocking)
The adapters do from iris_engine.ocr.tracing import instrument_extract, log_extract_success as a function-local import inside extract(). Since opentelemetry-api is now a hard iris-engine dependency, a module-top import would work and read more conventionally. Function-local is harmless (avoids any import-time cost, no cycle), just slightly unusual — fine to leave or hoist whenever you next touch these files.
Status
WS003 is complete: T030-T039 all done. With #35/#36/#37 already approved ahead of this, the OCR workstream merges as a clean six-commit-deep stack. Onto #39 (T040, LLMProvider Protocol) and #40 (T041+T042) — the start of WS004.
193ab70 to
b6c54b9
Compare
|
Hoisted the |
Adds iris_engine.ocr.tracing with instrument_extract() async context manager and log_extract_success(). Every adapter's extract() is wrapped: span name ocr.extract, attributes ocr.adapter_id, ocr.tenant_id, ocr.product_slug, ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms, ocr.success. OCRError subtypes recorded with ocr.error_category; unexpected exceptions re-raised with StatusCode.ERROR. InMemoryOCREngine also wrapped so C-OCR-011 covers all 5 adapters. conftest.py wires an InMemorySpanExporter at module level; span_exporter fixture clears it per test. C-OCR-011 parametrised across all 5 adapters. opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20 added to root dev deps. mypy ignore_missing_imports added for opentelemetry (no upstream type stubs).
Four new README files - one per OCR adapter - documenting prerequisites, environment variables, running tests, how each adapter works, and known limitations. PaddleOCR README includes a full airgapped deployment section covering PADDLEX_HOME and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK for environments without internet access.
9ee71d9 to
e4c4eb4
Compare
|
Rebased onto main - 3 commits (T038 OTEL instrumentation, T039 READMEs, T038 tracing import hoist). Please approve to squash merge. |
Summary
Implements T038 (OTEL span instrumentation) and T039 (per-adapter READMEs) as a combined PR on top of T037.
iris_engine/ocr/tracing.py(new): shared async context managerinstrument_extractwrapping every adapter'sextract()call. Records span attributes from US6, emits structured log lines on success and on each typed error. A single central module keeps all adapters consistent.extract()methods (adi, datalab, paddleocr, local, in-memory) wrapped withinstrument_extract. Result attributes (ocr.total_pages,ocr.total_latency_ms,ocr.success) set after the inner await so they reflect the real outcome.opentelemetry-api>=1.20added toiris-enginedeps;opentelemetry-sdk>=1.20added to root dev deps for the test exporter only.tests/contract/conftest.py(new): module-levelInMemorySpanExporter+TracerProviderwired up once;span_exporterfixture clears and yields it per test.ocr.success=Truefor all five adapters.README.mdfiles (one per adapter package) covering prerequisites, environment variables, running tests, how each adapter works, and known limitations. PaddleOCR README includes a full airgapped deployment section with step-by-step instructions for pre-seeding PP-DocLayoutV3 and the doc-preprocessor sub-model.Task reference
T038,T039003-ocr-adapter-setAcceptance criteria
T038
iris_engine/ocr/tracing.py- sharedinstrument_extractCM andlog_extract_successhelperocr.adapter_id,ocr.tenant_id,ocr.product_slug,ocr.document_id,ocr.content_type,ocr.total_pages,ocr.total_latency_ms,ocr.success,ocr.error_category(on error)opentelemetry-apiadded toiris-enginedeps;opentelemetry-sdkadded to root dev depsmake testT039
ocr-adi/README.md- prerequisites, env vars, running tests, how it works, limitationsocr-datalab/README.md- prerequisites, env vars, running tests, how it works, limitationsocr-local/README.md- prerequisites (system binary + language packs for all platforms), env vars, running tests, how it works, limitationsocr-paddleocr/README.md- prerequisites (CPU + GPU wheel), model download, full airgapped deployment (3-step), env vars, running tests, how it works, limitationsPR checklist
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Notes for the reviewer
Deviations from the original task acceptance (flagged for awareness)
T038 -
InMemoryOCREngine.extract()now wrapped byinstrument_extract:The original T038 acceptance listed only the four real adapters.
InMemoryOCREnginewas also wrapped so that C-OCR-011 achieves full five-adapter coverage matching every other contract clause. The in-memory engine is used as a fixture in integration and agent tests, so having it emit a real span is useful for those callers too.T039 - PaddleOCR README airgapped section added (originally T035 note):
T035 tasks.md noted: "Full airgap requires pre-seeding; document in T039." The README covers
PADDLEX_HOMEto redirect the PaddleX sub-model cache andPADDLE_PDX_DISABLE_MODEL_SOURCE_CHECKto skip the ModelScope connectivity check at startup - both required for a truly airgapped host. The adapter currently wires onlyvl_rec_model_dir(viaIRIS_PADDLEOCR_OFFLINE+IRIS_PADDLEOCR_MODEL_PATH); wiringlayout_detection_model_diras a separate env var is a follow-on task for production airgapped support.get_tracercalled inside CM, not at module leveltrace.get_tracer("iris.ocr", "1.0.0")is called insideinstrument_extracteach time rather than cached at import time. This ensures tests that configure the SDKTracerProviderbefore the first call get the real SDK tracer, not the no-op tracer captured at import. The cost is negligible (tracer lookup is a dict get in the provider).IRIS_ADI_MODELenv varDocumented in the ADI README with a note that it is not yet read by the adapter directly - it will be wired through the selector/factory layer in a future workstream.
DATALAB_BASE_URLremovedWas in a previous draft of the Datalab README. Removed because nothing in the stack reads it (not in any
.pyfile and not in.env.example).Rebase note
This PR is stacked on #37. Once #37 merges, rebase onto
mainbefore merging this one.