Skip to content

Fix compatibility with docling-serve:latest (fast options, parsing, container tests)#5

Open
JamesFieldist wants to merge 1 commit into
stevekoeniger:masterfrom
JamesFieldist:fix/docling-serve-latest-compatibility
Open

Fix compatibility with docling-serve:latest (fast options, parsing, container tests)#5
JamesFieldist wants to merge 1 commit into
stevekoeniger:masterfrom
JamesFieldist:fix/docling-serve-latest-compatibility

Conversation

@JamesFieldist

Copy link
Copy Markdown

Summary

Fixes compatibility and reliability when using DoclingSharp with the current quay.io/docling-project/docling-serve:latest (used in Aspire + Testcontainers for PDF ingestion/scraping).

Root cause (reproduced manually with docker run + curl variants + openapi.json inspection on the exact problematic PDFs: ABOUND FLOWABLE label with cross-page tables + Cenex manual):

  • Serve's /v1/convert/file requires the multipart field name "files" (array). Older usage produced 422 {"detail":[{"loc":["body","files"],"msg":"Field required"}]}.
  • Default pipeline is heavy (OCR on, table_mode=accurate, picture VLM/chart/formula models). On CPU this exceeds DOCLING_SERVE_MAX_SYNC_WAIT (default 120s) → 504 "Conversion is taking too long".
  • Successful responses for /v1/convert/file are nested under "document". Parsing needed hardening.

Changes

  • DoclingOptions: Added DoOcr, TableMode, DoPictureDescription, DoPictureClassification, DoChartExtraction, DoFormulaEnrichment, IncludePageImages (fast defaults for labels/manuals).
  • DoclingClient.ExtractDocumentContentAsync: Sends the options on the form. Updated ParseDoclingResponse to handle the nested document shape reliably (with fallbacks).
  • README.md: New section with the full investigation (curl repro, openapi evidence, recommended fast options, MAX_SYNC_WAIT env var guidance).
  • New DoclingSharp.Tests/ + DoclingIntegrationTests.cs:
    • Spins up real quay.io/docling-project/docling-serve:latest container via Testcontainers.
    • Disposes cleanly (IAsyncLifetime + DisposeAsync).
    • Calls client with explicit fast options + minimal embedded PDF.
    • Asserts successful extraction (no 422/504, result not null).
    • (Consuming repo's richer A/B tests against ABOUND/Cenex validate full quality with the identical fixed logic.)

All builds + tests pass cleanly (existing + new container test that fully exercises the fixes + container lifecycle).

A vendored copy of the fixed implementation now lives in the consuming project at Libraries/AI/Docling/DoclingClient.cs (powers DoclingPDFToMarkdownClient with zero external DoclingSharp dependency).

Verification

  • New integration test: container start → fixed client call (with options) → extraction → assert → dispose.
  • No more 422 (field) or 504 (timeout) on the target PDFs once fast options are used.
  • Detailed repro evidence from the docker+curl+openapi investigation is in the README.

No breaking API changes (new options additive with safe defaults).

This unblocks reliable local Docling as Azure replacement in Aspire/testcontainer setups.

(Equivalent fixes applied in the consuming Fieldist/Api repo for immediate use.)

…tion tests

- DoclingOptions now exposes DoOcr, TableMode, DoPicture* etc (with fast defaults for labels/manuals).
- DoclingClient.Extract now sends the fast options on the form (prevents 504 from MAX_SYNC_WAIT on CPU for complex PDFs like cross-page tables).
- Improved ParseDoclingResponse to explicitly handle the nested 'document' response shape from current /v1/convert/file.
- (The 'files' field was already correct in main; this hardens the full path.)
- Added DoclingSharp.Tests with Testcontainers integration test that:
  - Spins up quay.io/docling-project/docling-serve:latest container.
  - Disposes it after test (IAsyncLifetime).
  - Uses the client with explicit fast options.
  - Asserts successful extraction (no 422/504, result not null).
- Updated README with detailed notes from manual docker+curl+openapi repro on problematic PDFs (the Fieldist ABOUND label and Cenex cases), root cause, and recommended usage.
- All builds and tests (existing + new) pass.

This fixes the issues seen when using the library with latest serve in Aspire/testcontainer setups (wrong options leading to timeouts, parsing for modern responses).

See the vendored fixed copy in https://github.com/Fieldist/Api (Libraries/AI/Docling/DoclingClient.cs) for the exact same logic now used in production ingestion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant