Fix compatibility with docling-serve:latest (fast options, parsing, container tests)#5
Open
JamesFieldist wants to merge 1 commit into
Conversation
…tion tests - DoclingOptions now exposes DoOcr, TableMode, DoPicture* etc (with fast defaults for labels/manuals). - DoclingClient.Extract now sends the fast options on the form (prevents 504 from MAX_SYNC_WAIT on CPU for complex PDFs like cross-page tables). - Improved ParseDoclingResponse to explicitly handle the nested 'document' response shape from current /v1/convert/file. - (The 'files' field was already correct in main; this hardens the full path.) - Added DoclingSharp.Tests with Testcontainers integration test that: - Spins up quay.io/docling-project/docling-serve:latest container. - Disposes it after test (IAsyncLifetime). - Uses the client with explicit fast options. - Asserts successful extraction (no 422/504, result not null). - Updated README with detailed notes from manual docker+curl+openapi repro on problematic PDFs (the Fieldist ABOUND label and Cenex cases), root cause, and recommended usage. - All builds and tests (existing + new) pass. This fixes the issues seen when using the library with latest serve in Aspire/testcontainer setups (wrong options leading to timeouts, parsing for modern responses). See the vendored fixed copy in https://github.com/Fieldist/Api (Libraries/AI/Docling/DoclingClient.cs) for the exact same logic now used in production ingestion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes compatibility and reliability when using DoclingSharp with the current
quay.io/docling-project/docling-serve:latest(used in Aspire + Testcontainers for PDF ingestion/scraping).Root cause (reproduced manually with
docker run+curlvariants +openapi.jsoninspection on the exact problematic PDFs: ABOUND FLOWABLE label with cross-page tables + Cenex manual):/v1/convert/filerequires the multipart field name"files"(array). Older usage produced 422{"detail":[{"loc":["body","files"],"msg":"Field required"}]}.table_mode=accurate, picture VLM/chart/formula models). On CPU this exceedsDOCLING_SERVE_MAX_SYNC_WAIT(default 120s) → 504 "Conversion is taking too long"./v1/convert/fileare nested under"document". Parsing needed hardening.Changes
DoclingOptions: AddedDoOcr,TableMode,DoPictureDescription,DoPictureClassification,DoChartExtraction,DoFormulaEnrichment,IncludePageImages(fast defaults for labels/manuals).DoclingClient.ExtractDocumentContentAsync: Sends the options on the form. UpdatedParseDoclingResponseto handle the nesteddocumentshape reliably (with fallbacks).README.md: New section with the full investigation (curl repro, openapi evidence, recommended fast options, MAX_SYNC_WAIT env var guidance).DoclingSharp.Tests/+DoclingIntegrationTests.cs:quay.io/docling-project/docling-serve:latestcontainer via Testcontainers.IAsyncLifetime+DisposeAsync).All builds + tests pass cleanly (existing + new container test that fully exercises the fixes + container lifecycle).
A vendored copy of the fixed implementation now lives in the consuming project at
Libraries/AI/Docling/DoclingClient.cs(powersDoclingPDFToMarkdownClientwith zero external DoclingSharp dependency).Verification
No breaking API changes (new options additive with safe defaults).
This unblocks reliable local Docling as Azure replacement in Aspire/testcontainer setups.
(Equivalent fixes applied in the consuming Fieldist/Api repo for immediate use.)