A FastAPI-based application that extracts layout, tables, and content from documents (PDFs) using native Docling features.
- Advanced Table Extraction: Uses
TableFormerinACCURATEmode for high-fidelity table structure recognition. - OCR Support: Built-in OCR for scanned documents and images.
- Image Analysis: Automatic classification and description of figures and images.
- Markdown Export: Converts documents to Markdown with embedded table structures and image captions.
- Structured Data: Provides access to the underlying structured data model of the document.
-
Clone the repository:
git clone <repository_url> cd Doc_extracter
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Copy the example environment file:
cp .env.example .env
-
Open
.envand configure your credentials if using cloud-based features (e.g., specific enrichment services). Note: logic for loading.envis not currently inmain.py, so you may need to export these variables in your shell or usepython-dotenvif required by underlying libraries.
Run the FastAPI server using uvicorn:
uvicorn app.main:app --reloadThe API will be available at http://127.0.0.1:8000.
A simple client script is provided to test the extraction:
python client_example.py <path_to_document.pdf>Example:
python client_example.py sample.pdfExtracts content from an uploaded document.
- Request:
multipart/form-datawith afilefield. - Response: JSON object containing:
markdown: The extracted text in Markdown format.structured_data: The raw structured data from Docling.
Visit http://127.0.0.1:8000/docs for the interactive Swagger UI.