Docling Document Extractor

A FastAPI-based application that extracts layout, tables, and content from documents (PDFs) using native Docling features.

Features

Advanced Table Extraction: Uses TableFormer in ACCURATE mode for high-fidelity table structure recognition.
OCR Support: Built-in OCR for scanned documents and images.
Image Analysis: Automatic classification and description of figures and images.
Markdown Export: Converts documents to Markdown with embedded table structures and image captions.
Structured Data: Provides access to the underlying structured data model of the document.

Installation

Clone the repository:

git clone <repository_url>
cd Doc_extracter

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Copy the example environment file:
```
cp .env.example .env
```
Open .env and configure your credentials if using cloud-based features (e.g., specific enrichment services). Note: logic for loading .env is not currently in main.py, so you may need to export these variables in your shell or use python-dotenv if required by underlying libraries.

Usage

Starting the Server

Run the FastAPI server using uvicorn:

uvicorn app.main:app --reload

The API will be available at http://127.0.0.1:8000.

Using the Client

A simple client script is provided to test the extraction:

python client_example.py <path_to_document.pdf>

Example:

python client_example.py sample.pdf

API Documentation

`POST /extract`

Extracts content from an uploaded document.

Request: multipart/form-data with a file field.
Response: JSON object containing:
- markdown: The extracted text in Markdown format.
- structured_data: The raw structured data from Docling.

Visit http://127.0.0.1:8000/docs for the interactive Swagger UI.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.env.example		.env.example
.gitignore		.gitignore
DOCLING_FEATURES.md		DOCLING_FEATURES.md
README.md		README.md
client_example.py		client_example.py
output.json		output.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docling Document Extractor

Features

Installation

Configuration

Usage

Starting the Server

Using the Client

API Documentation

`POST /extract`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docling Document Extractor

Features

Installation

Configuration

Usage

Starting the Server

Using the Client

API Documentation

POST /extract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /extract`

Packages