pdf-granite

Standalone CLI that converts PDFs to Markdown and/or HTML using Docling, with Granite Vision chart extraction, image extraction, and optional OCR. CUDA by default.

Setup

Requires Python 3.13 and uv.

cd c:/ai/code/pdf_granit
uv sync                 # installs docling + a CPU torch and all other deps

CPU setup

uv sync already installs a CPU-only torch, so no extra steps are needed. Just run with --device cpu (about 10x slower than CUDA):

uv run --no-sync pdf-granite --device cpu

--no-sync is optional on a clean CPU install, but using it consistently avoids surprises if you later add a CUDA build (see below).

GPU / CUDA setup

The base install is CPU-only. For ~10x faster conversion, install a CUDA torch build matching your driver over the synced venv (nvidia-smi shows your CUDA version; Blackwell / RTX 50-series needs CUDA 12.8+):

# cu130 shown; cu129 / cu128 also valid depending on driver
uv pip install --force-reinstall \
  --index-url https://download.pytorch.org/whl/cu130 torch==2.12.1 torchvision==0.27.1

Verify the GPU is visible:

uv run --no-sync python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Always run with --no-sync afterwards — a plain uv run / uv sync restores the venv from the lockfile and silently reverts torch to the CPU build.

Run

# All *.pdf in ./input -> Markdown in ./output (GPU)
uv run --no-sync pdf-granite

# One file, both formats
uv run --no-sync pdf-granite "input/My File.pdf" --format both

# No GPU? (about 10x slower)
uv run --no-sync pdf-granite --device cpu

# Scanned / image-based PDFs? Enable OCR (off by default)
uv run --no-sync pdf-granite --ocr

With the default --device cuda, the tool hard-fails if no GPU is visible and prints how to install a CUDA torch build.

OCR is off by default — born-text PDFs already carry an extractable text layer, so OCR adds time without improving output. Add --ocr only for scanned or image-based PDFs where the text is baked into pixels.

Options

Flag	Default	Meaning
`paths`	all `*.pdf` in `--input-dir`	Specific PDFs to convert
`--input-dir`	`input`	Folder scanned when no paths given
`--output-dir`	`output`	Output folder
`--format md\|html\|both`	`md`	Output format(s)
`--device cuda\|cpu\|auto`	`cuda`	`cuda` hard-fails without a GPU
`--no-charts`	off	Disable chart extraction (faster)
`--ocr`	off	Enable OCR for scanned/image-based PDFs (off by default)
`--embed-images`	off	Inline images as base64 instead of linking external files
`--quiet`	off	Only print the final summary

Outputs per <name>.pdf: <name>.md and/or <name>.html, plus <name>_charts.csv when charts are detected.

By default images are written as separate PNGs into a <name>_artifacts/ folder next to the output and linked from it — keep that folder alongside the .md/.html when moving or sharing. Pass --embed-images to inline them as base64 for a single self-contained file instead.

Amazon Textract (comparison)

A separate pdf-textract script parses PDFs with Amazon Textract (LAYOUT + TABLES) for side-by-side comparison. It writes output/textract/<name>.md plus the raw Textract JSON. It is not part of the Docling pipeline.

Put your AWS credentials in a .env file at the project root (gitignored):

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-1
S3_BUCKET=input

Install the optional AWS dependency group and run it. Sources may be local files or s3:// URIs (multi-page PDFs are processed via Textract's async API, which needs the document in S3 — local files are uploaded to the S3_BUCKET scratch bucket, overridable with --bucket):

uv sync --group textract

# PDF already in S3 (region auto-detected from the bucket):
uv run --no-sync pdf-textract s3://input/SMR-1Q26-Presentation.pdf

# Local file (uploaded to the S3_BUCKET scratch bucket from .env first):
uv run --no-sync pdf-textract input/SMR-1Q26-Presentation.pdf

Tests

uv run --no-sync pytest -q

Offline unit tests run anywhere; the live smoke test converts the first PDF in input/ and is skipped when no CUDA GPU is present.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs/superpowers		docs/superpowers
input		input
output		output
src/pdf_granite		src/pdf_granite
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-granite

Setup

CPU setup

GPU / CUDA setup

Run

Options

Amazon Textract (comparison)

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-granite

Setup

CPU setup

GPU / CUDA setup

Run

Options

Amazon Textract (comparison)

Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages