A comprehensive Python solution for processing poorly formatted, scanned, or unevenly spaced PDFs—particularly those in Japanese. This tool automatically handles various document formats, applies intelligent text extraction, and outputs structured JSON data.
Many Japanese organizations deal with PDFs that are:
- Scanned/Image-based - No extractable text layer
- Poorly formatted - Inconsistent spacing and alignment
- Unstructured - Mixed content without clear sections
- Low quality - Faded or compressed images requiring OCR enhancement
✨ Multi-Modal Text Extraction
- Direct PDF text extraction for digital documents
- Optical Character Recognition (OCR) for scanned PDFs
- Automatic fallback to OCR when direct extraction fails
🖼️ Image Preprocessing
- Denoising with fast non-local means filtering
- Contrast enhancement using CLAHE (Contrast Limited Adaptive Histogram Equalization)
- Automatic binarization for improved text recognition
- Deskewing support
🌐 Multi-Language Support
- Japanese (jpn) - optimized for Japanese characters
- English (eng)
- Combined support (jpn+eng)
📊 Structured Data Extraction
- Automatic date detection (both Japanese and ISO formats)
- Email and phone number extraction
- Numerical data identification
- Document section segmentation
📝 Text Normalization
- Unicode normalization (NFKC for Japanese)
- Whitespace cleaning and standardization
- Control character removal
- Line break preservation
📄 JSON Output
- Complete metadata
- Page-by-page text with confidence scores
- Structured extracted data
- Processing timestamps and file information
- Python 3.8+
- Tesseract OCR engine (required for OCR functionality)
Windows:
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
# Or use chocolatey:
choco install tesseractmacOS:
brew install tesseractLinux (Ubuntu/Debian):
sudo apt-get install tesseract-ocr libtesseract-devLinux (Fedora/CentOS):
sudo yum install tesseractpip install -r requirements.txtfrom pdf_processor import process_pdf
# Simple one-liner
json_output = process_pdf(
pdf_path="document.pdf",
output_json_path="output.json",
language="jpn" # Japanese
)# Process PDF and save JSON
python pdf_processor.py input.pdf output.json jpn
# Process PDF and print JSON to console
python pdf_processor.py input.pdffrom pdf_processor import PDFProcessor
import json
# Initialize processor
processor = PDFProcessor("document.pdf", language="jpn")
# Process the document
result = processor.process()
if result:
# Access structured data
print(f"Extracted {len(result.pages)} pages")
print(f"Found dates: {result.structured_data['dates']}")
print(f"Found emails: {result.structured_data['email_addresses']}")
# Save to JSON
json_str = processor.to_json(result, "output.json")
# Parse JSON
data = json.loads(json_str)from pdf_processor import process_pdf
import os
pdf_directory = "./pdfs"
output_directory = "./json_outputs"
os.makedirs(output_directory, exist_ok=True)
for pdf_file in os.listdir(pdf_directory):
if pdf_file.endswith('.pdf'):
pdf_path = os.path.join(pdf_directory, pdf_file)
output_path = os.path.join(output_directory, f"{pdf_file[:-4]}.json")
process_pdf(pdf_path, output_path, language="jpn")The JSON output contains:
{
"metadata": {
"file_name": "document.pdf",
"file_size_bytes": 524288,
"total_pages": 5,
"language": "jpn",
"processing_completed_at": "2026-06-02T08:10:30.123456"
},
"processing_summary": {
"total_pages": 5,
"processing_timestamp": "2026-06-02T08:10:30.123456",
"file_path": "/path/to/document.pdf"
},
"structured_data": {
"dates": ["2026年6月2日", "2026-06-01"],
"numbers": ["1000", "5000", "12345"],
"email_addresses": ["user@example.com"],
"phone_numbers": ["090-1234-5678"],
"sections": {
"Section Title": "Content here...",
"Another Section": "More content..."
}
},
"pages": [
{
"page_number": 1,
"has_images": false,
"confidence": 0.95,
"text": "Extracted text from page 1..."
}
],
"full_text": "Complete text from entire document..."
}- Load PDF - Open file using PyMuPDF
- Try Direct Extraction - Attempt text extraction from PDF
- Fallback to OCR - If text extraction fails/yields minimal content:
- Convert page to image
- Preprocess image (denoise, enhance contrast)
- Apply Tesseract OCR with Japanese language model
- Normalize Text - Clean and standardize extracted text
- Extract Structured Data - Parse dates, emails, phone numbers, sections
- Generate JSON - Format all results as structured JSON
| Library | Purpose |
|---|---|
| PyMuPDF | PDF loading and rendering |
| Tesseract | OCR engine for scanned PDFs |
| Pillow | Image processing |
| OpenCV | Image preprocessing (denoising, contrast) |
| NumPy | Numerical operations |
- Batch Processing: Process multiple PDFs simultaneously using multiprocessing
- Resolution: Increase zoom factor in
get_pixmap()for low-quality PDFs - Language: Use only required languages (e.g., "jpn" instead of "jpn+eng")
- Memory: For very large PDFs, process page-by-page instead of loading entire document
- Ensure Tesseract is installed and in system PATH
- Specify Tesseract path if needed:
pytesseract.pytesseract.pytesseract_cmd = r'/path/to/tesseract'
- Check image preprocessing settings
- Increase
get_pixmap()zoom factor for better resolution - Try different Tesseract PSM modes (0-13)
- Process PDFs in chunks
- Reduce image resolution (lower zoom factor)
- Use streaming approach for very large files
See requirements.txt:
- PyMuPDF 1.23.8
- pytesseract 0.3.10
- Pillow 10.1.0
- opencv-python 4.8.1.78
- numpy 1.24.3
MIT License - Feel free to use and modify as needed.
Contributions welcome! Please submit issues and pull requests.
For issues, questions, or feature requests, please open a GitHub issue.