Skip to content

Greg735/archival-data-extractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Historical Document Extractor

Historical Document Extractor is a local OCR pipeline for extracting data from historical NSDAP membership cards (Formular A3340), powered by Qwen3.5-9B running through llama.cpp.

📋 Description

The tool automatically reads and extracts data from NSDAP personal cards, including:

  • Full name
  • Birth date and place
  • Occupation
  • Residential address
  • Membership number
  • Joining/leaving/rejoining dates
  • Region (Gau)

🛠️ Requirements

  • Python 3.8+
  • Llama.cpp server with Qwen3.5-9B
  • Libraries: requests, Pillow, tqdm, json

🚀 Installation

  1. Clone the repository
  2. Install dependencies:
    pip install requests Pillow tqdm

📦 Model Setup

Run the llama.cpp server with your selected Qwen model:

./llama-server -m "$SELECTED_MODEL" \
  --ctx-size 131072 \
  --jinja \
  --tensor-split 24,12 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --min-p 0.00 \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 \
  -fa on \
  -ngl 99

The server must be running at http://localhost:8080/v1/chat/completions

📝 Usage

  1. Place document images in the images/ folder
  2. Run the extractor:
    python extract.py
  3. Results are saved to extraction_results.json

📄 Example Output

{
  "name": "Aatz Heinrich",
  "birth_date": "16.10.00",
  "birth_place": "Herrensohr",
  "occupation": null,
  "residence_city": "Herrensohr",
  "street_address": "B., Bergstr.38",
  "district": "O. G.",
  "gau": "Saarpf.",
  "membership_number": "2690739",
  "joined_date": "25.8.33",
  "left_date": "1.Aug.1933",
  "rejoined_date": null,
  "footer_note": "Weitere Überweisungen umseitig!",
  "source_file": "A3340-MFKL-A0001-0076.png"
}

⚙️ Configuration

Edit SERVER_URL in extract.py if you're using a different port.

📝 License

MIT License

About

Local OCR pipeline for extracting data from historical NSDAP membership cards using Qwen3.5-9B and llama.cpp.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%