In this repository, OCR-related datasets are available.
-
Updated
Jun 17, 2026
In this repository, OCR-related datasets are available.
Arabic Chat with PDF is a user-friendly application that lets you interact with Arabic PDF documents. Powered by advanced language models, OCR, and vector search, it allows you to upload PDFs, ask questions, and receive accurate Arabic responses 🚀
This research aims to fine-tune an Arabic OCR model using Tesseract 5.0, enhancing text recognition accuracy through extensive data collection, preprocessing, and image generation. By leveraging advanced training techniques and data augmentation, we achieve significant improvements in word error rates (WER).
Alef-OCR-Image2Html, an OCR model designed to transform Arabic documents including historical texts, scanned pages, and handwritten materials into structured and semantic HTML.
Official code for "Ketaba-OCR at AR-MS NakbaNLP 2026" — QLoRA fine-tuning of a specialized HTR model with Linear+Boost ensemble for Arabic manuscript recognition. 1st place per-line (CER 0.082) and 3rd place official leaderboard at NakbaNLP 2026 (LREC 2026).
Automatic license-plate recognition for Sudanese plates — YOLO detector + fine-tuned OCR, with a reproducible benchmark.
Optical Character Recognition, OCR pipeline, Arabic OCR, Deep Learning OCR, Computer Vision text extraction, Text recognition system, AI document processing, Multilingual OCR, Transformer OCR, OCR benchmarking, Bounding box detection, Ground truth evaluation.
Additional experimental model for NakbaNLP 2026 Shared Task (AR-MS) — LoRA/DoRA fine-tuning of Qari-OCR (Qwen2-VL-2B) for Arabic handwritten manuscript recognition on the Omar Al-Saleh Memoir Collection (1951-1965).
Nassij V3: High-accuracy Arabic PDF-to-DOCX converter with direct digital extraction (NassijScanner) and cryptographic linguistic integrity verification (Merkle proofs).
Multilingual OCR with per-region script routing for Arabic + Latin. Built for MENA documents.
OCR-first Arabic book corpus platform with citation-grade APIs
Local Python pipeline + bilingual SPA archiving the @AqmarTofan Telegram channel — Telethon, ffmpeg, EasyOCR (Arabic+English), openpyxl, Alpine.js.
Flagship Demo — Medical Handwriting OCR powered by Omni Medical Suite
Local Arabic OCR field extraction for utility bills with PaddleOCR, FastAPI, CLI, and validation.
A deep learning-based handwritten Arabic OCR system using ResNet50 + BiLSTM + Attention with CTC decoding. Achieves 96.3% character accuracy and 80% word accuracy on the IFN/ENIT dataset, featuring a PyQt6 desktop GUI for real-time inference. Supports both greedy and beam search decoding.
Arabic handwritten text recognition using a CRNN (CNN + BiLSTM) with CTC loss, trained on the KHATT dataset — includes a Gradio web demo for OCR on your own images.
In this repo, we will list down all tools that we know about Arabic OCR text extraction
Arabic-first Document Intelligence API platform for OCR, Arabic handwriting recognition, document extraction, validation, search, chat with documents, and structured JSON output.
Unified Medical OCR Platform — Next.js + FastAPI + Gradio + Qdrant + Redis
Arabic Plate Recognition System
Add a description, image, and links to the arabic-ocr topic page so that developers can more easily learn about it.
To associate your repository with the arabic-ocr topic, visit your repo's landing page and select "manage topics."