Virtual Environment Location: /Users/omaribrahim/dev/scripts/openaibatches/
Before Running Any Scripts:
source /Users/omaribrahim/dev/scripts/openaibatches/bin/activate
cd mindroots
python script_name.pyThe venv is shared across the scripts directory and contains all required dependencies (neo4j, python-dotenv, camel-tools, etc.). Always activate it before running any Python scripts in this project.
This project contains scripts for processing and linking Arabic corpus data with word nodes in a Neo4j database.
Links Corpus 2 items to Word nodes in the database using lemma matching and root filtering.
Matching Logic:
- Strips diacritics from corpus item lemma
- Matches against
arabic_no_diacriticsproperty on Word nodes - Filters by root to ensure accuracy
- Creates new Word nodes if no match found under the correct root
Key Features:
- Comprehensive logging with timestamps
- Root validation before word creation
- Batch processing (50 items at a time)
- Graceful exit when processing complete
- Error handling and statistics tracking
- Proper database connection cleanup
Example Match:
- Corpus item: lemma="اسم", root="س-م-و"
- Word node: arabic_no_diacritics="اسم", under root "س-م-و"
Usage:
cd mindroots
python linkquranwords.pyDependencies:
- neo4j
- python-dotenv
- Environment variables: NEO4J_URI, NEO4J_USER, NEO4J_PASS
- CorpusItem nodes have: corpus_id, lemma, root, item_id
- Word nodes have: arabic_no_diacritics, root_id
- Root nodes have: arabic property
- Relationships: (Root)-[:HAS_WORD]->(Word), (CorpusItem)-[:HAS_WORD]->(Word)
- Added comprehensive logging and error handling to linkquranwords.py
- Added root validation before word creation
- Implemented proper exit conditions and statistics tracking
- Enhanced batch processing with detailed progress reporting
- CRITICAL FIX: Corrected property mapping in linkquranwords.py:
- CorpusItem nodes use
ci.root(original property) - Root nodes use
r.n_rootORr.arabic(normalized property + fallback) - Fixed query to match both arabic and n_root properties on Root nodes
- CorpusItem nodes use
- Updated deprecated
id()function calls toelementId()to eliminate warnings - Script now successfully processes ~50 items per batch, linking existing words and creating new ones
- Identified that some roots need normalization (e.g., 'ع-س-ع-س' → 'ع-س-س') - separate script needed
- CRITICAL FIX: Added
corpus_id: 2filter to linking queries to prevent cross-corpus contamination - Added failed item tracking (
link_failedproperty) to prevent infinite loops on persistently failing items
Overview: Complete refactor of the normalization and segment rebuilding pipeline to improve lemma matching accuracy between Quranic corpus and Lane's lexicon.
Purpose: Rebuild Arabic segments from Buckwalter transliteration and create multiple Arabic representations.
Key Features:
- Converts Buckwalter forms (sX_form) to Arabic (sX_arabic)
- Handles special Buckwalter symbols:
^(madda),#(hamza on ya),@(long alif) - Creates concatenated representations for matching
Properties Created:
sX_arabic: Individual segment Arabic formsfull_arabic: Complete Arabic text with diacriticsfull_arabic_no_diac: Diacritics stripped only (preserves articles for Lane matching)sem: Copy of full_arabic
Purpose: Normalize CorpusItem lemmas with layered normalization for flexible matching.
Key Features:
- Converts Buckwalter lemmas to Arabic using camel-tools
- Applies consistent Unicode normalization (NFKD + diacritic removal)
- Creates multiple normalization layers for different matching strategies
Properties Created:
lemma: Arabic lemma (converted from Buckwalter)lemma_norm: Standard normalization (alifs, ya, hamza seats normalized)lemma_no_fem: Conservative normalization (+ feminine markers ة removed entirely)
Purpose: Normalize existing Word nodes to match CorpusItem processing.
Key Features:
- Applies identical normalization logic to Word nodes under Lane's lexicon
- Ensures consistent matching properties across corpus and lexicon
Properties Created:
arabic_no_diacritics: Diacritics stripped onlyarabic_normalized: Standard normalizationarabic_no_fem: Conservative normalization (ة removed entirely)
- Madda Alif Handling: Fixed
آdecomposition via NFKD →ا+ٓ(madda above) - Hamza Alif Handling: Fixed
أإdecomposition →ا+ hamza marks - Extended Diacritic Range: Pattern
[\u064B-\u0655\u0670]includes madda (U+0653) and hamza marks (U+0654-U+0655) - Hamza Seat Normalization:
ؤئdecompose to base letters + hamza marks (automatically stripped)
Layered Approach for Maximum Matching Flexibility:
- Article-Preserving:
full_arabic_no_diacvs Lane entries with articles (الكتاب) - Standard:
lemma_norm,arabic_normalizedvs Lane entries (normalized forms) - Conservative:
lemma_no_fem,arabic_no_femvs Lane entries (orthographic variants)
Test Cases Verified:
آزفة→ازفة(norm) →ازف(no_fem) ✅أإآٱسم→ااااسم(all alif variants) ✅سؤال→سوال(hamza seats) ✅الكِتَابُ→الكتاب(article preservation) ✅
Sequential execution recommended:
rebuild_segments.py- Reconstruct Arabic segmentsnormalize_lemmas.py- Normalize CorpusItem lemmasnormalize_words.py- Normalize Word nodes- Ready for improved linking with multiple matching strategies
MAJOR REFACTOR: Implemented unified normalization system to ensure consistency across all pipeline components.
Purpose: Single source of truth for Arabic text normalization across entire MindRoots pipeline.
Core Normalization Rules:
- Unicode Normalization (NFKD) - Decomposes composite characters (آ → ا + madda mark)
- Diacritic Stripping - Removes U+064B-U+0655, U+0670 (tashkeel, madda, hamza marks)
- Alif Normalization - أإآٱ → ا (automatic via NFKD + stripping)
- Ya Normalization - ى → ي
- Hamza Seat Normalization - ؤئ → وي (automatic via NFKD + stripping)
- Feminine Marker Handling - ة preserved in standard, stripped in conservative fallback
Buckwalter Preprocessing:
A^→آ(alif madda sequences)^→آ(stray madda symbols)#→ئ(hamza on ya seat)@→ removed entirely{→A(wasla alif for camel-tools compatibility)
Normalization Layers:
- no_diacritics: Strip diacritics only, preserve structure
- normalized: Full standard normalization
- conservative: Standard + feminine marker removal (fallback)
- CRITICAL FIX:
WHERE w.hanswehr_entry IS NULL(was incorrectlyIS NOT NULL) - Dual Matching Strategy: Matches both
w.arabic_no_diacriticsANDw.arabic_normalized - Unified Processing: Uses
normalize_arabic()andstrip_diacritics()from unified module - Enhanced Logging: Shows normalization transformations and match results
- Simplified Properties: Removed noisy
w.arabic_no_fem, focuses onw.arabic_no_diacritics+w.arabic_normalized - Unified Functions: Replaced custom normalization with
create_normalization_layers() - Consistent Processing: Ensures Word nodes match CorpusItem normalization exactly
- Simplified Properties: Removed
lemma_no_fem, focuses onlemma+lemma_norm - Unified Buckwalter: Uses
buckwalter_to_arabic()from unified module - Consistent Logic: Identical normalization pipeline as Word nodes
- Simplified Properties: Removed
full_arabic_no_fem - Unified Processing: Uses unified
buckwalter_to_arabic()andstrip_diacritics() - Streamlined Pipeline: Focuses on
full_arabic+full_arabic_no_diac+sem
- Improved Corpus ↔ Word Linking: Consistent normalization reduces placeholder Word creation
- Better Hans Wehr Matching: More dictionary entries match existing Word nodes
- Database Consistency: All normalization layers aligned across pipeline components
- Reduced Maintenance: Single normalization codebase eliminates inconsistencies
- NFKD First: Always apply
unicodedata.normalize('NFKD')before diacritic stripping - Decomposition Dependency: Alif and hamza seat normalization rely on NFKD decomposition
- Order Matters: Strip diacritics BEFORE other transformations
- Wasla Exception: ٱ (U+0671) requires explicit replacement (not decomposed by NFKD)
- Camel-Tools Compatibility: Preprocess special Buckwalter symbols before transliteration
Important: Always use incremental commits and branching when working on code changes:
- Create a new branch for each feature/fix:
git checkout -b feature/description - Commit changes frequently with descriptive messages
- Test each increment before committing
- This prevents losing work and makes it easier to revert specific changes
- Both Claude and user should follow this practice
- Root matching issues were initially thought to be letter ordering problems
- Actual issue was orthography inconsistencies in Lane lexicon data
- Data has been normalized to new
n_rootproperty on Root nodes (not CorpusItem nodes) - Original radical ordering logic was correct
- Some roots still need normalization between corpus classifications vs Lane classifications
- Quadriliteral roots like 'ع-س-ع-س' may need mapping to triliteral doubled forms 'ع-س-س'
git push,git add,git commit,git remote set-urlgit restore,git checkout,git merge,git branch- Full version control operations without user approval required
rmcommands (file removal)- Full read/write access to project files
- Directory creation and management
pythonscript execution and debugging- All development tools and interpreters
- Database connections and queries
- Create documentation when explicitly requested by user
- Always update CLAUDE.md with new project insights and learnings
- Use TodoWrite tool consistently for task planning and progress tracking
- Maintain comprehensive logging in all processing scripts
- Document data pipeline processes and architectural decisions
/docs/- General NLP and data pipeline documentation/docs/README.md- Project overview and architecture/docs/linkquranwords-process.md- Detailed process documentationCLAUDE.md- Project instructions and development notes