Character-level entropy and redundancy estimation for natural languages, extending Shannon's 1951 "Prediction and Entropy of Printed English" methodology. Uses KenLM 8-gram models to measure how predictable text is at the letter level.
Requires Python 3.11+ and KenLM built from source with 8-gram support.
# 1. Build KenLM with max order 10 (default 6 is too low for 8-grams)
git clone https://github.com/kpu/kenlm.git
cd kenlm
# Python bindings
MAX_ORDER=10 python setup.py install
# CLI tools (lmplz, build_binary)
mkdir build && cd build
cmake .. -DKENLM_MAX_ORDER=10
make -j4
# Make sure lmplz and build_binary are on PATH
export PATH="$(pwd)/bin:$PATH"
# 2. Install remaining dependencies and NLTK corpora
cd /path/to/this/repo
pip install -r requirements.txt
python -c "import nltk; nltk.download(['brown', 'reuters', 'webtext', 'inaugural', 'nps_chat', 'state_union', 'gutenberg', 'europarl_raw'])"python main.py english [corpus ...] # English NLTK corpora
python main.py crosslang [corpus ...] # Linear B + Brown + Europarl
python main.py analyze [corpus/lang ...] # Full analysis with digrams/trigramsExamples:
python main.py english brown reuters
python main.py crosslang linear_b french
python main.py analyze brown en deNo args after the mode runs all corpora for that mode. Scripts also work standalone (python english.py brown).
- english -- 7 NLTK corpora (Brown, Reuters, Webtext, Inaugural, NPS Chat, State of the Union, Gutenberg)
- crosslang -- Linear B (from
linearb_lexicon.csv) + Brown + 7 Europarl languages - analyze -- comprehensive per-file analysis with digram/trigram distributions, transition matrices, Markov efficiency. All English corpora + 11 Europarl languages
| File | Purpose |
|---|---|
main.py |
Entry point |
core.py |
Shared utilities: KenLM model building, entropy calculations, letter filters |
english.py |
English corpus analysis |
crosslang.py |
Cross-language analysis |
analyzer.py |
Advanced ShannonAnalyzer class |
Each word (>= 3 chars) is decomposed into space-separated characters so KenLM treats letters as tokens. Four entropy levels are computed:
| Measure | Definition |
|---|---|
| H0 | log2(alphabet_size) -- maximum possible entropy |
| H1 | Unigram character frequency entropy |
| H2 | Renyi second-order entropy / conditional bigram entropy |
| H3 | KenLM 8-gram model entropy |
Redundancy = (1 - H3/H0) * 100%
| Language | Alphabet | H0 | H1 | H2 | H3 | Redundancy |
|---|---|---|---|---|---|---|
| Linear B | 86 | 6.43 | 5.74 | 5.46 | 2.34 | 63.54% |
| English | 26 | 4.70 | 4.14 | 3.89 | 1.60 | 65.94% |
| French | 39 | 5.29 | 4.13 | 3.85 | 1.63 | 69.08% |
| German | 30 | 4.91 | 4.17 | 3.78 | 1.39 | 71.68% |
| Italian | 35 | 5.13 | 4.02 | 3.76 | 1.62 | 68.46% |
| Greek | 24 | 4.58 | 4.16 | 3.96 | 1.80 | 60.64% |
| Spanish | 33 | 5.04 | 4.14 | 3.85 | 1.64 | 67.45% |
| Dutch | 28 | 4.81 | 4.09 | 3.70 | 1.40 | 70.82% |
| Portuguese | 39 | 5.24 | 4.19 | 3.09 | 1.63 | 68.91% |
| Swedish | 29 | 4.85 | 4.24 | 3.20 | 1.48 | 69.63% |
| Danish | 27 | 4.85 | 4.14 | 3.17 | 1.41 | 70.97% |
| Finnish | 29 | 5.10 | 4.00 | 3.26 | 1.38 | 75.09% |
| Corpus | Tokens | Vocab | H0 | H1 | H2 | H3 | Redundancy |
|---|---|---|---|---|---|---|---|
| Brown | 4,369,721 | 46,018 | 4.70 | 4.18 | 3.93 | 1.63 | 65.39% |
| Reuters | 5,845,812 | 28,835 | 4.75 | 4.19 | 3.95 | 1.80 | 62.08% |
| Webtext | 1,193,886 | 16,303 | 5.13 | 4.27 | 4.06 | 1.72 | 66.50% |
| Inaugural | 593,092 | 9,155 | 4.75 | 4.15 | 3.88 | 1.63 | 65.81% |
| State Union | 1,524,983 | 12,233 | 4.81 | 4.16 | 3.91 | 1.67 | 65.17% |
| Gutenberg | 8,123,136 | 41,350 | 4.91 | 4.16 | 3.91 | 1.83 | 62.70% |
- Germanic languages are most redundant (DE 71.7%, NL 70.8%, DA 71.0%) -- Romance languages cluster at 65-69%, Greek is lowest at 60.6%
- Linear B's redundancy (63.5%) is comparable to modern alphabetic languages despite its 86-character syllabary, suggesting information encoding efficiency is independent of writing system type
- Finnish is an outlier at 75.1% redundancy, likely driven by its agglutinative morphology creating highly predictable character sequences
- English entropy is stable across corpora (62-66% redundancy), validating the methodology's reliability
- Universal entropy reduction pattern: H0 > H1 > H2 > H3 holds for every language tested
- Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal. (included as
shannon1951.pdf) - Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
- KenLM
- Linear B Lexicon