Shannon Entropy Analysis

Character-level entropy and redundancy estimation for natural languages, extending Shannon's 1951 "Prediction and Entropy of Printed English" methodology. Uses KenLM 8-gram models to measure how predictable text is at the letter level.

Setup

Requires Python 3.11+ and KenLM built from source with 8-gram support.

# 1. Build KenLM with max order 10 (default 6 is too low for 8-grams)
git clone https://github.com/kpu/kenlm.git
cd kenlm

# Python bindings
MAX_ORDER=10 python setup.py install

# CLI tools (lmplz, build_binary)
mkdir build && cd build
cmake .. -DKENLM_MAX_ORDER=10
make -j4

# Make sure lmplz and build_binary are on PATH
export PATH="$(pwd)/bin:$PATH"

# 2. Install remaining dependencies and NLTK corpora
cd /path/to/this/repo
pip install -r requirements.txt
python -c "import nltk; nltk.download(['brown', 'reuters', 'webtext', 'inaugural', 'nps_chat', 'state_union', 'gutenberg', 'europarl_raw'])"

Usage

python main.py english [corpus ...]       # English NLTK corpora
python main.py crosslang [corpus ...]     # Linear B + Brown + Europarl
python main.py analyze [corpus/lang ...]  # Full analysis with digrams/trigrams

Examples:

python main.py english brown reuters
python main.py crosslang linear_b french
python main.py analyze brown en de

No args after the mode runs all corpora for that mode. Scripts also work standalone (python english.py brown).

Modes

english -- 7 NLTK corpora (Brown, Reuters, Webtext, Inaugural, NPS Chat, State of the Union, Gutenberg)
crosslang -- Linear B (from linearb_lexicon.csv) + Brown + 7 Europarl languages
analyze -- comprehensive per-file analysis with digram/trigram distributions, transition matrices, Markov efficiency. All English corpora + 11 Europarl languages

Project layout

File	Purpose
`main.py`	Entry point
`core.py`	Shared utilities: KenLM model building, entropy calculations, letter filters
`english.py`	English corpus analysis
`crosslang.py`	Cross-language analysis
`analyzer.py`	Advanced `ShannonAnalyzer` class

How it works

Each word (>= 3 chars) is decomposed into space-separated characters so KenLM treats letters as tokens. Four entropy levels are computed:

Measure	Definition
H0	`log2(alphabet_size)` -- maximum possible entropy
H1	Unigram character frequency entropy
H2	Renyi second-order entropy / conditional bigram entropy
H3	KenLM 8-gram model entropy

Redundancy = (1 - H3/H0) * 100%

Results

Cross-language comparison

Language	Alphabet	H0	H1	H2	H3	Redundancy
Linear B	86	6.43	5.74	5.46	2.34	63.54%
English	26	4.70	4.14	3.89	1.60	65.94%
French	39	5.29	4.13	3.85	1.63	69.08%
German	30	4.91	4.17	3.78	1.39	71.68%
Italian	35	5.13	4.02	3.76	1.62	68.46%
Greek	24	4.58	4.16	3.96	1.80	60.64%
Spanish	33	5.04	4.14	3.85	1.64	67.45%
Dutch	28	4.81	4.09	3.70	1.40	70.82%
Portuguese	39	5.24	4.19	3.09	1.63	68.91%
Swedish	29	4.85	4.24	3.20	1.48	69.63%
Danish	27	4.85	4.14	3.17	1.41	70.97%
Finnish	29	5.10	4.00	3.26	1.38	75.09%

English corpus comparison

Corpus	Tokens	Vocab	H0	H1	H2	H3	Redundancy
Brown	4,369,721	46,018	4.70	4.18	3.93	1.63	65.39%
Reuters	5,845,812	28,835	4.75	4.19	3.95	1.80	62.08%
Webtext	1,193,886	16,303	5.13	4.27	4.06	1.72	66.50%
Inaugural	593,092	9,155	4.75	4.15	3.88	1.63	65.81%
State Union	1,524,983	12,233	4.81	4.16	3.91	1.67	65.17%
Gutenberg	8,123,136	41,350	4.91	4.16	3.91	1.83	62.70%

Key findings

Germanic languages are most redundant (DE 71.7%, NL 70.8%, DA 71.0%) -- Romance languages cluster at 65-69%, Greek is lowest at 60.6%
Linear B's redundancy (63.5%) is comparable to modern alphabetic languages despite its 86-character syllabary, suggesting information encoding efficiency is independent of writing system type
Finnish is an outlier at 75.1% redundancy, likely driven by its agglutinative morphology creating highly predictable character sequences
English entropy is stable across corpora (62-66% redundancy), validating the methodology's reliability
Universal entropy reduction pattern: H0 > H1 > H2 > H3 holds for every language tested

References

Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal. (included as shannon1951.pdf)
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
KenLM
Linear B Lexicon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shannon Entropy Analysis

Setup

Usage

Modes

Project layout

How it works

Results

Cross-language comparison

English corpus comparison

Key findings

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyzer.py		analyzer.py
core.py		core.py
crosslang.py		crosslang.py
english.py		english.py
linearb_lexicon.csv		linearb_lexicon.csv
main.py		main.py
mypy.ini		mypy.ini
requirements.txt		requirements.txt
shannon1951.pdf		shannon1951.pdf

Folders and files

Latest commit

History

Repository files navigation

Shannon Entropy Analysis

Setup

Usage

Modes

Project layout

How it works

Results

Cross-language comparison

English corpus comparison

Key findings

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages