This repository contains a highly rigorous, academically recognized C++ benchmarking framework for evaluating text-based steganography methods. It was explicitly developed to evaluate and compare a novel Log-based Whitespace Steganography method against 11 globally known, state-of-the-art steganography techniques.
The framework evaluates methods across multiple crucial dimensions:
- Statistical Invisibility: KL divergence, Markov Shifts.
- Capacity: Bits Per Word (BPW).
- Robustness: Against semantic and format attacks.
- Resistance: To modern steganalysis scanners (ZWC detectors, language model cross-entropy, etc.).
This framework is built for reproducibility, enabling researchers to seamlessly re-run benchmarks and generate statistically significant 95% Confidence Intervals (CI) formatted directly for academic publication (CSV/LaTeX plotting).
The benchmark categorizes text steganography into four primary domains, evaluating 11 distinct, academically verified channel models.
- Whitespace Modulation [Proposed Novel Method]: Leverages naturally occurring space runs in structured log files. It models the distribution of double/triple spaces as a covert channel, making it extremely resistant to LM-based detection.
- Zero-Width Characters: Utilizes Unicode ZWJ (U+200D), ZWNJ (U+200C), and ZWNBSP (U+FEFF) to invisibly encode binary payloads between visible characters.
- Font-based / Homoglyph Manipulation: An academic channel model mapping Latin ASCII characters to visually identical Cyrillic/Greek homoglyphs (e.g., ASCII
avs. Cyrillica).
- Synonym Substitution: Replaces target words with synonyms using parity-based splitting (e.g., odd-parity synonyms encode
1, even-parity encode0). - Syntactic Transformation: Alters the grammatical structure of sentences (e.g., converting active voice to passive voice) to encode data streams.
- Abbreviation & Spelling: Exploits regional spelling differences (e.g., "color" vs "colour") or abbreviation expansions (e.g., "ASAP" vs "as soon as possible").
- Markov Chain Generation: Generates text using N-gram probability splits (median/quartile probability routing) to encode bits during the generation process.
- LLM-based Encoding: Simulates Large Language Model (LLM) token probability distributions, utilizing arithmetic coding or Huffman tree routing over the softmax output to hide data.
- Padding: Appends specific byte paddings (e.g., trailing spaces, null bytes) structurally at the end of lines or blocks.
- Layout / Acrostic: Encodes bits in the first letters of lines (Acrostic) or in specific layout positions within a document grid.
- Field Ordering: Uses AST-like Key-Value pair reordering (e.g., swapping JSON/Log fields) to encode mathematical permutations (Factorial encoding).
To prevent mathematical anomalies (such as "0.0000" outputs for linguistic methods where byte-level changes are meaningless), this framework employs multi-level statistical metrics:
- Capacity (Theoretical BPW - Bits Per Word): Measures the maximum embedding rate. Higher is better, but usually trades off with invisibility.
- Byte-level KL Divergence: Measures the shift in raw byte/ASCII distribution. Crucial for detecting format-based anomalies.
- Token Unigram & Bigram KL Divergence: Mathematically captures distribution shifts caused by word replacement and structural reordering. Linguistic and generative methods heavily disrupt these metrics.
- 2nd-Order Markov Shift: Reflects abnormal structural continuity by analyzing transition probabilities between consecutive elements.
- Information Entropy Difference: Calculates the overall Shannon entropy shift pre- and post-embedding to detect unexpected randomness (often caused by encrypted payloads).
- Robustness Testing (L2 & L3 Attacks):
- L2 Attack (Format Cleaning): Simulates channel noise like trailing space trimming, Unicode normalization (NFKC), and multiple-space collapse.
- L3 Attack (Semantic Rewriting): Simulates advanced active wardens applying synonym fallback, paraphrasing, and AST reconstruction.
The framework runs all generated stego-texts through 6 built-in detection scanners to evaluate real-world survivability:
- ZWC Scanner: Detects zero-width character injection anomalies via regex and Unicode block analysis.
- Markup/Rich Text Scanner: Identifies injected HTML/XML tags used for fake formatting.
- Homoglyph/Unicode Scanner: Detects mixed-script attacks (e.g., Cyrillic injected into Latin text blocks).
- Trailing Space Scanner: Identifies unnatural trailing whitespace runs at line endings.
- Whitespace Statistics Scanner: Analyzes double-space ratios and maximum space run lengths against natural corpus baselines.
- Language Model Cross-Entropy (LM-CE): Evaluates semantic perplexity. Texts with forced synonyms or generative anomalies will exhibit spikes in Cross-Entropy.
To ensure strict academic rigor, this benchmark implements the following design constraints:
- Zero Data Faking: The framework will statically check the capacity of your input log file. If the capacity is insufficient to embed the required bits across all schemes, it will hard-fail rather than artificially duplicating or faking input text.
- LLM Evaluation Honesty: To maintain an agile CI/CD pipeline and reasonable repository size, the Generative LLM-based Encoding scheme uses equivalent FLOPs matrix multiplications to accurately simulate inference delays, rather than loading a multi-gigabyte neural network. All entropy, KL divergence, and robustness metrics remain mathematically accurate to the algorithm's output.
βββ src/ # Core C++ Source Code
β βββ advanced_main.cpp # CLI entry point and argument parsing
β βββ compare_module.cpp # Implementation of 11 algorithms & 6 scanners
β βββ main.cpp # Legacy entry point
βββ scripts/ # Auxiliary Python and Batch scripts
β βββ simulation.py # Python simulation & validation scripts
β βββ chi_square_rise.py # Python plotting scripts for paper figures
β βββ build.bat # Internal build references
βββ data/ # Input datasets (corpus, logs, secrets)
β βββ log.txt # Primary log corpus
β βββ genesis.json # Auxiliary structural data
βββ docs/ # Academic papers, PDF reports, and analysis
βββ results/ # Output directory (Auto-generated)
β βββ stego_outputs/ # Raw .txt outputs of all 11 algorithms (Iteration 0)
β βββ benchmark_per_iter.csv # Raw data for every single iteration
β βββ benchmark_summary.csv # Aggregated Means and 95% CIs
βββ bin/ # Compiled Executables
βββ build.bat # Windows build script (MSVC)
βββ build.sh # Linux/POSIX build script (g++)
βββ Makefile # Standard Makefile for Linux
βββ Dockerfile # Docker container definition
βββ LICENSE # Strict GPL v3.0 License
- Windows OS: Microsoft Visual Studio (MSVC) with C++17 support.
- Linux / macOS:
g++(GCC) orclang++with C++17 support. - Docker: If you prefer not to compile it locally.
Windows:
Run the provided build script from the project root in your terminal/PowerShell:
.\build.batThis will configure the MSVC environment, compile src/advanced_main.cpp, and output the executable to bin/advanced_main.exe.
Linux / macOS:
We provide a standard Makefile and a bash script. You can build it using g++ (C++17 required):
make
# OR
bash build.shThis will compile the source code and output the executable to bin/advanced_main.
The executable accepts standard command-line arguments to facilitate automated batch runs and CI generation.
On Windows:
.\bin\advanced_main.exe --option 6 --file data\log.txt --secret "YourSecretMessage" --iterations 100 --outdir resultsOn Linux / macOS:
./bin/advanced_main --option 6 --file data/log.txt --secret "YourSecretMessage" --iterations 100 --outdir resultsTo guarantee 100% reproducibility across any OS (including Windows WSL2, macOS, Linux) without dealing with C++ compilers or Geth installations, we provide a fully automated Docker container.
docker build -t stego-benchmark .Run the container to automatically execute the 11-algorithm benchmark. It will output to the results folder.
docker run --rm -v $(pwd)/results:/app/results stego-benchmarkTo run the live blockchain simulation (where geth feeds logs to the steganography engine in real-time), simply append simulate to the command:
docker run --rm -it -v $(pwd)/results:/app/results stego-benchmark simulate(Note: -it is required for interactive input of the secret message).
To add a new steganography method to the benchmark:
- Open
src/compare_module.cpp. - Implement your
embed_customandextract_customfunctions. - Define the theoretical capacity
max_capacity_custom. - Register it in the
run_academic_benchmarkpipeline at the bottom of the file:
run_academic_benchmark("Category Name", "Your Algorithm Name", embed_custom, extract_custom, max_capacity_custom);- Recompile and run. The framework will automatically calculate KL Divergence, Markov Shifts, Entropy, Robustness, and run it against all 6 steganalysis scanners.
STRICT LICENSE WARNING: GNU General Public License v3.0 (GPL-3.0)
This project is licensed under the strict GNU GPLv3 License.
- No MIT License: This codebase is explicitly NOT licensed under MIT, Apache, or BSD permissive licenses.
- Copyleft Provision: Any derivative works, modifications, or larger projects that incorporate this code must also be open-sourced under the exact same GPLv3 license.
- Commercial Use: Commercial exploitation, closed-source integration, or proprietary redistribution of this framework is strictly prohibited without explicit dual-licensing permission from the original authors.
See the full LICENSE file for exact legal details.