GXD Compression Utility

A high-performance block-based compression utility with parallel processing, integrity verification, random-access capabilities, and smart algorithm selection.

Note: This is an alpha version (v0.0.0a2). APIs and file formats may change in future releases.

What's New in This Version

Auto Algorithm Selection

GXD now features an intelligent auto mode that automatically selects the best compression algorithm for each block based on data characteristics:

Analyzes Shannon Entropy (0.0-8.0 scale) to measure data randomness
Calculates zero-byte density and unique byte ratios
Dynamically chooses between lz4, zstd, brotli, or none per block
Per-block algorithm metadata stored in archive for accurate decompression

Enhanced Metadata Tracking

Each compressed block now includes detailed metrics:

Entropy value: Measures data randomness (0.0 = perfectly uniform, 8.0 = maximum randomness)
Compression time: Per-block timing for performance analysis
Timestamp: When each block was compressed
Algorithm used: Actual algorithm applied (important for auto mode)

File Attribute Preservation

GXD now preserves and restores original file attributes:

File permissions (mode)
Modification time (mtime)
Access time (atime)
User ID (uid) and Group ID (gid) on Unix systems
Automatic restoration on decompression

Archive Information Command

New info command provides comprehensive archive inspection:

View global archive metadata (version, algorithm, total blocks)
Display preserved file attributes
List block overview with compression details
Inspect specific block metadata by index

Smart Algorithm Analysis (`algo.py`)

algo.py is a predictive utility designed to analyze your data before compression. It uses Shannon Entropy and data density metrics to recommend the most efficient algorithm for your specific files, helping you avoid "data expansion" where compression actually increases file size.

Features

Predictive Modeling: Automatically suggests lz4, zstd, brotli, or none based on data heuristics.
Entropy Analysis: Calculates Shannon Entropy (0.0 to 8.0) to determine data randomness.
Efficiency Metrics: Tracks zero-byte density and unique byte ratios.
Real-time Benchmarking: Performs a test compression on blocks to report expected ratios and speeds (MB/s).

Usage

python3 algo.py input_file.bin --block-size 1mb --zstd-ratio 3

Analysis Metrics Reference

Metric	Logic / Threshold	Recommended Action
High Entropy	Entropy > 7.9	Use `--algo none` (Data is likely already compressed/encrypted)
Sparse Data	Zeros > 40% or Entropy < 3.0	Use `--algo lz4` for maximum speed
General Data	Entropy < 6.8	Use `--algo zstd` (Default)
High Redundancy	High Unique Density	Use `--algo brotli` for best ratio

Understanding the Output

The utility provides a block-by-block summary:

Ratio: The percentage of the original size (e.g., 70% means 30% savings).
Speed: Estimated throughput in MB/s for the selected algorithm.
Status: Displays EXPANDED if the compressed output is larger than the source.

Community Project

GXD is a community-driven project, built for the community and by the community. Community contributions are highly valued and essential to the growth and improvement of this project. Whether you're reporting bugs, suggesting features, improving documentation, or submitting code, your input matters and helps make GXD better for everyone.

Features

Feature	Description
Multiple Algorithms	Zstandard, LZ4, Brotli, and uncompressed modes
Auto Algorithm Selection	Intelligent per-block algorithm selection based on entropy analysis
Parallel Processing	Multi-threaded compression/decompression using all CPU cores
Block-Level Integrity	SHA-256 checksums for each data block
Random Access	Seek and extract specific byte ranges without full decompression
Flexible Verification	Optional integrity checking for performance optimization
Text Mode	Direct UTF-8 text output to stdout
File Attribute Preservation	Maintains permissions, timestamps, and ownership
Archive Inspection	View metadata and block details without extraction
Progress Tracking	Visual progress bars with tqdm (fallback to simple indicators)
Entropy Tracking	Per-block entropy metrics for compression analysis

Requirements

Category	Dependencies
Core	Python 3.6+
Optional	`zstd` (Zstandard compression), `lz4` (LZ4 compression), `brotli` (Brotli compression), `tqdm` (progress bars)

Installation

# Install all optional dependencies
pip install zstandard lz4 brotli tqdm

# Or install selectively
pip install zstandard tqdm  # Minimal recommended setup

Basic Usage

Compress a File

python gxd.py compress input.bin output.gxd

Compress with Auto Algorithm Selection

python gxd.py compress input.bin output.gxd --algo auto

Decompress a File

python gxd.py decompress input.gxd -o output.bin

View Archive Information

python gxd.py info input.gxd

Extract Specific Range

python gxd.py seek input.gxd --offset 1mb --length 512kb -o chunk.bin

Command Reference

Compression Options

Option	Values	Default	Description
`--algo`	`auto`, `zstd`, `lz4`, `brotli`, `none`	`zstd`	Compression algorithm to use
`--block-size`	`512kb`, `1mb`, `2mb`, etc.	`1024kb`	Size of data blocks
`--zstd-ratio`	`1-22`	`3`	Zstandard compression level (only applies when using zstd)
`--threads`	`1-128`	All CPU cores	Number of parallel threads
`--block-verify`	-	Enabled	Enable SHA-256 per-block integrity checks
`--no-verify`	-	-	Disable all integrity checks for faster performance

Important CLI Behavior Notes:

Auto Algorithm Mode: When using --algo auto, GXD analyzes each block's entropy and data characteristics to select the optimal algorithm (lz4, zstd, brotli, or none). The chosen algorithm is stored per-block in the archive metadata.
Algorithm-Specific Parameters: The --zstd-ratio parameter only affects compression when using the zstd algorithm. If you specify a different algorithm with --zstd-ratio, the tool will display a warning and ignore the ratio parameter.

Example:
```
# This will show a warning that --zstd-ratio is being ignored
python gxd.py compress input.txt output.gxd --algo lz4 --zstd-ratio 10
```
Output: [!] Warning: --zstd-ratio (10) is ignored when using algorithm 'lz4'. it only applies to 'zstd'.
Size Parsing: Invalid size formats will cause the program to exit with an error message. Valid formats include: 1024 (bytes), 512kb, 1mb, 2gb.
Block Size Validation: Block size must be greater than 0, otherwise the program will exit with an error.

Decompression Options

Option	Description
`-o`, `--output`	Path for the restored file (default: same as input minus .gxd)
`--text`	Print decompressed data as UTF-8 text to stdout
`--threads`	Number of parallel threads (default: all CPU cores)
`--block-verify`	Verify integrity using SHA-256 block hashes (enabled by default)
`--no-verify`	Disable integrity checks for maximum speed

Note: Decompression automatically detects per-block algorithms when using auto mode archives.

Info Options

Option	Description
`--block`	Display detailed metadata for a specific block (1-based index)
`--threads`	Number of threads (default: all CPU cores)

Seek Options

Option	Description
`-o`, `--output`	Path to save the extracted chunk (default: stdout)
`--offset`	Byte offset to start reading (e.g., `0`, `1mb`, `512kb`)
`--length`	Number of bytes to extract (e.g., `100`, `2mb`, default: until EOF)
`--text`	Print extracted chunk as UTF-8 text to stdout
`--threads`	Number of parallel threads (default: all CPU cores)
`--block-verify`	Verify hashes of accessed blocks (enabled by default)
`--no-verify`	Disable integrity checks

Size Notation

Format	Description
`1024`	Bytes
`512kb`	Kilobytes
`10mb`	Megabytes
`1gb`	Gigabytes

File Format

GXD uses a custom archive format:

[MAGIC: "GXDINC"]
[Compressed Block 1]
[Compressed Block 2]
...
[Compressed Block N]
[JSON Metadata]
[Metadata Length: 8 bytes]
[MAGIC: "GXDINC"]

Metadata Structure

{
  "version": "0.0.0a2",
  "algo": "auto",
  "global_hash": "sha256_hash_of_original_file",
  "file_attr": {
    "mode": 33188,
    "mtime": 1703347200.0,
    "atime": 1703347200.0,
    "uid": 1000,
    "gid": 1000
  },
  "blocks": [
    {
      "id": 0,
      "start": 6,
      "size": 12345,
      "orig_size": 1048576,
      "hash": "block_sha256_hash",
      "algo": "zstd",
      "entropy": 5.8234,
      "time": 0.023456,
      "timestamp": 1703347200.123
    }
  ]
}

Metadata Fields Explained

Field	Description
`version`	GXD format version
`algo`	Global algorithm setting (can be "auto")
`global_hash`	SHA-256 hash of the complete original file
`file_attr`	Preserved file system attributes
`blocks[].algo`	Actual algorithm used for each block
`blocks[].entropy`	Shannon entropy value (0.0-8.0)
`blocks[].time`	Compression time in seconds
`blocks[].timestamp`	Unix timestamp when block was compressed

Code Signing and Verification

The project includes a digital signature tool for verifying script integrity.

Command	Description
`python signer.py sign gxd.py`	Sign a Python file with default author
`python signer.py sign gxd.py --author "Your Name"`	Sign with custom author
`python signer.py verify gxd.py`	Verify a signed file's integrity

Testing

Run the comprehensive test suite:

python test.py

Test Coverage

Test	Description
Full cycle permutations	Compression/decompression for all algorithms
Corrupt footer magic	Detection of tampered magic bytes
File truncation	Handling of incomplete files
Checksum mismatch	Detection of corrupted data blocks
Unsupported algorithm	Handling of invalid metadata
Text mode verification	UTF-8 output functionality
Seek with corruption	Random access error handling

Performance Guidelines

Algorithm Selection

Algorithm	Speed	Compression Ratio	Best For
`auto`	Adaptive	Optimized	Mixed data types (recommended for varied content)
`zstd`	Balanced	Good	General purpose (default)
`lz4`	Fastest	Lower	Maximum speed
`brotli`	Slower	Best	Maximum compression
`none`	N/A	None	Integrity verification only

Auto Mode Behavior

The auto algorithm selection follows these rules per block:

Entropy > 7.9: Uses none (data is already compressed/encrypted)
Zero ratio > 40% OR Entropy < 3.0: Uses lz4 (sparse data, prioritize speed)
Entropy < 6.8: Uses zstd (compressible data, good balance)
Otherwise: Uses brotli (high redundancy, maximize compression)

Block Size Recommendations

Block Size	Compression Ratio	Random Access	Use Case
512KB-1MB	Lower	Excellent	Frequent random access
1MB (default)	Balanced	Good	General purpose
2-4MB	Better	Lower	Large sequential files

Threading

Setting	Description
Default	Uses all available CPU cores
Custom	Use `--threads N` to limit resource usage

Verification Options

Option	Performance	Security
`--block-verify`	Slower	High integrity checking
`--no-verify`	Fastest	No integrity verification

Security Features

Feature	Description
SHA-256 Integrity Checks	Per-block and global file hashing
Tamper Detection	Automatic detection of corrupted or modified archives
Metadata Validation	Structural integrity verification
Digital Signatures	Optional source code signing with signer.py

Examples

Compress with Auto Algorithm Selection

python gxd.py compress mixed_data.bin output.gxd \
  --algo auto \
  --block-size 1mb \
  --threads 8

Compress a Large Dataset

python gxd.py compress dataset.bin dataset.gxd \
  --algo zstd \
  --block-size 2mb \
  --zstd-ratio 10 \
  --threads 16

View Archive Metadata

# View general archive info
python gxd.py info data.gxd

# View specific block details
python gxd.py info data.gxd --block 5

Extract Log File Range

# Get last 100KB of a compressed log file
python gxd.py seek app.log.gxd \
  --offset 9.9mb \
  --length 100kb \
  --text

Quick Archive Verification

# Verify integrity without full extraction
python gxd.py decompress data.gxd --no-verify > /dev/null

Decompress with Attribute Restoration

# Original file attributes will be automatically restored
python gxd.py decompress archive.gxd -o restored.bin

Contributing

GXD is a community-driven project - your contributions are what make it thrive! Whether you're fixing bugs, adding features, improving documentation, or sharing ideas, every contribution matters and is greatly appreciated.

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Install dependencies: pip install zstandard lz4 brotli tqdm
Make your changes and test them: python test.py
Sign your code (optional): python signer.py sign your_file.py
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Contribution Ideas

Area	Ideas
Features	New compression algorithms, improved performance optimizations, enhanced auto-selection logic
Documentation	Tutorials, use case examples, translations
Testing	Additional test cases, platform-specific testing, auto mode validation
Bug Reports	Issue identification, reproduction steps
Code Quality	Refactoring, type hints, performance profiling

All contributions, no matter how small, help improve GXD for the entire community.

License

GXD Compression Utility
Copyright (C) 2025 @hejhdiss (Muhammed Shafin p)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

See LICENSE.txt for the full license text.

Author

@hejhdiss (Muhammed Shafin p)

GitHub: @hejhdiss

Acknowledgments

Built with Python's ProcessPoolExecutor for parallel processing
Compression powered by Zstandard, LZ4, and Brotli libraries
Progress visualization by tqdm
Smart algorithm selection using Shannon Entropy analysis

Development Status & Updates

Important Notice: This project is maintained as a personal/community effort. The author is not committed to regular updates, and future releases may or may not come depending on time, interest, and community needs. This is an alpha release (v0.0.0a2) provided as-is.

Update Schedule: No guaranteed timeline for new features or bug fixes
Stability: Current version is functional but APIs and file formats may change
Community Contributions: Highly encouraged and may be the primary driver of future development
Support: Best-effort basis only

If you need guaranteed maintenance or specific features, consider forking the project or contributing directly. Community feedback and contributions are welcome and may help shape future development.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
algo.py		algo.py
gxd.py		gxd.py
signer.py		signer.py
test.py		test.py

Folders and files

Latest commit

History

Repository files navigation

GXD Compression Utility

What's New in This Version

Auto Algorithm Selection

Enhanced Metadata Tracking

File Attribute Preservation

Archive Information Command

Smart Algorithm Analysis (algo.py)

Features

Usage

Analysis Metrics Reference

Understanding the Output

Community Project

Features

Requirements

Installation

Basic Usage

Compress a File

Compress with Auto Algorithm Selection

Decompress a File

View Archive Information

Extract Specific Range

Command Reference

Compression Options

Decompression Options

Info Options

Seek Options

Size Notation

File Format

Metadata Structure

Metadata Fields Explained

Code Signing and Verification

Testing

Test Coverage

Performance Guidelines

Algorithm Selection

Auto Mode Behavior

Block Size Recommendations

Threading

Verification Options

Security Features

Examples

Compress with Auto Algorithm Selection

Compress a Large Dataset

View Archive Metadata

Extract Log File Range

Quick Archive Verification

Decompress with Attribute Restoration

Contributing

How to Contribute

Contribution Ideas

License

Author

Acknowledgments

Development Status & Updates

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Smart Algorithm Analysis (`algo.py`)

Packages