Skip to content

omarkamali/hypersets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Hypersets

Efficient SQL interface for HuggingFace datasets using DuckDB.

Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.

Hypersets is currently in pre-alpha stage. Use at your own risk.

✨ Features

  • πŸš€ Fast metadata retrieval - Get dataset info without downloading
  • πŸ’Ύ Memory-only operation - No disk caching unless requested
  • 🎯 Efficient querying - SQL interface with DuckDB optimization
  • πŸ“Š Download tracking - See exactly how much data you're saving
  • 🧠 Smart caching - Avoid repeated API calls
  • πŸ”„ Multiple formats - Output as pandas DataFrame or HuggingFace Dataset
  • ⚑ Rate limit handling - Built-in exponential backoff for 429 errors
  • πŸ›‘οΈ Proper error handling - Clear exceptions for common issues

🚦 Validation Status

What has been tested and confirmed so far:

  • Dataset info retrieval: Fast YAML frontmatter parsing
  • Efficient querying: DuckDB SQL with HTTP optimization and 429 retry logic
  • Smart caching: 1000x+ speedup on repeated calls
  • Download tracking: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
  • Multiple formats: pandas DataFrame and HuggingFace Dataset support
  • Error handling: Proper exceptions and retry logic for production use
  • Memory efficiency: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth

πŸ“¦ Installation

pip install hypersets

🎯 Quick Start

import hypersets as hs

# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly") 
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")

# Query with SQL - only downloads what's needed
result = hs.query(
    "SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")

πŸš€ Core API

Dataset Information

# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")

# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}")  # First 10

# Clear cached metadata
hs.clear_cache()

SQL Querying

# Basic querying
result = hs.query(
    "SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Aggregation queries
count = hs.count(
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en"
)
print(f"Total articles: {count:,}")

# Advanced analytics
stats = hs.query(
    """
    SELECT 
        COUNT(*) as total_articles,
        AVG(LENGTH(text)) as avg_length,
        MAX(LENGTH(text)) as max_length
    FROM dataset
    """,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

Sampling & Exploration

# Random sampling with DuckDB optimization
sample = hs.sample(
    n=1000,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    columns=["title", "url", "LENGTH(text) as text_length"]
)

# Quick data preview
preview = hs.head(
    n=5,
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en",
    columns=["title", "url"]
)

# Schema inspection
schema = hs.schema(
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")

Output Formats

result = hs.query("SELECT * FROM dataset LIMIT 100", ...)

# As pandas DataFrame
df = result.to_pandas()
print(df.head())

# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)

# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")

Download Tracking

# Enable download tracking to see data savings
result = hs.query(
    "SELECT title FROM dataset LIMIT 1000",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    track_downloads=True
)

# Check savings
if result.download_stats:
    stats = result.download_stats
    print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
    print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
    print(f"Savings: {stats.savings_percentage:.1f}%")

πŸ“ Examples

Explore our comprehensive examples to see Hypersets in action:

πŸƒ Quick Demo

python examples/demo.py

Complete feature demonstration - Shows all Hypersets capabilities with real datasets.

πŸ“š Basic Usage

python examples/basic_usage.py  

Learn the fundamentals - Dataset info, querying, sampling, caching, and output formats.

πŸ”¬ Advanced Queries

python examples/advanced_queries.py

Sophisticated analytics - Text analysis, pattern matching, quality metrics, and performance optimization.

πŸ—οΈ Architecture

Hypersets consists of four core components:

  1. Dataset Info Retriever - Discovers parquet files, configs, and schema from YAML frontmatter
  2. DuckDB Mount System - Mounts remote parquet files as virtual tables with HTTP optimization
  3. Query Interface - Clean API with SQL support, download tracking, and multiple output formats
  4. Smart Caching - TTL-based caching of dataset metadata to avoid repeated API calls

All components include proper 429 rate limit handling with exponential backoff.

πŸ”§ Advanced Configuration

Memory Management

# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
    "SELECT * FROM dataset LIMIT 1000",
    dataset="large/dataset",
    memory_limit="8GB"  # Increase for large datasets
)

# For extremely large datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10000", 
    dataset="massive/dataset",
    memory_limit="16GB",  # More memory
    threads=8             # More threads
)

# Memory-efficient column selection
result = hs.query(
    "SELECT id, title FROM dataset LIMIT 100000",  # Only select needed columns
    dataset="large/dataset",
    memory_limit="2GB"  # Can use less memory
)

Memory Limit Guidelines:

  • Default (4GB): Good for most datasets up to ~50GB
  • 8GB: For large datasets (50-200GB) or complex queries
  • 16GB+: For massive datasets (200GB+) or heavy aggregations
  • Column selection: Always select only needed columns for better memory efficiency

Custom Caching

# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600)  # 1 hour

# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)

Authentication

# Use HuggingFace token for private datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10",
    dataset="private/dataset",
    token="hf_your_token_here"
)

Performance Tuning

# Optimize for your use case
result = hs.query(
    "SELECT * FROM dataset USING SAMPLE 10000",
    dataset="large/dataset",
    memory_limit="6GB",    # Adequate memory
    threads=4,            # Balanced parallelism  
    track_downloads=True  # Monitor efficiency
)

# For aggregation-heavy workloads
stats = hs.query(
    """
    SELECT 
        category,
        COUNT(*) as count,
        AVG(LENGTH(text)) as avg_length
    FROM dataset 
    GROUP BY category
    """,
    dataset="large/dataset",
    memory_limit="12GB",  # More memory for grouping
    threads=8            # More threads for aggregation
)

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and add tests
  4. Run tests: pytest tests/
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

  • DuckDB for incredible SQL analytics on remote data
  • Parquet for the de facto standard for columnar data storage
  • HuggingFace for democratizing access to datasets
  • The open source community for inspiration and feedback

Contributors

Omar Kamali

πŸ“ Citation

If you use Hypersets in your research, please cite:

@misc{hypersets,
    title={Hypersets: Efficient dataset transfer, querying and transformation},
    author={Omar Kamali},
    year={2025},
    url={https://github.com/omarkamali/hypersets}
    note={Project developed under Omneity Labs}
}

πŸš€ Ready to query terabytes of data efficiently? Start with examples/demo.py to see Hypersets in action!

About

Query terabytes of data using simple SQL and work with massive Huggingface datasets without fully downloading them.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors