Skip to content

dvla/sifft

Repository files navigation

SIFFT

File ingestion pipelines are repetitive and fragile. SIFFT (Spark Ingestion Framework For Tables) is a Python library that aims to:

  • Make file reading format-agnostic with automatic delimiter and header detection
  • Validate data quality with CSVW metadata and constraint checking before it enters your tables
  • Provide consistent error handling across CSV, Excel, TSV, and pipe-delimited files
  • Support schema evolution and merge/upsert operations for Delta tables
  • Reduce boilerplate code for common Spark ingestion patterns

Install

pip install sifft              # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3]    # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4]    # With PySpark 4.x + Delta Lake 4.x

Quick Start

from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions

spark = SparkSession.builder.appName("Pipeline").getOrCreate()

# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)

# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
    report = validate_csvw_constraints(result.dataframe, result.metadata)
    assert report.valid, f"{len(report.violations)} violations"

# Write
write_table(result.dataframe, "catalog.schema.target", spark,
            TableWriteOptions(format="delta", mode="append"))

Features

  • File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
  • Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
  • Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
  • File Management — Safe move/list across local, S3, Azure, GCS.
  • Extensibility — Custom format handlers, write modes, and constraint validators.

Documentation

Compatibility

  • Python 3.10+
  • PySpark 3.5.x, 4.0.x, or 4.1.x
  • Databricks / Unity Catalog compatible

Development

just test        # Run tests in Docker
just test-local  # Run tests locally (requires Java)
just --list      # All available commands

Design Decisions

Architecture Decision Records are in design_decisions/.

License

MIT — see LICENSE.

Contributors

  • Iwan Dyke
  • Fahad Khan
  • Michal Poreba

About

Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors