File ingestion pipelines are repetitive and fragile. SIFFT (Spark Ingestion Framework For Tables) is a Python library that aims to:
- Make file reading format-agnostic with automatic delimiter and header detection
- Validate data quality with CSVW metadata and constraint checking before it enters your tables
- Provide consistent error handling across CSV, Excel, TSV, and pipe-delimited files
- Support schema evolution and merge/upsert operations for Delta tables
- Reduce boilerplate code for common Spark ingestion patterns
pip install sifft # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3] # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4] # With PySpark 4.x + Delta Lake 4.xfrom pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions
spark = SparkSession.builder.appName("Pipeline").getOrCreate()
# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)
# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
report = validate_csvw_constraints(result.dataframe, result.metadata)
assert report.valid, f"{len(report.violations)} violations"
# Write
write_table(result.dataframe, "catalog.schema.target", spark,
TableWriteOptions(format="delta", mode="append"))- File Processing — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
- Data Validation — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
- Table Writing — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
- File Management — Safe move/list across local, S3, Azure, GCS.
- Extensibility — Custom format handlers, write modes, and constraint validators.
- Getting Started — installation, requirements, full quick start
- File Processing — reading files, tracking, deduplication
- Data Validation — schema validation and CSVW constraints
- Table Writing — write modes, formats, schema evolution
- File Management — safe move/list with cloud support
- CSVW Metadata — metadata format and constraint reference
- Extensibility — custom handlers and validators
- AutoLoader Integration — using SIFFT with Databricks AutoLoader
- Python 3.10+
- PySpark 3.5.x, 4.0.x, or 4.1.x
- Databricks / Unity Catalog compatible
just test # Run tests in Docker
just test-local # Run tests locally (requires Java)
just --list # All available commandsArchitecture Decision Records are in design_decisions/.
MIT — see LICENSE.
- Iwan Dyke
- Fahad Khan
- Michal Poreba