Data Index

A pipeline that ingests CF-compliant NetCDF files from S3, extracts metadata, and stores it for discovery and later analysis.

Pipeline

[ INJECTED DEPENDENCIES ]
  ├── InventorySource
  ├── BatchPartitioner
  ├── FileFetcher
  ├── MetadataExtractor
  ├── StructuredSink 
  └── UnstructuredSink
         │
         ▼
 ┌────────────────────────────────────────────────────────┐
 │ Orchestrator (Prefect Flow)                            │
 ├────────────────────────────────────────────────────────┤
 │                                                        │
 │  1. [ Sinks.provision() ]                              │
 │                                                        │
 │  2. [ InventorySource.inventory() ] ──► Full Corpus    │
 │                                           │            │
 │  3. [ BatchPartitioner.partition() ] ◄────┘            │
 │            │                                           │
 │            ▼                                           │
 │     [ Split Batches ]                                  │
 │            │                                           │
 └────────────┼───────────────────────────────────────────┘
              │
              │ Dispatch concurrent workers
              ▼
 ┌──────────────────────────────────────────────────────────┐
 │ Concurrently Executed Batch Process                      │
 ├──────────────────────────────────────────────────────────┤
 │                                                          │
 │    extract() ◄───────── [ FileFetcher.fetch() ]          │
 │        │                                                 │
 │        ▼                                                 │
 │   transform() ◄──────── [ MetadataExtractor.extract() ]  │
 │        │                                                 │
 │        ▼                                                 │
 │  ExtractionResult(structured + unstructured + status)    │
 │                                                          │
 │        │                                                 │
 │        ▼                                                 │
 │     load()                                               │
 │        ├──► [ StructuredSink.sink() ]   ──► store        │
 │        └──► [ UnstructuredSink.sink() ] ──► store        │
 │                                                          │
 └──────────────────────────────────────────────────────────┘

Running locally

Start a local Prefect server:

uv run prefect server start

Run a local test against a sampled inventory:

uv run cluster-local

Run against AWS Fargate (builds + pushes Docker image to ECR):

uv run cluster-fargate

Reading results

from pyiceberg.catalog.sql import SqlCatalog
import polars

catalog = SqlCatalog(
    "data-index",
    uri="sqlite:///.load/orchestrate-test/catalog.db",
    warehouse=".load/orchestrate-test",
)

df = polars.from_arrow(
    catalog.load_table(("structured-metadata", "test")).scan().to_arrow()
)

Development

uv sync --group dev
uv run pre-commit install   # install ruff check + format hooks
uv run pytest tests/ -v

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
_legacy		_legacy
analysis		analysis
docs		docs
src/data_index		src/data_index
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTEXT.md		CONTEXT.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
constraints.txt		constraints.txt
log.txt		log.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Index

Pipeline

Running locally

Reading results

Development

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Index

Pipeline

Running locally

Reading results

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages