Skip to content

maris-development/beacon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

338 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beacon — ARCO Data Lakehouse Query Engine

Release Docs Docker License Stars Slack

Beacon is a lightweight, high-performance data lakehouse query engine for scientific data. It lets you discover, read, transform, and serve large collections of array and tabular datasets in place — no copying into a warehouse, no rigid ETL pipeline. Point Beacon at a directory or object-storage bucket of files and query them directly over HTTP, with results streamed back in the format you ask for.

It is built on Apache Arrow and Apache DataFusion, so queries run on a columnar, vectorized engine while reading native scientific formats such as NetCDF, Zarr, Parquet, and ODV. Beacon is developed by MARIS for serving Analysis-Ready, Cloud-Optimized (ARCO) marine and environmental data, but nothing about it is domain-specific.

Table of contents

Why Beacon

  • Query files where they live. Read NetCDF, Zarr, Parquet, ODV, CSV and more directly from a local volume or S3-compatible object store — no ingestion step.
  • One API, many formats in and out. Send a SQL or JSON query, choose your output format (Parquet, CSV, NetCDF, GeoParquet, Arrow IPC, ODV) and stream the result.
  • Built for scale. Columnar execution, predicate/projection pushdown, and schema caching on top of Arrow + DataFusion.
  • Self-describing. A built-in OpenAPI/Swagger UI documents every endpoint, and discovery endpoints expose available datasets, tables, columns, and functions.

Features

  • Input formats: Parquet, NetCDF, Zarr, ODV, CSV, Arrow IPC, GeoTIFF, and the native Beacon Binary Format (BBF).
  • Output formats: Parquet, GeoParquet, NetCDF, ND-NetCDF, CSV, Arrow IPC, and ODV.
  • Two query interfaces: a structured JSON query API and read-only raw SQL (enabled by default; toggle with BEACON_ENABLE_SQL).
  • Arrow Flight SQL endpoint for high-throughput clients (enabled by default).
  • Storage backends: local filesystem and S3-compatible object storage, with optional change-event watching.
  • Interactive API docs via Swagger UI (/swagger) and Scalar (/scalar).

Concepts

  • Datasets — the raw source files you make available to Beacon (e.g. .nc, .zarr, .parquet, .csv). Drop them into the mounted datasets directory and Beacon discovers them automatically.
  • Tables — named, queryable collections defined over one or more datasets, stored in the tables directory. A configurable default table (BEACON_DEFAULT_TABLE) is queried when no source is specified.
  • Source functions — table functions such as read_netcdf(...), read_parquet(...), and read_csv(...) let a query read specific files directly, without first defining a table.
  • Query engine — every request is parsed into a DataFusion logical plan and executed on the Arrow columnar engine, then encoded into the requested output format and streamed back.

See the documentation for the full data model.

Quick start (Docker)

services:
  beacon:
    image: ghcr.io/maris-development/beacon:latest
    container_name: beacon
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      - BEACON_HOST=0.0.0.0
      - BEACON_PORT=8080
      - BEACON_ADMIN_USERNAME=admin
      - BEACON_ADMIN_PASSWORD=securepassword
      - BEACON_VM_MEMORY_SIZE=4096
      - BEACON_DEFAULT_TABLE=default
      - BEACON_LOG_LEVEL=INFO
    volumes:
      - ./datasets:/beacon/data/datasets
      - ./tables:/beacon/data/tables

Start it with docker compose up -d, then open the interactive API docs at http://localhost:8080/swagger/.

Add data by placing files (e.g. .nc, .zarr, .parquet, .csv) into ./datasets — the container discovers them through the mounted volume.

Prefer a native build or a non-Docker install? See the installation guide.

Query examples

Both examples below post to the same endpoint and stream back a file in the requested output format.

SQL

SQL is read-only and enabled by default. Set BEACON_ENABLE_SQL=false to disable it.

POST http://localhost:8080/api/query
Content-Type: application/json

{
  "sql": "SELECT TEMP, PSAL, LONGITUDE, LATITUDE FROM read_netcdf(['data/2020.nc', 'data/2021.nc']) WHERE time > '2020-01-01T00:00:00'",
  "output": { "format": "parquet" }
}

JSON

The JSON query API is always available and needs no extra configuration.

POST http://localhost:8080/api/query
Content-Type: application/json

{
  "query_parameters": [
    { "column_name": "TEMP", "alias": "temperature" },
    { "column_name": "PSAL", "alias": "salinity" },
    { "column_name": "TIME" },
    { "column_name": "LONGITUDE" },
    { "column_name": "LATITUDE" }
  ],
  "filters": [
    { "for_query_parameter": "temperature", "min": -2, "max": 35 },
    { "for_query_parameter": "salinity", "min": 30, "max": 42 },
    {
      "and": [
        { "for_query_parameter": "LONGITUDE", "min": -20, "max": 20 },
        { "for_query_parameter": "LATITUDE", "min": 40, "max": 65 }
      ]
    }
  ],
  "from": {
    "netcdf": { "paths": ["data/2020.nc", "data/2021.nc"] }
  },
  "output": { "format": "csv" }
}

The response is a streamed file in the chosen output.format (here, CSV). See the query reference for the full schema, all source types, and every output format.

Configuration

Beacon is configured entirely through BEACON_* environment variables. The most common ones:

Variable Default Description
BEACON_HOST 0.0.0.0 Address the HTTP server binds to.
BEACON_PORT 5001 HTTP server port.
BEACON_ADMIN_USERNAME beacon-admin Admin username for management endpoints.
BEACON_ADMIN_PASSWORD beacon-password Admin password — change this in production.
BEACON_LOG_LEVEL info Log verbosity (trace, debug, info, warn, error).
BEACON_VM_MEMORY_SIZE 8192 Working memory (MB) available to the query engine.
BEACON_DEFAULT_TABLE default Table queried when a request specifies no source.
BEACON_WORKER_THREADS 8 Number of query worker threads.
BEACON_ENABLE_SQL true Enable the read-only raw SQL query interface.
BEACON_TABLE_SYNC_INTERVAL_SECS 300 How often tables are re-synced from disk.
BEACON_FLIGHT_SQL_ENABLE true Enable the Arrow Flight SQL endpoint.
BEACON_FLIGHT_SQL_PORT 32011 Arrow Flight SQL port.

S3-compatible storage, CORS, NetCDF caching, and Flight SQL authentication have their own BEACON_* settings — see the configuration reference for the complete list.

Documentation

Contributing

Beacon is a Rust workspace. To build and test from source:

git clone https://github.com/maris-development/beacon.git
cd beacon
cargo build --release
cargo test

Issues and pull requests are welcome on GitHub. For larger changes, please open an issue first to discuss the approach.

License

Beacon is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for the full text.

About

A high-performance climate data lake supporting subsetting for zarr, netcdf, parquet, arrow ipc, csv and bbf

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages