Skip to content

gpfdist: add HDFS file reading via libhdfs3 and LZO decompression support#1

Closed
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1778554263-gpfdist-hdfs-lzo-support
Closed

gpfdist: add HDFS file reading via libhdfs3 and LZO decompression support#1
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1778554263-gpfdist-hdfs-lzo-support

Conversation

@devin-ai-integration

Copy link
Copy Markdown

What does this PR do?

Add support for reading LZO-compressed files from remote HDFS clusters through gpfdist, integrating libhdfs3 for HDFS I/O and liblzo2 for LZOP format decompression.

HDFS integration (--with-libhdfs3):

  • Parse hdfs://host:port/path URIs to connect to HDFS namenode via libhdfs3
  • Stream file data through the existing gfile read abstraction layer
  • Transparent to the compression layer — gz/bz2/zstd/lzo all work over HDFS
  • Read-only support; write operations return an error for HDFS paths
  • Modifies read_and_retry() to dispatch HDFS reads via hdfsRead(), so all existing compression handlers automatically work with HDFS files

LZO decompression (--with-liblzo2):

  • Full LZOP format parser (9-byte magic, versioned header, block-level streaming)
  • Supports LZOP flags: ADLER32/CRC32 checksums, extra fields, filters
  • Block-level decompression using lzo1x_decompress_safe()
  • Automatic detection via .lzo file extension
  • Read-only support matching existing bz2 behavior

Build system:

  • New configure options: --with-liblzo2 and --with-libhdfs3 (both off by default)
  • Library and header checks in configure.ac
  • Makefile and CMakeLists.txt updated for conditional linking
  • New macros: HAVE_LIBLZO2, USE_LIBHDFS3 in pg_config.h

Type of Change

  • New feature (non-breaking change)

Architecture

The implementation uses a clean two-layer design:

  1. I/O backend layer: read_and_retry() in gfile.c is modified to check fd->is_hdfs and dispatch reads to hdfsRead() when active. This means all compression handlers (gz, bz2, zstd, lzo) automatically work over HDFS without any changes.

  2. Compression layer: LZO decompression follows the same pattern as existing gz/bz2/zstd handlers — lzo_file_open(), lzo_file_read(), lzo_file_close() with the lzolib_stuff struct managing decompression state.

Key files changed:

  • src/include/fstream/gfile.h — Added LZO_COMPRESSION enum, HDFS fields in gfile_t
  • src/backend/utils/misc/fstream/gfile.c — HDFS read dispatch, LZO decompression, HDFS URI parsing
  • src/bin/gpfdist/Makefile — Conditional LZO/HDFS linking
  • src/bin/gpfdist/CMakeLists.txt — CMake support for LZO/HDFS
  • configure.ac — New --with-liblzo2 and --with-libhdfs3 options
  • src/Makefile.global.in — Propagate new configure variables
  • src/include/pg_config.h.in — New HAVE_LIBLZO2 and USE_LIBHDFS3 macros

Test Plan

  • Both features are guarded by --with-liblzo2 and --with-libhdfs3 configure flags (off by default), so existing builds are unaffected
  • LZO decompression can be tested by creating .lzo files with lzop and serving them via gpfdist
  • HDFS integration requires a running HDFS cluster with libhdfs3 installed

Impact

User-facing changes:

  • New gpfdist capability: reading .lzo compressed files (when built with --with-liblzo2)
  • New gpfdist capability: reading files from HDFS via hdfs:// URIs (when built with --with-libhdfs3)

Dependencies:

  • Optional: liblzo2 (for LZO decompression)
  • Optional: libhdfs3 (for HDFS file access)

Checklist

Additional Context

Both features are opt-in via configure flags and do not affect the default build. The LZO implementation handles the standard LZOP file format used by Hadoop's LzopCodec.

Link to Devin session: https://app.devin.ai/sessions/28a64684ce4a4f2b9d997bd16e447cce
Requested by: @ZTE-EBASE

…port

Add support for reading LZO-compressed files from remote HDFS clusters
through gpfdist, integrating libhdfs3 for HDFS I/O and liblzo2 for
LZOP format decompression.

HDFS integration (--with-libhdfs3):
- Parse hdfs:// URIs to connect to HDFS namenode via libhdfs3
- Stream file data through the existing gfile read abstraction
- Transparent to compression layer (gz/bz2/zstd/lzo all work over HDFS)
- Read-only support; write operations return an error for HDFS paths

LZO decompression (--with-liblzo2):
- Full LZOP format parser (header + block-level streaming decompression)
- Supports LZOP flags: ADLER32/CRC32 checksums, extra fields, filters
- Block-level decompression using lzo1x_decompress_safe()
- Automatic detection via .lzo file extension
- Read-only support matching existing bz2 behavior

Build system:
- New configure options: --with-liblzo2 and --with-libhdfs3 (both off by default)
- Library and header checks in configure.ac
- Makefile and CMakeLists.txt updated for conditional linking
- New macros: HAVE_LIBLZO2, USE_LIBHDFS3 in pg_config.h

Co-Authored-By: EBASE.Mars@zte.com.cn <EBASE.Mars@zte.com.cn>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @devin-ai-integration[bot] welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!

Repository owner deleted a comment from devin-ai-integration Bot Jun 1, 2026
@ZTE-EBASE ZTE-EBASE closed this Jun 1, 2026
@ZTE-EBASE ZTE-EBASE deleted the devin/1778554263-gpfdist-hdfs-lzo-support branch June 1, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant