gpfdist: add HDFS file reading via libhdfs3 and LZO decompression support#1
Closed
devin-ai-integration[bot] wants to merge 1 commit into
Closed
gpfdist: add HDFS file reading via libhdfs3 and LZO decompression support#1devin-ai-integration[bot] wants to merge 1 commit into
devin-ai-integration[bot] wants to merge 1 commit into
Conversation
…port Add support for reading LZO-compressed files from remote HDFS clusters through gpfdist, integrating libhdfs3 for HDFS I/O and liblzo2 for LZOP format decompression. HDFS integration (--with-libhdfs3): - Parse hdfs:// URIs to connect to HDFS namenode via libhdfs3 - Stream file data through the existing gfile read abstraction - Transparent to compression layer (gz/bz2/zstd/lzo all work over HDFS) - Read-only support; write operations return an error for HDFS paths LZO decompression (--with-liblzo2): - Full LZOP format parser (header + block-level streaming decompression) - Supports LZOP flags: ADLER32/CRC32 checksums, extra fields, filters - Block-level decompression using lzo1x_decompress_safe() - Automatic detection via .lzo file extension - Read-only support matching existing bz2 behavior Build system: - New configure options: --with-liblzo2 and --with-libhdfs3 (both off by default) - Library and header checks in configure.ac - Makefile and CMakeLists.txt updated for conditional linking - New macros: HAVE_LIBLZO2, USE_LIBHDFS3 in pg_config.h Co-Authored-By: EBASE.Mars@zte.com.cn <EBASE.Mars@zte.com.cn>
There was a problem hiding this comment.
Hi, @devin-ai-integration[bot] welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add support for reading LZO-compressed files from remote HDFS clusters through gpfdist, integrating libhdfs3 for HDFS I/O and liblzo2 for LZOP format decompression.
HDFS integration (
--with-libhdfs3):hdfs://host:port/pathURIs to connect to HDFS namenode via libhdfs3gfileread abstraction layerread_and_retry()to dispatch HDFS reads viahdfsRead(), so all existing compression handlers automatically work with HDFS filesLZO decompression (
--with-liblzo2):lzo1x_decompress_safe().lzofile extensionBuild system:
--with-liblzo2and--with-libhdfs3(both off by default)configure.acHAVE_LIBLZO2,USE_LIBHDFS3inpg_config.hType of Change
Architecture
The implementation uses a clean two-layer design:
I/O backend layer:
read_and_retry()ingfile.cis modified to checkfd->is_hdfsand dispatch reads tohdfsRead()when active. This means all compression handlers (gz, bz2, zstd, lzo) automatically work over HDFS without any changes.Compression layer: LZO decompression follows the same pattern as existing gz/bz2/zstd handlers —
lzo_file_open(),lzo_file_read(),lzo_file_close()with thelzolib_stuffstruct managing decompression state.Key files changed:
src/include/fstream/gfile.h— AddedLZO_COMPRESSIONenum, HDFS fields ingfile_tsrc/backend/utils/misc/fstream/gfile.c— HDFS read dispatch, LZO decompression, HDFS URI parsingsrc/bin/gpfdist/Makefile— Conditional LZO/HDFS linkingsrc/bin/gpfdist/CMakeLists.txt— CMake support for LZO/HDFSconfigure.ac— New--with-liblzo2and--with-libhdfs3optionssrc/Makefile.global.in— Propagate new configure variablessrc/include/pg_config.h.in— NewHAVE_LIBLZO2andUSE_LIBHDFS3macrosTest Plan
--with-liblzo2and--with-libhdfs3configure flags (off by default), so existing builds are unaffected.lzofiles withlzopand serving them via gpfdistImpact
User-facing changes:
.lzocompressed files (when built with--with-liblzo2)hdfs://URIs (when built with--with-libhdfs3)Dependencies:
liblzo2(for LZO decompression)libhdfs3(for HDFS file access)Checklist
Additional Context
Both features are opt-in via configure flags and do not affect the default build. The LZO implementation handles the standard LZOP file format used by Hadoop's
LzopCodec.Link to Devin session: https://app.devin.ai/sessions/28a64684ce4a4f2b9d997bd16e447cce
Requested by: @ZTE-EBASE