Skip to content

NeosZhang/LoongForge

Β 
Β 

Repository files navigation

README

LoongForge

A modular, scalable, and highly efficient training framework for language, multimodal, and embodied models.

DocumentationLicenseStarsIssues

πŸ“– About

πŸ‰ The Baige "Loong" Family: LoongForge (alongside LoongFlow) is part of the Loong open-source series from Baidu's Baige AI infrastructure platform, named after the traditional Chinese loong boat.

LoongForge is a training framework for large-scale transformer models across diverse modalities and architectures. It supports key stages of the training pipeline, including pre-training, continued pre-training, and supervised fine-tuning (SFT). Built upon Megatron-LM with significant enhancements, LoongForge delivers an efficient, easy-to-use, and highly extensible solution for model training.

  • πŸš€ Comprehensive Model Coverage: Natively supports mainstream model architectures including LLMs (Large Language Models), VLMs (Vision-Language), VLAs (Vision-Language-Action), and Diffusion Models. Its flexible composition abstraction makes adding new multi-modal variants effortless.
  • ⚑ Performance-Driven Optimization: Provides advanced optimizations in parallelism and memory management, significantly reducing training costs and accelerating model development.
  • πŸ§ͺ Heterogeneous Hardware Support: Provides native, high-performance support for both NVIDIA GPUs and Kunlun XPUs, ensuring seamless migration and stable training at scale across diverse hardware clusters.

πŸ”₯ Latest News

  • [2026/04] πŸŽ‰ Initial release of the LoongForge framework!

✨ Key Features

  • Flexible Composition: A configuration-driven approach to assemble VLMs using interchangeable ViT and LLM components.
  • Heterogeneous Parallelism: Enables assigning independent configurationsβ€”such as Tensor/Data Parallel sizes and recomputation layersβ€”to different model components (e.g., Vision Encoder vs. LLM) for optimal throughput and memory efficiency.
  • Decoupled Encoder-Decoder Training: Separates vision encoder and LLM into independent tasks, eliminating encoder-induced pipeline bubbles and preventing ViT computation from blocking LLM throughput.
  • DP Load Balancing: Leverages a load-aware data redistribution algorithm to optimize data parallel imbalances caused by data packing, improving multi-node scaling efficiency.
  • MoE A2A Optimization: Overlaps All2All communication, activation offloading, and computation to optimize memory usage and communication in large-scale MoE models, achieving lower memory footprint than upstream Megatron-LM.
  • Custom Fused Operators: High-performance fused operators like FusedDSA, which integrates flashmla and indexer forward operators with custom backward operators (essential for training) to accelerate DSA model training.
  • Adaptive FP8 precision: End-to-end FP8 training support for both LLMs and VLMs, further enhanced with adaptive FP8 that automatically determines whether to enable FP8 per operator based on GEMM shape and computational efficiency to maximize training performance.
  • Flexible Checkpoint Conversion: Supports both offline bidirectional Megatron ↔ HuggingFace weight conversion and native online HuggingFace checkpoint load/save, eliminating format barriers throughout the training workflow.
  • Versatile Pipeline & Tools: Out-of-the-box support for Pretrain, MidTrain, SFT, and LoRA, with built-in tools for dataset processing such as format conversion and packing.
  • Heterogeneous Hardware: Supports training on both NVIDIA GPUs and Kunlun XPUs via a minimally intrusive plugin design.

(πŸ””πŸ””πŸ”” Please refer to our Official Documentation for detailed tutorials.)

πŸš€ Ongoing & Upcoming

  • Expanded foundation model support (e.g., Kimi 2.6).
  • Expanded support for embodied AI models, including DreamZero, and LingBot VA.
  • Further performance acceleration for diffusion models such as WAN.
  • Further enhanced kernel performance.
  • Optimized Full Heterogeneous DP memory overhead and improved parallelism strategy compatibility.
  • Advanced MoE load-balancing strategies.
  • Support for INT4 quantization-aware training, such as the approach proposed by Kimi 2.5.
  • Enhanced long-sequence training with optimization techniques such as ChunkPipe scheduling and Context Parallelism (CP) to improve throughput and scalability.
  • Real-world application of MTP scaling to improve speculative decoding acceptance rates.
  • ...

πŸ› οΈ Getting Started

Installation (Docker Quick Start)

For detailed installation steps (source install, XPU, development setup), see the Installation Guide. Below is a quick start using Docker:

git clone --recurse-submodules https://github.com/baidu-baige/LoongForge.git

docker build --build-arg COMPILE_ENV=hopper --build-arg ENABLE_LEROBOT=false -t loongforge:latest -f ./LoongForge/docker/Dockerfile .

Tutorials:

Kunlun XPU Platform:

πŸ›οΈ Supported Models

LoongForge supports a massive array of state-of-the-art models. Check out configs/models/ for YAML configurations and examples/ for launch scripts.

Modality Architectures Models
LLM DeepSeek-V2 deepseek-v2-lite, deepseek-v2
DeepSeek-V3 deepseek-v3, deepseek-v32
LLaMA2 llama2-7b, llama2-13b, llama2-70b
LLaMA3 llama3-8b, llama3-70b
LLaMA3.1 llama3.1-8b, llama3.1-70b, llama3.1-405b
Qwen qwen-1.8b, qwen-7b, qwen-14b, qwen-72b
Qwen1.5 qwen1.5-0.5b, qwen1.5-1.8b, qwen1.5-4b, qwen1.5-7b, qwen1.5-14b, qwen1.5-32b, qwen1.5-72b
Qwen2 qwen2-0.5b, qwen2-1.5b, qwen2-7b, qwen2-72b
Qwen2.5 qwen2.5-0.5b, qwen2.5-1.5b, qwen2.5-3b, qwen2.5-7b, qwen2.5-14b, qwen2.5-32b, qwen2.5-72b
Qwen3 qwen3-0.6b, qwen3-1.7b, qwen3-4b, qwen3-8b, qwen3-14b, qwen3-32b, qwen3-30b-a3b, qwen3-235b-a22b, qwen3-480b-a35b, qwen3-coder-30b-a3b
Qwen3-Next qwen3-next-80b-a3b
MiniMax minimax-m2.1, minimax-m2.5
MIMO mimo-7b
GLM glm5
VLM Qwen2.5-VL qwen2.5-vl-3b, qwen2.5-vl-7b, qwen2.5-vl-32b, qwen2.5-vl-72b
Qwen3-VL qwen3-vl-30b-a3b, qwen3-vl-235b-a22b
Qwen3.5 qwen3.5-0.8b, qwen3.5-2b, qwen3.5-4b, qwen3.5-9b, qwen3.5-27b, qwen3.5-35B-A3B, qwen3.5-122B-A10B, qwen3.5-397B-A17B
Qwen3.6 qwen3.6-27B, qwen3.6-35B-A3B
ERNIE4.5-VL ernie4.5vl-28b-a3b
LLaVA-OneVision-1.5 llava-onevision-1.5-4B
InternVL2.5 internvl2.5-8b, internvl2.5-26b, internvl2.5-38b, internvl2.5-78b
InternVL3.5 internvl3.5-8b, internvl3.5-14b, internvl3.5-38b, internvl3.5-30b-a3b, internvl3.5-241b-a28b
CustomCombinedModel Flexible ViT + LLM backbone configuration (example)
Diffusion WAN2.2 wan2.2_i2v_a14b
VLA Pi pi0.5

πŸ—οΈ Architecture Overview

LoongForge/
β”œβ”€β”€ loongforge/                   # Core training framework
β”‚   β”œβ”€β”€ train/                    # Training entry points & trainers
β”‚   β”‚   β”œβ”€β”€ pretrain/             # Pretrain implementations (LLM, VLM)
β”‚   β”‚   β”œβ”€β”€ sft/                  # SFT implementations (LLM, VLM, InternVL, ERNIE)
β”‚   β”‚   β”œβ”€β”€ diffusion/            # Diffusion model trainers (WAN)
β”‚   β”‚   └── embodied/             # Embodied AI model trainers (Pi0.5, GR00T)
β”‚   β”œβ”€β”€ models/                   # Unified model abstractions
β”‚   β”‚   β”œβ”€β”€ foundation/           # LLM backbones (LLaMA, Qwen, DeepSeek, InternLM, MiniMax, MIMO, GLM)
β”‚   β”‚   β”œβ”€β”€ encoder/              # Vision encoders (ViT, Qwen-VL, InternVL, ERNIE4.5-VL, LLaVA-OV)
β”‚   β”‚   β”œβ”€β”€ omni_models/          # Multi-modal composition (encoder + projector + decoder)
β”‚   β”‚   β”œβ”€β”€ diffusion/            # Diffusion models (WAN)
β”‚   β”‚   β”œβ”€β”€ embodied/             # Embodied AI models (Pi0.5, GR00T N1.6)
β”‚   β”‚   └── common/               # Shared layers and utilities
β”‚   β”œβ”€β”€ data/                     # Data pipelines
β”‚   β”‚   β”œβ”€β”€ dp_balance/           # Data-parallel load balancing
β”‚   β”‚   β”œβ”€β”€ multimodal/           # Multi-modal data processing
β”‚   β”‚   └── video/                # Video data processing
β”‚   β”œβ”€β”€ tokenizer/                # Tokenizer modules
β”‚   └── utils/                    # Utility functions (config map, constants, etc.)
β”œβ”€β”€ third_party/
β”‚   └── Loong-Megatron/           # Patched Megatron-LM (git submodule)
β”œβ”€β”€ configs/                      # Hydra-based YAML configurations
β”‚   β”œβ”€β”€ models/                   # Model architecture configs
β”‚   └── data/                     # Data configs
β”œβ”€β”€ examples/                     # GPU launch scripts (pretrain / sft / checkpoint_convert)
β”œβ”€β”€ examples_xpu/                 # Kunlun XPU launch scripts
β”œβ”€β”€ tools/                        # Utility tools
β”‚   β”œβ”€β”€ convert_checkpoint/       # HuggingFace ↔ Megatron checkpoint conversion
β”‚   β”œβ”€β”€ data_preprocess/          # Data preprocessing utilities
β”‚   └── dist_checkpoint/          # Distributed checkpoint utilities
β”œβ”€β”€ ops/                          # Custom fused CUDA/C++ operators (sparse MLA, lightning indexer)
β”œβ”€β”€ patches/                      # TransformerEngine patch files
β”œβ”€β”€ docker/                       # Dockerfiles (GPU & XPU)
β”œβ”€β”€ tests/                        # E2E test suite (YAML-driven)
└── docs/                         # Documentation

🌟 Powered by LoongForge

Open-Source Ecosystem:

Enterprise Scale & Performance:

Before becoming an open-source project, LoongForge had already empowered numerous enterprise use cases with its robust training acceleration and scaling capabilities:

  • Powers proprietary large-scale models across diverse industries, including Education, Code Generation, and Embodied AI.
  • Typically achieves a 30%+ average speedup over standard customer baselines through systemic optimizations.
  • Seamlessly supports ultra-large cluster training scaling up to 5,000 XPUs.

🀝 Contributing

We heartily welcome community contributions! Whether it's reporting bugs, proposing features, or submitting code, please read our Contributing Guidelines before submitting a Pull Request.

πŸ“„ License

LoongForge is released under the Apache License 2.0.

Some files in this repository are derived from third-party open-source projects. Please refer to the specific file headers for their respective copyright, license notices, and attribution requirements.

πŸ“ Citation

If you find LoongForge helpful in your research or production, please consider citing our repository:

@software{LoongForge2026,
      title={LoongForge: A modular, scalable, and highly efficient training framework for language, multimodal, and embodied models.}, 
      author={{The LoongForge Authors}},
      year={2026},
      url={https://github.com/baidu-baige/LoongForge},
}

πŸ™ Acknowledgments

LoongForge is built upon NVIDIA's Megatron-LM. During development, we also referenced and drew inspiration from several excellent open-source projects, including but not limited to Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.

About

A modular, scalable, and highly efficient training framework for language, multimodal, and embodied models.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 92.3%
  • Shell 7.5%
  • Dockerfile 0.2%