LLaMA-Factory π€ MCoreAdapter
To fully leverage Megatron-core’s parallel computing and improve training efficiency for MoE models, we combined the MCoreAdapter provided by the ROLL team with LLaMA-Factory’s data pipeline and Megatron Trainer’s backend to build a new model training workflow.
π Quick Start
1. π» Environment Installation
π¦ pip
| |
π³ docker (Recommended)
Refer to the Dockerfile for building.
2. π― Start Test
π₯οΈ Single Node 8*80GB
π Multi Node 16*80GB
| |
π Benchmarks
We provide experiments for both multimodal and text MoE models. Refer to this GitHub issue for details.
π Weight conversion (mcore2hf)
You need to merge MCore type checkpoints saved during training into Hugging Face named safetensors using the conversion script:
3. βοΈ Megatron Strategy Configuration
Understanding Megatron’s parallelism and optimization parameters is crucial for efficient training. Here’s a detailed explanation of key configuration options:
3.1 π Parallelism Strategy
tensor_model_parallel_size(TP): Splits individual weight matrices across GPUs. Useful for very large models that don’t fit on a single GPU. Increases communication overhead, so use moderately (typically 2-8).- Recommendation: Start with 1, increase only if model doesn’t fit in memory
pipeline_model_parallel_size(PP): Divides model layers across GPUs in a pipeline fashion. Reduces memory per GPU but may cause pipeline bubbles.- Recommendation: Use powers of 2 (2, 4, 8); set
gradient_accumulation_stepsas a multiple of PP to minimize bubbles
- Recommendation: Use powers of 2 (2, 4, 8); set
expert_model_parallel_size(EP): Distributes MoE experts across GPUs. Essential for large MoE models.- Recommendation: For MoE models, typically set to 2-8 depending on expert count
context_parallel_size(CP): Splits sequence dimension for very long contexts. Useful when training with context length > 32k.- Recommendation: Use for ultra-long sequences; typically 1, 2, or 4
virtual_pipeline_model_parallel_size(VPP): Creates virtual pipeline stages to reduce pipeline bubbles by interleaving forward/backward passes.- Recommendation: Set to 2-4 when using PP to improve efficiency
sequence_parallel: Distributes sequence-level computations (LayerNorm, Dropout) across TP group. Reduces memory when TP > 1.- Recommendation: Enable when
tensor_model_parallel_size > 1
- Recommendation: Enable when
3.2 πΎ Memory Optimization
recompute_granularity: Trades computation for memory by recomputing activations during backward pass.full: Recomputes entire transformer layer (maximum memory saving)selective: Recomputes only attention (balanced trade-off)- Recommendation: Use
selectivefirst; switch tofullif still OOM
moe_layer_recompute: Checkpoints MoE layers to save activation memory for MoE models.- Recommendation: Enable for large MoE models when memory is tight
3.3 π Performance Optimization
moe_token_dispatcher_type: Determines how tokens are routed to experts.alltoall: Better performance for most cases (recommended)allgather: Alternative for specific network topologies- Recommendation: Use
alltoallfor better throughput
moe_grouped_gemm: Groups expert computations for better GPU utilization.- Recommendation: Always enable (
true) for MoE models
- Recommendation: Always enable (
moe_shared_expert_overlap: Overlaps shared expert computation with communication.- Recommendation: Enable to hide communication latency in MoE models
overlap_grad_reduce: Overlaps gradient reduce-scatter with backward computation in distributed optimizer.- Recommendation: Enable when using
use_distributed_optimizer: truefor better throughput
- Recommendation: Enable when using
4. π‘ Tips & Precautions
4.1 π Global Batch Size calculation differences
While using Megatron for training, note the subtle difference in how global batch size is calculated compared to previous setups:
π Parameter definitions:
bs: per_device_train_batch_sizega: gradient_accumulation_stepsws: WORLD_SIZEpp: pipeline_model_parallel_sizetp: tensor_model_parallel_sizeep: expert_model_parallel_sizecp: context_parallel_size
π’ Formula comparison:
π‘ Understanding the difference:
The key insight is that Megatron’s parallelism strategies (PP, TP, EP, CP) partition the available GPUs, so the effective data parallel size is reduced by these factors. Only the remaining GPUs contribute to data parallelism, which directly affects the global batch size.
π Example:
4.2 β‘ Performance optimization
- πΎ GPU memory optimization: enable
--use_distributed_optimizerand--overlap_param_gatherwould significantly reduce GPU memory usage - π‘ Communication optimization: use
--overlap_grad_reduceto overlap gradient communication with computation - π§ MoE optimization: For MoE models, prefer
--moe_token_dispatcher_type alltoalland--moe_grouped_gemm truefor better performance - βοΈ Parallel optimization: set
gradient_accumulation_stepsto be an integer multiple of PP - π Long context optimization: enable
context_parallel_size(typically 2-4) when training with very long sequences (>32k tokens) to distribute sequence computation and reduce memory pressure
4.3 π Troubleshooting
- π₯ OOM Errors: reduce
per_device_train_batch_sizeorgradient_accumulation_steps, or enable context parallelism for long sequences and check whether theuse_distributed_optimizeris enabled. - π Communication timeouts: check network connectivity,
master_addrandmaster_port - βοΈ Parallel settings: ensure
pp * tp * ep * cpdivideswsevenly - π Small global batch size: if your global batch size becomes too small due to high parallelism (PP/TP/EP/CP), consider increasing
gradient_accumulation_stepsor reducing parallelism degrees where possible