LlamaFactory + MindSpore HyperParallel#
我们将 MindSpore 社区的并行训练 HyperParallel 作为 FSDP2 后端集成到 LlamaFactory,支持昇腾 NPU 和 NVIDIA GPU,用户只需在 FSDP2 工作流上添加一行配置即可启用。
快速开始#
1. 环境安装#
pip#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # 安装 HyperParallel
git clone https://gitcode.com/mindspore/hyper-parallel
cd hyper-parallel
pip install -e .
# 安装 LlamaFactory
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e ".[torch,metrics]" --no-build-isolation
# 安装 PyTorch
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1
# 可选:安装 torch-npu 以支持昇腾 NPU
pip install torch-npu==2.7.1
|
2. 配置#
HyperParallel 训练需要两个配置文件:Accelerate FSDP2 配置和 LlamaFactory 训练配置。
2.1 Accelerate FSDP2 配置#
使用现有的 examples/accelerate/fsdp2_config.yaml 或自行创建:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # examples/accelerate/fsdp2_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_reshard_after_forward: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_version: 2
machine_rank: 0
main_training_function: main
mixed_precision: bf16 # or fp16
num_machines: 1 # the number of nodes
num_processes: 2 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
|
2.2 LlamaFactory 训练配置#
创建包含 use_hyper_parallel: true 的训练 YAML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| # examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml
### model
model_name_or_path: Qwen/Qwen3-VL-30B-A3B-Instruct
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true
use_v1_kernels: true
flash_attn: fa2
### method
stage: sft
do_train: true
finetuning_type: full
disable_gradient_checkpointing: false
### HyperParallel
use_hyper_parallel: true
### dataset
dataset: llava_1k_en, llava_1k_zh
template: qwen3_vl
cutoff_len: 1024
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: saves/Qwen3-VL-30B-A3B-Instruct/full/sft
logging_steps: 1
save_steps: 500
max_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: none
### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
seed: 1234
|
3. 启动训练#
1
2
3
4
5
6
7
8
9
10
11
12
| cd LlamaFactory
# 方式一:在 YAML 配置中添加 use_hyper_parallel: true
accelerate launch \
--config_file examples/accelerate/fsdp2_config.yaml \
src/train.py examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml
# 方式二:在命令行追加 --use_hyper_parallel True,无需修改 YAML
accelerate launch \
--config_file examples/accelerate/fsdp2_config.yaml \
src/train.py examples/ascend/qwen3vlmoe_full_sft_fsdp2.yaml \
--use_hyper_parallel True
|
4. 检查点与导出#
HyperParallel 检查点以标准 HuggingFace 格式保存,无需额外的权重转换,可以直接使用 from_pretrained() 加载。
5. 说明#
- HyperParallel 目前支持
sft 阶段且 finetuning_type: full - Accelerate FSDP2 的配置(混合精度、显存优化等)照常使用,详见 Accelerate FSDP 文档