MindSpore Community · HyperParallel SuperNode Parallel Library

Version: v1.0 | Updated: 2026-03-30

Vision

HyperParallel is a new supernode parallel training architecture proposed by the MindSpore Community, dedicated to simplifying Ascend supernode programming and unlocking computing potential. We aim to collaborate with the LlamaFactory ecosystem to provide an easy-to-use, high-performance distributed training solution. Our goal is to enable every developer to efficiently train large models on Ascend NPU and NVIDIA GPU, lowering the barrier and cost of large model training.

This roadmap outlines the development direction of the LlamaFactory and MindSpore HyperParallel community collaboration, covering parallel capability expansion, hardware optimization, backend support, and more.

Roadmap Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
2026 Q2                    2026 Q3                    2026 Q4
                                                        
                                                        
┌─────────────┐          ┌─────────────┐          ┌─────────────┐
  Phase 1                Phase 2                Phase 3    
  Capability    ───►     Hardware      ───►     Backend    
  Expansion              Deepening              Diversity  
└─────────────┘          └─────────────┘          └─────────────┘
                                                        
    ├─ TP/EP/CP Hybrid         ├─ High-Dim TP             ├─ MindSpore Backend
    ├─ More Model Coverage     ├─ HyperMPMD 3-Level       ├─ Graph-Kernel Fusion
    └─ Larger Model Scale      └─ HyperOffload UD-Chain   └─ More Training Stages

Phase 1: Parallel Capability Expansion (2026 Q2)

Goal: Extend multi-dimensional hybrid parallel capabilities including TP (Tensor Parallelism), EP (Expert Parallelism), and CP (Context Parallelism) to support larger-scale model training.

FeatureDescriptionPriorityStatus
TP-EP HybridSupport TP+EP combined parallelism for MoE modelsP0Validating
CP Long SequenceContext parallelism to break memory limits for ultra-long sequencesP0Validating
3D Parallel (DP-TP-PP)Full 3D hybrid parallelism for 100B+ parameter modelsP1Validating
Ascend-Affinity OffloadNPU-affinity multi-level intelligent memory offload strategiesP2In Development

Key Technical Points:

  • Unified declarative parallel strategy configuration interface
  • Efficient communication primitives and scheduling algorithms
  • Ascend-affinity parallel and memory strategies

Phase 2: Ascend Hardware Deep Optimization (2026 Q3)

2.1 High-Dimensional Tensor Parallelism (High-Dimensional TP)

Goal: Extend high-dimensional TP and other Ascend-affinity parallel features to improve training efficiency and generalization on Atlas A5/A3/A2.

FeatureDescriptionHardwareExpected Benefit
2D-TPTwo-dimensional tensor parallelism to reduce communication overhead; benefits grow significantly when TP ≥ 8A5/A3Communication reduced by 30%+ (even better at TP ≥ 8)
TP-PP HybridTP + Pipeline Parallelism combinationA5/A3/A2Memory optimization 20%+

Note: The communication optimization of high-dimensional TP becomes increasingly significant as TP parallelism degree grows—when TP ≥ 8, the All-Reduce communication volume of traditional 1D-TP becomes a major bottleneck. 2D-TP splits communication across two dimensions, reducing communication volume by over 40% compared to 1D-TP.

2.2 MPMD Multi-Core Parallel Optimization (HyperMPMD)

Goal: Leverage fine-grained MPMD (Multiple Program Multiple Data) parallelism to address computational load imbalance in MoE, multimodal, and reinforcement learning scenarios, fully utilizing the peer-to-peer interconnect architecture of Ascend supernodes.

HyperMPMD provides MPMD capabilities across three dimensions:

Dimension 1: Intra-sub-model Core-Level Concurrency

Leveraging the heterogeneous multi-core AICube/AIVector architecture on Ascend NPUs to achieve fine-grained compute-communication pipelining within a single card, addressing the communication masking challenge in MoE architectures.

FeatureDescriptionHardwareExpected Benefit
On-chip Multi-core MPMDAICube handles matrix ops, AIVector handles communication preprocessing, both pipelined in parallelA5/A3Communication masking ratio from 60% to 90%

Dimension 2: Inter-sub-model Concurrency Balancing

Decoupling heterogeneous sub-modules (e.g., text/image/audio encoders in multimodal models) into independent concurrent subgraph tasks, eliminating pipeline bubbles through dynamic scheduling.

Dimension 3: Cross-model Concurrent Scheduling

Integrating the MPMD runtime’s Single Controller mode to enable model-level concurrency within the supernode’s pooled computing resources, supporting the asynchronous architecture of reinforcement learning.

Expected Benefits:

  • Communication masking ratio from 60% → 90%
  • Eliminate 10-40% pipeline bubbles in multimodal/MoE scenarios
  • Overall training performance improvement of approximately 15%, cluster resource utilization improvement of 15%+

2.3 Intelligent Memory Offloading (HyperOffload)

Goal: Based on Use-Definition (UD) chain analysis, elevate remote memory access to a first-class operation in the computation graph, achieving deterministic global memory planning and compute-communication overlap, fully releasing the potential of the supernode’s hierarchical storage pool.

Technical Approach: HyperOffload performs global lifetime analysis of tensor definition points and use points through the compiler’s Use-Definition chain, precisely identifying the optimal offload/prefetch timing for each tensor. It goes beyond traditional weight-only offloading to enable deep hierarchical management of KV Cache, intermediate activations, and optimizer states throughout the training and inference pipeline. Through a UD chain-driven unified logical view, it automatically detects bandwidth differences between HBM and DDR based on hardware topology, seamlessly scheduling massive tensors across storage tiers.

Expected Benefits (based on HyperOffload paper experimental data):

Training Scenarios:

ModelHardware ConfigBaseline+ HyperOffloadPerformance Change
LLaMA-8B8×Ascend 910C5.2s/step4.08s/step~20% improvement
DeepSeek-V38×Ascend 910C2.5s/step2.19s/step~12% improvement

Phase 3: MindSpore Backend Support (2026 Q4)

Goal: LlamaFactory officially supports MindSpore backend, enabling AKG, DVM and other MindSpore-exclusive deep graph-kernel fusion optimization capabilities, further unlocking Ascend NPU computing power.

Community Collaboration Plan

Collaboration with LlamaFactory Community

AreaDetailsResponsible
Code IntegrationLlamaFactory officially supports MindSpore backend, integrating HyperParallel parallel capabilitiesCo-built
DocumentationAdd MindSpore backend user guide to LlamaFactory official documentationCo-built
Issue HandlingEstablish joint issue handling mechanismCo-built
Version SyncEnsure HyperParallel and LlamaFactory version compatibilityCo-built

Contact Us

Appendix: Glossary

TermFull NameDescription
TPTensor ParallelismTensor Parallelism
EPExpert ParallelismExpert Parallelism
CPContext ParallelismLong Sequence Parallelism
DPData ParallelismData Parallelism
PPPipeline ParallelismPipeline Parallelism
FSDPFully Sharded Data ParallelFully Sharded Data Parallelism
SPMDSingle Program Multiple DataSingle Program Multiple Data
MPMDMultiple Program Multiple DataMultiple Program Multiple Data
HCCLHuawei Collective Communication LibraryHuawei Collective Communication Library