<aside> 📌 This page is maintained by the DML group. We introduce the works accomplished and the work we are going to do. Our research interests include gradient compression algorithms, the training system on the LLM. We are seeking for cooperators on the DML😊. Please feel free to reach out to Zhi Wang (mail: [email protected]) or Rongwei Lu (mail: [email protected]) for further information or potential collaboration opportunities.
</aside>
<aside> 📌 该页面由分布式机器学习小组维护,包含已经完成的工作以及即将进行的工作。我们的研究兴趣包括梯度压缩算法以及大型语言模型的训练系统。我们正在寻找DML的合作伙伴,如果需要更多信息或有意向合作,请联系王智(电子邮件:[email protected])或路荣伟(电子邮件:[email protected])。
</aside>
Optimization and Aggregation Algorithms for Distributed Communication / 分布式通讯优化与聚合算法
Use gradient compression strategies to reduce communication volume, employ asynchronous training algorithms to minimize the impact of end-to-end latency on training, and design hierarchical aggregation algorithms that incorporate regional characteristics based on hardware or network topology, to achieve efficient cross-domain distributed training.
使用梯度压缩策略减少通讯量,用异步训练算法降低端对端延迟对训练影响,结合硬件或网络上的拓扑结果,设计结合地域特性的分层聚合算法,实现高校跨域分布式训练。
Resource Scheduling and Optimization / 资源调度与优化
We design efficient resource scheduling strategies for pipeline and expert parallelism.
我们针对流水线和专家并行设计高效的资源调度策略
Automatic Execution Planning for Deep Neural Networks / 深度神经网络的自动执行规划
In a large and complex computing cluster, for various deep neural networks and under multiple parallel dimensions, we implement optimal strategies for execution planning to reduce training and debugging costs and lower the threshold for professional knowledge.
在庞大复杂的算力集群下,针对各种深度神经网络,在多种并行维度下,实现最优的执行规划策略搜索,降低训练调试成本和专业知识门槛。
<aside> ✏️ The gradient compression algorithms are widely used in DML, which can effectively alleviate the communication bottleneck. But in Non-IID scenarios, traditional gradient compression algorithms face challenges. To address the accuracy degradation in Non-IID scenarios, we propose the data-aware adaptive gradient compresion algorithm, called DAGC. For the failure of the traditional hard-threshold compressor in federated learning scenarios, we propose γ-FedHT, which is a stepsize-aware adative hard-threshold compressor. In asynchronous federated learning, conventional solutions couldn’t optimize local update and communication jointly, we propose FedLuck, which optimizes the convergence speed via joint adjustment of local update frequency and gradient compression rate. For more details, please refer to the following.
</aside>
<aside> ✏️ The emerging concern about data privacy and security has motivated the proposal of federated learning, which allows nodes to only synchronize the locally-trained models instead their own original data. Conventional federated learning architecture, inherited from the parameter server design, relies on highly centralized topologies and the assumption of large nodes-to-server bandwidths. However, in real-world federated learning scenarios the network capacities between nodes are highly uniformly distributed and smaller than that in a datacenter. It is of great challenges for conventional federated learning approaches to efficiently utilize network capacities between nodes. In this paper, we propose a model segment level decentralized federated learning to tackle this problem. In particular, we propose a segmented gossip approach, which not only makes full utilization of node-tonode bandwidth, but also has good training convergence.
</aside>