Efficient and Scalable Distributed LLM Training: Hiding Communication Overhead

Abstract

Training Large Language Models (LLMs) is often inefficient due to high communication overhead, resulting in sub-50% Model FLOPS Utilization (MFU). In this talk, I will discuss how to build a cost-efficient and scalable machine learning system, using DHelix as an example. Inspired by the DNA double-helix structure, DHelix improves efficiency through Strand Interleaving (SI), which overlaps forward and backward passes to maximize computation-communication concurrency. It seamlessly integrates with all parallelism strategies, including pipeline parallelism via a model folding design. Experiments on Llama, GPT, and Phi MoE models across A40, A800, and H100 clusters demonstrate up to 58% MFU on A40 and 71% on A800, significantly outperforming state-of-the-art methods. I will explore DHelix’s design, optimization techniques, and its broader impact on distributed LLM training.

Date
Feb 12, 2025 2:00 PM — 3:00 PM
Event
Weekly Talk
Location
COM3-B1-15 - Meeting Rm 92