我要投稿

从Llama3.1的paper看AI训练集群网络搭建

发布日期：2024-08-05 12:54:21 浏览次数： 2339 作者：小麦杂记

今天再来水一篇吧。

Meta 上周放出了Llama3.1 8B/70B以及最大的405B 预训练模型，同时也公开了paper，文中介绍network部分的篇幅很少，直接贴原文吧。

Network.

Llama 3 405B used RDMA over Converged Ethernet (RoCE) fabric based on the Arista 7800 and Minipack2 Open Compute Project OCP rack switches.Smaller models in the Llama 3 family were trained using Nvidia Quantum2 Infiniband fabric. Both RoCE and Infiniband clustersleverage 400 Gbps interconnects between GPUs. Despite the underlying network technology differences between these clusters, we tune both of thm to provide equivalent performance for these large training workloads. We elaborate further on our RoCE network since we fully own its design.

• Network topology.

Our RoCE-based AI cluster comprises 24K GPUs connected by a three-layer Clos network (Lee et al., 2024).

At the bottom layer,each rack hosts 16 GPUs split between two servers and connected by a single Minipack2 top-of-the-rack (ToR) switch.

In the middle layer, 192 such racks are connected by Cluster Switches to form a pod of 3,072 GPUs with full bisection bandwidth, ensuring no oversubscription.

At the top layer, eight such pods within the same datacenter building are connected via Aggregation Switches to form a cluster of 24K GPUs.

However, network connectivity at the aggregation layer does not maintain full bisection bandwidth and instead has an oversubscription ratio of 1:7.

Our model parallelism methods and training job scheduler (Choudhury et al., 2024) are all optimized to be aware of network topology, aiming to minimize network communication across pods.

• Load balancing.

LLM training produces fat network flows that are hard to load balance across all available network paths using traditional methods such as Equal-Cost Multi-Path (ECMP) routing.To address this challenge, we employ two techniques.

First, our collective library creates 16 network flows between two GPUs, instead of just one, thereby reducing the traffic per flow and providing more flows for load balancing.

Second, our Enhanced-ECMP (E-ECMP) protocol effectively balances these 16 flows across different network paths by hashing on additional fields in the RoCE header of packets.

• Congestion control.

We use deep-buffer switches in the spine (Gangidi et al., 2024) to accommodate transient congestion and buffering caused by collective communication patterns. This setup helps limit the impact of persistent congestion and network back pressure caused by slow servers, which is common in training. Finally, better load balancing through E-ECMP significantly reduces the chance of congestion. With these optimizations, we successfully run a 24K GPU cluster without traditional congestion control methods such as Data Center Quantized Congestion Notification (DCQCN).

大意就是说Llama3.1 405B的模型训练用的是Meta自己设计的RoCE网络，用到的交换机型号主要是Arista 7800和Minipack2。

整个网络由8个POD组成，每个POD 包含384台GPU服务器(3072张H100 GPU)，所以整体规模是24K个GPU，但Llama3.1 405B训练实际只用到了16K个GPU，大概也就是其中的5个POD。

在Meta的实际部署里，每个机柜内放2台8卡的GPU服务器和一台TOR交换机，这个与2023年Meta放出的示意图一致。

网络结构就是经典的3层Clos架构，这三层在文中分别称为bottom layer, middle layer 和top layer。对应的交换机分别称为TOR交换机，Cluster交换机和Aggregation交换机。

bottom layer 也就是TOR这一层用的是Minipack2交换机，Minipack2的SPEC可以在OCP网站上查到，大概就是如下图所示，4U高度，8个接口槽(PIM)，整体设计跟Minipack1没太大变化，只是使用了Broadcom的TH4芯片，交换容量升级到了25.6T，支持的接口速率和数量相比Minipack1有了升级。

在用作400G(QSFP-DD)连接时，最多支持64个接口，也就是最多配置8个8口400G接口卡。

因为Meta在每个机柜只放了两台8卡GPU服务器，一共也就只有16个GPU对应16个400G网卡，每个机柜放一台Minipack2交换机，上下行各使用16个400G连接，用到的400G接口其实也就32个，实际上还空余了一半的接口容量。

在middle layer和top layer，使用的自然就是Arista的7800交换机了，在Arista网站上看了一下，猜测用的应该是7808这个型号，因为每个POD里的middle layer交换机都要连接到同POD内的192个Minipack2，只能支持192个400G的7804显然不够用了。

每个POD里16台middle layer switch，每台使用192个400G下联，按照文中的说法，在middle layer 到top layer 之间采用7:1的收敛比，考虑到7800上每个接口卡上支持36个400G，推测每个接口卡上实际有28个下行和4个上行，对应到整机就是224个下行和32个上行，共使用256个400G接口。

至于为什么每个POD 是16台7808呢，因为文中说了，在CCL通讯时会给每两个GPU之间创建16个flow 分别hash到不同的网络路径上以便更好的实现ECMP。

而在top layer switch 上，一共需要连接到8个POD ，每个POD 16个middle layer switch，每台32个400G上行，也就是一共需要8*16*32=4096个400G接口，也是16台7808即可满足，每台实际使用接口数量同样也是256个。

至于为什么Meta要使用Arista 7808交换机做Spine交换机而不是使用Minipack2搭呢，其实就是文中提到的deep-buffer的好处了，Minipack2用的是TH4，只有芯片内置的百来M SRAM用作缓存，而7808用的是Jericho2芯片，外挂了8GB HBM2实现deep-buffer，在AI训练这种场景使用，甚至连传统的DCQCN都不需要用了。

而传统上AI训练网络里产生的fat-flow 导致ECMP困难的问题，通过修改CCL集合通讯库让每个GPU之间并发16个流这样的方法也就避开了，不需要在底层网络上搞一些flowlet dynamic load-balancing之类的东西了。

基础网络嘛，还是越简单越好。