支持私有化部署
AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


A800服务器部署满血版DeepSeek-R1-671B

发布日期:2025-03-15 11:22:56 浏览次数: 2825 作者:蒋小颖乱侃
推荐语

深入探索A800服务器部署DeepSeek-R1-671B的详细配置与优化。

核心内容:
1. 物理服务器配置与网络设置
2. 软件环境与依赖版本汇总
3. 系统盘配置及文件存储策略

杨芳贤
53A创始人/腾讯云(TVP)最具价值专家

一、服务器及配置信息

1.1 服务器列表

此次用来部署满血版DeepSeek-R1-671B,用到4个物理服务器。每个物理服务器上有8个NVIDIA A800,相关具体硬件信息如下。

image-20250226103235878
#10000M network
#10.119.165.139 deepseek1
#10.119.165.138 deepseek2
#10.119.165.140 deepseek3
#10.119.165.141 deepseek4
#IB network
10.119.85.141 deepseek1
10.119.85.138 deepseek2
10.119.85.140 deepseek3
10.119.85.139 deepseek4

(1)当前,物理服务器的系统盘由两个SATA盘做 RAID1 构成,按照相关资料说明,建议使用两个1T左右的SSD做RAID1,安装操作系统

(2)现在模型文件是存放在nvme盘的挂载目录上

(3)服务访问、节点间网络通信,能用IB网络尽量用IB网络

(4)节点内8显卡、节点间显卡通信,需要配置好NVLink与NVSwitch

(5)其实上表格中还有一些东西没有用到,比如万兆网卡。

(6)以下deepseek用户具有无密sudo权限,实际操作时也可使用root用户,但还是不建议直接使用root用户。

1.2 相关软件信息汇总

物理服务器操作系统:Ubuntu 22.04.4 LTS-x86_64
Nvidia driver version: 550.90.07
CUDA runtime version: 12.1.105(node容器内)、V12.4.99(物理服务器上)
nvidia-fabricmanager版本:550.90.07
nvlink:3.0
nvswitch:2.0

PyTorch version: 2.5.1+cu124
CUDA used to build PyTorch: 12.4
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
CMake version: version 3.31.4
Libc version: glibc-2.35
Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Is CUDA available: True

CUDA_MODULE_LOADING set to: LAZY
Is XNNPACK available: True
CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz, 112核心
numpy==1.26.4
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
triton==3.1.0

二、准备工作

以下准备工作在所有服务器上都要执行。

2.1 python环境与模型文件准备

2.1.1 python环境准备

#添加主机名与ip映射条目
deepseek@deepseek1:~$ sudo vi /etc/hosts
#添加如下内容
#10000M network
#10.119.165.139 deepseek1
#10.119.165.138 deepseek2
#10.119.165.140 deepseek3
#10.119.165.141 deepseek4
#IB network
10.119.85.141 deepseek1
10.119.85.138 deepseek2
10.119.85.140 deepseek3
10.119.85.139 deepseek4


#
配置python库安装源
deepseek@deepseek1:~$ sudo mkdir ~/.pip
deepseek@deepseek1:~$ sudo vi ~/.pip/pip.conf 
deepseek@deepseek1:~$ sudo cat ~/.pip/pip.conf 
[global]
trusted-host = mirrors.aliyun.com
index-url = https://mirrors.aliyun.com/pypi/simple


#
下载Anaconda3-2024.06-1-Linux-x86_64.sh,安装anaconda
deepseek@deepseek1:~/installPkgs$ wget --user-agent=“Mozilla” + https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
deepseek@deepseek1:~/installPkgs$ sudo bash Anaconda3-2024.06-1-Linux-x86_64.sh -p /home/deepseek/anaconda3
###anaconda的安装目录是/home/deepseek/anaconda3
...
You can undo this by running `conda init --reverse $SHELL`? [yes|no]
[no] >>> yes
...

#
加载~/.bashrc
deepseek@deepseek1:~/installPkgs$ source ~/.bashrc
#查看现有的conda管理的所有虚拟python环境
(base) deepseek@deepseek1:~/installPkgs$ conda env list
conda environments:
#
base                  *  /home/deepseek/anaconda3

#
准备好python3环境与pip3(使用上述conda创建虚拟python3、pip3环境)
(base) deepseek@deepseek1:~/installPkgs$ conda create -n self-llm python=3.12 
##查看现有的conda管理的所有虚拟python环境
(base) deepseek@deepseek1:~/installPkgs$ conda env list
conda environments:
#
self-llm                 /home/deepseek/.conda/envs/self-llm
base                  *  /home/deepseek/anaconda3

#
激活self-llm 这个虚拟python环境,并在其中执行相关命令
(base) deepseek@deepseek1:~/installPkgs$ conda activate self-llm
(self-llm) deepseek@deepseek1:~/installPkgs$ 

#
请先通过如下命令安装ModelScope
(self-llm) deepseek@deepseek1:~/installPkgs$ pip install modelscope

2.1.2 模型文件准备

###DeepSeek-R1:671B是fp8 模型,需要在支持fp8数据类型的GPU上运行,但NVIDIA A800(A800跟A100设计上基本上是一样的,只是减配了)不支持fp8数据类型,
###因此要在NVIDIA A800部署DeepSeek-R1:671B,需要下载DeepSeek-R1:671B的BF16版本模型文件
###方法4(成功)
DeepSeek-R1的BF16版本,是.safetensors文件,在vllm中直接可用:https://modelscope.cn/models/unsloth/DeepSeek-R1-BF16/files
总大小为:1341.32G左右
#此时deepseek1正在下载方法2中文件,故此方法的实施先在deepseek2上下载相关文件(下载过程中发现,deepseek1上的下载进程占据大量带宽影响deepseek2下载文件,故将deepseek1上的下载进程杀死)
#/dev/nvme0n1  nvme硬盘,大小为3.5T
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo parted /dev/nvme0n1 -s -- mklabel gpt mkpart DATA01 1 -1 
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo mkfs.ext4 /dev/nvme0n1p1 
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo mount /dev/nvme0n1p1 /mnt/
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo tail -1 /etc/mtab
/dev/nvme0n1p1 /mnt ext4 rw,relatime,stripe=32 0 0
#将上述输出内容写入/etc/fstab文件的末尾
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo vi /etc/fstab
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo rm -rf /mnt/lost+found
###下载DeepSeek-R1模型的BF16版本
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo mkdir -p hub/models/unsloth/DeepSeek-R1-BF16
(self-llm) deepseek@deepseek2:~/installPkgs$ sudo /home/deepseek/anaconda3/envs/self-llm/bin/modelscope download --model unsloth/DeepSeek-R1-BF16 --local_dir /mnt/hub/models/unsloth/DeepSeek-R1-BF16
Downloading Model to directory: /mnt/hub/models/unsloth/DeepSeek-R1-BF16
Downloading [config.json]: 100%|█████████████████████████████████████| 1.57k/1.57k [00:00<00:00, 6.75kB/s]
...
###下载完成后,需要将此模型文件同步各服务器的相同目录下

2.2 查看服务器/显卡

4个物理服务器,每个物理服务器上的磁盘信息如下。系统盘是893G左右,其余还有6块893G SATA盘、一块3.5T的nvme盘

deepseek@deepseek1:~$ lsblk 
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0  63.9M  1 loop /snap/core20/2318
loop1     7:1    0    87M  1 loop /snap/lxd/28373
loop2     7:2    0  38.8M  1 loop /snap/snapd/21759
loop3     7:3    0  44.4M  1 loop /snap/snapd/23545
loop4     7:4    0  63.7M  1 loop /snap/core20/2496
loop5     7:5    0  89.4M  1 loop /snap/lxd/31333
sda       8:0    0 893.8G  0 disk 
sdb       8:16   0 893.8G  0 disk 
├─sdb1    8:17   0   550M  0 part /boot/efi
├─sdb2    8:18   0     8M  0 part 
├─sdb3    8:19   0 893.1G  0 part /
└─sdb4    8:20   0    65M  0 part 
sdc       8:32   0 893.8G  0 disk 
sdd       8:48   0 893.8G  0 disk 
sde       8:64   0 893.8G  0 disk 
sdf       8:80   0 893.8G  0 disk 
sdg       8:96   0 893.8G  0 disk 
nvme0n1 259:0    0   3.5T  0 disk 

每个物理服务器上有8个NVIDIA A100-SXM4-80GB ,GPU型号、驱动版本、 CUDA版本信息如下。

deepseek@deepseek1:~$ nvidia-smi 
Fri Feb 21 09:25:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A800-SXM4-80GB          On  |   00000000:3D:00.0 Off |                    0 |
| N/A   33C    P0             61W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A800-SXM4-80GB          On  |   00000000:42:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A800-SXM4-80GB          On  |   00000000:61:00.0 Off |                    0 |
| N/A   30C    P0             61W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A800-SXM4-80GB          On  |   00000000:67:00.0 Off |                    0 |
| N/A   33C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A800-SXM4-80GB          On  |   00000000:AD:00.0 Off |                    0 |
| N/A   32C    P0             57W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A800-SXM4-80GB          On  |   00000000:B1:00.0 Off |                    0 |
| N/A   29C    P0             61W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A800-SXM4-80GB          On  |   00000000:D0:00.0 Off |                    0 |
| N/A   30C    P0             62W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A800-SXM4-80GB          On  |   00000000:D3:00.0 Off |                    0 |
| N/A   32C    P0             60W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

显卡驱动、cuda、以及nvidia-fabricmanager三者版本一定要相互匹配。

笔者显卡驱动、cuda、nvidia-fabricmanager版本信息分别是:550.90.07、V12.4.99、550.90.07 。

2.3 安装cuda

Cuda安装可参考NVIDIA官方文档:cuda-installation-guide-linux

#服务器拿到手时,发现已经安装了Cuda11.5.119
###如果要使用apt安装nvcc(此方式在Ubuntu22.04LTS上安装的是Cuda11.5.119):sudo apt install nvidia-cuda-toolkit
#查看现有nvcc版本
deepseek@deepseek1:~/installPkgs$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

#
##卸载现有的Cuda11.5.119:
deepseek@deepseek1:~/installPkgs$ sudo apt remove nvidia-cuda-toolkit

改成使用Cuda12.4。访问NVIDIA官网cuda安装文件下载网页 ,依次选择如下选项找到下载与安装命令:

image-20250221172033572
#下载安装文件
(self-llm) deepseek@deepseek1:~/installPkgs$ wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

#
执行安装
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo sh cuda_12.4.0_550.54.14_linux.run
#在弹出对话中输入“accept”接受条款,然后如下选择
image-20250221173309528
image-20250222162736130

安装进程完成后,输出如下内容:

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-12.4/

Please make sure that
 -   PATH includes /usr/local/cuda-12.4/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.4/lib64, or, add /usr/local/cuda-12.4/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.4/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 550.00 is required for CUDA 12.4 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log
#发现系统已经创建软链接/usr/local/cuda,指向/usr/local/cuda-12.4/。后续使用此软链接
(self-llm) deepseek@deepseek1:~/installPkgs$ ls -al /usr/local/cuda
lrwxrwxrwx 1 root root 21 Feb 21 17:55 /usr/local/cuda -> /usr/local/cuda-12.4/

#
按照提示修改系统环境变量PATH、LD_LIBRARY_PATH
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo vi /etc/profile
###在最后添加如下内容
export PATH=$PATH:/usr/local/cuda/bin/
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/

#
重新打一个终端会话窗口或重新加载上述文件以启用上述两个修改或添加的系统环境变量
#重新加载上述文件的方法:source /etc/profile
#以下以重新打一个终端会话窗口为例继续后续操作
(base) deepseek@deepseek1:~$ conda activate self-llm
(self-llm) deepseek@deepseek1:~$ echo $PATH
/home/deepseek/anaconda3/envs/self-llm/bin:/home/deepseek/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin
(self-llm) deepseek@deepseek1:~$ echo $LD_LIBRARY_PATH  
/usr/local/cuda/lib64/
#可以看到当前使用的Cuda版本已经是V12.4.99
(self-llm) deepseek@deepseek1:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
(self-llm) deepseek@deepseek1:~$ 

2.4 单机多GPU卡间互联NVlink

单机上的8个NVIDIA A800-SXM4-80GB之间使用NVLink进行了连接。与NVIDIA A800所使用的Ampere架构相适配的,在NVIDIA A800使用NVLink版本是NVLink3.0,NvSwitch版本是NvSwitch 3.0。

NVLink3.0有12条Link接口,每个Link接口有8个lane,NVLink3.0的每个Link接口单向通信速度是25GB/s,所以每个NVIDIA A800使用NVLink3.0时,单向宽带是 25GB/s * 12 = 300GB/s,双向宽带在单向宽带的基础上乘以2即为600GB/s。

deepseek@deepseek1:~$ nvidia-smi nvlink --status
GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-f275597c-05e4-f7e8-35bc-a3ab26194262)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 1: NVIDIA A800-SXM4-80GB (UUID: GPU-56afa4fb-5618-6b47-7861-51c811bb87d8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 2: NVIDIA A800-SXM4-80GB (UUID: GPU-9039f623-9c6f-bc1a-f7d4-fd706a7cd7f5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 3: NVIDIA A800-SXM4-80GB (UUID: GPU-3029201c-2230-e6d4-290b-267c7c8adb03)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 4: NVIDIA A800-SXM4-80GB (UUID: GPU-54d02ccd-6b42-8ec6-1b2a-624974742a62)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 5: NVIDIA A800-SXM4-80GB (UUID: GPU-13254428-498a-5ec1-5dd5-d2c46dd29c36)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 6: NVIDIA A800-SXM4-80GB (UUID: GPU-5442c93d-d690-3603-f0c3-c0592cf3797c)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
GPU 7: NVIDIA A800-SXM4-80GB (UUID: GPU-bfe7f980-f583-3778-9779-85c7ebbb9432)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s

2.5 nvidia-fabricmanager安装与确认

2.5.1 安装nvidia-fabricmanager

A800 GPU 并支持 NvLink & NvSwitch,需额外安装与驱动版本对应的 nvidia-fabricmanager 服务使 GPU 卡间能够互联。安装 nvidia-fabricmanager 服务后都能正常使用 GPU 实例,安装方法如下。

#Install the new cuda-keyring package(将创建文件/etc/apt/sources.list.d/cuda-ubuntu2204-x86_64.list )
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
#安装nvidia-fabricmanager-550=550.90.07-1
version=550.90.07
main_version=$(echo $version | awk -F '.' '{print $1}')
sudo apt-get update
sudo apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*
#卸载不再需要的组件
sudo apt autoremove

#
查看已经安装的nvidia-fabricmanager版本
deepseek@deepseek1:~$ dpkg -l | grep nvidia-fabricmanager
ii  nvidia-fabricmanager-550               550.90.07-1                                              amd64        Fabric Manager for NVSwitch based systems.
deepseek@deepseek1:~$ nv-fabricmanager -v
Fabric Manager version is : 550.90.07
#nvidia-fabricmanager 服务开机自启动、启动
sudo systemctl enable nvidia-fabricmanager
sudo systemctl start nvidia-fabricmanager.service
#查看 nvidia-fabricmanager 服务
sudo systemctl status nvidia-fabricmanager
#可以看到“已成功配置所有可用的gpu和nvswitch以路由NVLink流量”
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2025-02-20 18:11:52 CST; 21h ago
   Main PID: 3697 (nv-fabricmanage)
      Tasks: 19 (limit: 629145)
     Memory: 22.4M
        CPU: 36.002s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─3697 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Feb 20 18:11:40 deepseek1 systemd[1]: Starting NVIDIA fabric manager service...
Feb 20 18:11:41 deepseek1 nv-fabricmanager[3697]: Connected to 1 node.
Feb 20 18:11:52 deepseek1 nv-fabricmanager[3697]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.
Feb 20 18:11:52 deepseek1 systemd[1]: Started NVIDIA fabric manager service.

2.5.2 查看显卡拓扑结构

deepseek@deepseek1:~/installPkgs$ nvidia-smi topo --matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    SYS     SYS     NODE    0-27,56-83      0               N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    SYS     SYS     NODE    0-27,56-83      0               N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    PXB     SYS     SYS     NODE    0-27,56-83      0               N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    PXB     SYS     SYS     NODE    0-27,56-83      0               N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     PXB     NODE    SYS     28-55,84-111    1               N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     PXB     NODE    SYS     28-55,84-111    1               N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     NODE    PXB     SYS     28-55,84-111    1               N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     NODE    PXB     SYS     28-55,84-111    1               N/A
NIC0    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     NODE
NIC1    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     NODE
NIC2    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS      X      NODE    SYS
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     NODE     X      SYS
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_bond_0

可以看到GPU0与其他GPU如GPU1之间是通过NV

2.6 超高性能网络-Infiniband

InfiniBand(以下简称IB网络),是一种能力很强的通信技术协议。它的英文直译过来,就是“无限带宽”。 它是一种专为高性能计算(HPC)和超大规模数据中心设计的网络技术,以亚微秒级超低延迟和超高带宽为核心优势。

IB网络因其低延迟、高带宽的网络特性被用于很多高性能计算(High Performance Computing,HPC)项目,IB网络采用了100G Mellanox IB网卡,通过专用IB交换机和控制器软件UFM实现网络通信和管理。IB网络通过Partition Key实现网络隔离,不同租户的IB网络可通过不同的Partition Key来隔离,类似于以太网的VLAN。在BMS场景,IB网络支持RDMA和IPoIB通信方式。

#首先查看是否具有 Infiniband controller
(self-llm) deepseek@deepseek1:~/installPkgs$ lspci | grep -i infiniband
44:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
68:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
c2:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
db:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

在保证系统上安装且启用了IB网卡的情况下,使用命令ibdev2netdev -v可以看到IB网口的状态。

可以看到最后有一个bond1,是IB网卡做的bond1,就用它了。

#device和网口,mlx5_2是device名称,ibp68s0是网口名称
(self-llm) deepseek@deepseek2:~$ sudo ibdev2netdev -v
0000:44:00.0 mlx5_2 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                           fw 20.35.3006 port 1 (ACTIVE) ==> ibp68s0 (Up)
0000:68:00.0 mlx5_3 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                           fw 20.35.3006 port 1 (ACTIVE) ==> ibp104s0 (Up)
0000:c2:00.0 mlx5_4 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                           fw 20.35.3006 port 1 (ACTIVE) ==> ibp194s0 (Up)
0000:db:00.0 mlx5_5 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                           fw 20.35.3006 port 1 (ACTIVE) ==> ibp219s0 (Up)
0000:17:00.0 mlx5_bond_0 (MT4119 - MCX562A-ACAB) ConnectX-5 EN network interface card for OCP 3.0, with host management, 25GbE Dual-port SFP28, Pull Tab bracket                                                                        fw 16.35.3006 port 1 (ACTIVE) ==> bond1 (Up)

2.7 推理引擎准备

此处使用的推理引擎是vLLM,vLLM是一个快速易用的LLM推理和服务库,旨在提高模型运行的效率和性能。它通过内存优化和推理加速技术,使得在资源有限的环境下也能高效运行大型语言模型。它跟ollama是同一类开源产品。

官方部署vLLM文档上看,vLLM支持Python与Docker两种安装方式。此文档在多个服务器上以集群的方式运行vLLM进程,为了保证 所有服务器上执行环境如模型文件所在目录、Python环境等的一致性,此处通过Docker的方式安装vLLM与运行vLLM。

2.7.1 安装Docker环境

sudo apt-get update
#安装 apt 依赖包,用于通过HTTPS来获取仓库
sudo apt-get -qy install apt-transport-https ca-certificates curl gnupg-agent software-properties-common

#
添加 Docker 的官方 GPG 密钥
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
#上述命名会提示“gpg: no valid OpenPGP data found.”,其中一种解决办法如下:
sudo gpg --keyserver keyserver.ubuntu.com --recv 7EA0A9C3F273FCD8
sudo gpg --export --armor 7EA0A9C3F273FCD8 |sudo apt-key add -

#
设置稳定版仓库(添加到/etc/apt/sources.list中)
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

sudo apt-get update
#查询可安装docker-ce版本
sudo apt-cache policy docker-ce

#
安装指定版本
sudo apt-get -qy install docker-ce=5:28.0.0-1~ubuntu.22.04~jammy
#查看docker版本
sudo docker --version

2.7.2 配置Docker使用NVIDIA GPU

#设置 repository 和 GPG key
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
#安装 nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2
#修改docker守护进程配置文件/etc/docker/daemon.json 
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo vi /etc/docker/daemon.json 
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    } 
  },
  "exec-opts": ["native.cgroupdriver=systemd"],
  "dns": [
    "223.5.5.5",
    "223.6.6.6",
    "8.8.8.8"
  ],
  "log-opts": {
    "max-file": "5",
    "max-size": "50m"
  },
  "registry-mirrors": [
    "https://registry.aliyuncs.com",
    "https://registry.docker-cn.com",
    "https://docker.chenby.cn",
    "https://docker.registry.cyou",
    "https://docker-cf.registry.cyou",
    "https://dockercf.jsdelivr.fyi",
    "https://docker.jsdelivr.fyi",
    "https://dockertest.jsdelivr.fyi",
    "https://dockerproxy.com",
    "https://docker.m.daocloud.io",
    "https://docker.nju.edu.cn",
    "https://docker.mirrors.sjtug.sjtu.edu.cn",
    "https://docker.mirrors.ustc.edu.cn",
    "https://mirror.iscas.ac.cn",
    "https://docker.rainbond.cc"
  ] 


#
重新加载此文件,重启docker进程
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo systemctl daemon-reload && sudo systemctl restart docker
#查看docker详细配置信息
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo docker info
Client: Docker Engine - Community
 Version:    28.0.0
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 ...

2.7.3 拉取vllm-openai镜像

#此处不使用dockerhub上的镜像,而是使用阿里云上的镜像
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai:v0.7.2
(self-llm) deepseek@deepseek1:~/installPkgs$ sudo docker images
REPOSITORY                                                            TAG       IMAGE ID       CREATED       SIZE
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai   v0.7.2    f78c8f2f8ad5   2 weeks ago   16.5GB

2.7.4 下载vllm代码仓库

(self-llm) deepseek@deepseek1:~$ mkdir code_repos
(self-llm) deepseek@deepseek1:~$ cd code_repos/
(self-llm) deepseek@deepseek1:~/code_repos$ git clone https://github.com/vllm-project/vllm.git
(self-llm) deepseek@deepseek1:~/code_repos$ cd vllm/

三、正式部署验证与使用

3.1 运行vLLM推理引擎与验证

参考:Running vLLM on multiple nodes

3.1.1 在各节点启动node容器

在各节点启动node容器,以创建ray集群或向其中添加节点。

vllm集群主节点

#对于vllm集群主节点,操作如下
###启动集群主节点上的进程命令如下
#bash run_cluster.sh \
#    vllm/vllm-openai \
#    ip_of_head_node \
#    --head \
#    /path/to/the/huggingface/home/in/this/node \
#    -e VLLM_HOST_IP=ip_of_this_node
#以下命令执行会在当前shell中启动子进程且永远阻塞,需要保证此shell进程不被关闭或杀死,如果某个shell进程被关闭或杀死(针对启动vllm集群的每个节点上的shell进程而言)将导致ray集群异常
#保证/mnt目录有如下目录:hub/models/deepseek-ai/DeepSeek-R1-BP16/,此目录下放置从modelscope下载下来的DeepSeek-R1-BP16模型仓库中所有文件
(self-llm) deepseek@deepseek1:~/code_repos$ sudo bash examples/online_serving/run_cluster.sh \
    swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai:v0.7.2 \
    10.119.85.141 \
    --head \
    /mnt \
    --cap-add SYS_ADMIN \
    -e VLLM_HOST_IP=10.119.85.141 \
    --privileged -e NCCL_IB_HCA=mlx5 \
    -e NCCL_P2P_LEVEL=NVL \
    -e NCCL_IB_GID_INDEX=3 \
    -e NCCL_IB_DISABLE=0 \
    -e NCCL_DEBUG=INFO \
    -e NCCL_SOCKET_IFNAME=bond1

其中相关部分参数的解释如下(可参考官网:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html):

  • • NCCL_IB_HCA
  • • NCCL_P2P_LEVEL

允许用户精细控制何时在gpu之间使用点对点(P2P)传输。值是一个枚举类型,有LOC 、NVL 等

  • • NCCL_IB_GID_INDEX

定义了RoCE模式中使用的全局ID索引。可选值如下。

(self-llm) deepseek@deepseek3:~$ show_gids              
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_2  1       0       fe80:0000:0000:0000:a088:c203:0014:e6e2                 v1
mlx5_3  1       0       fe80:0000:0000:0000:a088:c203:0014:e7ea                 v1
mlx5_4  1       0       fe80:0000:0000:0000:a088:c203:0014:e9f6                 v1
mlx5_5  1       0       fe80:0000:0000:0000:946d:ae03:00ed:119e                 v1
mlx5_bond_0     1       0       fe80:0000:0000:0000:1270:fdff:fec2:6294                 v1      bond1
mlx5_bond_0     1       1       fe80:0000:0000:0000:1270:fdff:fec2:6294                 v2      bond1
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:0a77:558c 10.119.85.140   v1      bond1
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:0a77:558c 10.119.85.140   v2      bond1
n_gids_found=8
  • • NCCL_IB_DISABLE
  • • NCCL_DEBUG
  • • NCCL_SOCKET_IFNAME

指定用于通信的IP接口。参考:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname

  • • NCCL_NET_GDR_LEVEL

允许用户精细控制何时在网卡和GPU之间使用GPU直接RDMA。

#成功运行后,输出大致如下
2025-02-23 22:04:33,124 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-02-23 22:04:33,125 INFO scripts.py:865 -- Local node IP: 10.119.165.139
2025-02-23 22:04:34,147 SUCC scripts.py:902 -- --------------------
2025-02-23 22:04:34,147 SUCC scripts.py:903 -- Ray runtime started.
2025-02-23 22:04:34,147 SUCC scripts.py:904 -- --------------------
2025-02-23 22:04:34,147 INFO scripts.py:906 -- Next steps
2025-02-23 22:04:34,148 INFO scripts.py:909 -- To add another node to this Ray cluster, run
2025-02-23 22:04:34,148 INFO scripts.py:912 --   ray start --address='10.119.165.139:6379'
2025-02-23 22:04:34,148 INFO scripts.py:921 -- To connect to this Ray cluster:
2025-02-23 22:04:34,148 INFO scripts.py:923 -- import ray
2025-02-23 22:04:34,148 INFO scripts.py:924 -- ray.init()
2025-02-23 22:04:34,148 INFO scripts.py:936 -- To submit a Ray job using the Ray Jobs CLI:
2025-02-23 22:04:34,148 INFO scripts.py:937 --   RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
2025-02-23 22:04:34,148 INFO scripts.py:946 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
2025-02-23 22:04:34,148 INFO scripts.py:950 -- for more information on submitting Ray jobs to the Ray cluster.
2025-02-23 22:04:34,148 INFO scripts.py:955 -- To terminate the Ray runtime, run
2025-02-23 22:04:34,148 INFO scripts.py:956 --   ray stop
2025-02-23 22:04:34,148 INFO scripts.py:959 -- To view the status of the cluster, use
2025-02-23 22:04:34,148 INFO scripts.py:960 --   ray status
2025-02-23 22:04:34,148 INFO scripts.py:964 -- To monitor and debug Ray, view the dashboard at 
2025-02-23 22:04:34,148 INFO scripts.py:965 --   127.0.0.1:8265
2025-02-23 22:04:34,148 INFO scripts.py:972 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2025-02-23 22:04:34,148 INFO scripts.py:1076 -- --block
2025-02-23 22:04:34,148 INFO scripts.py:1077 -- This command will now block forever until terminated by a signal.
2025-02-23 22:04:34,148 INFO scripts.py:1080 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

vllm进程集群副节点

#对于vllm进程集群副节点,操作如下
#此次部署时,一共有4个服务器。deepseek1是主节点,deepseek2、deepseek3、deepseek4都是副节点
#所以deepseek3、deepseek4上也需要执行如下操作

#
##启动集群副节点上的进程命令如下
#bash run_cluster.sh \
#    vllm/vllm-openai \
#    ip_of_head_node \
#    --worker \
#    /path/to/the/huggingface/home/in/this/node \
#    -e VLLM_HOST_IP=ip_of_this_node

#
以下命令执行会在当前shell中启动子进程且永远阻塞,需要保证此shell进程不被关闭或杀死,如果某个shell进程被关闭或杀死(针对启动vllm集群的每个节点上的shell进程而言)将导致ray集群不符合预期
#以下命令中注意修改VLLM_HOST_IP的值为具体节点的IP
##10000M network
#10.119.165.139 deepseek1
#10.119.165.138 deepseek2
#10.119.165.140 deepseek3
#10.119.165.141 deepseek4
##IB network
#10.119.85.141 deepseek1
#10.119.85.138 deepseek2
#10.119.85.140 deepseek3
#10.119.85.139 deepseek4
(self-llm) deepseek@deepseek2:~/code_repos/vllm$ sudo bash examples/online_serving/run_cluster.sh \
    swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/vllm/vllm-openai:v0.7.2 \
    10.119.85.141 \
    --worker \
    /home/deepseek/.cache/modelscope \
    --cap-add SYS_ADMIN \
    -e VLLM_HOST_IP=10.119.85.138 \
    --privileged -e NCCL_IB_HCA=mlx5 \
    -e NCCL_P2P_LEVEL=NVL \
    -e NCCL_IB_GID_INDEX=3 \
    -e NCCL_IB_DISABLE=0 \
    -e NCCL_DEBUG=INFO \
    -e NCCL_SOCKET_IFNAME=bond1

#
运行成功后,输出如下
[2025-02-23 22:06:55,915 W 1 1] global_state_accessor.cc:429: Retrying to get node with node ID de9745473ab8d1adb62391a32ea4254142ca109b2520c761285bc1a0
2025-02-23 22:06:55,851 INFO scripts.py:1047 -- Local node IP: 10.119.85.138
2025-02-23 22:06:56,937 SUCC scripts.py:1063 -- --------------------
2025-02-23 22:06:56,937 SUCC scripts.py:1064 -- Ray runtime started.
2025-02-23 22:06:56,937 SUCC scripts.py:1065 -- --------------------
2025-02-23 22:06:56,937 INFO scripts.py:1067 -- To terminate the Ray runtime, run
2025-02-23 22:06:56,938 INFO scripts.py:1068 --   ray stop
2025-02-23 22:06:56,938 INFO scripts.py:1076 -- --block
2025-02-23 22:06:56,938 INFO scripts.py:1077 -- This command will now block forever until terminated by a signal.
2025-02-23 22:06:56,938 INFO scripts.py:1080 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

上述操作当在每个服务器启动一个名为node的容器。按照vllm官方文档的说法,上述操作针启动一个以容器形式运行的ray集群(英文是“a ray cluster of containers”,不知如何翻译更准确,暂时就叫这个名字吧),4个服务器上的每个node容器都是这个容器集群的一员。

3.1.2 查看ray集群

#在集群中任何一个节点进入node容器(每个节点上都有一个node容器)
#比如在deepseek1服务器上,再打开一个终端窗口,执行如下命令
(base) deepseek@deepseek1:~$ sudo docker exec -it node /bin/bash
#以下在node容器中执行
root@deepseek1:/vllm-workspace# ray status
======== Autoscaler status: 2025-02-23 22:08:18.818774 ========
Node status
---------------------------------------------------------------
Active:
 1 node_5a3c4cb16e576f2afb9e2c612f5052ab28c46014fcfa7d7b501eb35c
 1 node_de9745473ab8d1adb62391a32ea4254142ca109b2520c761285bc1a0
 1 node_f2f75b7120e56b71d416d9af7301c3ad78ead0a19a6bf810a35e016f
 1 node_d0f9e04c9e3425460679f6fc59d274be1547d81f6a5f1d3dbd96c0fb
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/448.0 CPU
 0.0/32.0 GPU
 0B/3.89TiB memory
 0B/38.91GiB object_store_memory

Demands:
 (no resource demands)

#
##可以看到当前ray集群中一共有4个节点,且都是处于Active状态;ray集群总CPU数是448、总GPU个数是32、总内存是3.89T
root@deepseek1:/vllm-workspace# 

3.1.3 收集环境信息

root@deepseek1:/vllm-workspace# python3 /usr/local/lib/python3.12/dist-packages/torch/utils/collect_env.py

3.1.4 检查跨节点GPU通信

参考:sanity check script

在node容器内创建test.py

root@deepseek1:/vllm-workspace# cat test.py
# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

if world_size <= 1:
    exit()

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
# pynccl is enabled by default for 0.6.5+,
# but for 0.6.4 and below, we need to enable it manually.
# keep the code for backward compatibility when because people
# prefer to read the latest documentation.
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    out = pynccl.all_reduce(data, stream=s)
    value = out.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = out.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()

在单个节点上测试

#--nproc-per-node 是此节点上,待要使用的GPU个数
#NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
#NCCL_P2P_DISABLE=0 
NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py

如果上述py文件运行成功且通信是正常的,则输出类似于如下:

...
deepseek1:1239:1239 [3] NCCL INFO Connected all trees
deepseek1:1239:1239 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
deepseek1:1239:1239 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
deepseek1:1240:1240 [4] NCCL INFO Connected all trees
deepseek1:1240:1240 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
deepseek1:1240:1240 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
deepseek1:1241:1241 [5] NCCL INFO Connected all trees
deepseek1:1241:1241 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
deepseek1:1241:1241 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
deepseek1:1243:1243 [7] NCCL INFO Connected all trees
deepseek1:1243:1243 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
deepseek1:1243:1243 [7] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
deepseek1:1242:1242 [6] NCCL INFO Connected all trees
deepseek1:1242:1242 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
deepseek1:1242:1242 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
deepseek1:1236:1236 [0] NCCL INFO ncclCommInitRank comm 0xb606410 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1238:1238 [2] NCCL INFO ncclCommInitRank comm 0xb88e340 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 61000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1240:1240 [4] NCCL INFO ncclCommInitRank comm 0xad28600 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId ad000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1242:1242 [6] NCCL INFO ncclCommInitRank comm 0xb29e250 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId d0000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1237:1237 [1] NCCL INFO ncclCommInitRank comm 0xb60f020 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 42000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1241:1241 [5] NCCL INFO ncclCommInitRank comm 0xc2f2570 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId b1000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1239:1239 [3] NCCL INFO ncclCommInitRank comm 0xac2c590 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 67000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
deepseek1:1243:1243 [7] NCCL INFO ncclCommInitRank comm 0xbd1b050 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d3000 commId 0x3c2a650fc7de9a50 - Init COMPLETE
vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!





vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!

vLLM NCCL with cuda graph is successful!

vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
deepseek1:1236:1299 [0] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek1:1237:1286 [1] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek1:1241:1291 [5] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek1:1240:1293 [4] NCCL INFO [Service thread] Connection closed by localRank 4
deepseek1:1239:1295 [3] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek1:1242:1288 [6] NCCL INFO [Service thread] Connection closed by localRank 6
deepseek1:1243:1287 [7] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek1:1238:1290 [2] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek1:1236:1400 [0] NCCL INFO comm 0x8265490 rank 0 nranks 8 cudaDev 0 busId 3d000 - Abort COMPLETE
deepseek1:1243:1405 [7] NCCL INFO comm 0x897f4f0 rank 7 nranks 8 cudaDev 7 busId d3000 - Abort COMPLETE
deepseek1:1237:1403 [1] NCCL INFO comm 0x828e430 rank 1 nranks 8 cudaDev 1 busId 42000 - Abort COMPLETE
deepseek1:1240:1402 [4] NCCL INFO comm 0x79a3f10 rank 4 nranks 8 cudaDev 4 busId ad000 - Abort COMPLETE
deepseek1:1241:1401 [5] NCCL INFO comm 0x8f75840 rank 5 nranks 8 cudaDev 5 busId b1000 - Abort COMPLETE
deepseek1:1242:1404 [6] NCCL INFO comm 0x7f1d4b0 rank 6 nranks 8 cudaDev 6 busId d0000 - Abort COMPLETE
deepseek1:1238:1407 [2] NCCL INFO comm 0x850c0d0 rank 2 nranks 8 cudaDev 2 busId 61000 - Abort COMPLETE
deepseek1:1239:1406 [3] NCCL INFO comm 0x78b0310 rank 3 nranks 8 cudaDev 3 busId 67000 - Abort COMPLETE

在多个节点间测试

###方法1)
#--nnodes 是节点个数
#--nproc-per-node 是各节点上,待要使用的GPU个数
#--rdzv_endpoint 是ray集群中主节点的IP(保证这个IP在所有节点上都是可访问的。既然都已经搭建出ray集群,当然是可访问的)
#在4个服务器上的node容器中执行如下相同的命令
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --rdzv_backend=c10d --rdzv_endpoint=10.119.85.141 test.py
#必须等4个容器中的命令都执行起来后,测试程序才会往下执行与输出相关内容


#
##方法2)
#在4个服务器上的node容器中分别执行如下命令:
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --node-rank 0 --master_addr 10.119.85.141 test.py
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --node-rank 1 --master_addr 10.119.85.141 test.py
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --node-rank 2 --master_addr 10.119.85.141 test.py
NCCL_DEBUG=TRACE torchrun --nnodes 4 --nproc-per-node=8 --node-rank 3 --master_addr 10.119.85.141 test.py
#必须等4个容器中的命令都执行起来后,测试程序才会往下执行与输出相关内容

如果上述py文件运行成功且通信是正常的,则某个node容器test.py将输出类似于如下:

...
INFO 02-24 00:01:07 pynccl.py:69] vLLM is using nccl==2.21.5
deepseek2:460:460 [4] NCCL INFO Using non-device net plugin version 0
deepseek2:460:460 [4] NCCL INFO Using network IB
deepseek2:459:459 [3] NCCL INFO ncclCommInitRank comm 0xb0c19c0 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 67000 commId 0x8da567a0342af828 - Init START
deepseek2:458:458 [2] NCCL INFO ncclCommInitRank comm 0xb7b98b0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 61000 commId 0x8da567a0342af828 - Init START
deepseek2:457:457 [1] NCCL INFO ncclCommInitRank comm 0xc8103e0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8da567a0342af828 - Init START
deepseek2:460:460 [4] NCCL INFO ncclCommInitRank comm 0xc1ff420 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId ad000 commId 0x8da567a0342af828 - Init START
deepseek2:456:456 [0] NCCL INFO ncclCommInitRank comm 0xbd9b580 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x8da567a0342af828 - Init START
deepseek2:461:461 [5] NCCL INFO ncclCommInitRank comm 0xc041a00 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId b1000 commId 0x8da567a0342af828 - Init START
deepseek2:463:463 [7] NCCL INFO ncclCommInitRank comm 0xc41ace0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId d3000 commId 0x8da567a0342af828 - Init START
deepseek2:462:462 [6] NCCL INFO ncclCommInitRank comm 0xb794080 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId d0000 commId 0x8da567a0342af828 - Init START
deepseek2:463:463 [7] NCCL INFO Setting affinity for GPU 7 to ffff,fff00000,00ffffff,f0000000
deepseek2:463:463 [7] NCCL INFO NVLS multicast support is not available on dev 7
deepseek2:459:459 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
deepseek2:459:459 [3] NCCL INFO NVLS multicast support is not available on dev 3
deepseek2:460:460 [4] NCCL INFO Setting affinity for GPU 4 to ffff,fff00000,00ffffff,f0000000
deepseek2:460:460 [4] NCCL INFO NVLS multicast support is not available on dev 4
deepseek2:461:461 [5] NCCL INFO Setting affinity for GPU 5 to ffff,fff00000,00ffffff,f0000000
deepseek2:461:461 [5] NCCL INFO NVLS multicast support is not available on dev 5
deepseek2:456:456 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
deepseek2:456:456 [0] NCCL INFO NVLS multicast support is not available on dev 0
deepseek2:457:457 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
deepseek2:457:457 [1] NCCL INFO NVLS multicast support is not available on dev 1
deepseek2:458:458 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
deepseek2:458:458 [2] NCCL INFO NVLS multicast support is not available on dev 2
deepseek2:462:462 [6] NCCL INFO Setting affinity for GPU 6 to ffff,fff00000,00ffffff,f0000000
deepseek2:462:462 [6] NCCL INFO NVLS multicast support is not available on dev 6
deepseek2:463:463 [7] NCCL INFO comm 0xc41ace0 rank 15 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
deepseek2:462:462 [6] NCCL INFO comm 0xb794080 rank 14 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
deepseek2:457:457 [1] NCCL INFO comm 0xc8103e0 rank 9 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
deepseek2:461:461 [5] NCCL INFO comm 0xc041a00 rank 13 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
deepseek2:460:460 [4] NCCL INFO comm 0xc1ff420 rank 12 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
deepseek2:458:458 [2] NCCL INFO comm 0xb7b98b0 rank 10 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
deepseek2:462:462 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13
deepseek2:463:463 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] 8/-1/-1->15->14
deepseek2:459:459 [3] NCCL INFO comm 0xb0c19c0 rank 11 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
deepseek2:456:456 [0] NCCL INFO comm 0xbd9b580 rank 8 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
deepseek2:462:462 [6] NCCL INFO P2P Chunksize set to 131072
deepseek2:461:461 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12
deepseek2:463:463 [7] NCCL INFO P2P Chunksize set to 131072
deepseek2:457:457 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/16/-1->9->8 [3] -1/-1/-1->9->8
deepseek2:460:460 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11
deepseek2:461:461 [5] NCCL INFO P2P Chunksize set to 131072
deepseek2:458:458 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->19 [2] 11/-1/-1->10->9 [3] 11/6/-1->10->26
deepseek2:457:457 [1] NCCL INFO P2P Chunksize set to 131072
deepseek2:459:459 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/18/-1->11->10
deepseek2:458:458 [2] NCCL INFO P2P Chunksize set to 131072
deepseek2:460:460 [4] NCCL INFO P2P Chunksize set to 131072
deepseek2:459:459 [3] NCCL INFO P2P Chunksize set to 131072
deepseek2:456:456 [0] NCCL INFO Trees [0] 9/-1/-1->8->17 [1] 9/-1/-1->8->15 [2] 9/2/-1->8->24 [3] 9/-1/-1->8->15
deepseek2:456:456 [0] NCCL INFO P2P Chunksize set to 131072
deepseek2:460:460 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 02/0 : 13[5] -> 12[4] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 16[0] [send] via NET/IB/0/GDRDMA
deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 18[2] [send] via NET/IB/1/GDRDMA
deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 16[0] [send] via NET/IB/0/GDRDMA
deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 18[2] [send] via NET/IB/1/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 3[3] -> 8[0] [receive] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 7[7] -> 10[2] [receive] via NET/IB/1/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 3[3] -> 8[0] [receive] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 7[7] -> 10[2] [receive] via NET/IB/1/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 03/0 : 15[7] -> 14[6] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Connected all rings
deepseek2:459:459 [3] NCCL INFO Connected all rings
deepseek2:460:460 [4] NCCL INFO Connected all rings
deepseek2:458:458 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 02/0 : 10[2] -> 11[3] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/CUMEM/read
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 11[3] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 00/0 : 12[4] -> 13[5] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Connected all rings
deepseek2:457:457 [1] NCCL INFO Connected all rings
deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 9[1] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 01/0 : 12[4] -> 13[5] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 02/0 : 11[3] -> 12[4] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 01/0 : 8[0] -> 9[1] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 02/0 : 12[4] -> 13[5] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 12[4] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/CUMEM/read
deepseek2:460:460 [4] NCCL INFO Channel 03/0 : 12[4] -> 13[5] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 03/0 : 8[0] -> 9[1] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 18[2] -> 11[3] [receive] via NET/IB/1/GDRDMA
deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 2[2] -> 8[0] [receive] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 6[6] -> 10[2] [receive] via NET/IB/1/GDRDMA
deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 16[0] -> 9[1] [receive] via NET/IB/0/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 8[0] -> 17[1] [send] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 10[2] -> 19[3] [send] via NET/IB/1/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 24[0] -> 8[0] [receive] via NET/IB/0/GDRDMA
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 24[0] [send] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 26[2] -> 10[2] [receive] via NET/IB/1/GDRDMA
deepseek2:461:461 [5] NCCL INFO Connected all rings
deepseek2:463:463 [7] NCCL INFO Connected all rings
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 26[2] [send] via NET/IB/1/GDRDMA
deepseek2:462:462 [6] NCCL INFO Connected all rings
deepseek2:456:456 [0] NCCL INFO Channel 00/0 : 17[1] -> 8[0] [receive] via NET/IB/0/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 01/0 : 19[3] -> 10[2] [receive] via NET/IB/1/GDRDMA
deepseek2:458:458 [2] NCCL INFO Channel 03/0 : 10[2] -> 6[6] [send] via NET/IB/1/GDRDMA
deepseek2:459:459 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/CUMEM/read
deepseek2:459:459 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 00/0 : 9[1] -> 8[0] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/CUMEM/read
deepseek2:457:457 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 02/0 : 13[5] -> 14[6] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/CUMEM/read
deepseek2:461:461 [5] NCCL INFO Channel 03/0 : 13[5] -> 14[6] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 02/0 : 14[6] -> 15[7] via P2P/CUMEM/read
deepseek2:462:462 [6] NCCL INFO Channel 03/0 : 14[6] -> 15[7] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/CUMEM/read
deepseek2:463:463 [7] NCCL INFO Channel 03/0 : 15[7] -> 8[0] via P2P/CUMEM/read
deepseek2:456:456 [0] NCCL INFO Channel 02/0 : 8[0] -> 2[2] [send] via NET/IB/0/GDRDMA
deepseek2:462:462 [6] NCCL INFO Connected all trees
deepseek2:462:462 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:462:462 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:461:461 [5] NCCL INFO Connected all trees
deepseek2:461:461 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:461:461 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:460:460 [4] NCCL INFO Connected all trees
deepseek2:460:460 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:460:460 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:458:458 [2] NCCL INFO Connected all trees
deepseek2:459:459 [3] NCCL INFO Connected all trees
deepseek2:458:458 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:458:458 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:459:459 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:459:459 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:463:463 [7] NCCL INFO Connected all trees
deepseek2:463:463 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:463:463 [7] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:457:457 [1] NCCL INFO Connected all trees
deepseek2:457:457 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:457:457 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:456:456 [0] NCCL INFO Connected all trees
deepseek2:456:456 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
deepseek2:456:456 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 1 p2p channels per peer
deepseek2:457:457 [1] NCCL INFO ncclCommInitRank comm 0xc8103e0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:459:459 [3] NCCL INFO ncclCommInitRank comm 0xb0c19c0 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 67000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:462:462 [6] NCCL INFO ncclCommInitRank comm 0xb794080 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId d0000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:461:461 [5] NCCL INFO ncclCommInitRank comm 0xc041a00 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId b1000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:458:458 [2] NCCL INFO ncclCommInitRank comm 0xb7b98b0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 61000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:463:463 [7] NCCL INFO ncclCommInitRank comm 0xc41ace0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId d3000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:460:460 [4] NCCL INFO ncclCommInitRank comm 0xc1ff420 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId ad000 commId 0x8da567a0342af828 - Init COMPLETE
deepseek2:456:456 [0] NCCL INFO ncclCommInitRank comm 0xbd9b580 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 3d000 commId 0x8da567a0342af828 - Init COMPLETE
vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL is successful!vLLM NCCL is successful!




vLLM NCCL is successful!
vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!
vLLM NCCL with cuda graph is successful!


deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek2:457:552 [1] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek2:461:549 [5] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 6
deepseek2:459:551 [3] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 6
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek2:463:547 [7] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 6
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 6
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 4
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 5
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 4
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 4
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 4
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek2:456:544 [0] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 1
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek2:458:548 [2] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 3
deepseek2:462:546 [6] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek2:460:550 [4] NCCL INFO [Service thread] Connection closed by localRank 7
deepseek2:456:660 [0] NCCL INFO comm 0x8a11620 rank 8 nranks 32 cudaDev 0 busId 3d000 - Abort COMPLETE
deepseek2:458:664 [2] NCCL INFO comm 0x8430430 rank 10 nranks 32 cudaDev 2 busId 61000 - Abort COMPLETE
deepseek2:462:658 [6] NCCL INFO comm 0x84085f0 rank 14 nranks 32 cudaDev 6 busId d0000 - Abort COMPLETE
deepseek2:460:661 [4] NCCL INFO comm 0x8e5a270 rank 12 nranks 32 cudaDev 4 busId ad000 - Abort COMPLETE
deepseek2:463:662 [7] NCCL INFO comm 0x9090570 rank 15 nranks 32 cudaDev 7 busId d3000 - Abort COMPLETE
deepseek2:461:657 [5] NCCL INFO comm 0x8cb26b0 rank 13 nranks 32 cudaDev 5 busId b1000 - Abort COMPLETE
deepseek2:457:659 [1] NCCL INFO comm 0x948a0f0 rank 9 nranks 32 cudaDev 1 busId 42000 - Abort COMPLETE
deepseek2:459:663 [3] NCCL INFO comm 0x7d3d650 rank 11 nranks 32 cudaDev 3 busId 67000 - Abort COMPLETE
deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:462:632 [32661] NCCL INFO [Service thread] Connection closed by localRank 0
deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 2
deepseek2:460:636 [32679] NCCL INFO [Service thread] Connection closed by localRank 6

3.1.5 启动推理引擎vllm进程

执行启动

###启动vllm进程的命令如下(当然还可以添加其他参数。可使用“vllm serve --help”查看)
#vllm serve /path/to/the/model/in/the/container \
#     --tensor-parallel-size 8 \
#     --pipeline-parallel-size 2

#
这vllm官网说可以在集群中任何一个节点进入node容器并执行“vllm serve ...”,但在ray集群主节点即head节点执行命令,会提示“No available node types can fulfill resource request”。所以笔者选择在ray集群副节点即worker节点执行
#比如在deepseek2服务器上,再打开一个终端窗口并进入node容器(也可以继续使用“查看ray集群”步骤中打开的容器会话)
(base) deepseek@deepseek2:~$ sudo docker exec -it node /bin/bash
#在node容器中执行以下命令,启动vllm进程。将有很多输出内容,以下截取部分
root@deepseek2:/vllm-workspace# vllm serve /root/.cache/huggingface/hub/models/unsloth/DeepSeek-R1-BF16/ \
    --served-model-name DeepSeek-R1-671B \
    --enable-prefix-caching \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 4 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 32768 \
    --trust-remote-code \
    --port 8000 \
    --dtype auto \
    --api-key zY0MrQwXV9Oo3+++
#相关参数的意义,参考“vllm serve --help”输出或“https://docs.vllm.ai/en/latest/serving/engine_args.html”。
#其中--tensor-parallel-size 是指每个节点上GPU个数,--pipeline-parallel-size 指节点个数
#输出内容很多,以下截取一部分(包括最后的输出内容)
...
Loading safetensors checkpoint shards:  96% Completed | 157/163 [00:33<00:01,  3.41it/s]
Loading safetensors checkpoint shards:  97% Completed | 158/163 [00:33<00:01,  3.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:33<00:00,  4.85it/s]

INFO 02-24 04:45:42 model_runner.py:1115] Loading model weights took 35.4806 GB
(RayWorkerWrapper pid=1124) INFO 02-24 04:45:44 model_runner.py:1115] Loading model weights took 35.4806 GB
(RayWorkerWrapper pid=7564, ip=10.119.165.139) INFO 02-24 04:45:08 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 30x across cluster]
(RayWorkerWrapper pid=1138) INFO 02-24 04:45:08 cuda.py:161] Using Triton MLA backend. [repeated 31x across cluster]
(RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 utils.py:950] Found nccl from library libnccl.so.2 [repeated 31x across cluster]
(RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 31x across cluster]
(RayWorkerWrapper pid=1149) NCCL version 2.21.5+cuda12.4 [repeated 7x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/IPC/read [repeated 43x across cluster]
(RayWorkerWrapper pid=1116)  09/0 : 3[3] -> 2[2] via P2P/IPC/read [repeated 2x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO Connected all trees [repeated 6x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 [repeated 6x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer [repeated 6x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so [repeated 6x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO TUNER/Plugin: Using internal tuner plugin. [repeated 6x across cluster]
(RayWorkerWrapper pid=1149) deepseek2:1149:1149 [7] NCCL INFO ncclCommInitRank comm 0xd15faf0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d3000 commId 0x7f39c29d5bac68b4 - Init COMPLETE [repeated 6x across cluster]
(RayWorkerWrapper pid=1115) Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read [repeated 2x across cluster]
(RayWorkerWrapper pid=973, ip=10.119.85.140) INFO 02-24 04:45:08 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_3a784ba1'), local_subscribe_port=56925, remote_subscribe_port=None) [repeated 2x across cluster]
(RayWorkerWrapper pid=976, ip=10.119.85.140) INFO 02-24 04:45:08 model_runner.py:1110] Starting to load model /root/.cache/huggingface/hub/models/unsloth/DeepSeek-R1-BF16/... [repeated 30x across cluster]
(RayWorkerWrapper pid=978, ip=10.119.85.140) INFO 02-24 04:46:17 model_runner.py:1115] Loading model weights took 42.8992 GB [repeated 7x across cluster]
...
...
...
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO P2P Chunksize set to 131072
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 2[1] -> 3[1] [receive] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 2[1] -> 3[1] [receive] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[1] [send] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[1] [send] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31505 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Connected all rings
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[1] [receive] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 01/0 : 3[1] -> 1[1] [send] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Channel 00/0 : 3[1] -> 2[1] [send] via NET/IB/1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO Connected all trees
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:7565 [1] NCCL INFO ncclCommInitRank comm 0x1179c2d0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 42000 commId 0x8b115fa040bc44fc - Init COMPLETE
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Using non-device net plugin version 0
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Using network IB
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO ncclCommInitRank comm 0x30ca9860 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 42000 commId 0xd37e8e722df3123f - Init START
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO NVLS multicast support is not available on dev 1
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO comm 0x30ca9860 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO P2P Chunksize set to 524288
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read
(RayWorkerWrapper pid=7565, ip=10.119.165.139) deepseek1:7565:31563 [1] NCCL INFO Channel 02/0 : 1[
(RayWorkerWrapper pid=7587, ip=10.119.165.139) ] NCCL INFO Connected all rings
(RayWorkerWrapper pid=7591, ip=10.119.165.139)  Channel 09/0 : 7[7] -> 6[6] via P2P/IPC/read
(RayWorkerWrapper pid=7580, ip=10.119.165.139) 4[4] -> 3[3] via P2P/IPC/read
(RayWorkerWrapper pid=7580, ip=10.119.165.139) deepseek1:7580:31567 [4] NCCL I
(RayWorkerWrapper pid=7591, ip=10.119.165.139) 
(RayWorkerWrapper pid=7591, ip=10.119.165.139) deepseek1:759
(RayWorkerWrapper pid=7580, ip=10.119.165.139) NFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read
(RayWorkerWrapper pid=7580, ip=10.119.165.139) deepseek1:7580:31594 [4] NCCL INFO Channel 15
INFO 02-24 17:15:16 executor_base.py:110] # CUDA blocks: 68440, # CPU blocks: 13107
...
...
...
Capturing CUDA graph shapes:  43%|████████████████████████████████████████████████████████████▍                                                                                | 15/35 [00:06<00:09,  2.18it/s]
Capturing CUDA graph shapes:  91%|█████████▏| 32/35 [00:13<00:01,  2.46it/s]
Capturing CUDA graph shapes:  46%|████████████████████████████████████████████████████████████████▍                                                                            | 16/35 [00:07<00:08,  2.18it/s]
Capturing CUDA graph shapes:  94%|█████████▍| 33/35 [00:13<00:00,  2.46it/s]
Capturing CUDA graph shapes:  49%|████████████████████████████████████████████████████████████████████▍                                                                        | 17/35 [00:07<00:08,  2.19it/s]
Capturing CUDA graph shapes:  97%|█████████▋| 34/35 [00:13<00:00,  2.47it/s]
Capturing CUDA graph shapes:  51%|████████████████████████████████████████████████████████████████████████▌                                                                    | 18/35 [00:08<00:07,  2.13it/s]
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:14<00:00,  2.47it/s]
Capturing CUDA graph shapes:  63%|████████████████████████████████████████████████████████████████████████████████████████▋                                                    | 22/35 [00:10<00:06,  2.07it/s]
Capturing CUDA graph shapes:  89%|████████▊ | 31/35 [00:13<00:01,  2.36it/s] [repeated 24x across cluster]
Capturing CUDA graph shapes:  66%|████████████████████████████████████████████████████████████████████████████████████████████▋                                                | 23/35 [00:10<00:05,  2.08it/s]
(RayWorkerWrapper pid=974, ip=10.119.85.140) INFO 02-24 04:48:08 model_runner.py:1562] Graph capturing finished in 64 secs, took 1.19 GiB
Capturing CUDA graph shapes:  71%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                        | 25/35 [00:11<00:04,  2.04it/s]
(RayWorkerWrapper pid=7561, ip=10.119.165.139) INFO 02-24 04:48:09 custom_all_reduce.py:226] Registering 4480 cuda graph addresses [repeated 16x across cluster]
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:15<00:00,  2.19it/s]
INFO 02-24 04:48:13 custom_all_reduce.py:226] Registering 4340 cuda graph addresses
(RayWorkerWrapper pid=7564, ip=10.119.165.139) INFO 02-24 04:48:13 model_runner.py:1562] Graph capturing finished in 68 secs, took 1.19 GiB [repeated 23x across cluster]
(RayWorkerWrapper pid=1138) INFO 02-24 04:48:18 custom_all_reduce.py:226] Registering 4340 cuda graph addresses [repeated 14x across cluster]
INFO 02-24 04:48:18 model_runner.py:1562] Graph capturing finished in 73 secs, took 1.16 GiB
INFO 02-24 04:48:18 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 84.43 seconds
INFO 02-24 04:48:18 api_server.py:756] Using supplied chat template:
INFO 02-24 04:48:18 api_server.py:756] None
INFO 02-24 04:48:18 launcher.py:21] Available routes are:
INFO 02-24 04:48:18 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 02-24 04:48:18 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 02-24 04:48:18 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-24 04:48:18 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 02-24 04:48:18 launcher.py:29] Route: /health, Methods: GET
INFO 02-24 04:48:18 launcher.py:29] Route: /ping, Methods: GET, POST
INFO 02-24 04:48:18 launcher.py:29] Route: /tokenize, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /detokenize, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/models, Methods: GET
INFO 02-24 04:48:18 launcher.py:29] Route: /version, Methods: GET
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /pooling, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /score, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/score, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /rerank, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 02-24 04:48:18 launcher.py:29] Route: /invocations, Methods: POST
INFO:     Started server process [12020]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
image-20250224205008780

3.1.6 再次查看ray集群状态

root@deepseek1:/vllm-workspace# ray status
======== Autoscaler status: 2025-02-24 06:09:38.499412 ========
Node status
---------------------------------------------------------------
Active:
 1 node_78c610008905942ec65274e7c7ce990d1a554e9627512bf633c15c28
 1 node_0aee3e0efd8a7f95dfa4205cd692f7f08e7d665b779a0facf0d3201d
 1 node_8407508c22842dea6182c7accc2565b42daf4c7b051f1d37a4629258
 1 node_3ec95e5ee056a4484f0f81cc518716edd4d2bfd98ffa771b024edc27
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/448.0 CPU
 32.0/32.0 GPU (32.0 used of 32.0 reserved in placement groups)
 0B/3.89TiB memory
 0B/38.91GiB object_store_memory

Demands:
 (no resource demands)
 
#
32块GPU都处于使用状态了

3.1.7 Production Metrics

(self-llm) deepseek@deepseek2:~$ curl http://10.119.85.138:8000/metrics
540    0     # TYPE python_gc_objects_collected_total counter
0  7756k   python_gc_objects_collected_total{generation="0"} 37427.0
   0 --:python_gc_objects_collected_total{generation="1"} 14232.0
--:-- --:--:-- python_gc_objects_collected_total{generation="2"} 16818.0
--:--:-- 9615k
HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
HELP python_gc_collections_total Number of times this generation was collected
TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 3033.0
python_gc_collections_total{generation="1"} 267.0
python_gc_collections_total{generation="2"} 315.0
...

3.1.8 openai API接口测试

#其中 10.119.85.138 是deepseek2节点的IB网卡IP
(self-llm) deepseek@deepseek2:~$ curl 10.119.85.138:8000/v1/models -H "Authorization: Bearer zY0MrQwXV9Oo3+++" | jq  
#输出内容如下
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   523  100   523    0     0   105k      0 --:--:-- --:--:-- --:--:--  127k
{
  "object": "list",
  "data": [
    {
      "id": "DeepSeek-R1-671B",
      "object": "model",
      "created": 1740405511,
      "owned_by": "vllm",
      "root": "/root/.cache/huggingface/hub/models/unsloth/DeepSeek-R1-BF16/",
      "parent": null,
      "max_model_len": 32768,
      "permission": [
        {
          "id": "modelperm-ced685e8156b4618b593580109205165",
          "object": "model_permission",
          "created": 1740405511,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

同时在vllm serve命令执行的窗口会看到如下输出内容

image-20250224215955409

3.1.9 服务功能验证

(self-llm) deepseek@deepseek2:~$ curl -X POST "http://10.119.85.138:8000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer zY0MrQwXV9Oo3+++"  -d '{ "model": "DeepSeek-R1-671B", "messages": [{"role": "user", "content": "你好"}]}'
    
(self-llm) deepseek@deepseek2:~$ curl -X POST "http://10.119.85.138:8000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer zY0MrQwXV9Oo3+++"  -d '{ "model": "DeepSeek-R1-671B", "messages": [{"role": "user", "content": "请证明勾股定理"}]}'
#回答
{"id":"chatcmpl-11ae1ddf321343af848b5c683e67b72d","object":"chat.completion","created":1740411348,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\n嗯,用户让我证明勾股定理。勾股定理是数学里非常基础但又重要的定理,肯定有很多不同的证明方法。先回忆一下,勾股定理是说在直角三角形中,斜边的平方等于两条直角边的平方和,也就是a² + b² = c²。现在得选一
种合适的证明方式,可能是几何的或者代数的。\n\n首先想到的是几何证明中的拼接法,通过将四个直角三角形拼成一个大的正方形,然后比较面积。我要不要试试这个方法?比如说,四个全等的直角三角形,设它们的直角边为a和b,
斜边为c,拼起来的话中间应该会形成一个边长为(a+b)的正方形,中间的空隙可能是小正方形,边长是c或者别的?\n\n不对,应该会形成一个边长为c的正方形,或者这个?等一下,可能得仔细画个图想象一下。假设我们把四个三角
形每个的直角边朝外,那斜边就会组成里面的一个正方形,这时候那个正方形的边长应该是a - b吗?或者可能是其他情况?\n\n另一个方法是代数证明,利用相似三角形或者欧几里得的证明方式。我记得欧几里得在《几何原本》里的证明比较复杂,可能不太直观,适合进阶的学生。但普通学生应该更容易接受面积拼接法,或者是利用代数展开。还有一种方法是总统证明法,据说是美国第20任总统加菲尔德想到的一种梯形面积法,对吗?\n\n比如说,画一个梯形,由
两个直角三角形和一个等腰直角三角形组成,计算面积两种方式。可能这个更简单。先复习一下这种方法的具体步骤。总面积的公式应该等于梯形的高乘以上底加下底的一半,对吗?\n\n或者回到基本的拼图方法。四个直角三角形,边
长为a和b,斜边c,将它们放在一个大正方形里面,这样有两种不同的拼法,分别形成边长为a + b的大正方形,一种是四个三角形围成一个边长为c的正方形,另一种则是中间留有边长为a和b的小正方形。\n\n这时候整个大正方形的面积有两种表达式,一种等于(a + b)²,另一种等于4个三角形的面积加上中间的正方形面积,也就是4*(1/2 ab) + c²。另一边,同一个大正方形的面积也可以通过另一种拼法得到,这时候中间可能是两个小正方形,边长分别是a和b,所以
总面积是4*(1/2 ab) + a² + b²。这样等式两边相等的话就能得到a² + b² = c²。这可能就是常见的拼图法证明。\n\n或者考虑用相似三角形的比例来推导。在直角三角形中,画一条高,由此产生两个小三角形,跟原来的大三角形相似
。根据相似三角形的边长比例,各个边的关系可以得出勾股定理。\n\n比如,设原三角形ABC,C为直角,从C向AB边作高,交于D点。这样,AD和DB分别是两个小三角形的边,之后利用比例关系得出AC² = AD * AB,BC² = DB * AB,相加
后得出AC² + BC² = AB²(AD + DB)= AB²,这样就得到结论。这可能是欧几里得的方法。\n\n不过,可能用户需要的是更直观的证明方法,比如面积法,可能适合更多人理解。像用四个直角三角形围成一个大的正方形,中间形成的部分是c²的面积,然后另外方式拼接得到a² + b²的总面积,两者相等。\n\n这个时候应该详细画出这个图形。但因为是文字描述,我需要解释清楚:将四个直角三角形以边长为a和b的直角边向外,组成一个大正方形,边长为a + b,中间的
区域则是一个边长为c的正方形。大正方形的面积等于四个三角形的面积再加上中间正方形的面积。也就是(a + b)² = 4*(1/2 ab) + c²。展开左边得到a² + 2ab + b² = 2ab + c²,两边减去2ab即得a² + b² = c²,得证。\n\n或者,换
一种方式拼凑,以直角边为排列方向,或许中间形成的区域会是两个小正方形,边长a和b,这可能导致同样的结论。但上述的证明应该更直接,步骤更简单。\n\n另一个可能是代数计算法,比如构造一个边长为c的正方形,分割成四个三角形和一个中间的小正方形,不过这可能需要不同的分割方式。这时同样是利用面积平衡去推导。\n\n确定之后,我觉得用四个三角形的方法,也就是所谓的“面积拼图法”应该是最直接的方式。这个方法的步骤比较清晰,而且代数运算
简单,只有几步展开,容易理解。\n\n另外,要注意用户的疑问可能是什么。这些证明方法是不是正确的?有没有逻辑漏洞?比如在拼接过程中,是否真能组成一个正方形?是否存在对应的几何条件是否符合,例如直角是否拼接正确,
边长是否确实a + b等等。\n\n所以需要在证明过程中详细说明图形是如何构造的,确保各个边对的长度相等,角度正确,这样才能保证中间区域是正方形。例如,四个直角三角形的直角边a和b交替排列,使得各外角都是直角,从而外框是一个边长为a + b的正方形。而中间的部分因为每个边都是斜边c,每个角都是原来的直角拼接后的角度,所以中间肯定是边长为c的正方形。\n\n这样的话,整个过程在几何上是严密的,可以放心使用。\n\n总结一下,我打算使用面积拼接法来证明勾股定理,详细描述如何将四个直角三角形拼成一个大正方形,然后通过两种不同的面积计算方式得到方程,从而解出a² + b² = c²。这种方法直观且步骤清晰,适合各种数学基础的用户理解。同时,提醒自己需要检查各
个步骤是否正确,例如正方形的构造是否正确,面积计算是否有误,代数推导是否严谨。\n</think>\n\n勾股定理表明,在直角三角形中,斜边长(\\( c \\))的平方等于两直角边(\\( a \\) 和 \\( b \\))的平方和,即  \n\\[\na^2 + b^2 = c^2.\n\\]  \n**证明**(面积拼接法):  \n\n1. **构造图形**:  \n   取四个全等的直角三角形,其直角边分别为 \\( a \\) 和 \\( b \\),斜边为 \\( c \\)。将它们以直角相接的方式拼接成一个大正方形,如图1
所示。  \n   - 外框大正方形的边长为 \\( a + b \\)。  \n   - 中间空出的区域是一个边长为 \\( c \\) 的小正方形。  \n\n   ![图1:四个直角三角形构成边长为 \\( a + b \\) 的大正方形,中间空出边长为 \\( c \\) 的小正方形](https://i.imgur.com/yyNUyzB.png)  \n\n2. **计算总面积**:  \n   外框大正方形的面积可表示为:  \n   \\[\n   (a + b)^2 = a^2 + 2ab + b^2.\n   \\]  \n\n3. **另一种方式的总面积**:  \n   总面积也可以看作四个直角三角形的面积加上中间小正方形的面积:  \n   \\[\n   \\text{总面积} = 4 \\times \\left( \\frac{1}{2}ab \\right) + c^2 = 2ab + c^2.\n   \\]  \n\n4. **联立方程**:  \n   将两种表达式联立:  \n   \\[\n   a^2 + 2ab + b^2 = 2ab + c^2.\n   \\]  \n   两边减去 \\( 2ab \\),得到:  \n   \\[\n   a^2 + b^2 = c^2.\n   \\]  \n\n**结论**:在直角三角形中,斜边的平方等于两直角边的平方和。此证法通过几何构造与代数运算的结合
,直观展示了勾股定理的必然性。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":1606,"completion_tokens":1598,"prompt_tokens_details":null},"prompt_logprobs":null}
#回答问题的同时在vllm serve命令执行的窗口会看到如下,显示token平均生成吞吐率
INFO 02-24 17:21:12 metrics.py:455] Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-24 17:21:17 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-24 17:21:22 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
...
#甚至更高速度
INFO 02-24 23:32:00 metrics.py:455] Avg prompt throughput: 442.9 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 02-24 23:32:05 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 102.4 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 02-24 23:32:07 async_llm_engine.py:179] Finished request chatcmpl-03add50cba264c84afe98fd6cce9907f.
INFO 02-24 23:32:10 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
#apt install nvtop
(self-llm) deepseek@deepseek1:~/installPkgs$ nvtop
#如下是`nvtop`命令输出
image-20250224233912301

3.2 运行OpenWebUI

3.2.1 下载镜像

#下载容器镜像
#sudo docker pull ghcr.io/open-webui/open-webui:v0.5.10
sudo docker pull ghcr.io/open-webui/open-webui:v0.5.16

#
如果无法下载或下载速度太慢,也可以使用如下镜像,它们是同步的(建议使用此镜像)
#sudo docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/ghcr.io/open-webui/open-webui:v0.5.10
sudo docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/ghcr.io/open-webui/open-webui:v0.5.16

3.2.2 启动open-webui容器

(self-llm) deepseek@deepseek2:~$ sudo docker run -d --name open-webui \
    --hostname=open-webui \
    --volume=open-webui:/app/backend/data \
    --workdir=/app/backend \
    -p 18080:8080 \
    --restart always \
    --runtime=nvidia \
    --env=OPENAI_API_BASE_URL=http://10.119.85.138:8000/v1 \
    --env=OPENAI_API_KEY=zY0MrQwXV9Oo3+++ \
    --env=ENABLE_OLLAMA_API=false \
    --env=ENABLE_SIGNUP=true \
    swr.cn-north-4.myhuaweicloud.com/ddn-k8s/ghcr.io/open-webui/open-webui:v0.5.16 \
    bash start.sh
#其中OPENAI_API_BASE_URL 指定了vllm暴露的openai API 服务地址


#
等待,直到open-webui容器状态更新为healthy
(self-llm) deepseek@deepseek2:~$ sudo watch docker ps -a
#实时更新查看open-webui容器日志
(self-llm) deepseek@deepseek2:~$ sudo docker logs open-webui -f
Loading WEBUI_SECRET_KEY from file, not provided as an environment variable.
Generating WEBUI_SECRET_KEY
Loading WEBUI_SECRET_KEY from .webui_secret_key
/app/backend/open_webui
/app/backend
/app
Running migrations
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [open_webui.env] 'ENABLE_SIGNUP' loaded from the latest database entry
INFO  [open_webui.env] 'DEFAULT_LOCALE' loaded from the latest database entry
INFO  [open_webui.env] 'DEFAULT_PROMPT_SUGGESTIONS' loaded from the latest database entry
WARNI [open_webui.env] 

WARNING: CORS_ALLOW_ORIGIN IS SET TO '*' - NOT RECOMMENDED FOR PRODUCTION DEPLOYMENTS.

INFO  [open_webui.env] Embedding model set: sentence-transformers/all-MiniLM-L6-v2
WARNI [langchain_community.utils.user_agent] USER_AGENT environment variable not set, consider setting it to identify your requests.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)  #出现了日志内容就是可用了

3.2.3 访问open-webui界面

#其中 10.119.85.138 是deepseek2节点的IB网卡IP
(self-llm) deepseek@deepseek2:~$ curl http://10.119.85.138:18080
#或在浏览器中直接访问上述地址。第一个注册的用户,默认就是管理员。注册后登录、提问
image-20250225154520858


53AI,企业落地大模型首选服务商

产品:场景落地咨询+大模型应用平台+行业解决方案

承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

添加专属顾问

回到顶部

加载中...

扫码咨询