微信扫码
添加专属顾问
我要投稿
老码农如何用AI辅助开发性能测试脚本,节省大量调试时间。 核心内容: 1. 利用AI辅助开发性能测试脚本的经历分享 2. 脚本参数设置与使用样例 3. 压测结果输出示例与性能指标分析
来看下成品的使用样例:
python3 simple-bench-to-api.py \ --url http://10.96.0.188:11434/v1 \ --model deepseek-r1:32b \ --api_key "any-string-if-no-apikey-on-sever" \ --concurrency 1 \ --prompt "Tell me a story" \ --max_tokens 100 \ --duration_seconds 30
其中参数的含义:
--url 推理服务的基础地址,路径以 /v1 结束
--model 推理服务的模型名称
--concurrency 并发数,如果这个设置了,则忽略 --concurrencys 参数内容
--prompt 每个请求的用户问题
--max_tokens 最大 token 数
--duration_seconds 持续时间,单位秒
--concurrencys 逗号分隔的并发数列表,如 1,5,10,15,20,30;不设置 concurrency 时才生效。会分别按设置的并发数发起压力,两个并发批次中间间隔5秒
输出样例:
针对一个并发值的压测结果:
压测结果:并发数: 1总请求数: 9成功率: 100.00%平均延迟: 3.3685s最大延迟: 3.4171s最小延迟: 3.3369s平均首字延迟: 0.0767sP90延迟: 3.3893sP95延迟: 3.4032sP99延迟: 3.4143s总生成tokens数 : 918单并发最小吞吐量 : 30.71 tokens/s单并发最大吞吐量 : 31.26 tokens/s单并发平均吞吐量 : 30.99 tokens/s总体吞吐量: 30.24 tokens/s
其中有几个概念需要解释下
”延迟“:从发出请求,到接收到最后一个token或字符的时间
“P90延迟”:分位数90的延迟,计算方法为延迟从小到大排序,前90%的最大延迟值,和下一个延迟值,基于线性插值计算的一个介于2者之间的值。表示90%的请求都可以在这个时间之内完成。P95、P99延迟以此类推。
“首字延迟”:从发出请求,到接收到第一个返回字符(或token)的时间。
“单并发吞吐量”:这个名字是我为了和其他指标区分想出来的一个名字,一时没找到通用的对应名字。意思是指站在每个并发用户/通道的角度看,从首token返回后,token的生成速度。统计时间不包含首字延迟。即一个通道的吞吐量 = 该通道生成的token数/除首token延迟外的生成时间。个人觉得,这个指标加上平均首字延迟,能反映真实的用户体感。
具体指标的含义:
平均首字延迟:所有通道的首字延迟的平均值
单并发最小吞吐量: 所有并发通道中,吞吐量最小的通道的吞吐量
单并发最大吞吐量: 所有并发通道中,吞吐量最大的通道的吞吐量
单并发平均吞吐量:所有并发通道的吞吐量的平均值
总体吞吐量:在压测期间所有通道生成的tokens总数/压测开始到结束的时间
P90延迟: 3.3893s:表示有90%的请求延迟低于这个数值
P95延迟: 3.4032s:表示有95%的请求延迟低于这个数值
P99延迟: 3.4143s:表示有95%的请求延迟低于这个数值
接下来分别对部署在单卡 RTX 4090(24G显存)上的 DeepSeek 针对 qwen2.5-7B 和 qwen2.5-32B 的蒸馏版本做压测
模型地址:https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
模型量化:无
部署方式:k8s + vLLM 0.7.3
压测发起地:与模型推理同一台机器,这样就消除了远程网络开销的影响。
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 512 \ --max-model-len 65536
客户端压测命令:
python3 simple-bench-to-api.py \ --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Tell me a story" \ --max_tokens 100 \ --api_key 服务端配置的 API Key \ --duration_seconds 30
压测结果(脚本会汇总每个并发的结果,生成Markdown表格):
从这个结果汇总看,当并发从1到200时
平均首字延迟(avg_ttft)从0.0363s增长到了0.6999s,仍然很快。
单并发平均吞吐量(avg)从 60.14 tokens/s 下降到了 23.27 tokens/s; 说明在200个并发时,单用户的延迟会变慢1倍多,符合预期。
总体吞吐量从 58.76 tokens/s 上升到了 3730.98 tokens/s,并越大,总体吞吐量增长越缓慢,符合一般规律。
平均延迟(avg_latency)从1.6825s 增长到了4.9803s, 还算不错的。
总体说明,单卡 4090 跑 R1-7B 模型,在200并发之内都是很流畅的。
服务端日志:
# 1并发
INFO 03-03 03:59:21 metrics.py:455] Avg prompt throughput: 1.3 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:26 metrics.py:455] Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 59.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:31 metrics.py:455] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 59.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:36 metrics.py:455] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 59.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:41 metrics.py:455] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 59.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:46 metrics.py:455] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 59.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 03:59:51 metrics.py:455] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 59.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
# 10并发
INFO 03-03 03:59:57 metrics.py:455] Avg prompt throughput: 3.4 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:02 metrics.py:455] Avg prompt throughput: 50.3 tokens/s, Avg generation throughput: 561.9 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:07 metrics.py:455] Avg prompt throughput: 53.9 tokens/s, Avg generation throughput: 553.2 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:12 metrics.py:455] Avg prompt throughput: 54.0 tokens/s, Avg generation throughput: 553.3 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:17 metrics.py:455] Avg prompt throughput: 54.0 tokens/s, Avg generation throughput: 554.3 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:22 metrics.py:455] Avg prompt throughput: 39.6 tokens/s, Avg generation throughput: 557.7 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:27 metrics.py:455] Avg prompt throughput: 50.3 tokens/s, Avg generation throughput: 556.0 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:32 metrics.py:455] Avg prompt throughput: 6.1 tokens/s, Avg generation throughput: 50.9 tokens/s, Running: 50 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
# 50并发
INFO 03-03 04:00:37 metrics.py:455] Avg prompt throughput: 172.2 tokens/s, Avg generation throughput: 1967.8 tokens/s, Running: 49 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 5.5%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:42 metrics.py:455] Avg prompt throughput: 180.0 tokens/s, Avg generation throughput: 1775.5 tokens/s, Running: 50 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.9%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:47 metrics.py:455] Avg prompt throughput: 178.7 tokens/s, Avg generation throughput: 1923.5 tokens/s, Running: 50 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.2%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:52 metrics.py:455] Avg prompt throughput: 181.2 tokens/s, Avg generation throughput: 1939.9 tokens/s, Running: 47 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
INFO 03-03 04:00:58 metrics.py:455] Avg prompt throughput: 186.5 tokens/s, Avg generation throughput: 1942.9 tokens/s, Running: 50 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:03 metrics.py:455] Avg prompt throughput: 172.2 tokens/s, Avg generation throughput: 1946.6 tokens/s, Running: 45 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.8%, CPU KV cache usage: 0.0%.
# 100并发
INFO 03-03 04:01:10 metrics.py:455] Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 321.9 tokens/s, Running: 62 reqs, Swapped: 0 reqs, Pending: 38 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:15 metrics.py:455] Avg prompt throughput: 352.8 tokens/s, Avg generation throughput: 3194.2 tokens/s, Running: 100 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.0%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:20 metrics.py:455] Avg prompt throughput: 190.3 tokens/s, Avg generation throughput: 2817.9 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:25 metrics.py:455] Avg prompt throughput: 352.9 tokens/s, Avg generation throughput: 3163.5 tokens/s, Running: 100 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.1%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:30 metrics.py:455] Avg prompt throughput: 197.2 tokens/s, Avg generation throughput: 2960.1 tokens/s, Running: 13 reqs, Swapped: 0 reqs, Pending: 2 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:35 metrics.py:455] Avg prompt throughput: 358.0 tokens/s, Avg generation throughput: 3230.8 tokens/s, Running: 100 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:40 metrics.py:455] Avg prompt throughput: 185.1 tokens/s, Avg generation throughput: 2847.4 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
# 150并发
INFO 03-03 04:01:47 metrics.py:455] Avg prompt throughput: 2.7 tokens/s, Avg generation throughput: 25.5 tokens/s, Running: 39 reqs, Swapped: 0 reqs, Pending: 22 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:52 metrics.py:455] Avg prompt throughput: 535.9 tokens/s, Avg generation throughput: 3599.7 tokens/s, Running: 150 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:01:57 metrics.py:455] Avg prompt throughput: 268.1 tokens/s, Avg generation throughput: 3613.7 tokens/s, Running: 150 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 9.7%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:02 metrics.py:455] Avg prompt throughput: 273.3 tokens/s, Avg generation throughput: 3520.0 tokens/s, Running: 150 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.7%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:07 metrics.py:455] Avg prompt throughput: 280.3 tokens/s, Avg generation throughput: 3703.8 tokens/s, Running: 150 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.1%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:12 metrics.py:455] Avg prompt throughput: 278.9 tokens/s, Avg generation throughput: 3685.7 tokens/s, Running: 11 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:17 metrics.py:455] Avg prompt throughput: 406.6 tokens/s, Avg generation throughput: 3176.2 tokens/s, Running: 81 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 2.0%, CPU KV cache usage: 0.0%.
# 200并发
INFO 03-03 04:02:25 metrics.py:455] Avg prompt throughput: 13.0 tokens/s, Avg generation throughput: 867.2 tokens/s, Running: 58 reqs, Swapped: 0 reqs, Pending: 44 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:30 metrics.py:455] Avg prompt throughput: 472.7 tokens/s, Avg generation throughput: 4012.6 tokens/s, Running: 157 reqs, Swapped: 0 reqs, Pending: 29 reqs, GPU KV cache usage: 2.7%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:35 metrics.py:455] Avg prompt throughput: 244.4 tokens/s, Avg generation throughput: 3969.8 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:40 metrics.py:455] Avg prompt throughput: 439.9 tokens/s, Avg generation throughput: 4022.8 tokens/s, Running: 93 reqs, Swapped: 0 reqs, Pending: 32 reqs, GPU KV cache usage: 1.6%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:45 metrics.py:455] Avg prompt throughput: 258.5 tokens/s, Avg generation throughput: 3944.5 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 20 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:50 metrics.py:455] Avg prompt throughput: 405.5 tokens/s, Avg generation throughput: 4000.0 tokens/s, Running: 68 reqs, Swapped: 0 reqs, Pending: 22 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.
INFO 03-03 04:02:55 metrics.py:455] Avg prompt throughput: 320.2 tokens/s, Avg generation throughput: 4016.7 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
GPU KV cache usage在1%以内,说明缓存还没有充分利用,系统还有潜力
资源稳态消耗:
# 1并发
|=========================================+======================+======================|
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 30% 50C P2 291W / 450W | 22736MiB / 24564MiB | 94% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 10并发
|=========================================+======================+======================|
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 30% 56C P2 297W / 450W | 22736MiB / 24564MiB | 89% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 50并发
|=========================================+======================+======================|
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 30% 57C P2 298W / 450W | 22736MiB / 24564MiB | 84% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 150并发
|=========================================+======================+======================|
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 46% 61C P2 342W / 450W | 22736MiB / 24564MiB | 83% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 200并发
|=========================================+======================+======================|
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 35% 60C P2 323W / 450W | 22736MiB / 24564MiB | 78% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
符合预期,并发越大,系统调度的上下文切换消耗就越多,显卡的算力使用就会下降
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 512 \ --max-model-len 65536
客户端:
python3 simple-bench-to-api.py \ --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Tell me a story" \ --max_tokens 1024 \ --api_key 服务端配置的 API Key \ --duration_seconds 30
服务不变,客户端 max_tokens 从参数组合1的100,变成了 1024,压测结果:
可以看到平均延迟在100个token的时候1个并发只有1.7秒,当上下文增加了10倍到1k后,1个并发下的平均延迟到了16到17秒,正好也变为原来的10倍,符合预期。
当设置为1k上下文后,从1个并发到200个并发,变化规律和100个token的上下文是基本一样的:
平均首字延迟(avg_ttft)从0.0404s增长到了0.7275s,仍然很快。
单并发平均吞吐量(avg)从 60.18 tokens/s 下降到了 20.89 tokens/s; 说明在200个并发时,单用户的延迟会变慢1倍多,符合预期。
总体吞吐量从 59.95 tokens/s 上升到了 3157.15 tokens/s,并发越大,总体吞吐量增长越缓慢,符合一般规律。
平均延迟(avg_latency)从16.3665s 增长到了49.7491s,符合预期
服务端日志:略
并发从1到200的过程中,GPU KV cache usage 从个位数增长到了99.9%
资源稳态情况:略
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 512 \ --max-model-len 65536
客户端:
# python3 simple-bench-to-api.py \ --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Tell me a story" \ --max_tokens 16384 \ --api_key 服务端配置的 API Key \ --duration_seconds 30
服务端不变, 客户端 max_tokens 改为 16k,压测结果如下:
可以看到,在16k上下文时,200个并发和150个并发比较,吞吐量上升已经不明显了。平均延迟66秒,P99的延迟已经比较高。16k下,7B还是在100个并发内质量比较稳定。另外我测试的时候,服务同时有个别小伙伴有调用,可能有些结果并不准确。
服务端日志:略
资源情况:略
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 256 \ --max-model-len 103632
客户端:
python3 simple-bench-to-api.py \ --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Introduce the history of China" \ --max_tokens 100 \ --api_key 服务端配置的APIKEY \ --duration_seconds 30
服务端 max-num-seqs 设为 256,max-model-len 设为103632, 客户端 max_tokens 设为 100
压测结果:
服务端日志:略
资源情况:略
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 256 \ --max-model-len 103632
客户端:
python3 simple-bench-to-api.py --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Introduce the history of China" \ --max_tokens 1024 \ --api_key 服务端配置的APIKEY \ --duration_seconds 30
服务端不变, 客户端 max_tokens 设为 1024,压测结果
服务端日志:略
资源稳态消耗:略
服务端:
--gpu_memory_utilization 0.95 \ --max-num-seqs 256 \ --max-model-len 103632
客户端:
python3 simple-bench-to-api.py --url http://10.96.2.221:7869/v1 \ --model DeepSeek-R1-Distill-Qwen-7B \ --concurrencys 1,10,50,100,150,200 \ --prompt "Introduce the history of China" \ --max_tokens 16384 \ --api_key 服务端配置的APIKEY \ --duration_seconds 30
服务端不变, 客户端 max_tokens 设为 16384(16k上下文),压测结果:
可以看到,同样是16k的max_tokens,服务端参数调整后,200个并发下的吞吐量934,比调整前的794增加了 17% 多,且P99延迟下降了近一半。这个原因还在分析,也不排除是两次压测期间,受到的干扰不同造成的。
服务端日志:略
资源情况:略
模型地址:https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
部署方式: k8s + ollama,ollama版本: 0.5.10
模型量化:ollama 官方模型 deepseek-r1:32b, 基于 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 做的 4-bit (Q4_K_M) 量化的 gguf 版本;
压测发起地:与模型推理同一台机器,这样就消除了远程网络开销的影响。
python3 simple-bench-to-api.py \ --url http://10.96.0.188:11434/v1 \ --model deepseek-r1:32b \ --concurrencys 1,5,10,15,20,30 \ --prompt "Introduce the history of China" \ --max_tokens 100 \ --api_key 服务端设置的APIKEY,没有随便填一个 \ --duration_seconds 30
资源情况:
# 1并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 30% 49C P2 340W / 450W | 23012MiB / 24564MiB | 79% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 5并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 30% 53C P2 379W / 450W | 23012MiB / 24564MiB | 75% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 10并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 42% 61C P2 372W / 450W | 23012MiB / 24564MiB | 76% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 15并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 53% 58C P2 356W / 450W | 23012MiB / 24564MiB | 72% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 20并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 53% 61C P2 378W / 450W | 23012MiB / 24564MiB | 76% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 30并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 55% 62C P2 382W / 450W | 23012MiB / 24564MiB | 76% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
python3 simple-bench-to-api.py \ --url http://10.96.0.188:11434/v1 \ --model deepseek-r1:32b \ --concurrencys 1,5,10,15,20,30 \ --prompt "Introduce the history of China" \ --max_tokens 1024 \ --api_key 服务端设置的APIKEY,没有随便填一个 \ --duration_seconds 30
资源情况:
# 1并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 30% 52C P2 333W / 450W | 23012MiB / 24564MiB | 87% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 5并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 33% 57C P2 370W / 450W | 23012MiB / 24564MiB | 79% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 10并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 53% 60C P2 376W / 450W | 23012MiB / 24564MiB | 77% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 15并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 54% 63C P2 385W / 450W | 23012MiB / 24564MiB | 78% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 20并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 65% 62C P2 376W / 450W | 23012MiB / 24564MiB | 77% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
# 30并发
|=========================================+======================+======================|
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 63% 59C P2 371W / 450W | 23012MiB / 24564MiB | 76% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
python3 simple-bench-to-api.py \ --url http://10.96.0.188:11434/v1 \ --model deepseek-r1:32b \ --concurrencys 1,5,10,15,20,30 \ --prompt "Introduce the history of China" \ --max_tokens 16384 \ --api_key 服务端设置的APIKEY,没有随便填一个 \ --duration_seconds 30
资源情况:略
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-02-04
2025-02-04
2024-09-18
2024-07-11
2024-07-09
2024-07-11
2024-07-26
2025-02-05
2025-01-27
2025-02-01
2025-03-31
2025-03-20
2025-03-16
2025-03-16
2025-03-13
2025-03-13
2025-03-11
2025-03-07