微信扫码
和创始人交个朋友
我要投稿
掌握单卡4090高效训练大型AI模型的全新方法。核心内容:1. 单卡4090微调DeepSeek-R1-32B模型的LoRA方法2. 单卡4090通过GRPO训练QWen2.5基础模型的关键思路3. 训练QWen基础模型的完整步骤和性能测试结果
之前实测单卡4090训练的两篇文章:
B: 单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路
都是单刀直入,直接开干的风格。这里略微做下说明:
A: 是基于Deepseek-R1蒸馏Qwen2.4-32B得到的模型DeepSeek-R1-Distill-Qwen-32B 做的 LoRA 微调。LoRA 是PEFT(Parameter-Efficient Fine-Tuning)参数高效微调的一种具体方法,简单理解就是锁定模型的大部分权重,只用特定领域的数据集训练改变少量权重以提升效果。优点是节约资源。加上 unsloth 优化和 int4 量化的加持,在有限的24G显存的4090卡上,就可以训练这个权重文件都有62G的模型了。
B: 是基于Qwen的基础原始模型Qwen2.5-3B,训练出一个推理模型。应用的就是Deepseek-R1的关键方法,通过简单的奖励函数加 GRPO 算法做强化学习,让模型具备更好的推理能力。过程中原始模型的全部权重都会参与训练,所以对显存和算力的需求更高。即使用 unsloth 叠加了这么多的优化Buff,也仍然只能训练到3B的模型。到7B 的模型实测显存会爆,还在优化中。
在上篇文章 单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路 中,为了快速完成测试,最大训练步数 max_steps 只设置了250步。后面取消了 max_steps 设置,让训练器自己根据数据量计算步数,完整跑了一遍,结果如下:
训练总步数 (Total steps) : 22419 步
训练总轮次 (Epochs) : 3.0 轮
训练时间: 总计 17.3 小时(62352.0686 秒)
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 30% 56C P2 251W / 450W | 18142MiB / 24564MiB | 93% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
训练开始日志:
INFO 02-18 09:44:59 model_runner.py:1115] Loading model weights took 5.7701 GB
INFO 02-18 09:44:59 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-18 09:45:00 worker.py:267] Memory profiling takes 1.43 seconds
INFO 02-18 09:45:00 worker.py:267] the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.59) = 13.96GiB
INFO 02-18 09:45:00 worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 6.89GiB.
INFO 02-18 09:45:01 executor_base.py:110] # CUDA blocks: 12541, # CPU blocks: 10922
INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91x
INFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00, 1.45it/s]
INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiB
INFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 seconds
Unsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 7,473 | Num Epochs = 3
O^O/ \_/ \ Batch size per device = 1 | Gradient Accumulation steps = 1
\ / Total batch size = 1 | Total steps = 22,419
"-____-" Number of trainable parameters = 59,867,136
0%| | 5/22419 [00:14<18:12:54, 2.93s/it]-------------------- Question:
训练结束日志:
s/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.25, 'rewards/int_reward_func': 0.25, 'rewards/correctness_reward_func': 1.0, 'reward': 1.7916667461395264, 'reward_std': 1.8719420433044434, 'kl': 0.4413377642631531, 'epoch': 3.0}
-------------------- Question:
Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?
Answer:
299
Response:
<reasoning>
Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.
</reasoning>
<answer>
299
</answer>
Extracted:
299
{'loss': 0.0023, 'grad_norm': 0.38597264885902405, 'learning_rate': 0.0, 'completion_length': 87.5, 'rewards/xmlcount_reward_func': 0.5, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.5, 'rewards/int_reward_func': 0.5, 'rewards/correctness_reward_func': 1.6666667461395264, 'reward': 3.1666667461395264, 'reward_std': 0.8164965510368347, 'kl': 0.058432161808013916, 'epoch': 3.0}
{'train_runtime': 62352.0686, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.36, 'train_loss': 0.006079863988740294, 'epoch': 3.0}
Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (i.e., points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:
import random
n = 10**8 # Number of random points to generate
count = 0 # Number of points that fall within the circle
for i in range(n):
x = random.uniform(-1, 1)
y = random.uniform(-1, 1)
if x**2 + y**2 < 1:
count += 1
pi = count / n * 4
print("pi = %f" % pi)
<reasoning>
We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.
</reasoning>
<answer>
3.14
</answer>
可见训练后模型的回答符合要求格式,且比较简洁,已经具备推理能力。
9.919 is smaller than 9.92.
】,【】,【】,【】,【】,【】,...
(这次原始模型回答对了,后面的空白字符应该是预先填充的,因为不是instruct模型,所以格式上会比较乱。
<reasoning>
9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.
</reasoning>
<answer>
9.92
</answer>
53AI,企业落地大模型首选服务商
产品:场景落地咨询+大模型应用平台+行业解决方案
承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2025-02-23
通义灵码 + DeepSeek:AI 编程助手的强强联合,让开发效率飙升!
2025-02-23
某安全团队针对网安大模型的微调初体验
2025-02-23
DeepSeek R1 70B本地版和官网版测试: 思考深度、内容结构、语言风格、搜索能力有出入
2025-02-22
一文搞懂DeepSeek - 基于R1蒸馏Qwen1.5B
2025-02-22
基于Ubuntu Ollama 部署 DeepSeek-R1:32B 聊天大模型(附带流式接口调用示例)
2025-02-22
DeepSeek-R1第三方稳定性测试(API端):首批结果出炉!
2025-02-22
企业AI私有化终极方案:DeepSeek-R1蒸馏实战全解析
2025-02-21
实测Grok3效果到底如何!发现中文好像蒸馏了Qwen!
2025-02-04
2025-02-04
2024-09-18
2024-07-11
2024-07-11
2024-07-09
2024-07-26
2025-01-27
2025-02-01
2025-02-05
2025-02-16
2025-02-10
2025-02-10
2025-02-09
2025-02-05
2025-01-24
2025-01-22
2025-01-14