AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


单卡4090用Deepseek-R1思想GPRO训练QWen基础模型完整step

发布日期:2025-02-23 08:28:45 浏览次数: 1525 来源:特沃兹道
推荐语

掌握单卡4090高效训练大型AI模型的全新方法。

核心内容:
1. 单卡4090微调DeepSeek-R1-32B模型的LoRA方法
2. 单卡4090通过GRPO训练QWen2.5基础模型的关键思路
3. 训练QWen基础模型的完整步骤和性能测试结果

杨芳贤
53A创始人/腾讯云(TVP)最具价值专家

之前实测单卡4090训练的两篇文章:

A单卡4090微调DeepSeek-R1-32B

B单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路

都是单刀直入,直接开干的风格。这里略微做下说明:

A: 是基于Deepseek-R1蒸馏Qwen2.4-32B得到的模型DeepSeek-R1-Distill-Qwen-32B 做的 LoRA 微调。LoRA 是PEFT(Parameter-Efficient Fine-Tuning)参数高效微调的一种具体方法,简单理解就是锁定模型的大部分权重,只用特定领域的数据集训练改变少量权重以提升效果。优点是节约资源。加上 unsloth 优化和 int4 量化的加持,在有限的24G显存的4090卡上,就可以训练这个权重文件都有62G的模型了。

B: 是基于Qwen的基础原始模型Qwen2.5-3B,训练出一个推理模型。应用的就是Deepseek-R1的关键方法,通过简单的奖励函数加 GRPO 算法做强化学习,让模型具备更好的推理能力。过程中原始模型的全部权重都会参与训练,所以对显存和算力的需求更高。即使用 unsloth 叠加了这么多的优化Buff,也仍然只能训练到3B的模型。到7B 的模型实测显存会爆,还在优化中。


在上篇文章 单卡4090通过GRPO训练QWen2.5基础模型复现Deepseek-R1关键思路 中,为了快速完成测试,最大训练步数 max_steps 只设置了250步。后面取消了 max_steps 设置,让训练器自己根据数据量计算步数,完整跑了一遍,结果如下:


  • 训练总步数 (Total steps) : 22419 步

  • 训练总轮次 (Epochs) : 3.0 轮 

  • 训练时间: 总计 17.3 小时(62352.0686 秒)


资源峰值:
+---------------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     ||-----------------------------------------+----------------------+----------------------+GPU  Name                 Persistence-M | Bus-Id        Disp.A Volatile UncorrECC |Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage GPU-Util  Compute M. ||                                         |                      |               MIG M. ||=========================================+======================+======================||   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |30%   56C    P2             251W / 450W |  18142MiB / 24564MiB |     93%      Default ||                                         |                      |                  N/A |+-----------------------------------------+----------------------+----------------------+

训练开始日志:

INFO 02-18 09:44:59 model_runner.py:1115] Loading model weights took 5.7701 GBINFO 02-18 09:44:59 punica_selector.py:18] Using PunicaWrapperGPU.INFO 02-18 09:45:00 worker.py:267] Memory profiling takes 1.43 secondsINFO 02-18 09:45:00 worker.py:267] the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.59) = 13.96GiBINFO 02-18 09:45:00 worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 6.89GiB.INFO 02-18 09:45:01 executor_base.py:110# CUDA blocks: 12541, # CPU blocks: 10922INFO 02-18 09:45:01 executor_base.py:115] Maximum concurrency for 512 tokens per request: 391.91xINFO 02-18 09:45:04 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:21<00:00,  1.45it/s]INFO 02-18 09:45:26 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.15 GiBINFO 02-18 09:45:26 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 27.19 secondsUnsloth 2025.2.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1   \\   /|    Num examples = 7,473 | Num Epochs = 3O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1\        /    Total batch size = 1 | Total steps = 22,419 "-____-"     Number of trainable parameters = 59,867,136  0%|          | 5/22419 [00:14<18:12:54,  2.93s/it]-------------------- Question:

训练结束日志:

s/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.25, 'rewards/int_reward_func': 0.25, 'rewards/correctness_reward_func': 1.0, 'reward': 1.7916667461395264, 'reward_std': 1.8719420433044434, 'kl': 0.4413377642631531, 'epoch': 3.0}-------------------- Question:Nellie had 380 legos, but she lost 57 of them and gave her sister 24 legos. How many legos does she have now?Answer:299Response:<reasoning>Nellie had 380 legos initially. She lost 57 legos, so she now has 380 - 57 = 323 legos. She then gave her sister 24 legos, so she now has 323 - 24 = 299 legos.</reasoning><answer>299</answer>
Extracted:299{'loss': 0.0023, 'grad_norm': 0.38597264885902405, 'learning_rate': 0.0, 'completion_length': 87.5, 'rewards/xmlcount_reward_func': 0.5, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.5, 'rewards/int_reward_func': 0.5, 'rewards/correctness_reward_func': 1.6666667461395264, 'reward': 3.1666667461395264, 'reward_std': 0.8164965510368347, 'kl': 0.058432161808013916, 'epoch': 3.0}{'train_runtime': 62352.0686, 'train_samples_per_second': 0.36, 'train_steps_per_second': 0.36, 'train_loss': 0.006079863988740294, 'epoch': 3.0}

测试问题1:Calculate pi

原始模型的回答:
Here is a possible way to calculate pi using a Monte Carlo simulation. Let's say we want to estimate pi with an accuracy of 5 decimal places (0.00001). We can do this by generating a large number of random points in the range [-1, 1] for both x and y coordinates. We can then count the number of points that fall within the circle of radius 1 centered at the origin (i.e., points that satisfy x^2 + y^2 < 1). The ratio of the number of points that fall within the circle to the total number of points will be an estimate of pi/4. To get an estimate of pi with an accuracy of 5 decimal places, we would need to generate approximately 10^8 random points. Here is some sample Python code that implements this method:
import randomn = 10**8 # Number of random points to generatecount = 0 # Number of points that fall within the circlefor i in range(n):    x = random.uniform(-1, 1)    y = random.uniform(-1, 1)    if x**2 + y**2 < 1:        count += 1pi = count / n * 4print("pi = %f" % pi)
This code generates 10^8 random points and counts the number of points that fall within the circle. It then calculates an estimate of pi and prints the result. Note that this method is not particularly fast or efficient, and other methods (such as the Chudnovsky algorithm) exist for computing pi with greater accuracy and speed.
训练后模型的回答:

<reasoning>

We can use the Monte Carlo method to approximate pi. In this method, we generate random points within a square and count how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an approximation of pi.

</reasoning>

<answer>

3.14

</answer>

可见训练后模型的回答符合要求格式,且比较简洁,已经具备推理能力。

测试问题2:Which is bigger? 9.919 or 9.92?
原始模型的回答:

9.919 is smaller than 9.92.

】,【】,【】,【】,【】,【】,...

(这次原始模型回答对了,后面的空白字符应该是预先填充的,因为不是instruct模型,所以格式上会比较乱。

训练后模型的回答:

<reasoning>

9.919 is equivalent to 9.91900, while 9.92 is equivalent to 9.92000. Since 9.92000 is greater than 9.91900, 9.92 is bigger.

</reasoning>

<answer>

9.92

</answer>

可见训练后模型的回答完全符合要求的格式,且具备推理过程,回答简洁而准确。比之前只训练250步效果明显提升。

53AI,企业落地大模型首选服务商

产品:场景落地咨询+大模型应用平台+行业解决方案

承诺:免费场景POC验证,效果验证后签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

和创始人交个朋友

回到顶部

 

加载中...

扫码咨询