微信扫码
与创始人交个朋友
我要投稿
LLM(Large Language Model,大型语言模型)在预训练和对齐阶段,虽然都使用loss函数来指导模型学习,但两者在loss的设计和目标上存在显著差异。
544 @dataclass
545 class LabelSmoother:
546 """
547 Adds label-smoothing on a pre-computed output from a Transformers model.
548
549 Args:
550 epsilon (`float`, *optional*, defaults to 0.1):
551 The label smoothing factor.
552 ignore_index (`int`, *optional*, defaults to -100):
553 The index in the labels to ignore when computing the loss.
554 """
555
556 epsilon: float = 0.1
557 ignore_index: int = -100
558
559 def __call__(self, model_output, labels, shift_labels=False):
560 logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0]
561 if shift_labels:
562 logits = logits[..., :-1, :].contiguous()
563 labels = labels[..., 1:].contiguous()
564
565 log_probs = -nn.functional.log_softmax(logits, dim=-1)
566 if labels.dim() == log_probs.dim() - 1:
567 labels = labels.unsqueeze(-1)
568
569 padding_mask = labels.eq(self.ignore_index)
570 # In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask
571 # will ignore them in any case.
572 labels = torch.clamp(labels, min=0)
573 nll_loss = log_probs.gather(dim=-1, index=labels)
574 # works for fp16 input tensor too, by internally upcasting it to fp32
575 smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)
576
577 nll_loss.masked_fill_(padding_mask, 0.0)
578 smoothed_loss.masked_fill_(padding_mask, 0.0)
579
580 # Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded):
581 num_active_elements = padding_mask.numel() - padding_mask.long().sum()
582 nll_loss = nll_loss.sum() / num_active_elements
583 smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
584 return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss
81 @override
82 def prediction_step(
83 self,
84 model: "torch.nn.Module",
85 inputs: Dict[str, Union["torch.Tensor", Any]],
86 prediction_loss_only: bool,
87 ignore_keys: Optional[List[str]] = None,
88 ) -> Tuple[Optional[float], Optional["torch.Tensor"], Optional["torch.Tensor"]]:
89 r"""
90 Removes the prompt part in the generated tokens.
91
92 Subclass and override to inject custom behavior.
93 """
94 labels = inputs["labels"] if "labels" in inputs else None
95 if self.args.predict_with_generate:
96 assert self.tokenizer.padding_side == "left", "This method only accepts left-padded tensor."
97 labels = labels.detach().clone() if labels is not None else None # backup labels
98 prompt_len, label_len = inputs["input_ids"].size(-1), inputs["labels"].size(-1)
99 if prompt_len > label_len:
100 inputs["labels"] = self._pad_tensors_to_target_len(inputs["labels"], inputs["input_ids"])
101 if label_len > prompt_len: # truncate the labels instead of padding the inputs (llama2 fp16 compatibility)
102 inputs["labels"] = inputs["labels"][:, :prompt_len]
103
104 loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
105 model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
106 )
107 if generated_tokens is not None and self.args.predict_with_generate:
108 generated_tokens[:, :prompt_len] = self.tokenizer.pad_token_id
109 generated_tokens = generated_tokens.contiguous()
110
111 return loss, generated_tokens, labels
112
113 def _pad_tensors_to_target_len(self, src_tensor: "torch.Tensor", tgt_tensor: "torch.Tensor") -> "torch.Tensor":
114 r"""
115 Pads the tensor to the same length as the target tensor.
116 """
117 assert self.tokenizer.pad_token_id is not None, "Pad token is required."
118 padded_tensor = self.tokenizer.pad_token_id * torch.ones_like(tgt_tensor)
119 padded_tensor[:, -src_tensor.shape[-1] :] = src_tensor # adopt left-padding
120 return padded_tensor.contiguous() # in contiguous memory
特征 | 预训练 | 对齐 (SFT) |
---|---|---|
目标 | 学习通用语言表示 | 迁移到特定任务 |
数据 | 海量未标注数据 | 高质量标注数据 |
Loss函数 | 自监督学习 (MLM, CLM) | 监督学习 (Cross-Entropy, MSE) |
Loss特点 | 数值较大,关注语言理解 | 数值较小,关注任务表现 |
需要注意的是,以上只是一些常见的区别,实际情况可能更加复杂。例如,有些预训练任务也会使用少量标注数据,而有些对齐任务也会使用自监督学习方法。
总的来说,预训练和对齐阶段的loss函数设计都至关重要,它们共同决定了LLM最终的性能。
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-12-22
Hugging Face 发布免费开放课程,微调本地LLMs模型
2024-12-22
我对Multi-Agent集成业务场景设计
2024-12-21
一文回顾OpenAI系列发布会:从工具到AGI,OpenAI的12天进化论
2024-12-19
强化微调技术详解:开启AI模型定制的新篇章
2024-12-18
OpenAI 年底「百亿补贴」来了,满血 o1 API 开放,成本暴跌,定制升级
2024-12-18
腾讯AI团队:用Ray分布式计算效率提升800%
2024-12-18
OpenAI 新货详解:大量接口更新,还有 Go/Java SDK
2024-12-18
聊聊对强化微调(RFT)的理解及看法
2024-09-18
2024-07-11
2024-07-11
2024-07-09
2024-06-11
2024-10-20
2024-07-23
2024-07-20
2024-07-26
2024-07-12