我要投稿

将Llama2训练为Embendding模型

发布日期：2024-05-10 20:16:15 浏览次数： 2532

作者：大魏分享

微信搜一搜，关注“大魏分享”

一、llm2vec的原理

因果语言模型训练的基本原理

当我们谈论训练一个语言模型来生成或预测文本时，一个重要的概念就是因果语言模型。简而言之，这意味着如果我们向模型提供了"猫在"这样的输入，我们希望模型能预测出下一个合理的词语，比如"睡觉"。

因果关系训练的解释

在因果语言模型训练中，"因果"意味着模型在预测下一个词时，仅依赖于它之前的词，不能看到后面的词。

例如，在处理句子 "The cat is sleeping in the kitchen" 时，采用的注意力掩码确保了模型在预测时只考虑它之前的词。这一点是通过在训练中应用特殊的"掩码"实现的，从而让模型只"看到"当前词之前的词，不考虑当前词之后的词。

实现双向理解与生成嵌入的关系

为了更好地理解和处理自然语言，我们需要模型能够生成文本的嵌入，即将词、句子或文档转换为向量以捕捉其语义特征。

双向理解在此过程中尤为重要，因为它允许模型综合考虑每个词前后的上下文，从而生成更准确的嵌入。这不仅提高了嵌入的质量，也加强了模型捕捉细粒度语义的能力。

如何利用LLM2Vec实现双向理解（https://github.com/McGill-NLP/llm2vec）

将LLM转换为双向模型：

问题：原始的大型语言模型（如GPT系列）是单向的，仅考虑前文。
解决方案：LLM2Vec通过替换原因果注意力掩码为全1矩阵，让模型同时考虑前后文，从单向模式转化为双向模式。

采用MNTP目标进行训练：

问题：仅转变为双向模式可能不足以生成高质量的嵌入。
解决方案：通过引入屏蔽下一个令牌预测（MNTP）目标，结合预测下一个词和屏蔽语言模型两者的特点，改善训练策略。

无监督对比学习（SimCSE）：

问题：尽管转为双向模型并采用MNTP目标，仍需提高嵌入的质量。
解决方案：通过对比学习策略，即让模型对同一文本产生的不同表征进行比较学习，进一步提高模型的泛化能力和嵌入质量。

二、训练代码实现

l2v = LLM2Vec.from_pretrained(#    "meta-llama/Meta-Llama-3-8B","meta-llama/Llama-2-7b-hf",    device_map="cuda" if torch.cuda.is_available() else "cpu",    torch_dtype=torch.bfloat16,)

l2v.save("Llama-2-7b-Emb")

!git clone https://github.com/McGill-NLP/llm2vec.git

JSON_CONFIG='''{"model_name_or_path": "meta-llama/Llama-2-7b-hf","dataset_name": "wikitext","dataset_config_name": "wikitext-103-raw-v1","per_device_train_batch_size": 8,"per_device_eval_batch_size": 1,"gradient_accumulation_steps": 16,"do_train": true,"do_eval": true,"max_seq_length": 512,"mask_token_type": "blank","data_collator_type": "all_mask","mlm_probability": 0.8,"overwrite_output_dir": true,"output_dir": "Llama-2-7B-llm2vec-MNTP-Emb","evaluation_strategy": "steps","eval_steps": 100,"save_steps": 200,"stop_after_n_steps": 1000,"lora_r": 16,"gradient_checkpointing": true,"torch_dtype": "bfloat16","attn_implementation": "flash_attention_2"}'''
with open("mtnp_config.json", 'w') as f:  f.write(JSON_CONFIG)

!python llm2vec/experiments/run_mntp.py mtnp_config.json

训练中资源利用率：

训练完毕后，可以使用训练好的embendding进行验证。

实际上，HF上有别人训练好的emb模型。如果自己有特殊的语料，可以自己训练。否则用现成的也没问题。选一个下载次数多、Like多的即可：

from llm2vec import LLM2Vec

import torchfrom transformers import AutoTokenizer, AutoModel, AutoConfigfrom peft import PeftModel

# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs.tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp")config = AutoConfig.from_pretrained("McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp", trust_remote_code=True)model = AutoModel.from_pretrained("McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",    trust_remote_code=True,    config=config,    torch_dtype=torch.bfloat16,    device_map="cuda" if torch.cuda.is_available() else "cpu",)

# Loading MNTP (Masked Next Token Prediction) model.model = PeftModel.from_pretrained(    model,"McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp",)

# Wrapper for encoding and pooling operationsl2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)

# Encoding queries using instructionsinstruction = ("Given a web search query, retrieve relevant passages that answer the query:")queries = [    [instruction, "how much protein should a female eat"],    [instruction, "summit define"],]q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documentsdocuments = ["As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.","Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",]d_reps = l2v.encode(documents)

# Compute cosine similarityq_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)

tensor([[0.6266, 0.4201],        [0.3415, 0.5245]])

最后有两点补充：

嵌入模型的基础模型可以和RAG中的 LLM 可以完全不同。例如如果基于 mistral 使用 embendding 模型进行 LLM2Vec 训练，在进行 RAG 时，LLM 会使用 ChatGPT-4。
如果训练数据集中有一些未包含的词（例如，训练集中没有中文）。那么训练好的模型能对中文进行嵌入吗？

是可以的，但是模型必须拥有可以用来构成词语的词汇tokens。对于中文，如果词汇表中不包含你想要嵌入的词语使用的中文字符，模型将使用未知标记（UNK token）的嵌入。也就是说，要么基础模型本身包含token，要么llm2vec训练时的语料包含中文。显然选一个支持多语言的基础模型很关键。本文实验我用的训练集是一个公开的，只有英文：https://huggingface.co/datasets/wikitext