我要投稿

检索增强生成 (RAG) 从业者指南

发布日期：2024-04-17 20:21:53 浏览次数： 3085

作者：詹大叔有话说

微信搜一搜，关注“詹大叔有话说”

A Practitioners Guide to Retrieval Augmented Generation (RAG)
检索增强生成 (RAG) 从业者指南

How basic techniques can be used to build powerful applications with LLMs...

如何使用基本技术通过 LLMs 构建强大的应用程序...

Cameron R. Wolfe, Ph.D.
卡梅伦·沃尔夫博士

Feb 05, 20242024 年 2 月 5 日

The recent surge of interest in generative AI has led to a proliferation of AI assistants that can be used to solve a variety of tasks, including anything from shopping for products to searching for relevant information. All of these interesting applications are powered by modern advancements in large language models (LLMs), which are trained over vast amounts of textual information to amass a sizable knowledge base. However, LLMs have a notoriously poor ability to retrieve and manipulate the knowledge that they possess, which leads to issues like hallucination (i.e., generating incorrect information), knowledge cutoffs, and poor understanding of specialized domains. Is there a way that we can improve an LLM’s ability to access and utilize high-quality information?
最近人们对生成式人工智能的兴趣激增，导致人工智能助手激增，这些助手可用于解决各种任务，包括从购买产品到搜索相关信息的任何任务。所有这些有趣的应用程序都由大型语言模型 (LLMs) 的现代进步提供支持，这些模型经过大量文本信息的训练，积累了相当大的知识库。然而，LLMs 检索和操纵其所拥有的知识的能力非常差，这导致了诸如幻觉（即生成不正确的信息）、知识中断和对专业领域理解不佳等问题。有没有一种方法可以提高LLM访问和利用高质量信息的能力？

“If AI assistants are to play a more useful role in everyday life, they need to be able not just to access vast quantities of information but, more importantly, to access the correct information.” - [source](http://retrieval augmented generation for knowledge intensive nlp tasks)
“如果人工智能助手要在日常生活中发挥更有用的作用，他们不仅需要能够访问大量信息，更重要的是能够访问正确的信息。” - 来源

The answer to the above question is a definitive “yes”. In this overview, we will explore one of the most popular techniques for injecting knowledge into an LLM—retrieval augmented generation (RAG). Interestingly, RAG is both simple to implement and highly effective at integrating LLMs with external data sources. As such, it can be used to improve the factuality of an LLM, supplement the model’s knowledge with more recent information, or even build a specialized model over proprietary data without the need for extensive finetuning.
上述问题的答案是明确的“是”。在本概述中，我们将探讨将知识注入 LLM 中最流行的技术之一——检索增强生成 (RAG)。有趣的是，RAG 不仅易于实现，而且在将 LLMs 与外部数据源集成方面非常有效。因此，它可以用来提高 LLM 的真实性，用更新的信息补充模型的知识，甚至可以在专有数据上构建专门的模型，而无需进行广泛的微调。

What is Retrieval Augmented Generation?

什么是检索增强生成？

In context learning adapts a single foundation model to solve many tasks via a prompting approach 在上下文中学习采用单一基础模型通过提示方法解决许多任务

Before diving in to the technical content of this overview, we need to build a basic understanding of retrieval augmented generation (RAG), how it works, and why it is useful. LLMs contain a lot of knowledge within their pretrained weights (i.e., parametric knowledge) that can be surfaced by prompting the model and generating output. However, these models also have a tendency to hallucinate—or generate false information—indicating that the parametric knowledge possessed by an LLM can be unreliable. Luckily, LLMs have the ability to perform in context learning (depicted above), defined as the ability to leverage information within the prompt to produce a better output1. With RAG, we augment the knowledge base of an LLM by inserting relevant context into the prompt and relying upon the in context learning abilities of LLMs to produce better output by using this context.
在深入了解本概述的技术内容之前，我们需要对检索增强生成 (RAG)、它的工作原理以及它的用途有一个基本的了解。LLMs 在其预训练权重（即参数知识）中包含大量知识，可以通过提示模型并生成输出来呈现这些知识。然而，这些模型也有产生幻觉或生成错误信息的倾向，这表明 LLM 拥有的参数知识可能不可靠。幸运的是，LLMs 具有在上下文学习中执行的能力（如上所述），定义为利用提示中的信息产生更好输出的能力 1. 通过 RAG，我们增强了 < b1004> 通过在提示中插入相关上下文，并依靠 LLMs 的上下文学习能力，通过使用此上下文产生更好的输出。

The Structure of a RAG Pipeline

RAG管道的结构

“A RAG process takes a query and assesses if it relates to subjects defined in the paired knowledge base. If yes, it searches its knowledge base to extract information related to the user’s question. Any relevant context in the knowledge base is then passed to the LLM along with the original query, and an answer is produced.” - source
“RAG 流程接受查询并评估它是否与配对知识库中定义的主题相关。如果是，它会搜索其知识库以提取与用户问题相关的信息。然后，知识库中的任何相关上下文都会与原始查询一起传递到 LLM，并生成答案。” - 来源

Given an input query, we normally respond to this query with an LLM by simply ingesting the query (possibly as part of a prompt template) and generating a response with the LLM. RAG modifies this approach by combining the LLM with a searchable knowledge base. In other words, we first use the input query to search for relevant information within an external dataset. Then, we add the info that we find to the model’s prompt when generating output, allowing the LLM to use this context (via its in context learning abilities) to generate a better and more factual response; see below. By combining the LLM with a non-parametric data source, we can feed the model correct, specific, and up-to-date information.
给定一个输入查询，我们通常通过简单地提取查询（可能作为提示模板的一部分）并使用 LLM 生成响应来使用 LLM 响应该查询。RAG 通过将 LLM 与可搜索知识库相结合来修改此方法。换句话说，我们首先使用输入查询在外部数据集中搜索相关信息。然后，我们在生成输出时将找到的信息添加到模型的提示中，允许 LLM 使用此上下文（通过其上下文学习能力）来生成更好、更真实的响应；见下文。通过将 LLM 与非参数数据源相结合，我们可以为模型提供正确、具体和最新的信息。

Adding relevant data to an LLM’s prompt in RAG
将相关数据添加到 RAG 中 LLM 的提示中

Cleaning and chunking. RAG requires access to a dataset of correct and useful information to augment the LLM’s knowledge base, and we must construct a pipeline that allows us to search for relevant data within this knowledge base. However, the external data sources that we use for RAG might contain data in a variety of different formats (e.g., pdf, markdown, and more). As such, we must first clean the data and extract the raw textual information from these heterogenous data sources. Once this is done, we can “chunk” the data, or split it into sets of shorter sequences that typically contain around 100-500 tokens; see below.
清洁和分块。RAG 需要访问包含正确且有用信息的数据集，以扩充 LLM 的知识库，并且我们必须构建一个管道，使我们能够在此知识库中搜索相关数据。但是，我们用于 RAG 的外部数据源可能包含各种不同格式的数据（例如 pdf、markdown 等）。因此，我们必须首先清理数据并从这些异构数据源中提取原始文本信息。完成此操作后，我们可以对数据进行“分块”，或者将其分成较短的序列集，通常包含大约 100-500 个标记；见下文。

Data preprocessing (cleaning and chunking) for RAG
RAG 的数据预处理（清理和分块）

The goal of chunking is to split the data into units of retrieval (i.e., pieces of text that we can retrieve as search results). An entire document could be too large to serve as a unit of retrieval, so we must split this document into smaller chunks. The most common chunking strategy is a fixed-size approach, which breaks longer texts into shorter sequences that each contain a fixed number of tokens. However, this is not the only approach! Our data may be naturally divided into chunks (e.g., social media posts or product descriptions on an e-commerce store) or contain separators that allow us to use a variable-size chunking strategy.
分块的目标是将数据分割成检索单元（即我们可以作为搜索结果检索的文本片段）。整个文档可能太大而无法作为检索单元，因此我们必须将该文档分割成更小的块。最常见的分块策略是固定大小的方法，它将较长的文本分成较短的序列，每个序列包含固定数量的标记。然而，这不是唯一的方法！我们的数据可能会自然地分为块（例如，社交媒体帖子或电子商务商店上的产品描述）或包含允许我们使用可变大小分块策略的分隔符。

Searching over chunks. Once we have cleaned our data and separated it into searchable chunks, we must build a search engine for matching input queries to chunks! Luckily, we have covered the topic of AI-powered search extensively in a prior overview. All of these concepts can be repurposed to build a search engine that can accurately match input queries to textual chunks in RAG.
搜索块。一旦我们清理了数据并将其分成可搜索的块，我们必须构建一个搜索引擎来将输入查询与块相匹配！幸运的是，我们在前面的概述中广泛讨论了人工智能驱动的搜索主题。所有这些概念都可以重新利用来构建一个搜索引擎，该搜索引擎可以准确地将输入查询与 RAG 中的文本块进行匹配。

First, we will want to build a dense retrieval system by i) using an embedding model2 to produce a corresponding vector representation for each of our chunks and ii) indexing all of these vector representations within a vector database. Then, we can embed the input query using the same embedding model and perform an efficient vector search to retrieve semantically-related chunks; see above.
首先，我们希望构建一个密集检索系统，方法是 i) 使用嵌入模型 2 为每个块生成相应的向量表示，以及 ii) 在向量数据库中对所有这些向量表示建立索引。然后，我们可以使用相同的嵌入模型嵌入输入查询，并执行有效的向量搜索来检索语义相关的块；往上看。

A simple framework for AI-powered search
人工智能驱动的搜索的简单框架

Many RAG applications use pure vector search to find relevant textual chunks, but we can create a much better retrieval pipeline by re-purposing existing approaches from AI-powered search. Namely, we can augment dense retrieval with a lexical (or keyword-based) retrieval component, forming a hybrid search algorithm. Then, we can add a fine-grained re-ranking step—either with a cross-encoder or a less expensive component (e.g., ColBERT [10])—to sort candidate chunks based on relevance; see above for a depiction.
许多 RAG 应用程序使用纯矢量搜索来查找相关文本块，但我们可以通过重新利用人工智能驱动的搜索的现有方法来创建更好的检索管道。也就是说，我们可以使用词汇（或基于关键字）检索组件来增强密集检索，形成混合搜索算法。然后，我们可以添加一个细粒度的重新排序步骤——使用交叉编码器或更便宜的组件（例如 ColBERT [10]）——根据相关性对候选块进行排序；请参阅上面的描述。

More data wrangling. After retrieval, we might perform additional data cleaning on each textual chunk to compress the data or emphasize key information. For example, some practitioners add an extra processing step after retrieval that passes textual chunks through an LLM for summarization or reformatting prior to feeding them to the final LLM—this approach is common in LangChain. Using this approach, we can pass a compressed version of the textual information into the LLM’s prompt instead of the full document, thus saving costs.
更多数据争论。检索后，我们可能会对每个文本块执行额外的数据清理，以压缩数据或强调关键信息。例如，一些从业者在检索后添加额外的处理步骤，将文本块通过 LLM 进行摘要或重新格式化，然后再将其馈送到最终的 LLM - 这种方法在 LangChain 中很常见。使用这种方法，我们可以将文本信息的压缩版本而不是完整的文档传递到 LLM 的提示中，从而节省成本。

Do we always search for chunks? Within RAG, we usually use search algorithms to match input queries to relevant textual chunks. However, there are several different algorithms and tools that can be used to power RAG. For example, practitioners have recently explored connecting LLMs to graph databases, forming a RAG system that can search for relevant information via queries to a graph database (e.g., Neo4J); see here. Similarly, researchers have found synergies between LLMs and recommendation systems [14], as well as directly connected LLMs to search APIs like Google or Serper for accessing up-to-date information.
我们总是在寻找块吗？在 RAG 中，我们通常使用搜索算法将输入查询与相关文本块进行匹配。然而，有多种不同的算法和工具可用于为 RAG 提供支持。例如，实践者最近探索将LLMs连接到图数据库，形成一个RAG系统，可以通过查询图数据库（例如Neo4J）来搜索相关信息；看这里。同样，研究人员发现了 LLMs 和推荐系统 [14] 之间的协同作用，以及将 LLMs 直接连接到 Google 或 Serper 等搜索 API 来访问最新信息。

Generating output with RAG
使用 RAG 生成输出

Generating the output. Once we have retrieved relevant textual chunks, the final step of RAG is to insert these chunks into a language model’s prompt and generate an output; see above. RAG comprises the full end-to-end process of ingesting an input query, finding relevant textual chunks, concatenating this context with the input query3, and using an LLM to generate an output based on the combined input. As we will see, such an approach has a variety of benefits.
生成输出。一旦我们检索到相关的文本块，RAG 的最后一步是将这些块插入到语言模型的提示中并生成输出；往上看。RAG 包含完整的端到端过程：摄取输入查询、查找相关文本块、将此上下文与输入查询 3 连接起来，并使用 LLM 基于组合输入生成输出。正如我们将看到的，这种方法有很多好处。

The Benefits of RAGRAG 的好处

“RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations.” - from [8]
“RAG 系统由检索和基于 LLM 的生成模块组成，并为 LLMs 提供来自参考文本数据库的知识，这使它们能够充当用户和文本数据库，减少产生幻觉的风险。” - 来自[8]

Implementing RAG allows us to specialize an LLM over a knowledge base of our choosing. Compared to other knowledge injection techniques—finetuning (or continued pretraining) is the primary alternative—RAG is both simpler to implement and computationally cheaper. As we will see, RAG also produces much better results compared to continued pretraining! However, implementing RAG still requires extra effort compared to just prompting a pretrained LLM, so we will briefly cover here the core benefits of RAG that make it worthwhile.
实施 RAG 使我们能够在我们选择的知识库上专门化LLM。与其他知识注入技术（微调（或持续预训练）是主要替代方案）相比，RAG 实现起来更简单，计算成本也更便宜。正如我们将看到的，与持续预训练相比，RAG 还产生了更好的结果！然而，与仅仅提示预训练 LLM 相比，实施 RAG 仍然需要额外的努力，因此我们将在这里简要介绍 RAG 的核心优势，使其值得。

Reducing hallucinations. The primary reason that RAG is so commonly-used in practice is its ability to reduce hallucinations (i.e., generation of false information by the LLM). While LLMs tend to produce incorrect information when relying upon their parametric knowledge, the incorporation of RAG can drastically reduce the frequency of hallucinations, thus improving the overall quality of any LLM application and building more trust among users. Plus, RAG provides us with direct references to data that is used to generate information within the model’s output. We can easily provide the user with references to this information so that the LLM’s output can be verified against the actual data; see below.
减少幻觉。RAG 在实践中如此普遍使用的主要原因是它能够减少幻觉（即 LLM 生成错误信息）。虽然LLMs在依赖参数知识时往往会产生不正确的信息，但RAG的结合可以大大减少幻觉的频率，从而提高任何LLM应用程序的整体质量并构建更多用户之间的信任。此外，RAG 为我们提供了对用于在模型输出中生成信息的数据的直接引用。我们可以轻松地向用户提供对此信息的引用，以便可以根据实际数据验证 LLM 的输出；见下文。

User verification of context and output within RAG applications
用户验证 RAG 应用程序中的上下文和输出

Access to up-to-date information. When relying upon parametric knowledge, LLMs typically have a knowledge cutoff date. If we want to make this knowledge cutoff more recent, we would have to continually train the LLM over new data, which can be expensive. Plus, recent research has shown that finetuning tends to be ineffective at injecting new knowledge into an LLM—most information is learned during pretraining [7, 15]. With RAG, however, we can easily augment the LLM’s output and knowledge base with accurate and up-to-date information.
获取最新信息。当依赖参数化知识时，LLMs 通常有一个知识截止日期。如果我们想让这种知识截止更近，我们就必须不断地根据新数据训练 LLM，这可能会很昂贵。另外，最近的研究表明，微调在向 LLM 注入新知识方面往往无效——大多数信息是在预训练期间学习的 [7, 15]。然而，借助 RAG，我们可以轻松地利用准确且最新的信息来扩充 LLM 的输出和知识库。

Data security. When we add data into an LLM’s training set, there is always a chance that the LLM will leak this data within its output. Recently, researchers have shown that LLMs are prone to data extraction attacks that can discover the contents of an LLM’s pretraining dataset via prompting techniques. As such, including proprietary data within an LLM’s training dataset is a security risk. However, we can still specialize an LLM to such data using RAG, which mitigates the security risk by never actually training the model over proprietary data.
数据安全。当我们将数据添加到 LLM 的训练集中时，LLM 总是有可能在其输出中泄漏这些数据。最近，研究人员表明，LLMs容易受到数据提取攻击，这些攻击可以通过提示技术发现LLM预训练数据集的内容。因此，在 LLM 的训练数据集中包含专有数据存在安全风险。但是，我们仍然可以使用 RAG 将 LLM 专门用于此类数据，这样可以通过从未在专有数据上实际训练模型来降低安全风险。

“Retrieval-augmented generation gives models sources they can cite, like footnotes in a research paper, so users can check any claims. That builds trust.” - source
“检索增强生成提供了他们可以引用的模型来源，例如研究论文中的脚注，因此用户可以检查任何声明。这会建立信任。” - 来源

Ease of implementation. Finally, one of the biggest reasons to use RAG is the simple fact that the implementation is quite simple compared to alternatives like finetuning. The core ideas from the original RAG paper [1] can be implemented in only five lines of code, and there is no need to train the LLM itself. Rather, we can focus our finetuning efforts on improving the quality of the smaller, specialized models that are used for retrieval within RAG, which is much cheaper/easier.
易于实施。最后，使用 RAG 的最大原因之一是一个简单的事实，即与微调等替代方案相比，其实现相当简单。原始RAG论文[1]的核心思想只需五行代码即可实现，并且无需训练LLM本身。相反，我们可以将微调工作的重点放在提高用于 RAG 内检索的较小的专用模型的质量上，这样更便宜/更容易。

From the Origins of RAG to Modern Usage
从 RAG 的起源到现代用法

Many of the ideas used by RAG are derived from prior research on the topic of question answering. Interestingly, however, the original proposal of RAG in [1] was largely inspired (as revealed by the author of RAG) by a single paper [16] that augments the language model pretraining process with a similar retrieval mechanism. Namely, RAG was inspired by a “compelling vision of a trained system that had a retrieval index in the middle of it, so it could learn to generate any text output you wanted (source)”. Within this section, we will outline the origins of RAG and how this technique has evolved to be used in modern LLM applications.
RAG 使用的许多想法都源自先前对问答主题的研究。然而有趣的是，[1] 中 RAG 的最初提议很大程度上受到一篇论文 [16] 的启发（正如 RAG 作者所揭示的），该论文通过类似的检索机制增强了语言模型预训练过程。也就是说，RAG 的灵感来自于“经过训练的系统的令人信服的愿景，该系统中间有一个检索索引，因此它可以学习生成您想要的任何文本输出（来源）”。在本节中，我们将概述 RAG 的起源以及该技术如何演变为在现代 LLM 应用程序中使用。

**Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [1]

知识密集型 NLP 任务的检索增强生成 [1]**

(from [1])（来自[1]）

RAG was first proposed in [1]—in 2021, when LLMs were less explored and Seq2Seq models were extremely popular—to help with solving knowledge-intensive tasks, or tasks that humans cannot solve without access to an external knowledge source. As we know, pretrained language models possess a lot of information within their parameters, but they have a notoriously poor ability to access and manipulate this knowledge base4. For this reason, the performance of language model-based systems was far behind that of specialized, extraction-based methods at the time of RAG’s proposal. Put simply, researchers were struggling to find an efficient and simple method of expanding the knowledge base of a pretrained model.
RAG 首次在 [1] 中提出——2021 年，当时 LLMs 的探索较少，Seq2Seq 模型非常流行——帮助解决知识密集型任务，或者人类在没有外部资源的情况下无法解决的任务知识来源。众所周知，预训练的语言模型在其参数内拥有大量信息，但众所周知，它们访问和操作该知识库的能力很差4。因此，基于语言模型的系统的性能远远落后于专门的系统。，RAG 提出时基于提取的方法。简而言之，研究人员正在努力寻找一种有效且简单的方法来扩展预训练模型的知识库。

“The retriever provides latent documents conditioned on the input, and the seq2seq model then conditions on these latent documents together with the input to generate the output.” - from [1]
“检索器提供以输入为条件的潜在文档，然后 seq2seq 模型以这些潜在文档和输入为条件来生成输出。” - 来自 [1]

How can RAG help? The idea behind RAG is to improve a pretrained language model’s ability to access and use knowledge by connecting it with a non-parametric memory store—typically a set of documents or textual data over which we can perform retrieval; see below. Using this approach, we can dynamically retrieve relevant information from our datastore when generating output with the model. Not only does this approach provide extra (factual) context to the model, but it also allows us (i.e., the people using/training the model) to examine the results of retrieval and gain more insight into the LLM’s problem-solving process. In comparison, the generations of a pretrained language model are largely a black box!
RAG 能提供什么帮助？RAG 背后的想法是通过将预训练语言模型与非参数存储存储（通常是我们可以执行检索的一组文档或文本数据）连接来提高预训练语言模型访问和使用知识的能力；见下文。使用这种方法，我们可以在使用模型生成输出时从数据存储中动态检索相关信息。这种方法不仅为模型提供了额外的（事实）上下文，而且还允许我们（即使用/训练模型的人）检查检索结果并更深入地了解 LLM的问题解决过程。相比之下，预训练语言模型的生成很大程度上是一个黑匣子！

RAG integrates LLMs with a searchable knowledge base
RAG 将 LLMs 与可搜索的知识库集成

The pretrained model in [1] is actually finetuned using this RAG setup. As such, the RAG strategy proposed in [1] is not simply an inference-time technique for improving factuality. Rather, it is a general-purpose finetuning recipe that allows us to connect pretrained language models with external information sources.
[1] 中的预训练模型实际上是使用此 RAG 设置进行微调的。因此，[1]中提出的 RAG 策略不仅仅是一种用于提高事实性的推理时间技术。相反，它是一种通用的微调方法，使我们能够将预训练的语言模型与外部信息源连接起来。

Details on the setup. Formally, RAG considers an input sequence x (i.e., the prompt) and uses this input to retrieve documents z (i.e., the text chunks), which are used as context when generating a target sequence y. For retrieval, authors in [1] use the dense passage retrieval (DPR) model [2]5, a pretrained bi-encoder that uses separate BERT models to encode queries (i.e., query encoder) and documents (i.e., document encoder); see below. For generation, a pretrained BART model [3] is used. BART is an encoder-decoder (Seq2Seq) language model that is pretrained using a denoising objective6. Both the retriever and the generator in [1] are based upon pretrained models, which makes finetuning optional—the RAG setup already possesses the ability to retrieve and leverage knowledge via its pretrained components.
有关设置的详细信息。形式上，RAG 考虑输入序列 x （即提示）并使用此输入来检索文档 z （即文本块），这些文档在生成目标序列 y 。对于检索，[1] 中的作者使用密集段落检索 (DPR) 模型 [2] 5，这是一种预训练的双编码器，使用单独的 BERT 模型对查询（即查询编码器）和文档（即文档编码器）进行编码；见下文。对于生成，使用预训练的 BART 模型 [3]。BART 是一种编码器-解码器 (Seq2Seq) 语言模型，使用去噪目标 6 进行预训练。[1] 中的检索器和生成器都基于预训练模型，这使得微调成为可选 — RAG 设置已经具备检索的能力并通过其预训练组件利用知识。

DPR bi-encoder setup (from [1])
DPR 双编码器设置（来自 [1]）

The data used for RAG in [1] is a Wikipedia dump that is chunked into sequences of 100 tokens. The chunk size used for RAG is a hyperparameter that must be tuned depending upon the application. Each chunk is converted to a vector embedding using DPR’s pretrained document encoder. Using these embeddings, we can build an index for efficient vector search and retrieve relevant chunks when given a sequence of text (e.g., a prompt or message) as input.
[1] 中用于 RAG 的数据是维基百科转储，它被分成 100 个标记的序列。RAG 使用的块大小是一个超参数，必须根据应用程序进行调整。使用 DPR 的预训练文档编码器将每个块转换为向量嵌入。使用这些嵌入，我们可以构建一个索引以进行有效的向量搜索，并在给定文本序列（例如提示或消息）作为输入时检索相关块。

Training with RAG. The dataset used to train the RAG model in [1] contains pairs of input queries and desired responses. When training the model in [1], we first embed the input query using the query encoder of DPR and perform a nearest neighbor search within the document index to return the K most similar textual chunks. From here, we can concatenate a textual chunk with the input query and pass this concatenated input to BART to generate an output; see below.
使用 RAG 进行训练。[1] 中用于训练 RAG 模型的数据集包含输入查询和所需响应对。在训练[1]中的模型时，我们首先使用 DPR 的查询编码器嵌入输入查询，并在文档索引中执行最近邻搜索以返回 K 最相似的文本块。从这里，我们可以将文本块与输入查询连接起来，并将连接后的输入传递给 BART 以生成输出；见下文。

(from [1, 3])（来自 [1, 3]）

The model in [1] only takes a single document as input when generating output with BART. As such, we must marginalize over the top K documents when generating text, meaning that we predict a distribution over generated text using each individual document. In other words, we run a forward pass of BART with each of the different documents used as input. Then, we take a weighted sum over the model’s outputs (i.e., each output is a probability distribution over generated text) based upon the probability of the document used as input. This document probability is derived from the retrieval score (e.g., cosine similarity) of the document. In [1], two methods of marginalizing over documents are proposed:
[1] 中的模型在使用 BART 生成输出时仅将单个文档作为输入。因此，在生成文本时，我们必须边缘化顶部的 K 文档，这意味着我们使用每个单独的文档来预测生成的文本的分布。换句话说，我们运行 BART 的前向传递，并将每个不同的文档用作输入。然后，我们根据用作输入的文档的概率对模型的输出（即每个输出是生成文本的概率分布）进行加权和。该文档概率源自文档的检索分数（例如，余弦相似度）。在[1]中，提出了两种边缘化文档的方法：

RAG-Sequence: the same document is used to predict each target token.
RAG-Sequence：使用相同的文档来预测每个目标标记。
RAG-Token: each target token is predicted with a different document.
RAG-Token：每个目标令牌都是用不同的文档预测的。

At inference time, we can generate an output sequence using either of these approaches using a modified form of beam search. To train the model, we simply use a standard language modeling objective that maximizes the log probability of the target output sequence. Notably, the RAG approach proposed in [1] only trains the DPR query encoder and the BART generator, leaving the document encoder fixed. This way, we can avoid having to constantly rebuild the vector search index used for retrieval, which would be expensive.
在推理时，我们可以使用这些方法中的任何一种，使用修改形式的波束搜索来生成输出序列。为了训练模型，我们只需使用标准语言建模目标来最大化目标输出序列的对数概率。值得注意的是，[1] 中提出的 RAG 方法仅训练 DPR 查询编码器和 BART 生成器，而文档编码器保持固定。这样，我们就可以避免不断重建用于检索的向量搜索索引，这将是昂贵的。

How does it perform? The RAG formulation proposed in [1] is evaluated across a wide variety of knowledge-intensive NLP tasks. On these datasets, the RAG formulation is compared to:
它的表现如何？[1] 中提出的 RAG 公式在各种知识密集型 NLP 任务中进行了评估。在这些数据集上，RAG 公式与以下内容进行比较：

Extractive methods: operate by predicting an answer in the form of a span of text from a retrieved document.
提取方法：通过从检索到的文档中预测一段文本形式的答案来进行操作。
Closed-book methods: operate by generating an answer to a question without any associated retrieval mechanism.
闭卷方法：通过生成问题的答案来进行操作，无需任何相关的检索机制。

(from [1])（来自[1]）

As shown in the tables above, RAG sets new state-of-the-art performance on open domain question answering tasks (left table), outperforming both extractive and Seq2Seq models. Interestingly, RAG even outperforms baselines that use a cross-encoder-style retriever for documents. Compared to extractive approaches, RAG is more flexible, as questions can still be answered even when they are not directly present within any of the retrieved documents.
如上表所示，RAG 在开放域问答任务（左表）上设置了新的最先进性能，优于提取模型和 Seq2Seq 模型。有趣的是，RAG 甚至优于使用跨编码器式文档检索器的基线。与提取方法相比，RAG 更加灵活，因为即使问题没有直接出现在任何检索到的文档中，仍然可以得到解答。

“RAG combines the generation flexibility of the closed-book (parametric only) approaches and the performance of open-book retrieval-based approaches.” - from [1]
“RAG 结合了闭卷（仅参数）方法的生成灵活性和基于开卷检索的方法的性能。” - 来自 [1]

On abstractive question answering tests, RAG achieves near state-of-the-art performance. Unlike RAG, baseline techniques are given access to a gold passage that contains the answer to each question, and many questions are quite difficult to answer without access to this information (i.e., necessary information might not be present in Wikipedia). Despite this deficit, RAG tends to generate responses that are more specific, diverse, and factually grounded.
在抽象问答测试中，RAG 实现了接近最先进的性能。与 RAG 不同，基线技术可以访问包含每个问题答案的黄金段落，并且如果不访问此信息，许多问题都很难回答（即，维基百科中可能不存在必要的信息）。尽管存在这一缺陷，RAG 仍倾向于产生更具体、更多样化且基于事实的回应。

Using RAG in the Age of LLMs

在 LLMs 时代使用 RAG

The modern RAG pipeline现代 RAG 管道

Although RAG was originally proposed in [1], this strategy—with some minor differences—is still heavily used today to improve the factuality of modern LLMs. The structure of RAG used for LLMs is shown within the figure above. The main differences between this approach and that of [1] are the following:
尽管 RAG 最初是在 [1] 中提出的，但这种策略（有一些细微的差别）至今仍被大量使用，以提高现代 LLMs 的真实性。LLMs使用的RAG结构如上图所示。该方法与[1]的主要区别如下：

Finetuning is optional and oftentimes not used. Instead, we rely upon the in context learning abilities of the LLM to leverage the retrieved data.
微调是可选的，并且通常不被使用。相反，我们依靠 LLM 的上下文学习能力来利用检索到的数据。
Due to the large context windows present in most LLMs, we can pass several documents into the model’s input at once when generating a response7.
由于大多数 LLMs 中存在较大的上下文窗口，我们可以在生成响应 7 时一次将多个文档传递到模型的输入中。

Going further, the RAG approach in [1] uses purely vector search (with a bi-encoder) to retrieve document chunks. However, there is no reason that we have to use pure vector search! Put simply, the document retrieval mechanism used for RAG is just a search engine. So, we can apply everything we know about AI-powered search to craft the best RAG pipeline possible!
更进一步，[1] 中的 RAG 方法使用纯向量搜索（使用双编码器）来检索文档块。然而，我们没有理由必须使用纯矢量搜索！简单地说，RAG使用的文档检索机制只是一个搜索引擎。因此，我们可以应用我们所知道的关于人工智能驱动的搜索的一切来打造尽可能最好的 RAG 管道！

“Giving your LLM access to a database it can write to and search across is very useful, but it’s ultimately best conceptualized as giving an agent access to a search engine, versus actually having more memory.” - source
“让您的 LLM 访问可以写入和搜索的数据库非常有用，但最终最好的概念是让代理访问搜索引擎，而不是实际拥有更多内存。” - 来源

Within this section, we will go over more recent research that builds upon work in [1] and applies this RAG framework to modern, generative (decoder-only) LLMs. As we will see, RAG is highly impactful in this domain due to the emergent ability of LLMs to perform in context learning. Namely, we can inject knowledge into an LLM by just including relevant information in the prompt!
在本节中，我们将回顾基于 [1] 中的工作的最新研究，并将此 RAG 框架应用于现代的生成式（仅解码器）LLMs。正如我们将看到的，由于 LLMs 在上下文学习中执行的新兴能力，RAG 在此领域具有很高的影响力。也就是说，我们只需在提示中包含相关信息即可将知识注入 LLM 中！

(from [4])（来自[4]）

How Context Affects Language Models' Factual Predictions [4]. Pretrained LLMs have factual information encoded within their parameters, but there are limitations with leveraging this knowledge base—pretrained LLMs tend to struggle with storing and extracting (or manipulating) knowledge in a reliable fashion. Using RAG, we can mitigate these issues by injecting reliable and relevant knowledge directly into the model’s input. However, existing approaches—including work in [1]—use a supervised approach for RAG, where the model is directly trained to leverage this context. In [4], authors explore an unsupervised approach for RAG that leverages a pretrained retrieval mechanism and generator, finding that the benefit of RAG is still large when no finetuning is performed; see above.
上下文如何影响语言模型的事实预测 [4]。预训练的 LLMs 在其参数中编码了事实信息，但利用此知识库存在局限性 - 预训练的 LLMs 往往难以以可靠的方式存储和提取（或操作）知识。使用 RAG，我们可以通过将可靠且相关的知识直接注入模型的输入来缓解这些问题。然而，现有方法（包括 [1] 中的工作）使用 RAG 的监督方法，其中模型经过直接训练以利用此上下文。在[4]中，作者探索了一种利用预训练检索机制和生成器的无监督 RAG 方法，发现在不进行微调的情况下，RAG 的优势仍然很大；往上看。

“Supporting a web scale collection of potentially millions of changing APIs requires rethinking our approach to how we integrate tools.” - from [5]
“支持可能有数百万个不断变化的 API 的网络规模集合需要重新思考我们集成工具的方法。” - 来自 [5]

Gorilla: Large Language Models Connected with Massive APIs [5]. Combining language models with external tools is a popular topic in AI research. However, these techniques usually teach the underlying LLM to leverage a small, fixed set of potential tools (e.g., a calculator or search engine) to solve problems. In contrast, authors in [5] develop a retrieval-based finetuning strategy to train an LLM, called Gorilla, to use over 1,600 different deep learning model APIs (e.g., from HuggingFace or TensorFlow Hub) for problem solving; see below.
Gorilla：与海量 API 连接的大型语言模型 [5]。将语言模型与外部工具相结合是人工智能研究中的一个热门话题。然而，这些技术通常教会底层LLM利用一组固定的小型工具（例如计算器或搜索引擎）来解决问题。相比之下，[5] 中的作者开发了一种基于检索的微调策略来训练名为 Gorilla 的 LLM，以使用 1,600 多种不同的深度学习模型 API（例如，来自 HuggingFace 或 TensorFlow Hub）来解决问题;见下文。

(from [5])（来自[5]）

First, the documentation for all of these different deep learning model APIs is downloaded. Then, a self-instruct [6] approach is used to generate a finetuning dataset that pairs questions with an associated response that leverages a call to one of the relevant APIs. From here, the model is finetuned over this dataset in a retrieval-aware manner, in which a pretrained information retrieval system is used to retrieve the documentation of the most relevant APIs for solving each question. This documentation is then passed into the model’s prompt when generating output, thus teaching the model to leverage the documentation of retrieved APIs when solving a problem and generating API calls; see below.
首先，下载所有这些不同深度学习模型 API 的文档。然后，使用自指导 [6] 方法生成一个微调数据集，该数据集将问题与利用对相关 API 之一的调用的相关响应配对。从这里开始，模型以检索感知的方式对此数据集进行微调，其中使用预训练的信息检索系统来检索用于解决每个问题的最相关 API 的文档。然后，在生成输出时，该文档会被传递到模型的提示中，从而教导模型在解决问题和生成 API 调用时利用检索到的 API 的文档；见下文。

(from [5])（来自[5]）

Unlike most RAG applications, Gorilla is actually finetuned to better leverage its retrieval mechanism. Interestingly, such an approach allows the model to adapt to real-time changes in an API’s documentation at inference time and even enables the model to generate fewer hallucinations by leveraging relevant documentation.
与大多数 RAG 应用程序不同，Gorilla 实际上经过了微调，可以更好地利用其检索机制。有趣的是，这种方法允许模型在推理时适应 API 文档中的实时变化，甚至使模型能够通过利用相关文档来生成更少的幻觉。

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs [7]. In [7], authors study the concept of knowledge injection, which refers to methods of incorporating information from an external dataset into an LLM’s knowledge base. Given a pretrained LLM, the two basic ways that we can inject knowledge into this model are i) finetuning (i.e., continued pretraining) and ii) RAG.
微调还是检索？比较LLMs中的知识注入[7]。在[7]中，作者研究了知识注入的概念，它是指将外部数据集的信息合并到LLM知识库中的方法。给定一个预训练的 LLM，我们可以将知识注入到该模型中的两种基本方法是 i) 微调（即持续预训练）和 ii) RAG。

(from [4])（来自[4]）

We see in [4] that RAG far outperforms finetuning with respect to injecting new sources of information into an LLM’s responses; see below. Interestingly, combining finetuning with RAG does not consistently outperform RAG alone, thus revealing the impact of RAG on the LLM’s factuality and response quality.
我们在[4]中看到，在将新的信息源注入LLM的响应方面，RAG远远优于微调；见下文。有趣的是，将微调与 RAG 相结合并不总是优于单独的 RAG，从而揭示了 RAG 对 LLM 的事实性和响应质量的影响。

(from [4])（来自[4]）

RAGAS: Automated Evaluation of Retrieval Augmented Generation [8]. RAG is an effective tool for LLM applications. However, the approach is difficult to evaluate, as there are many dimensions of “performance” that characterize an effective RAG pipeline:
RAGAS：检索增强生成的自动评估[8]。RAG 是LLM 应用程序的有效工具。然而，该方法很难评估，因为有许多“性能”维度来表征有效的 RAG 管道：

The ability to identify relevant documents.
识别相关文件的能力。
Properly exploiting data in the documents via in context learning.
通过上下文学习正确利用文档中的数据。
Generating a high-quality, grounded output.
产生高质量的接地输出。

RAG is not just a retrieval system, but rather a multi-step process of finding useful information and leveraging this information to generate better output with LLMs. In [8], authors propose an approach, called Retrieval Augmented Generation Assessment (RAGAS), for evaluating these complex RAG pipelines without any human-annotated datasets or reference answers. In particular, three classes of metrics are used for evaluation:
RAG 不仅仅是一个检索系统，而是一个查找有用信息并利用这些信息通过 LLMs 生成更好输出的多步骤过程。在[8]中，作者提出了一种称为检索增强生成评估（RAGAS）的方法，用于评估这些复杂的 RAG 管道，而无需任何人工注释的数据集或参考答案。特别是，使用三类指标进行评估：

Faithfulness: the answer is grounded in the given context.
忠实：答案基于给定的上下文。
Answer relevance: the answer addresses the provided question.
答案相关性：答案针对所提供的问题。
Context relevance: the retrieved context is focused and contains as little irrelevant information as possible.
上下文相关性：检索到的上下文是有重点的，并且尽可能少地包含不相关的信息。

Together, these metrics—as claimed by authors in [8]—holistically characterize the performance of any RAG pipeline. Additionally, we can evaluate each of these metrics in an automated fashion by prompting powerful foundation models like ChatGPT or GPT-4. For example, faithfulness is evaluated in [8] by prompting an LLM to extract a set of factual statements from the generated answer, then prompting an LLM again to determine if each of these statements can be inferred from the provided context; see below. Answer and context relevance are evaluated similarly (potentially with some added tricks based on embedding similarity8).
正如作者在 [8] 中所声称的那样，这些指标共同全面地描述了任何 RAG 管道的性能。此外，我们可以通过提示强大的基础模型（如 ChatGPT 或 GPT-4）以自动方式评估每个指标。例如，在[8]中，通过提示 LLM 从生成的答案中提取一组事实陈述，然后再次提示 LLM 来确定这些陈述中的每一个是否正确，来评估忠实度可以从所提供的上下文中推断出来；见下文。答案和上下文相关性的评估方式类似（可能会添加一些基于嵌入相似性的技巧 8）。

Evaluating RAG faithfulness (from [8])
评估 RAG 忠诚度（来自 [8]）

Notably, the RAGAS toolset is not just a paper. These tools, which are now quite popular among LLM practitioners, have been implemented and openly released online. The documentation of RAGAS tools is provided at the link below.
值得注意的是，RAGAS 工具集不仅仅是一篇论文。这些目前在LLM从业者中颇受欢迎的工具已经实现并在网上公开发布。下面的链接提供了 RAGAS 工具的文档。

RAGAS Docs拉格斯文档

Practical Tips for RAG Applications
RAG 应用的实用技巧

Although a variety of papers have been published on the topic of RAG, this technique is most popular among practitioners. As a result, many of the best takeaways for how to successfully use RAG are hidden within blog posts, discussion forums, and other non-academic publications. Within this section, we will capture some of this domain knowledge by outlining the most important practical lessons of which one should be aware when building a RAG application.
尽管已经发表了有关 RAG 主题的各种论文，但该技术在从业者中最受欢迎。因此，许多有关如何成功使用 RAG 的最佳要点都隐藏在博客文章、讨论论坛和其他非学术出版物中。在本节中，我们将通过概述构建 RAG 应用程序时应了解的最重要的实践经验教训来获取一些该领域知识。

RAG is a Search Engine!

RAG 是一个搜索引擎！

When applying RAG in practical applications, we should realize that the retrieval pipeline used for RAG is just a search engine! Namely, the same retrieval and ranking techniques that have been used by search engines for years can be applied by RAG to find more relevant textual chunks. From this realization, there are several practical tips that can be derived for improving RAG.
在实际应用中应用RAG时，我们应该意识到RAG使用的检索管道只是一个搜索引擎！也就是说，RAG 可以应用搜索引擎多年来使用的相同检索和排名技术来查找更相关的文本块。从这一认识中，可以得出一些改进 RAG 的实用技巧。

Don’t just use vector search. Many RAG systems purely leverage dense retrieval for finding relevant textual chunks. Such an approach is quite simple, as we can just i) generate an embedding for the input prompt and ii) search for related chunks in our vector database. However, semantic search has a tendency to yield false positives and may have noisy results. To solve this, we should perform hybrid retrieval using a combination of vector and lexical search—just like a normal (AI-powered) search engine! The approach to vector search does not change, but we can perform a parallel lexical search by:
不要只使用矢量搜索。许多 RAG 系统纯粹利用密集检索来查找相关文本块。这种方法非常简单，因为我们只需 i）为输入提示生成嵌入，ii）在向量数据库中搜索相关块。然而，语义搜索容易产生误报，并且可能产生嘈杂的结果。为了解决这个问题，我们应该使用向量和词汇搜索的组合来执行混合检索——就像普通的（人工智能驱动的）搜索引擎一样！向量搜索的方法没有改变，但我们可以通过以下方式执行并行词法搜索：

Extracting keywords from the input prompt9.
9. 从输入提示中提取关键字
Performing a lexical search with these keywords.
使用这些关键字执行词法搜索。
Taking a weighted combination of results from lexical/vector search.
对词法/向量搜索结果进行加权组合。

By performing hybrid search, we make our RAG pipeline more robust and reduce the frequency of irrelevant chunks in the model’s context. Plus, adopting keyword-based search allows us to perform clever tricks like promoting documents with important keywords, excluding documents with negative keywords, or even augmenting documents with synthetically-generated data for better matching!
通过执行混合搜索，我们使 RAG 管道更加稳健，并减少模型上下文中不相关块的频率。另外，采用基于关键字的搜索使我们能够执行巧妙的技巧，例如推广具有重要关键字的文档、排除具有否定关键字的文档，甚至使用综合生成的数据来增强文档以实现更好的匹配！

Optimizing the RAG pipeline. To improve our retrieval system, we need to collect metrics that allow us to evaluate its results similarly to any normal search engine. One way this can be done is by displaying the textual chunks used for certain generations to the end user similarly to a citation, such that the user can use the information retrieved by RAG to verify the factual correctness of the model’s output. As part of this system, we could then prompt the user to provide binary feedback (i.e., thumbs up or thumbs down) as to whether the information was actually relevant; see below. Using this feedback, we can evaluate the results of our retrieval system using traditional search metrics (e.g., DGC or nDCG), test changes to the system via AB tests, and iteratively improve our results.
优化 RAG 管道。为了改进我们的检索系统，我们需要收集指标，使我们能够像任何普通搜索引擎一样评估其结果。实现此目的的一种方法是向最终用户显示某些代所使用的文本块，类似于引文，以便用户可以使用 RAG 检索到的信息来验证模型输出的事实正确性。作为该系统的一部分，我们可以提示用户提供二元反馈（即赞成或反对），以确定该信息是否确实相关；见下文。利用此反馈，我们可以使用传统搜索指标（例如 DGC 或 nDCG）评估检索系统的结果，通过 AB 测试测试系统的更改，并迭代改进我们的结果。

(from [17])（来自[17]）

Evaluations for RAG must go beyond simply verifying the results of retrieval. Even if we retrieve the perfect set of context to include within the model’s prompt, the generated output may still be incorrect. To evaluate the generation component of RAG, the AI community relies heavily upon automated metrics such as RAGAS [8] or LLM as a Judge [9]10, which perform evaluations by prompting LLMs like GPT-4; see here for more details. These techniques seem to provide reliable feedback on the quality of generated output. To successfully apply RAG in practice, however, it is important that we evaluate all parts of the end-to-end RAG system—including both retrieval and generation—so that we can reliably benchmark improvements that are made to each component.
对 RAG 的评估必须不仅仅是验证检索结果。即使我们检索到包含在模型提示中的完美上下文集，生成的输出仍然可能不正确。为了评估 RAG 的生成组件，AI 社区严重依赖自动化指标，例如 RAGAS [8] 或 LLM 作为法官 [9] 10，它们通过提示 LLMs 来执行评估像 GPT-4；请参阅此处了解更多详细信息。这些技术似乎可以提供有关生成输出的质量的可靠反馈。然而，为了在实践中成功应用 RAG，我们必须评估端到端 RAG 系统的所有部分（包括检索和生成），以便我们能够可靠地对每个组件所做的改进进行基准测试。

Improving over time. Once we have built a proper retrieval pipeline and can evaluate the end-to-end RAG system, the last step of applying RAG is to perform iterative improvements using a combination of better models and data. There are a variety of improvements that can be investigated, including (but not limited to):
随着时间的推移而改善。一旦我们构建了适当的检索管道并可以评估端到端 RAG 系统，应用 RAG 的最后一步就是使用更好的模型和数据的组合来执行迭代改进。可以研究多种改进，包括（但不限于）：

Adding ranking to the retrieval pipeline, either using a cross-encoder or a hybrid model that performs both retrieval and ranking (e.g., ColBERT [10]).
使用交叉编码器或执行检索和排序的混合模型（例如 ColBERT [10]）将排序添加到检索管道中。
Finetuning the embedding model for dense retrieval over human-collected relevance data (i.e., pairs of input prompts with relevant/irrelevant passages).
微调嵌入模型，以便对人类收集的相关数据（即具有相关/不相关段落的输入提示对）进行密集检索。
Finetuning the LLM generator over examples of high-quality outputs so that it learns to better follow instructions and leverage useful context.
根据高质量输出的示例微调 LLM 生成器，以便它学会更好地遵循指令并利用有用的上下文。
Using LLMs to augment either the input prompt or the textual chunks with extra synthetic data to improve retrieval.
使用 LLMs 通过额外的合成数据来增强输入提示或文本块，以改进检索。

For each of these changes, we can measure their impact over historical data in an offline manner. To truly understand whether they positively impact the RAG system, however, we should rely upon online AB tests that compare metrics from the new and improved system to the prior system in real-time tests with humans.
对于每一个变化，我们都可以以离线方式衡量它们对历史数据的影响。然而，为了真正了解它们是否对 RAG 系统产生积极影响，我们应该依靠在线 AB 测试，在人类实时测试中将新的和改进的系统与先前系统的指标进行比较。

Optimizing the Context Window

优化上下文窗口

Successfully applying RAG is not just a matter of retrieving the correct context—prompt engineering plays a massive role. Once we have the relevant data, we must craft a prompt that i) includes this context and ii) formats it in a way that elicits a grounded output from the LLM. Within this section, we will investigate a few strategies for crafting effective prompts with RAG to gain a better understanding of how to properly include context within a model’s prompt.
成功应用 RAG 不仅仅是检索正确上下文的问题，即时工程发挥着重要作用。一旦我们有了相关数据，我们必须制作一个提示，i) 包含此上下文，ii) 以从 LLM 引出接地输出的方式对其进行格式化。在本节中，我们将研究一些使用 RAG 制作有效提示的策略，以便更好地理解如何在模型提示中正确包含上下文。

RAG needs a larger context window. During pretraining, an LLM sees input sequences of a particular length. This choice of sequence length during pretraining becomes the model’s context length. Recently, we have seen a trend in AI research towards the creation of LLMs with longer context lengths11. See, for example, MPT-StoryWriter-65K, Claude-2.1, or GPT-4-Turbo, which have context lengths of 65K, 200K, and 128K, respectively. For reference, the Great Gatsby (i.e., an entire book!) only contains ~70K tokens. Although not all LLMs have a large context window, RAG requires a model with a large context window so that we can include a sufficient number of textual chunks in the model’s prompt.
RAG 需要更大的上下文窗口。在预训练期间，LLM 会看到特定长度的输入序列。预训练期间序列长度的选择成为模型的上下文长度。最近，我们看到人工智能研究的趋势是创建具有更长上下文长度的 LLMs 11。例如，参见 MPT-StoryWriter-65K、Claude-2.1 或 GPT-4-Turbo，其中上下文长度分别为 65K、200K 和 128K。作为参考，《了不起的盖茨比》（即整本书！）仅包含约 70K 代币。尽管并非所有 LLMs 都具有较大的上下文窗口，但 RAG 需要具有较大上下文窗口的模型，以便我们可以在模型的提示中包含足够数量的文本块。

Maximizing diversity. Once we’ve been sure to select an LLM with a sufficiently large context length, the next step in applying RAG is to determine how to select the best context to include in the prompt. Although the textual chunks to be included are selected by our retrieval pipeline, we can optimize our prompting strategy by adding a specialized selection component12 that sub-selects the results of retrieval. Selection does not change the retrieval process of RAG. Rather, selection is added to the end of the retrieval pipeline—after relevant chunks of text have already been identified and ranked—to determine how documents can best be sub-selected and ordered within the resulting prompt.
最大限度地提高多样性。一旦我们确定选择了具有足够大上下文长度的 LLM，应用 RAG 的下一步就是确定如何选择要包含在提示中的最佳上下文。尽管要包含的文本块是由我们的检索管道选择的，但我们可以通过添加专门的选择组件 12 对检索结果进行子选择来优化我们的提示策略。选择不会改变 RAG 的检索过程。相反，在相关文本块已经被识别和排序之后，选择被添加到检索管道的末尾，以确定如何在结果提示中最好地对文档进行子选择和排序。

One popular selection approach is a diversity ranker, which can be used to maximize the diversity of textual chunks included in the model’s prompt by performing the following steps:
一种流行的选择方法是多样性排序器，它可用于通过执行以下步骤来最大化模型提示中包含的文本块的多样性：

Use the retrieval pipeline to generate a large set of documents that could be included in the model’s prompt.
使用检索管道生成大量可以包含在模型提示中的文档。
Select the document that is most similar to the input (or query), as determined by embedding cosine similarity.
选择与输入（或查询）最相似的文档（通过嵌入余弦相似度确定）。
For each remaining document, select the document that is least similar to the documents that are already selected13.
对于剩余的每个文档，选择与已选择的文档最不相似的文档 13.

Notably, this strategy solely optimizes for the diversity of selected context, so it is important that we apply this selection strategy after a set of relevant documents has been identified by the retrieval pipeline. Otherwise, the diversity ranker would select diverse, but irrelevant, textual chunks to include in the context.
值得注意的是，该策略仅针对所选上下文的多样性进行优化，因此在检索管道识别出一组相关文档后应用此选择策略非常重要。否则，多样性排名器将选择不同但不相关的文本块来包含在上下文中。

Lost in the middle selection for RAG
在 RAG 的中间选择中迷失

Optimizing context layout. Despite increases in context lengths, recent research indicates that LLMs struggle to capture information in the middle of a large context window [11]. Information at the beginning and end of the context window is captured most accurately, causing certain data to be “lost in the middle”. To solve this issue, we can adopt a selection strategy that is more mindful of where context is placed in the prompt. In particular, we can take the relevant textual chunks from our retrieval pipeline and iteratively place the most relevant chunks at the beginning and end of the context window; see below. Such an approach avoids inserting textual chunks in order of relevance, choosing instead to place the most relevant chunks at the beginning and end of the prompt.
优化上下文布局。尽管上下文长度增加，但最近的研究表明 LLMs 难以捕获大上下文窗口中间的信息 [11]。上下文窗口开头和结尾的信息被最准确地捕获，导致某些数据“在中间丢失”。为了解决这个问题，我们可以采用一种选择策略，更加注意上下文在提示中的位置。特别是，我们可以从检索管道中获取相关的文本块，并迭代地将最相关的块放置在上下文窗口的开头和结尾；见下文。这种方法避免了按照相关性顺序插入文本块，而是选择将最相关的块放置在提示的开头和结尾。

Data Cleaning and Formatting

数据清理和格式化

In most RAG applications, our model will be retrieving textual information from many different sources. For example, an assistant that is built to discuss the details of a codebase with a programmer may pull information from the code itself, documentation pages, blog posts, user discussion threads, and more. In this case, the data being used for RAG has a variety of different formats that could lead to artifacts (e.g., logos, icons, special symbols, and code blocks) within the text that have the potential to confuse the LLM when generating output. In order for the application to function properly, we must extract, clean, and format the text from each of these heterogenous sources. Put simply, there’s a lot more to preprocessing data for RAG than just splitting textual data into chunks!
在大多数 RAG 应用程序中，我们的模型将从许多不同的来源检索文本信息。例如，为与程序员讨论代码库细节而构建的助手可以从代码本身、文档页面、博客文章、用户讨论线程等中提取信息。在这种情况下，用于 RAG 的数据具有多种不同的格式，可能会导致文本中出现伪影（例如徽标、图标、特殊符号和代码块），从而有可能混淆</ b1001> 生成输出时。为了使应用程序正常运行，我们必须从这些异构源中提取、清理和格式化文本。简而言之，RAG 数据预处理不仅仅是将文本数据分割成块！

(from [12])（来自[12]）

Performance impact. If text is not extracted properly from each knowledge source, the performance of our RAG application will noticeably deteriorate! On the flip side, cleaning and formatting data in a standardized manner will noticeably improve performance. As shown in this blog post, investing into proper data preprocessing for RAG has several benefits (see above):
性能影响。如果没有从每个知识源中正确提取文本，我们的 RAG 应用程序的性能将明显恶化！另一方面，以标准化方式清理和格式化数据将显着提高性能。如本博文所示，投资于 RAG 的适当数据预处理有几个好处（见上文）：

20% boost in the correctness of LLM-generated answers.
LLM 生成的答案的正确性提高了 20%。
64% reduction in the number of tokens passed into the model14.
传递到模型 14 的代币数量减少 64%。
Noticeable improvement in overall LLM behavior.
整体LLM行为显着改善。

“We wrote a quick workflow that leveraged LLM-as-judge and iteratively figured out the cleanup code to remove extraneous formatting tokens from Markdown files and webpages.” - from [12]
“我们编写了一个快速工作流程，利用 LLM 作为判断，并迭代地找出清理代码，以从 Markdown 文件和网页中删除无关的格式标记。” - 来自[12]

Data cleaning pipeline. The details of any data cleaning pipeline for RAG will depend heavily upon our application and data. To craft a functioning data pipeline, we should i) observe large amounts of data within our knowledge base, ii) visually inspect whether unwanted artifacts are present, and iii) amend issues that we find by adding changes to the data cleaning pipeline. Although this approach isn’t flashy or cool, any AI/ML practitioner knows that 90% of time building an application will be spent observing and working with data.
数据清洗管道。RAG 的任何数据清理管道的细节将在很大程度上取决于我们的应用程序和数据。为了构建有效的数据管道，我们应该 i) 观察知识库中的大量数据，ii) 目视检查是否存在不需要的工件，以及 iii) 通过向数据清理管道添加更改来修正我们发现的问题。尽管这种方法并不华丽或酷，但任何 AI/ML 从业者都知道，构建应用程序 90% 的时间将花在观察和处理数据上。

If we aren’t interested in manually inspecting data and want a sexier approach, we can automate the process of creating a functional data preprocessing pipeline by using LLM-as-a-Judge [9] to iteratively construct the code for cleaning up and properly formatting data. Such an approach was recently shown to retain useful information, remove formatting errors, and drastically reduce the average size of documents [12]. See here for the resulting data preprocessing script and below for an example of a reformatted document after cleanup.
如果我们对手动检查数据不感兴趣并且想要一种更性感的方法，我们可以通过使用 LLM-as-a-Judge [9] 迭代构建用于清理和正确格式化数据的代码。最近证明这种方法可以保留有用的信息，消除格式错误，并大大减少文档的平均大小[12]。有关生成的数据预处理脚本，请参阅此处；有关清理后重新格式化文档的示例，请参阅下面的示例。

Textual chunk before and after data cleaning (from [12])
数据清理前后的文本块（来自[12]）

Further Practical Resources for RAG

RAG 的更多实用资源

As previously mentioned, some of the best resources for learning about RAG are not published within academic journals or conferences. There are a variety of blog posts and practical write ups that have helped me to gain insight for how to better leverage RAG. Some of the most notable resources are outlined below.
如前所述，一些了解 RAG 的最佳资源并未在学术期刊或会议上发表。有各种博客文章和实用文章帮助我深入了解如何更好地利用 RAG。下面概述了一些最著名的资源。

What is Retrieval Augmented Generation? [link]
什么是检索增强生成？[ 关联]
Building RAG-based LLM Applications for Production [link]
构建基于 RAG 的LLM 生产应用程序 [链接]
Best Practices for LLM Evaluation of RAG Applications [link]
RAG 应用程序评估 LLM 的最佳实践 [链接]
Building Conversational Search with RAG at Vespa [link]
在 Vespa 中使用 RAG 构建对话式搜索 [链接]
RAG Finetuning with Ray and HuggingFace [link]
使用 Ray 和 HuggingFace 进行 RAG 微调 [链接]

Closing Thoughts结束语

At this point, we should have a comprehensive grasp of RAG, its inner workings, and how we can best approach building a high-performing LLM application using RAG. Both the concept and implementation of RAG are simple, which—when combined with its impressive performance—is what makes the technique so popular among practitioners. However, successfully applying RAG in practice involves more than putting together a minimal functioning pipeline with pretrained components. Namely, we must refine our RAG approach by:
此时，我们应该全面掌握 RAG、其内部工作原理，以及如何最好地使用 RAG 构建高性能的LLM应用程序。RAG 的概念和实现都很简单，再加上其令人印象深刻的性能，使得该技术在从业者中如此受欢迎。然而，在实践中成功应用 RAG 不仅仅需要将最小功能管道与预训练组件组合在一起。也就是说，我们必须通过以下方式完善我们的 RAG 方法：

Creating a high-performing hybrid retrieval algorithm (potentially with a re-ranking component) that can accurately identify relevant textual chunks.
创建高性能混合检索算法（可能带有重新排名组件），可以准确识别相关文本块。
Constructing a functional data preprocessing pipeline that properly formats data and removes harmful artifacts before the data is used for RAG.
构建功能性数据预处理管道，在数据用于 RAG 之前正确格式化数据并删除有害工件。
Finding the correct prompting strategy that allows the LLM to reliably incorporate useful context when generating output.
找到正确的提示策略，使 LLM 在生成输出时能够可靠地合并有用的上下文。
Putting detailed evaluations in place for both the retrieval pipeline (i.e., using traditional search metrics) and the generation component (using RAGAS or LLM-as-a-judge [8, 9]).
对检索管道（即使用传统搜索指标）和生成组件（使用 RAGAS 或 LLM-as-a-judge [8, 9]）进行详细评估。
Collecting data over time that can be used to improve the RAG pipeline’s ability to discover relevant context and generate useful output.
随着时间的推移收集数据，可用于提高 RAG 管道发现相关上下文并生成有用输出的能力。

Going further, creating a robust evaluation suite allows us to improve each of the components listed above by quantitatively testing (via offline metrics or an AB test) iterative improvements to our RAG pipeline, such as a modified retrieval algorithm or a finetuned component of the system. As such, our approach to RAG should mature (and improve!) over time as we test and discover new ideas.
更进一步，创建一个强大的评估套件使我们能够通过定量测试（通过离线指标或 AB 测试）对 RAG 管道的迭代改进来改进上面列出的每个组件，例如修改的检索算法或系统的微调组件。因此，随着我们测试和发现新想法，我们的 RAG 方法应该会随着时间的推移而成熟（并改进！）。

Bibliography参考书目

[1] Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks."_Advances in Neural Information Processing Systems_33 (2020): 9459-9474.
[1] 刘易斯、帕特里克等人。“知识密集型 NLP 任务的检索增强生成。”神经信息处理系统的进展 33 (2020): 9459-9474。

[2] Karpukhin, Vladimir, et al. "Dense passage retrieval for open-domain question answering."arXiv preprint arXiv:2004.04906(2020).
[2] 卡普欣、弗拉基米尔等人。“用于开放域问答的密集段落检索。” arXiv 预印本 arXiv:2004.04906 (2020)。

[3] Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension."arXiv preprint arXiv:1910.13461(2019).
[3] 刘易斯，迈克，等人。“Bart：用于自然语言生成、翻译和理解的序列到序列去噪预训练。” arXiv 预印本 arXiv:1910.13461 (2019)。

[4] Petroni, Fabio, et al. "How context affects language models' factual predictions."arXiv preprint arXiv:2005.04611(2020).
[4] 佩特罗尼、法比奥等人。“上下文如何影响语言模型的事实预测。” arXiv 预印本 arXiv:2005.04611 (2020)。

[5] Patil, Shishir G., et al. "Gorilla: Large language model connected with massive apis."arXiv preprint arXiv:2305.15334(2023).
[5] 帕蒂尔 (Patil)、Shishir G. 等人。“Gorilla：与海量 API 连接的大型语言模型。” arXiv 预印本 arXiv:2305.15334 (2023)。

[6] Wang, Yizhong, et al. "Self-instruct: Aligning language model with self generated instructions."arXiv preprint arXiv:2212.10560(2022).
[6] 王一中，等． “自我指导：将语言模型与自我生成的指令对齐。” arXiv 预印本 arXiv:2212.10560 (2022)。

[7] Ovadia, Oded, et al. "Fine-tuning or retrieval? comparing knowledge injection in llms."arXiv preprint arXiv:2312.05934(2023).
[7] 奥瓦迪亚、奥德等人。“微调还是检索？比较 llms 中的知识注入。” arXiv 预印本 arXiv:2312.05934 (2023)。

[8] Es, Shahul, et al. "Ragas: Automated evaluation of retrieval augmented generation."arXiv preprint arXiv:2309.15217(2023).
[8] Es，Shahul，等人。“Ragas：检索增强生成的自动评估。” arXiv 预印本 arXiv:2309.15217 (2023)。

[9] Zheng, Lianmin, et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena."arXiv preprint arXiv:2306.05685(2023).
[9] 郑联民，等。“使用 MT-Bench 和 Chatbot Arena 来评判 LLM。” arXiv 预印本 arXiv:2306.05685 (2023)。

[10] Khattab, Omar, and Matei Zaharia. "Colbert: Efficient and effective passage search via contextualized late interaction over bert."Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 2020.
[10] 哈塔布、奥马尔和马泰·扎哈里亚。“Colbert：通过 bert 的情境化后期交互进行高效且有效的段落搜索。”第 43 届国际 ACM SIGIR 信息检索研究与开发会议论文集。2020.

[11] Liu, Nelson F., et al. "Lost in the middle: How language models use long contexts."arXiv preprint arXiv:2307.03172(2023).
[11] 刘，纳尔逊 F.，等人。“迷失在中间：语言模型如何使用长上下文。” arXiv 预印本 arXiv:2307.03172 (2023)。

[12] Leng, Quinn, el al. “Announcing MLflow 2.8 LLM-as-a-judge metrics and Best Practices for LLM Evaluation of RAG Applications, Part 2.” https://www.databricks.com/blog/announcing-mlflow-28-llm-judge-metrics-and-best-practices-llm-evaluation-rag-applications-part (2023).
[12] Leng，Quinn，等人。“宣布 MLflow 2.8 LLM 作为评判指标和 RAG 应用程序评估 LLM 的最佳实践，第 2 部分。” https://www.databricks.com/blog/announcing-mlflow-28-llm-judge-metrics-and-best-practices-llm-evaluation-rag-applications-part（2023）。

[13] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
[13] 汤姆·布朗等人。“语言模型是小样本学习者。”神经信息处理系统的进展33（2020）：1877-1901。

[14] Wang, Yan, et al. "Enhancing recommender systems with large language model reasoning graphs."arXiv preprint arXiv:2308.10835(2023).
[14] 王岩，等。“通过大型语言模型推理图增强推荐系统。” arXiv 预印本 arXiv:2308.10835 (2023)。

[15] Zhou, Chunting, et al. "Lima: Less is more for alignment."arXiv preprint arXiv:2305.11206(2023).
[15]周春亭，等。“利马：对于协调来说，少即是多。” arXiv 预印本 arXiv:2305.11206 (2023)。

[16] Guu, Kelvin, et al. "Retrieval augmented language model pre-training."International conference on machine learning. PMLR, 2020.
[16] Guu，Kelvin，等人。“检索增强语言模型预训练。”机器学习国际会议。PMLR，2020。

[17] Glaese, Amelia, et al. "Improving alignment of dialogue agents via targeted human judgements." arXiv preprint arXiv:2209.14375 (2022).
[17] 格莱泽、阿米莉亚等人。“通过有针对性的人类判断来改善对话代理的一致性。” arXiv 预印本 arXiv:2209.14375 (2022)。

Interestingly, in context learning is an emergent capability of LLMs, meaning that it is most noticeable in larger models. In context learning ability was first demonstrated by the impressive few-shot learning capabilities of GPT-3 [13].
有趣的是，上下文学习是 LLMs 的一项新兴功能，这意味着它在较大的模型中最为引人注目。GPT-3 令人印象深刻的小样本学习能力首先证明了上下文学习能力[13]。

In nearly all cases, we will use an encoder-only embedding model (e.g., BERT, sBERT, ColBERT, etc.) for vector search. However, recent research has indicated that decoder-only models (i.e., the architecture used for most modern, generative LLMs) can produce high-quality embeddings as well!
几乎在所有情况下，我们都会使用仅编码器的嵌入模型（例如 BERT、sBERT、ColBERT 等）进行向量搜索。然而，最近的研究表明，仅解码器模型（即用于大多数现代生成LLMs的架构）也可以产生高质量的嵌入！

We can also explore other ways of adding context to the query, such as by creating a more generic prompt template.
我们还可以探索向查询添加上下文的其他方法，例如通过创建更通用的提示模板。

For more information, check out recent research on the reversal curse and knowledge manipulation within LLMs. These models oftentimes struggle to perform even simple manipulations (e.g., reversal) of factual relationships within their knowledge base.
有关更多信息，请查看 LLMs 中关于逆转诅咒和知识操纵的最新研究。这些模型通常很难对其知识库中的事实关系进行简单的操作（例如逆转）。

The original RAG paper purely uses vector search (with a bi-encoder) to retrieve relevant documents.
原始 RAG 论文纯粹使用向量搜索（带有双编码器）来检索相关文档。

The denoising objective used by BART considers several perturbations to the original sequence of text, such as token masking/deletion, masking entire sequences of tokens, permuting sentences in a document, or even rotating a sequence about a chosen token. Given the permuted input, the BART model is trained to reconstruct the original sequence of text during pretraining.
BART 使用的去噪目标考虑了对原始文本序列的几种扰动，例如标记屏蔽/删除、屏蔽整个标记序列、排列文档中的句子，甚至围绕所选标记旋转序列。给定排列后的输入，BART 模型经过训练以在预训练期间重建原始文本序列。

The number of textual chunks that we actually pass into the model’s prompt is dependent upon several factors, such as i) the model’s context window, ii) the chunk size, and iii) the application we are solving.
我们实际传递到模型提示中的文本块的数量取决于几个因素，例如 i) 模型的上下文窗口，ii) 块大小，以及 iii) 我们正在解决的应用程序。

Context relevance follows a simple approach of prompting an LLM to determine whether sentences from the retrieved context are actually relevant or not. For answer relevance, however, we prompt an LLM to generate potential questions associated with the generated answer, then we take the average cosine similarity between the embeddings of these questions and the actual question as the final score.
上下文相关性遵循一种简单的方法，提示 LLM 确定检索到的上下文中的句子是否确实相关。然而，对于答案相关性，我们提示 LLM 生成与生成的答案相关的潜在问题，然后我们将这些问题的嵌入与实际问题之间的平均余弦相似度作为最终分数。

This can be done via traditional query understanding techniques, or we can simply prompt an LLM to generate a list of keyword associated with the input.
这可以通过传统的查询理解技术来完成，或者我们可以简单地提示 LLM 生成与输入关联的关键字列表。

LLMs can effectively evaluate unstructured outputs (semi-)reliably and at a low cost. However, human feedback remains the gold standard for evaluating an LLM’s output.
LLMs 可以以低成本（半）可靠地有效评估非结构化输出。然而，人类反馈仍然是评估 LLM 输出的黄金标准。

Plus, there has been a ton of research on extending the context length of existing, pretrained LLMs or making them more capable of handling longer inputs; e.g., ALiBi, RoPE, Self Extend, LongLoRA, and more.
此外，还有大量的研究致力于扩展现有的预训练 LLMs 的上下文长度，或者使它们更有能力处理更长的输入；例如，ALiBi、RoPE、Self Extend、LongLoRA 等。

Here, I call this step “selection” rather than ranking as to avoid confusion with re-ranking within search, which sorts documents based on textual relevance. Selection refers to the process of deciding the order of documents as they are inserted into the model’s prompt, and textual relevance is assumed to already be known at this step.
在这里，我将此步骤称为“选择”而不是排名，以避免与搜索中的重新排名混淆，搜索中的重新排名根据文本相关性对文档进行排序。选择是指在将文档插入模型提示时决定文档顺序的过程，并且假设在此步骤中已经知道文本相关性。

This is a greedy approach for selecting the most diverse subset of documents. The resulting set is not optimal in terms of diversity, but this efficient approximation does a good job of constructing a diverse set of documents in practice.
这是一种选择最多样化的文档子集的贪婪方法。结果集在多样性方面并不是最佳的，但这种有效的近似在实践中很好地构建了多样化的文档集。

The cost reduction is due to a reduction in the average size of textual chunks after artifacts and unnecessary components are removed from the text.
成本的降低是由于从文本中删除伪影和不必要的组件后文本块的平均大小减少了。