我要投稿

RAG实战篇：检索召回结果不准确？试试这三种实用方法

发布日期：2024-10-11 08:31:08 浏览次数： 3229

作者：风叔云

微信搜一搜，关注“风叔云”

前言

在《RAG实战篇：构建一个最小可行性的Rag系统》中，风叔详细介绍了Rag系统的实现框架，以及如何搭建一个最基本的Naive Rag。

在前面四篇文章中，风叔分别介绍了索引（Indexing）、查询转换（Query Translation）、路由（Routing）和查询构建（Query Construction）环节的优化方案。

在这篇文章中，围绕检索召回（Retrieval），风叔详细介绍如何优化RAG系统的召回结果，提升LLM大模型的回答准确度。

在检索召回的时候，用户的问题会被输入到嵌入模型中进行向量化处理，然后系统会在向量数据库中，搜索与该问题向量语义上相似的知识文本或历史对话记录并返回。

在Naive Rag中，系统会将所有检索到的块直接输入到 LLM生成回答，导致出现中间内容丢失、噪声占比过高、上下文长度限制等问题。

下面，我们结合源代码，详细介绍下Reranking（重排序）、Refinement（压缩）和Corrective Rag（纠正性Rag）这三种优化召回准确率的方案。

具体的源代码地址可以在文末获取。

1. Rerank（重排序）

重排序，顾名思义，就是将检索召回的结果，按照一定的规则或逻辑重新排序，从而将相关性或准确度更高的结果排在前面，提升检索质量。

重排序主要有两种类型，基于统计打分的重排序和基于深度学习的重排序。

基于统计的重排序会汇总多个来源的候选结果列表，使用多路召回的加权得分或倒数排名融合（RRF）算法来为所有结果重新算分。这种方法的优势是计算简单，成本低效率高，广泛用于对延迟较敏感的传统检索系统中，比如内部知识库检索、电商智能客服检索等。

在《RAG实战篇：优化查询转换的五种高级方法，让大模型真正理解用户意图》一文中，提到RAG Fusion 中的 reciprocal_rank_fusion 就是一种基于统计打分的重排序，我们再来回顾一下，如以下代码所示：

def reciprocal_rank_fusion(results: list[list], k=60):""" Reciprocal_rank_fusion that takes multiple lists of ranked documents and an optional parameter k used in the RRF formula """
# Initialize a dictionary to hold fused scores for each unique documentfused_scores = {}
# Iterate through each list of ranked documentsfor docs in results:# Iterate through each document in the list, with its rank (position in the list)for rank, doc in enumerate(docs):# Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)doc_str = dumps(doc)# If the document is not yet in the fused_scores dictionary, add it with an initial score of 0if doc_str not in fused_scores:fused_scores[doc_str] = 0# Retrieve the current score of the document, if anyprevious_score = fused_scores[doc_str]# Update the score of the document using the RRF formula: 1 / (rank + k)fused_scores[doc_str] += 1 / (rank + k)
# Sort the documents based on their fused scores in descending order to get the final reranked resultsreranked_results = [(loads(doc), score)for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)]
# Return the reranked results as a list of tuples, each containing the document and its fused scorereturn reranked_results

基于深度学习模型的重排序，通常被称为 Cross-encoder Reranker，经过特殊训练的神经网络可以更好地分析问题和文档之间的相关性。这类重排序可以给问题和文档之间的语义相似度进行打分，打分只取决于问题和文档的文本内容，不取决于文档在召回结果中的位置。这种方法的优点是检索准确度更高，但成本更高，响应时间更慢，比较适合于对检索精度要求极高的场景，比如医疗问诊。

我们也可以使用大名鼎鼎的Cohere进行重排，一个非常优秀的开源工具，支持多种重排序策略。

其使用方法也非常简单，如以下代码所示：

from langchain_community.llms import Coherefrom langchain.retrievers importContextualCompressionRetrieverfrom langchain.retrievers.document_compressors import CohereRerankfrom langchain.retrievers.document_compressors import CohereRerank
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Re-rankcompressor = CohereRerank()compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
compressed_docs = compression_retriever.get_relevant_documents(question)

2. Refinement（压缩）

压缩，即对于检索到的内容块，不要直接输入大模型，而是先删除无关内容并突出重要上下文，从而减少整体提示长度，降低冗余信息对大模型的干扰。

langchain中有一个基础的上下文压缩检索器可以使用，叫做ContextualCompressionRetriever。

from langchain.retrievers import ContextualCompressionRetrieverfrom langchain.retrievers.document_compressors import LLMChainExtractorfrom langchain_openai import OpenAI
llm = OpenAI(temperature=0)compressor = LLMChainExtractor.from_llm(llm)compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
compressed_docs = compression_retriever.invoke("What did the president say about Ketanji Jackson Brown")

LLMChainFilter是一个稍微简单但更强大的压缩器，它使用 LLM 链来决定过滤掉哪些最初检索到的文档以及返回哪些文档，而无需操作文档内容

from langchain.retrievers.document_compressors import LLMChainFilter
_filter = LLMChainFilter.from_llm(llm)compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)
compressed_docs = compression_retriever.invoke("What did the president say about Ketanji Jackson Brown")pretty_print_docs(compressed_docs)

3. Corrective Rag（纠错性Rag）

Corrective-RAG (CRAG) 是一种 RAG 策略，它结合了对检索到的文档进行自我反思/自我评分。

CRAG 增强生成的方式是使用轻量级的“检索评估器”，该评估器为每个检索到的文档返回一个置信度分数，然后该分数决定触发哪种检索操作。例如评估器可以根据置信度分数将检索到的文档标记为三个桶中的一个：正确、模糊、不正确。

如果所有检索到的文档的置信度分数均低于阈值，则假定检索“不正确”。这会触发采取新的知识来源（例如网络搜索）的行动，以实现生成的质量。

如果至少有一个检索到的文档的置信度分数高于阈值，则假定检索“正确”，这会触发对检索到的文档进行知识细化的方法。知识细化包括将文档分割成“知识条”，然后根据相关性对每个条目进行评分，最相关的条目被重新组合为生成的内部知识。

所以，Corrective Rag的关键在于”检索评估器“的设计，以下是一个实现检索评估器的示例：

from langchain_core.prompts import ChatPromptTemplatefrom langchain_core.pydantic_v1 import BaseModel, Fieldfrom langchain_openai import ChatOpenAI
# Data modelclass GradeDocuments(BaseModel):"""Binary score for relevance check on retrieved documents."""
binary_score: str = Field(description="Documents are relevant to the question, 'yes' or 'no'")
# LLM with function callllm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)structured_llm_grader = llm.with_structured_output(GradeDocuments)
# Promptsystem = """You are a grader assessing relevance of a retrieved document to a user question. \n If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant. \nGive a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
grade_prompt = ChatPromptTemplate.from_messages([("system", system),("human", "Retrieved document: \n\n {document} \n\n User question: {question}"),])
retrieval_grader = grade_prompt | structured_llm_graderquestion = "agent memory"docs = retriever.get_relevant_documents(question)doc_txt = docs[1].page_contentprint(retrieval_grader.invoke({"question": question, "document": doc_txt}))