我要投稿

微软 GraphRAG 解析：知识图谱如何提升 AI 检索能力

发布日期：2025-03-04 05:50:45 浏览次数： 2003

作者：活水智能

微信搜一搜，关注“活水智能”

大多数 RAG 技术的检索步骤都依赖于我们从向量数据库中获取相关文档的能力。尽管大多数向量数据库都采用了高效的相似性计算算法，加上我们运用了最佳的分块、排序及重排序策略，基线 RAG 仍然难以处理那些查询与源文内容不完全匹配（即查询在语义上应与语料库保持高度相似）的情况。并且，它不具备回答全局问题或得出高级结论的能力，无法理解数据实体之间关系背后的深层次含义。其他 RAG 技术主要集中在当语料库大于 LLM 的上下文窗口时该怎么办，而高级 RAG 系统包括诸如预检索和后检索策略以及使用诸如查询扩展等技术来改进给定查询的步骤。这正是微软推出的 GraphRAG 所解决的问题。

在本文中，我们将了解什么是知识图谱，它们是如何演变的，以及为什么 RAG 中需要它们。我们将在阿瑟·柯南·道尔爵士的流行小说《巴斯克维尔的猎犬(https://www.gutenberg.org/cache/epub/2852/pg2852.txt)》（古腾堡计划许可）上实现基线 RAG 和 GraphRAG，并比较结果以了解每种策略的性能，最后了解一些需要注意的事项以及 GraphRAG 的缺点。

什么是 RAG

RAG，即检索增强生成（Retrieval-Augmented Generation），是一种通过整合来自外部来源的相关信息来增强 LLM 响应的技术，从而增强模型的预训练参数记忆，而无需重新训练模型本身。

基线 RAG

以下代码演示了基线 RAG 的实现。我们有一个嵌入模型，用于将我们的小说转换为向量并将其存储在 Azure 向量搜索中。这可以用于执行混合搜索，以获取任何查询在语义上最相关的 top k (=3) 结果。我们将结果连同查询一起传递给 LLM 以生成适当的响应。

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
    openai_api_version=os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION"),
)

vector_store = AzureSearch(
    index_name="hound-of-baskervilles",
    embedding_function=embeddings.embed_query,
    azure_search_endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
    azure_search_key=os.getenv("AZURE_SEARCH_KEY"),
)

loader = TextLoader(book_path, encoding="utf-8")

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1200, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

vector_store.add_documents(documents=docs)

def retrieve(query):
    docs = vector_store.hybrid_search(query=query, k=3)
    context = "\n\n".join(doc.page_content for doc in docs)
    print(context)
    return context

rag_prompt = ChatPromptTemplate.from_messages([\
  ("human", """\
You are an assistant for question-answering tasks.\
Only use the following pieces of context to answer the question.\
If you don't know the answer or the answer cant be dereived from the given context, just say that you don't know.\
Use three sentences maximum and keep the answer concise.\
Question: {question}\
Context: {context}\
Answer:"""),\
])

def get_response(query):
    messages = rag_prompt.invoke({"context": retrieve(query), "question": query})
    response = llm.invoke(messages)
    return response.content

get_response("Who is Sir Henry Baskerville?")
Output:
Sir Henry Baskerville is the young baronet,
about thirty years old, who is the last of the Baskerville family.
He is the nephew of the late Sir Charles Baskerville.
He has recently arrived in England from Central America.

如果您需要快速回顾 RAG，请查看我的文章：

RAG 101：简介，朴素 RAG 的内容、原因和方式：
(https://ai.gopubby.com/rag-101-introduction-aa1138b1dcf3)

搜索引擎的演变

在任何 RAG 方法中，都有两个步骤：检索和生成。对于所有方法，生成步骤或多或少都是相同的，我们使用 LLM 来总结我们从检索步骤中获得的所有数据。检索是大多数策略不同的地方。这一步至关重要，因为如果我们未能向 LLM 提供适当的文档，则生成的响应也会受到影响。因此，检索步骤的执行情况与 RAG 的整体响应之间存在直接关联。

检索本质上就是在我们的文档中进行搜索，因此本质上，我们正在构建一个仅用于我们私有数据的搜索引擎。因此，检索策略也遵循与传统搜索引擎相似的趋势也就不足为奇了。

• 关键词搜索——首先是基于关键词的搜索，我们在数据中查找关键词的精确出现。甚至同义词或关键词的变体也可能无法识别，因此搜索文本应与语料库中的单词完全匹配。最重要的是，倒排索引是一种数据结构，包含数据集中所有唯一单词的词汇表以及单词在文档中出现的位置。此外，还可以存储单词的频率，从而尽可能快地检索与查询匹配的文档。
• 最佳匹配算法——这些排序算法使用词频、文档频率和文档长度来计算每个查询的文档分数。BM25是最常用的排序函数，它解决了简单关键词搜索无法解决的一些痛点，从而提高了检索到的文档的质量。
• PageRank——最初由 Google 开发，它根据传入和传出链接以及内容的质量为每个页面分配一个分数。尽管上下文相关性略有下降，并且页面的时间顺序在排序中非常重要，但它在对数百万个文档进行排序方面是下一个最佳步骤。文档排序和重排序也是用于增强基线 RAG 中检索到的文档的常用策略。
• 语义搜索——在此步骤之前，所有其他函数/算法/技术都只关注搜索哪些单词，而不是真正理解用户查询。语义搜索解释用户查询背后的意图，并考虑单词之间的关系。现在所有的搜索索引器都提供纯关键词、语义和混合搜索，顾名思义，它们既考虑关键词，也考虑查询的语义含义。
• 知识图谱——尽管知识图谱在搜索中使用之前已经存在了几十年，但有意或无意地，搜索的演变始终是关于构建一个连接的图。PageRank 和语义搜索也严重依赖于类似于某种图的数据结构。即使在 Google 中用于增强搜索结果(https://en.wikipedia.org/wiki/Google_Knowledge_Graph)中，知识图谱也被表示为一个有向标记图，其中_节点_表示实体，_边_表示_节点_之间的关系。因此，当用户查询与特定节点或边匹配时，相应的高度连接的节点/边也可以加起来，为响应提供更多上下文。

因此，RAG 的检索步骤也演变为知识图谱阶段是理所当然的。

什么是知识图谱

它是两个实体之间关系的语义表示。它为非结构化数据提供了一种结构，以便机器可以理解实体如何相互关联以及它们共享哪些属性。知识图谱中的所有实体都表示为节点，节点之间的所有关系都表示为边。节点可以是任何对象——人、地点、组织或事件。边可以是两个节点相关的任何属性或属性。

使用 LLM 构建知识图谱

传统上，构建知识图谱是一项手动且耗时的任务。它需要主题专家和数据科学家之间的密切合作，他们审查大量数据并了解每个实体如何相互对应。但随着 LLM 的出现，我们可以自动化构建知识图谱的几乎所有步骤。正如我们之前所见，任何知识的最重要组成部分都是节点。让我们编写一个简单的提示来提取名称、地点、组织和事件。

提取实体及其关系

让我们尝试使用基本的提示来获取名称、地点、组织和事件。最终，我们希望让用户可以灵活地提及他们的实体类型，因为它们非常特定于领域，并且将大大增强提取过程。

from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

entity_extraction_prompt="""
Given a document identify all entities from the text.
Entity types can be one of : Name, Place, Organization or Event
Document: {text}"""

def extract_entites(text):
    prompt = ChatPromptTemplate.from_messages([entity_extraction_prompt])
    messages = prompt.invoke({"text": text})
    response = llm.invoke(messages)
    return response.content

使用小说的第一段作为输入，我们得到以下已识别的实体。

extract_entites("""
Mr. Sherlock Holmes, who was usually very late in the mornings,
      save upon those not infrequent occasions when he was up all
      night, was seated at the breakfast table. I stood upon the
      hearth-rug and picked up the stick which our visitor had left
      behind him the night before. It was a fine, thick piece of wood,
      bulbous-headed, of the sort which is known as a "Penang lawyer."
      Just under the head was a broad silver band nearly an inch
      across. "To James Mortimer, M.R.C.S., from his friends of the
      C.C.H.," was engraved upon it, with the date "1884." It was just
      such a stick as the old-fashioned family practitioner used to
      carry—dignified, solid, and reassuring.
""")

Output:
Entities identified in the text:

- Name: Sherlock Holmes
- Name: James Mortimer
- Organization: M.R.C.S.
- Organization: C.C.H.
- Event: 1885

对于一个基本的提示来说还不错。让我们也添加_entity_types_作为一个变量，并给出一些指定结构的少量示例，以获得所需格式的输出。这正是 GraphRAG 中正在使用的。从 GraphRAG 代码中，提取实体及其关系的提示是使用单个提示完成的：

-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)

3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

4. When finished, output {completion_delimiter}

######################
-Examples-
######################
Example 1:
Entity_types: ORGANIZATION,PERSON
Text:
The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
######################
Output:
("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
{record_delimiter}
("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
{record_delimiter}
("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
{record_delimiter}
("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
{completion_delimiter}

######################
Example 2:
Entity_types: ORGANIZATION
Text:
TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.

TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
######################
Output:
("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
{record_delimiter}
("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
{record_delimiter}
("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
{completion_delimiter}

######################
Example 3:
Entity_types: ORGANIZATION,GEO,PERSON
Text:
Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.

The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.

The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.

They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.

The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
######################
Output:
("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
{record_delimiter}
("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
{record_delimiter}
("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
{record_delimiter}
{record_delimiter}
("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
{record_delimiter}
("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
{record_delimiter}
("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
{record_delimiter}
("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
{record_delimiter}
("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
{record_delimiter}
("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
{record_delimiter}
("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
{record_delimiter}
("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
{record_delimiter}
("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
{completion_delimiter}

######################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:

我们以指定的格式获得输出：

'("entity"<|>SHERLOCK HOLMES<|>PERSON<|>Mr. Sherlock Holmes is a detective who often stays up all night and was seated at the breakfast table in the morning)\n##\n("entity"<|>JAMES MORTIMER<|>PERSON<|>James Mortimer is the owner of the stick, engraved "To James Mortimer, M.R.C.S., from his friends of the C.C.H.," with the date "1884")\n##\n("entity"<|>C.C.H.<|>ORGANIZATION<|>C.C.H. is an organization whose friends gifted James Mortimer a stick)\n|COMPLETE|'

使用 GraphRAG

为了展示 GraphRAG 的真正价值，让我们删除这本书的最后两章，然后尝试使用基线 RAG 和 GraphRAG 来回答问题。要开始使用 GraphRAG，让我们使用 CLI 命令将这本书索引到 parqet 文件中，然后加载到内存中。

import requests
import os
from dotenv import find_dotenv, load_dotenv

book_path = "graphrag/input/book.txt"

response = requests.get("https://www.gutenberg.org/cache/epub/2852/pg2852.txt")

if response.status_code == 200:
    if not os.path.isdir("graphrag/input"):
        os.mkdir("graphrag/input")
    with open(book_path, "wb") as file:
        file.write(response.content)
    print("File saved succesfully")
else:
    print("Failed to fetch the file")

load_dotenv(find_dotenv())

!python -m graphrag index --root ./graphrag

关于 GraphRAG，我们需要记住以下几点

1. 尽管默认提示提取了诸如人物、地点和组织等基本实体，但最好提供少量示例来提取与您自己的领域相关的其他实体。
2. 尽管 GraphRAG 论文声称，仅仅因为有一个提示首先识别是/否，然后建议缺少许多实体，因此更长的上下文是合理的，但仍然不能保证 LLM 会识别所有实体，尤其是那些非名词或由多个单词组成的实体。
3. 没有明确的实体去重步骤。即使一个实体被误解并被复制为一个单独的节点，它们也相对靠近彼此，并且也紧密地连接在一起。并且由于 GraphRAG 生成社区摘要，LLM 最终会理解这一点并生成适当的摘要。
4. 为了在整个图中分发信息，GraphRAG 随机化社区报告（删除权重为 0 的报告，并根据相关性分数对其他报告进行排序）。仅采用适合上下文窗口的前 n 个报告；不能保证每次都出现相同的报告（如果它们具有相同的分数）；尽管问题的结论可能相同，但源报告可能完全不同。

GraphRAG 的局限性

1. 从它对 LLM 的调用次数可以明显看出，索引成本非常高，并且最终，在查询时，成本也比基线 RAG 相对更高。
2. 在检索文档时，通常的做法是在元数据上应用过滤器，以排除我们知道在用户上下文中不相关或用户无法访问的文档。这充当了一个安全层，用户只能通过他们有权访问的文档进行搜索。但是，由于 GraphRAG 执行本地摘要，因此在哪些实体、关系或社区报告需要 RBAC（基于角色的访问控制）以及如何为不同的角色构建图方面，这可能会变得更加复杂。
3. 只要文本没有更改，嵌入就始终相同。因此，无论我们生成多少次相同文档的嵌入，基线 RAG 都将为相同的查询提供相同的答案（或者至少检索器将始终提供相同的文档）。另一方面，由于 GraphRAG 依赖于 LLM 的输出来进行实体识别和关系标记，以及生成社区摘要和报告，因此在同一语料库上重新生成图可能会为相同的查询提供不同的答案的可能性很高。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费场景POC验证，效果验证后签署服务协议。零风险落地应用大模型，已交付160+中大型企业