AI知识库

53AI知识库

学习大模型的前沿技术与行业应用场景


Microsoft 的 GraphRAG实践
发布日期:2024-07-12 13:58:24 浏览次数: 3545 来源:知识图谱科技


前言

微软正式开源GraphRAG后重磅 - 微软官宣正式在GitHub开源GraphRAG,引起了业界热烈的讨论,公众号文章一天阅读量达到一万以上,后续也分享了进一步的解析开源GraphRAG解读:微软的人工智能驱动知识发现方法 和 揭示微软开源的RAG策略:GraphRAG。今天摘取分享实践介绍,共同学习。我们也在医药和工业领域利用GraphRAG落地,解决通用RAG缺乏行业上下文语义理解、精准问答和溯源等问题,期待一起交流合作。

摘要

尝试使用Microsoft的Graph RAG – baeke.info

本文讨论了Microsoft的GraphRAG方法,将其与基准RAG进行对比,以实现更有效的面向查询的总结,利用知识图谱和社区摘要来回答全局问题。

要点:

- Graph RAG通过知识图和社区摘要增强全局查询响应能力。

- Graph RAG的处理过程涉及实体提取、图形创建和社区摘要生成。

- 基准RAG在缺乏数据集描述性主题的全局问题上遇到困难。

- 该实现包括基于Python的工具进行索引和查询。

- Graph RAG涉及更高的计算成本,但提供细致的查询响应。

- 优化结果关键在于关键提示的修改。

- 该系统支持本地和全局搜索功能。

来源:

https://blog.baeke.info/2024/07/07/trying-out-graph-rag/

正文:

当我们在诸如OpenAI的gpt-4o等LLM之上构建应用程序时,我们经常使用RAG模式。RAG代表检索增强生成。您可以使用它让LLM回答有关它从未见过的数据的问题。为了回答问题,您检索相关信息并将其交给LLM以生成答案。

下面的图示从高层次描绘了数据摄取和查询部分,使用gtp-4和Azure中的向量数据库,Azure AI搜索。

以上,我们的文档被切分并进行了向量化。这些向量存储在 Azure AI Search 中。向量使我们能够找到与用户查询相似的文本片段。当用户输入问题时,我们将问题进行向量化,找到相似的向量,然后将前 n 个匹配项交给 LLM 处理。找到的文本片段会与原始问题一起放在提示中。查看此页面以了解更多关于向量的信息。

请注意,以上是其最简单形式下的基本情况。您可以在索引和检索阶段以多种方式优化此过程。查看 YouTube 上的“RAG From Scratch”系列以深入了解更多信息。

https://youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&si=ZQdvbm-_9mUYWq8D

基线RAG的局限性

尽管您可以用基本的 RAG 取得一定进展,但对于整个数据集的全局问题,它并不十分擅长。如果您问“数据集中的主题是什么?”除非数据集本身描述了主题,否则要找到与问题相关的文本块将会很困难。实质上,这是一个以查询为焦点的总结任务与一个显式检索任务之间的对比。

在这篇论文中重磅-微软发表GraphRAG论文并即将开源项目从本地到全局:一种基于GraphRAG 方法的以查询为焦点的总结,微软提出了一种基于知识图谱和中间社区摘要的解决方案,以更有效地回答这些全局问题。

如果您不太清楚基本 RAG 和GraphRAG 之间的区别,可以观看这段 YouTube 上的视频,在该视频中更详细地解释了它们的区别。

基线 RAG 和 Graph RAG 之间的差异

创建索引开始

微软具有基于Python的开源Graph RAG实现,可用于本地和全局查询。我们稍后会讨论本地查询,并暂时专注于全局查询。请查看GitHub存储库以获取更多信息。https://github.com/microsoft/graphrag 

如果您的本地计算机上有Python,则很容易尝试:

  • 在其中创建一个文件夹并在其中创建一个Python虚拟环境

  • 确保Python环境是活动的,并运行pip install graphrag

  • 在文件夹中创建一个名为“input”的文件夹,并在其中放入一些带有您内容的文本文件

  • 从包含输入文件夹的文件夹中运行以下命令:python -m graphrag.index --init --root .

  • 这会创建一个.env文件和一个settings.yaml文件。

  • 在.env文件中输入您的OpenAI密钥。这也可以是Azure OpenAI密钥。Azure OpenAI在settings.yaml文件中需要额外的设置:api_base,api_version,deployment_name。

  • 我直接使用了OpenAI并在settings.yaml文件中修改了模型。找到“model”设置,并将其设置为gpt-4o。

你现在可以开始运行索引管道。在运行之前,请注意,根据您放入索引文件夹中的数据量,这将会产生大量的LLM调用。在我的测试中,使用800KB的文本数据,索引的成本在10到15欧元之间。以下是命令:

1

python -m graphrag.index --root .

展示发生的情况,请看下面的图表:

上面,让我们从上到下看一下,暂时不包括用户查询部分:

  • 输入文件夹中的源文档被分割成300个令牌的片段,重叠100个令牌。微软使用cl100k_base令牌化程序,这是gpt-4使用的程序,而不是gpt-4o。这不应该产生影响。您可以调整令牌大小和重叠。使用更大的令牌大小,在后续步骤中进行的LLM调用更少,但元素提取可能不够精确。

  • 在gpt-4o的帮助下,从每个块中提取元素。这些元素是正在构建的图中的实体和实体之间的关系。此外,有关这些实体的声明也被提取。上述文献和图表使用了术语协变量。如果输入文件夹中有大量数据,则这是一项昂贵的操作。

  • 生成了元素的文本描述。

在这些步骤之后,建立了一个图,其中包含 gpt-4o 能够找到的所有实体、关系、声明和元素描述。但是流程并不止步于此。为了支持全球查询,会发生以下情况:

  • 在图中检测社区。社区是密切相关实体的群体。使用Leiden算法进行检测。在我的小数据集中,大约检测到250个社区。

  • 为每个社区创建使用gpt-4o并存储的社区摘要。这些摘要以后可以用于全局查询。

为了使以上所有工作正常运行,必须进行大量的LLM调用。使用的提示可以在prompts文件夹中找到:


您可以且可能应该修改这些提示以匹配您文档的领域。实体提取提示包含示例,以教示大语言模型应该提取的实体。默认情况下,会检测到人物、地点、组织等实体。但如果您主要处理建筑项目、建筑物、桥梁、建筑材料等内容,那么提示应相应进行调整。答案的质量将在很大程度上取决于这些调整。

除了图表之外,解决方案还使用开源项目 LanceDB 来存储每个文本块的嵌入。数据库中只有一个表,包含四个字段:

  • id: unique id for the chunk

  • text: the text in the chunk

  • vector: the vector of the chunk; by default the text-embedding-3-small model is used

  • attributes: e.g., {“title”: “\”title here\””}

图表和相关数据存储在另一个带有时间戳的文件夹中的 artifacts 文件夹中的 Parquet 文件中。例如:

包含图结构的Parquet文件

如果您有一个镶木地板查看器,您可以检查create_final_entities.parquet文件以查看检测到的实体。您将找到诸如ORGANIZATION、PERSON、GEO、EVENT、CONCEPT等的实体类型... 每个实体都有一个描述并链接回文本单元的标识符。文本单元是块。

请注意,如果您希望以GraphML格式获得图谱,请在settings.yaml中将snapshots.graphml设置为true。然后,您的工件文件夹将包含.graphml文件。您可以在类似Gephi的查看器中加载这些文件:

如果您在不设置此值的情况下已经运行了索引器,您可以再次运行它。Graph RAG具有一个缓存层,因此当您再次运行索引器以生成.graphml文件时,您不会产生费用。

Global Search全局搜索

现在让我们进行全局查询。全局查询使用生成的社区摘要来回答问题。中间答案用于生成最终答案。

全局查询不仅仅是一个LLM调用,而是几个调用。总令牌成本相对较高,与使用3到5个检索到的块的典型相似度搜索相比。看到有200K令牌的> 10个LLM调用并不罕见。您可以在settings.yaml中控制令牌使用情况。请参阅底部的global_search设置。

要进行全局查询,请运行下面的命令。我在语料库中有一本《1984》的书,所以让我们询问主题:

1

2

3

4

python -m graphrag.query \

--root . \

--method global \

"What are the top themes in 1984?"

请注意,由于它知道关于1984年的情况,LLM可以独自回答这个问题。但这并不是重点。您得到的答案将来自索引:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

### Top Themes in *1984*

 

**1. Totalitarianism and Absolute Control**

The Party in Oceania exercises absolute control over its citizens, manipulating historical records, maintaining a state of perpetual conflict, and enforcing oppressive policies through purges, arrests, tortures, imprisonments, and vaporizations [Data: Reports (211)]. The regime's control extends to every aspect of life, including personal relationships, sexual behavior, and even thoughts, enforced by the Thought Police and the omnipresent Telescreen [Data: Reports (141, 183, 223, 185, 191, 181, 151, 116, 186)].


**2. Surveillance and Psychological Manipulation**

Surveillance is a pervasive theme, with the Telescreen and the Thought Police constantly monitoring individuals to ensure compliance and suppress dissent [Data: Reports (52, 141, 183, 223, 185, 191, 181, 151, 116, 186)]. The Party uses psychological manipulation techniques such as doublethink, Newspeak, and the alteration of historical records to control and limit the range of thought among its citizens [Data: Reports (141, 175, 183, 181, 215, 149, 179, 177, 178, 165, 58)].


**3. Manipulation of Truth and Reality**

The Party manipulates historical records and information to maintain its narrative and control over reality. This is evident in Winston's work at the Ministry of Truth and the concept of doublethink, which allows individuals to accept two contradictory beliefs simultaneously [Data: Reports (141, 175, 183, 181, 215, 149, 179, 177, 178, 165, 58)]. The Memory Hole is used to destroy documents, ensuring that information is permanently erased, highlighting the Party's control over historical narratives [Data: Reports (52)].


**4. Continuous Warfare**

Oceania is engaged in a perpetual state of war with Eastasia and Eurasia. This continuous warfare serves to maintain the Party's control over the population and justify its oppressive policies [Data: Reports (211, 178, 214)]. The war consumes surplus goods, maintains a hierarchical structure, and impacts the mental states of individuals, ensuring that the populace remains divided and distracted [Data: Reports (211, 178, 214, 208)].


**5. Rebellion and Resistance**

Despite the oppressive regime, themes of rebellion and resistance are present, as seen through characters like Winston and Julia. Winston's clandestine activities, such as his relationship with Julia and his hope in the Proles, reflect his desire to rebel against the Party's oppressive regime [Data: Reports (141, 183, 151, 182, 101, 186)]. The Brotherhood represents an underground network aiming to overthrow the state of Oceania, highlighting the existence of resistance within the oppressive regime [Data: Reports (211, 64, 152)].


**6. Class Struggle and Social Hierarchy**

The distinction between the Inner Party, Outer Party, and Proles highlights the rigid social hierarchy within Oceania. The Party's control over the Proles and the limited privileges of the Outer Party members emphasize the theme of class struggle [Data: Reports (141, 66, 185, 211, 182, 124)]. The Proles, despite their subjugation, are seen as a potential source of hope and rebellion [Data: Reports (182, 64)].


**7. Language as a Tool of Control**

Newspeak, a language created by The Party, is designed to limit the range of thought and eliminate any possibility of rebellious thoughts. By controlling language, The Party effectively controls thought, making it impossible for individuals to conceptualize or express subversive ideas [Data: Reports (177, 179, 181, 149, 114)].


**8. Fear and Psychological Torture**

The Party uses fear and psychological torture to maintain control, exemplified by the use of Room 101, where individuals are subjected to their worst fears. Winston's torture with the cage containing rats demonstrates the regime's ability to break individuals' spirits [Data: Reports (169, 145, 224, 220)].

 

These themes collectively illustrate the oppressive nature of the regime in *1984* and its impact on personal freedom, truth, and reality. The Party's extensive use of surveillance, psychological manipulation, and control over information ensures its dominance and the suppression of any potential dissent.

Local Search本地搜索

本地搜索通过将知识图谱中的数据与原始文档的文本块结合起来生成答案。这种方法适用于需要理解文档中特定实体的问题(例如,Winston与谁互动了?)。以下笔记本包含有关使用Python进行本地搜索的更多信息https://github.com/microsoft/graphrag/blob/a22003c302bf4ffeefec76a09533acaf114ae7bb/examples_notebooks/local_search.ipynb#L43

要从CLI执行本地搜索,请使用以下命令:

1

2

3

4

python -m graphrag.query \

--root . \

--method local \

"Who is Winston? Who does he interact with and why?"

本地搜索使用相似性搜索在图中查找实体作为入口点。一旦找到这些入口点,就会使用图来找到进一步感兴趣的实体。本地搜索需要LanceDB嵌入存储,而全局搜索则不需要。上述问题的回答:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

### Who is Winston?

 

Winston is a complex character from George Orwell's novel "Nineteen Eighty-Four." He is a member of the Outer Party and works at the Ministry of Truth, where his job involves altering historical records to fit the Party's propaganda. Winston is deeply resentful of the Party and Big Brother, harboring a sense of intellectual inferiority and mental domination by figures like O'Brien. His life is marked by intense fear, paranoia, and physical and emotional suffering, particularly during his interactions with O'Brien, who tortures and interrogates him [Data: Entities (8, 3)].

 

Winston's reflections on reality, power, and the nature of suffering are central to his character. He contemplates the symbolic meaning of chess, the principles of Ingsoc, and the dominion of The Party. His dreams and memories, including the tragic death of his mother and sister, the bells of St. Clement's, and his early childhood, provide a poignant backdrop to his internal struggles [Data: Entities (8)].

 

### Interactions and Relationships

 

#### Julia

Winston's most significant relationship is with Julia, a fellow Party member. Despite the risks, they engage in a love affair, renting a room from Mr. Charrington for their secret meetings. Julia represents a source of intimacy and rebellion for Winston, as they navigate their dangerous liaison under the watchful eyes of The Party [Data: Entities (8)].

 

#### O'Brien

O'Brien is another crucial figure in Winston's life. Initially, Winston feels a sense of connection and admiration towards O'Brien, hoping that his political orthodoxy is not perfect. However, O'Brien ultimately becomes his torturer, subjecting Winston to severe psychological and physical pain. Despite this, Winston experiences moments of connection and even a peculiar intimacy with O'Brien [Data: Entities (8)].

 

#### Mr. Charrington

Mr. Charrington is the shop owner who rents a room to Winston and Julia for their secret meetings. Initially, he appears discreet and non-judgmental, but later reveals a more authoritative and alert persona, indicating his role in the Party's surveillance [Data: Entities (317)].

 

#### Other Characters

Winston also interacts with various other characters, such as Syme, Parsons, and the old man in the pub. These interactions reveal his curiosity about the past and the changes brought about by The Party. For instance, Syme is a colleague who discusses the principles of Newspeak with Winston, while Parsons is a fellow employee at the Ministry of Truth [Data: Entities (8, 83)].

 

### Conclusion

 

Winston is a deeply reflective and observant character, constantly grappling with the oppressive nature of The Party and his own internal conflicts. His interactions with Julia, O'Brien, Mr. Charrington, and others provide a multifaceted view of his struggles and the dystopian world he inhabits. Through these relationships, Winston's character is fleshed out, revealing the complexities of life under totalitarian rule.

请注意,输出中包含对找到的实体的引用。例如,关于查林顿先生的部分指定实体317。在Gephi数据实验室中,我们可以使用human_readable_id轻松找到该实体:

在构建应用程序时,用户界面可以提供到实体的链接,以供进一步检查。

结论

检索增强生成(RAG)已成为增强语言模型回答特定数据集问题能力的强大技术。虽然基准RAG在通过检索相关文本块回答特定查询方面表现出色,但在需要全面了解整个数据集的全局问题上却难以应对。为解决这一局限,微软推出了GraphRAG,一种创新方法,利用知识图谱和社区摘要来更有效地回答全局查询。

GraphRAG的索引过程涉及文档分块、提取实体和关系、构建图谱以及生成社区摘要。这种方法能够更细致和具有上下文意识地回应本地和全局查询。虽然GraphRAG在处理复杂的整个数据集问题方面具有显著优势,但需要注意的是,它带来了更高的计算成本,并需要仔细的提示工程来实现最佳结果。随着人工智能领域的不断发展,像GraphRAG这样的技术代表着迈向更全面和有洞察力的信息检索和生成系统的重要一步。



53AI,企业落地应用大模型首选服务商

产品:大模型应用平台+智能体定制开发+落地咨询服务

承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业

联系我们

售前咨询
186 6662 7370
预约演示
185 8882 0121

微信扫码

与创始人交个朋友

回到顶部

 
扫码咨询