我要投稿

两个基于大模型的知识图谱自动构建开源工具：兼看AutoKG轻量化关键词KG构建和混合增强问答思路

发布日期：2024-06-10 12:00:20 浏览次数： 6674

作者：老刘说NLP

微信搜一搜，关注“老刘说NLP”

今天是2024年6月10日，星期一，北京，天气晴。

今天是端午节，大家端午安康。

我们来看看关于大模型知识图谱自动化构建的开源项目有哪些，挑2个比较有趣的来看看。然后，再看看AutoKG自动知识图谱构建和混合增强问答思路，会很有趣。

另一个，关于知识图谱与大模型的结合，也陆续再看一下整理工作，比较好的可以看看https://github.com/liuhuanyong/KG-LLM-Papers

供大家一起参考并思考。

1、llmgraph 知识图谱自动构建工具

利用大型语言模型（LLMs）创建知识图谱的自动化工具，支持从维基百科页面提取实体知识，生成GraphML、GEXF和HTML格式的图谱。

创建知识图谱，给定一个源实体。使用ChatGPT（或另一个指定的LLM）来提取知识。生成HTML、GraphML和GEXF格式的知识图谱，也可以通过自定义提示支持多种实体类型和关，但从得到的结果来看，更像根据实体之间的相似性，拉了一条边。

其主要原理是通过prompt，调用大模型进行生成，对应prompt如下：

You are knowledgeable about {knowledgeable_about}. List, in json array format, the top {top_n} {entities} most like '{{entity_root}}' with Wikipedia link, reasons for similarity, similarity on scale of 0 to 1. Format your response in json array format as an array with column names: 'name', 'wikipedia_link', 'reason_for_similarity', and 'similarity'. Example response: {{{{"name": "Example {entity}","wikipedia_link": "https://en.wikipedia.org/wiki/Example_{entity_underscored}","reason_for_similarity": "Reason for similarity","similarity": 0.5}}}}

例子：

llmgraph concepts-general "https://en.wikipedia.org/wiki/Knowledge_graph" --levels 4

图谱结果：https://blog.infocruncher.com/html/llmgraph/concepts-general_knowledge-graph_v1.0.0_level4_fully_connected.html

例子：

llmgraph company "https://en.wikipedia.org/wiki/OpenAI" --levels 4

图谱结果：https://blog.infocruncher.com/html/llmgraph/company_openai_v1.0.0_level4_fully_connected.html

更详细的配置在：https://github.com/dylanhogg/llmgraph/blob/main/llmgraph/prompts.yaml

地址：https://github.com/dylanhogg/llmgraph

2、Graph Maker知识图谱自动构建工具

The Graph Maker，一个Python库，可以将任何文本转换为知识图谱，需要一个本体（ontology），可以通过pip install knowledge-graph-maker安装项目，然后使用Poetry设置项目poetry config --local virtualenvs.in-project true，poetry install然后执行相关处理。

其实现原理在于：

1、定义图的本体

本体是定义知识图谱中要识别和关联的实体类型和关系的模型。本体论是一个pydantic模型，可以定义实体标签和关系。

ontology = Ontology(
    # labels of the entities to be extracted. Can be a string or an object, like the following.
    labels=[
        {"Person": "Person name without any adjectives, Remember a person may be references by their name or using a pronoun"},
        {"Object": "Do not add the definite article 'the' in the object name"},
        {"Event": "Event event involving multiple people. Do not include qualifiers or verbs like gives, leaves, works etc."},
        "Place",
        "Document",
        "Organisation",
        "Action",
        {"Miscellanous": "Any important concept can not be categorised with any other given label"},
    ],
    # Relationships that are important for your application.
    # These are more like instructions for the LLM to nudge it to focus on specific relationships.
    # There is no guarentee that only these relationships will be extracted, but some models do a good job overall at sticking to these relations.
    relationships=[
        "Relation between any pair of Entities",
        ],
)

2、分割文本成块

由于大型语言模型（LLM）的上下文窗口有限，需要将文本分割成合适的块，以便逐块创建图。块的大小取决于模型的上下文窗口和使用的提示所占用的令牌数。

3、转换这些块为文档

## Pydantic document model
class Document(BaseModel):
    text: str
    metadata: dict

文档是一个pydantic模型，包含文本和元数据。元数据可以提供关系的上下文，如页码、章节、文章名称等，有助于对关系进行上下文化。

4、选择一个LLM客户端

可以选择Groq模型或OpenAI模型作为LLM客户端。也可以定义自己的LLM客户端并将其传递给图制作器。

llm = OpenAIClient(model="gpt-3.5-turbo", temperature=0.1, top_p=0.5)

5、运行图制作器

图制作器接受文档列表作为输入，并对每个文档迭代以创建每个文档的子图。最终输出是所有文档的完整图。

graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=False)
graph = graph_maker.from_documents(list(docs), delay_s_between=10)
print("总边数", len(graph))

6、保存到Neo4j（可选步骤）

可以将模型保存到Neo4j数据库中，用于创建RAG应用程序、运行网络算法或使用Bloom可视化图。

neo4j_graph = Neo4jGraphModel(edges=graph, create_indices=False)
neo4j_graph.save()

有个具体的例子：https://github.com/rahulnyk/graph_maker/blob/main/graph_maker_example.ipynb，项目地址在https://github.com/rahulnyk/graph_maker，https://pypi.org/project/knowledge-graph-maker/

3、AutoKG自动知识图谱构建和混合增强问答思路

但越到后面，我们发现，这种基于本体的知识图谱构建成本很高，因此，可以进行轻量化的构建，回过头来，面向检索，来做一些构建工作。

例如，《AutoKG: Efficient Automated Knowledge Graph Generation for Language Models》(https://arxiv.org/pdf/2311.14740)提出了个一个思路，分成2个阶段。

1、关键词图谱构建阶段

从知识库中提取关键词，并在这些关键词上构建图结构，图中的边被赋予一个正整数权重，表示整个语料库中两个相连关键词之间的关联强度。

实现流程如下：

其中，针对由文本块组成的知识库，使用无监督聚类算法和LLM的辅助来从知识库中提取关键词，算法接受文本块及其对应的嵌入向量作为输入，并使用预定义的参数进行操作。

对应的prompt如下：

在构图阶段，然后通过图拉普拉斯学习评估每对关键词之间的关系权重。首先创建一个基于文本块的图，然后利用图拉普拉斯学习算法根据已有的图结构来扩散标签值，从而建立关键词之间的关联权重，算法如下：

这个，可查看create_KG.ipynb(https://github.com/wispcarey/AutoKG/blob/main/create_KG.ipynb)，其说明了如何从选定的论文中提取关键词，并基于这些关键词生成一个知识图谱。

2、混合图谱与文本搜索

利用图结构，设计了一种混合搜索方案，同时进行基于向量相似性的文本搜索和基于图的强关联关键词搜索。检索到的所有信息都被整合到提示中，以增强模型的回复。

算法比较清晰，设计了一个多阶段搜索过程，该过程结合了直接文本块搜索以及由知识图谱引导的基于关键词的搜索，如下：

首先，通过计算与给定查询嵌入向量最接近的文本块来进行初始搜索；

然后，转向知识图谱并识别与查询最接近的关键词以及与这些关键词相关的文本块；

最后，根据知识图谱中的权重矩阵识别与先前识别的关键词关联最紧密的额外关键词，并相应地搜索相关的文本块，然后返回的不仅是与查询高度相关的文本块集合，还有与查询紧密连接的关键词集合。

直接放入prompt中，进行大模型预测：

这个对应的实现代码，可查看chat_with_KG.ipynb(https://github.com/wispcarey/AutoKG/blob/main/chat_with_KG.ipynb)，其提供了一个使用我们构建的知识图谱进行问答交互的示例，效果如下：‍‍‍‍‍‍‍‍‍

地址在：https://github.com/wispcarey/AutoKG/blob/main/readme.md

总结

本文主要讲了关于大模型知识图谱自动化构建的两个开源项目，实际上是3个。其实现思路还是基于大模型进行预先定义好的图谱schema进行的抽取。

当然，除此之外，还有dspy-neo4j-knowledge-graph构建工具，使用DSPy、Neo4j和OpenAI的GPT-4模型实体和关系，并构建知识图谱。当给定一段文本或文本块时，应用程序使用DSPy库和OpenAI的GPT-4提取实体和关系，并生成Cypher语句，在Neo4j中运行以创建知识图谱。地址放在：https://github.com/chrisammon3000/dspy-neo4j-knowledge-graph