微信扫码
与创始人交个朋友
我要投稿
文章指出,在以前的方法中,使用LLM生成三元组时,必须预定义好schema,假如schema数量很多/复杂,很容易超过LLM的上下文窗口长度。并且,在某些情况下,没有可用的固定预定义schema。
提出了一个名为提取-定义-规范化(EDC)的三阶段框架:先进行开放信息提取,然后定义schema,最后进行规范化。解决知识图谱构建问题。
1.开放信息提取(Open Information Extraction): 利用LLMs进行开放信息提取,通过少量的提示,LLMs从输入文本中识别并提取关系三元组([主体, 关系, 对象]),不依赖于任何特定Schema。
OIE Prompt示例:
Given a piece of text, extract relational triplets in
the form of [Subject, Relation, Object] from it.
Here are some examples:
Example 1:
Text: The 17068.8 millimeter long ALCO RS-3
has a diesel-electric transmission.
Triplets: [[‘ALCO RS-3’, ‘powerType’, ‘Dieselelectric transmission’], [‘ALCO RS-3’, ‘length’,
‘17068.8 (millimetres)’]] ...
Now please extract triplets from the following
text: Alan Shepard was born on Nov 18, 1923
and selected by NASA in 1959. He was a member of the Apollo 14 crew.
提取的三元组:[‘Alan Shepard’, ‘bornOn’, ‘Nov 18, 1923’], [‘Alan Shepard’, ‘participatedIn’, ‘Apollo 14’]
2.Schema定义(Schema Definition): 提示LLMs为提取的Schema组件(如实体类型和关系类型)提供自然语言定义。然后将这些定义作为用于规范化的辅助信息传递到下一阶段。
Schema Definition Prompt示例:
Given a piece of text and a list of relational triplets
extracted from it, write a definition for each relation present.
Example 1:
Text: The 17068.8 millimeter long ALCO RS-3
has a diesel-electric transmission.
Triplets: [[‘ALCO RS-3’, ‘powerType’, ‘Dieselelectric transmission’], [‘ALCO RS-3’, ‘length’,
‘17068.8 (millimetres)’]]
Definitions:
powerType: The subject entity uses the type of
power or energy source specified by the object
entity.
...
Now write a definition for each relation present
in the triplets extracted from the following text:
Text: Alan Shepard was an American who was
born on Nov 18, 1923 in New Hampshire, was
selected by NASA in 1959, was a member of the
Apollo 14 crew and died in California
Triplets: [[‘Alan Shepard’, ‘bornOn’, ‘Nov 18,
1923’], [‘Alan Shepard’, ‘participatedIn’, ‘Apollo14’]]
结果: (bornOn: The subject entity was born on the date specified by the object entity.) and (participatedIn: The subject entity took part in the event or mission specified by the object entity.)
3.Schema标准化(Schema Canonicalization): 第三阶段将开放知识库(KG)精炼成规范化的形式,消除冗余和歧义。首先使用句子变换器对每个schema组件的定义进行向量化,创建嵌入。然后根据目标Schema的可用性,规范化以两种方式之一进行:
Schema Canonicalization提示示例:
Given a piece of text, a relational triplet extracted
from it, and the definition of the relation in it,
choose the most appropriate relation to replace it
in this context if there is any.
Text: Alan Shepard was born on Nov 18, 1923
and selected by NASA in 1959. He was a member
of the Apollo 14 crew.
Triplets: [‘Alan Shepard’, ‘participatedIn’,
‘Apollo 14’]
Definition of ‘participatedIn’: The subject entitytook part in the event or mission specified by the
object entity.
Choices:
A. ‘mission’: The subject entity participated in
the event or operation specified by the object entity.
B. ‘season’: The subject entity participated in the
season of a series specified by the object entity.
...
F. None of the above
结果:[‘Alan Shepard’, ‘birthDate’, ‘Nov 18, 1923’],[‘Alan Shepard’, ‘mission’, ‘Apollo 14’],构成了规范化的知识图谱。
EDC+R 是对 EDC 的改进,通过引入一个额外的迭代步骤来进一步提升知识图谱的质量。这个过程类似于RAG,通过在初始提取阶段的提示(prompt)中提供先前提取的三元组和相关Schema部分来实现。目标是利用从 EDC 过程中产生的数据来提高提取三元组的质量。
精炼过程由以下两个主要元素组成:
Schema Retriever 的作用:Schema Retriever是可以训练的,Schema Retriever 通过将Schema组件和输入文本投影到向量空间中,使得余弦相似度能够捕捉二者之间的相关性,即Schema组件在输入文本中出现的概率。
训练数据集由文本和它们对应的定义关系对组成。微调的是一个嵌入模型,目标是区分与给定文本相关联的正确关系和其他不相关的关系。
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-12-21
SAC-KG:利用大型语言模型一键构建领域知识图谱 - 中科大&阿里
2024-12-19
北大Chatlaw - 基于知识图谱增强混合专家模型的多智能体法律助手
2024-12-18
Elasticsearch vs 向量数据库:寻找最佳混合检索方案
2024-12-16
轻量高效的知识图谱RAG系统:LightRAG
2024-12-16
5种方法,让文本信息瞬间变成结构化图谱!
2024-12-16
向量数据库到底算不算一种NoSQL数据库?
2024-12-14
大模型能自动创建高质量知识图谱吗?可行性及人机协同机制 - WhyHow.AI
2024-12-12
大模型+知识图谱在工业领域落地的4大场景
2024-07-17
2024-07-11
2024-08-13
2024-07-13
2024-07-12
2024-06-24
2024-07-08
2024-06-10
2024-07-26
2024-07-04
2024-12-16
2024-12-10
2024-12-04
2024-12-01
2024-11-30
2024-11-22
2024-11-04
2024-10-10