我要投稿

LlamaIndex中的Document和Node

发布日期：2024-04-27 18:22:39 浏览次数： 4209

作者：PyTorch研习社

微信搜一搜，关注“PyTorch研习社”

Document 和 Node 是 LlamaIndex 中的核心抽象。

Document 是PDF、API 输出或从数据库检索的数据等任何数据源的通用容器。它们可以手动构建，也可以通过 LlamaIndex 提供的各种 Reader 自动创建。默认情况下，Document 存储文本以及一些其他属性：

metadata：可以附加到文本的注释字典。
relationships：包含与其他 Document/Node 的关系的字典。

Node 表示 Document 的一个“块”。与 Document 类似，它们也包含 metadata 和 relationships 属性。

Node 是 LlamaIndex 中的一等公民。我们可以选择直接定义 Node 及其所有属性。我们还可以选择通过 LlamaIndex 的 NodeParser 类将 Document 解析为 Node。默认情况下，从 Document 派生的每个 Node 都会从该 Document 继承相同的 metadata（例如，Document 中归档的“文件名”会传播到每个 Node）。

Document

默认情况下所有的 reader 都会通过 load_data 方法返回 Document 对象。

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()

或者，我们可以直接创建 Document 对象：

from llama_index.core import Document
text_list = [text1, text2, ...]documents = [Document(text=t) for t in text_list]

为了加快原型设计和开发速度，我们还可以使用一些默认文本快速创建 Documen：

document = Document.example()

Document 元数据

Document 有一个重要的属性是 metadata，它是以字典形式存储了像是文件名、文件类别等信息在内的元数据。从 Document 中创建的 Node 会继承 metadata 属性。后面在生成嵌入向量和调用 LLM 时也会用到这些元数据。

注意！有的向量数据库要求 metadata 字典中的键必须是字符串，值必须是 str、float 或者 int。

设置 metadata 字典的方法有很多：

1、在 Document 类的构造函数中：

document = Document(text="text",metadata={"filename": "<doc_file_name>", "category": "<category>"},)

2、创建完 Document 对象之后：

document.metadata = {"filename": "<doc_file_name>"}

3、使用 SimpleDirectoryReader 和 file_metadata 钩子自动设置文件名。这样将会在每个 Document 对象上自动运行钩子并设置 metadata 属性：

from llama_index.core import SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
# automatically sets the metadata of each document according to filename_fndocuments = SimpleDirectoryReader("./data", file_metadata=filename_fn).load_data()

Document ID

doc_id 属性存储了 Document 对象的唯一标识，当我们在 Document 的基础上构建完索引之后，如果后面 Document 内容有更新，那么可以根据 doc_id 去刷新对应的索引。在使用 SimpleDirectoryReader 时，我们可以将 doc_id 设为文件的全路径：

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()print([x.doc_id for x in documents])

当然，也可以设置成其他的值：

document.doc_id = "My new document id!"

高级元数据管理

所有的元数据默认情况下都会在生成嵌入向量和调用 LLM 生成答案时被使用。一般情况下元数据字典会包含很多的元数据键，我们可能不希望 LLM 在生成答案时读取所有的元数据，比如文件名。然而文件名对于生成嵌入向量时很有用，因为可能包含重要信息。我们可以直接指定 LLM 不读取的元数据键：

LLM 只能看到 category 元数据和文件内容了。

我们也可以定制在生成嵌入向量阶段要屏蔽的元数据键：

现在我们已经知道元数据和文本内容都会被发送到嵌入模型和 LLM，那么它们是以什么格式组织的呢？我们可以通过 Document 的以下三个属性得到：

这些属性都可以更改，以上只是默认值。

Node

Node 是 LlamaIndex 中的一等公民。Node 代表 Document 的“块”（文本块、图像等等）。它们还包含元数据以及与其他 Node 和索引结构的关系信息。

我们可以选择通过 NodeParser 类将 Document 解析为 Node：

from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

我们可以选择直接定义 Node 及其所有属性：

from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
node1 = TextNode(text="<text_chunk>", id_="<node_id>")node2 = TextNode(text="<text_chunk>", id_="<node_id>")# set relationshipsnode1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=node2.node_id)node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(node_id=node1.node_id)nodes = [node1, node2]

RelatedNodeInfo 类也可以包含必要的 metadata 信息：

node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(node_id=node1.node_id, metadata={"key": "val"})

每个 Node 都有一个 node_id 属性，如果没有手动指定，该属性会自动生成。这个 ID 可用于多种目的：能够更新存储中的 Node、能够定义 Node 之间的关系（通过 IndexNode）等等。

print(node.node_id)node.node_id = "My new node_id!"

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业

相关资讯

2026-04-08

用 LlamaIndex 让 AI 读懂你的 Excel：三种方案详解

2025-12-04

LlamaIndex 深度实战：用《长安的荔枝》学会构建智能问答系统

2025-09-29

LlamaIndex 开发多智能体 Agents 入门基础

2025-09-27

LlamaIndex 开发智能体 Agents 要点解析

2025-07-21

LlamaIndex 是什么？普通人也能用它构建 AI 应用？

2025-07-13

手把手教你用 LlamaIndex 构建专属AI问答系统（新手友好版）

2025-07-04

LlamaIndex 开发大模型 Agent Workflow攻略

2025-07-01

llamaindex发布Workflows1.0轻量级编排框架

联系获取

160+中大型企业正在使用53AI

立即咨询预约演示

把握AI发展的机遇，共同探索、共同进步

2025-01-22

如何打造基于GenAI的员工服务机器人

2025-01-22

热点资讯

用 LlamaIndex 让 AI 读懂你的 Excel：三种方案详解

2026-04-08

大家都在问

LlamaIndex 是什么？普通人也能用它构建 AI 应用？

2025-07-21

对于初学者，该如何选择 LlamaIndex 与 LangChain ？

2024-08-28

LlamaIndex团队技术演讲: 如何构建和改进一个能处理复杂文档和查询的RAG知识助手？

2024-06-20

LlamaIndex是如何进行RAG的？

2024-04-19

热门标签

内容创作大模型技术个人提效 langchain llamaindex 多模态技术 RAG技术智能客服知识图谱模型微调 RAGFlow coze Dify Fastgpt Bisheng Qanything AI+汽车 AI+金融 AI+工业 AI+培训 AI+SaaS Skill 提示词技巧 AI+电商 AI面试数字员工 ChatBI AI知识库开源大模型智能营销智能硬件智能化改造 AI+医疗 MaxKB Palantir Glean Openclaw