我要投稿

RAG文本切分的第四个层次，基于向量模型的语义切分

发布日期：2024-08-30 06:21:42 浏览次数： 2376

作者：哎呀AIYA

微信搜一搜，关注“哎呀AIYA”

之前的文章提到，我们将文本切分划分为五个层级，并介绍了前三个层级的实现和一些基础知识。本篇文章开始，我们将介绍第四层级的内容语义切分；本篇文章将介绍基于向量模型的语义切分。

文本切分五个层级：

Level 1: Character Splitting - 简单的字符长度切分
Level 2: Recursive Character Text Splitting - 通过分隔符切分，然后递归合并
Level 3: Document Specific Splitting - 针对不同文档格式切分 (PDF, Python, Markdown)
Level 4: Semantic Splitting - 语义切分
Level 5: Agentic Splitting-使用代理实现自动切分

这个切分器的工作原理是确定何时分隔句子。这是通过查找任意两个句子之间的向量差异来完成的。当该差异超过某个阈值时，它们将被拆分。后面演示它是怎么实现的：

搭建语义切分流程

数据加载

# This is a long document we can split up.with open("state_of_the_union.txt") as f:state_of_the_union = f.read()

创建拆分器

要实例化 SemanticChunker，我们必须指定一个嵌入模型。下面我们将使用 OpenAIEmbeddings，也可以使用自己的模型。

from langchain_experimental.text_splitter import SemanticChunkerfrom langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())

拆分文本

docs = text_splitter.create_documents([state_of_the_union])print(docs[0].page_content)

这样我们就完成了基于向量的语义切分；下面介绍其参数控制：

‍切分的几种形式

在切分的过程中，我们怎么控制切分的粒度？有几种方法可以确定该阈值是什么？这些方法可以由kwarg的breakpoint_threshold_type控制。

百分比

默认的拆分方式是基于百分位数。在这种方法中，计算句子之间的所有差异，然后拆分任何大于 X 百分位数的差异。

text_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="percentile")docs = text_splitter.create_documents([state_of_the_union])print(docs[0].page_content)

print(len(docs))

# 26

标准差

在此方法中，任何大于 X 个标准差的差值都将被拆分。

text_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")docs = text_splitter.create_documents([state_of_the_union])print(docs[0].page_content)

print(len(docs))

# 4

四分位距

在这种方法中，四分位数距离用于分割块。

text_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")docs = text_splitter.create_documents([state_of_the_union])print(docs[0].page_content)

print(len(docs))

# 25

梯度

在这种方法中，距离的梯度与百分位数方法一起用于分割块。当块彼此高度相关或特定于某个领域时，此方法非常有用。这个想法是在梯度数组上应用异常检测，使分布变得更宽，并且易于识别高度语义数据中的边界。

text_splitter = SemanticChunker( OpenAIEmbeddings(), breakpoint_threshold_type="gradient")docs = text_splitter.create_documents([state_of_the_union])print(docs[0].page_content)