我要投稿

LlamaIndex中的SimpleDirectoryReader

发布日期：2024-04-28 18:06:06 浏览次数： 4268

作者：PyTorch研习社

微信搜一搜，关注“PyTorch研习社”

从本地文件系统加载文件到 LlamaIndex 最简单直接的方法就是通过 SimpleDirectoryReader。

默认情况下，SimpleDirectoryReader 将尝试读取它找到的任何文件，并将它们全部视为文本。除了纯文本之外，它还明确支持以下文件类型，这些文件类型是根据文件扩展名自动检测的：

.csv - 逗号分隔值
.docx - Microsoft Word
.epub - EPUB 电子书格式
.hwp - 韩文文字处理器
.ipynb - Jupyter 笔记本
.jpeg、.jpg - JPEG 图像
.mbox - MBOX 电子邮件存档
.md-Markdown
.mp3、.mp4 - 音频和视频
.pdf - 便携式文档格式
.png - 便携式网络图形
.ppt、.pptm、.pptx - Microsoft PowerPoint

对于 JSON 格式的文件，我们需要：

pip install llama-index-readers-json

然后：

from llama_index.readers.json import JSONReader

最简单直接的使用方法就是将目录路径传到 input_dir 参数，然后 SimpleDirectoryReader 就会读取该目录下所有支持的文件格式的文件：

from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_dir="path/to/directory")documents = reader.load_data()

如果从目录加载多个文件，也可以通过并行处理加载文档。请注意，在 Windows 和 Linux/MacOS 计算机上使用多线程处理时存在差异，Windows 用户可能会看到较少的性能提升或没有性能提升，而 Linux/MacOS 用户在加载完全相同的文件集时会看到这些提升。

...documents = reader.load_data(num_workers=4)

从子目录中读取文件

设置 recursive=True 就可以读取子目录中的文件：

SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)

迭代处理文件

可以使用 iter_data() 方法迭代处理每个文件：

reader = SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)all_docs = []for docs in reader.iter_data():# <do something with the documents per file>all_docs.extend(docs)

指定要读取的文件

将想要读取的文件的文件名放在列表中就可以只读取这些文件：

SimpleDirectoryReader(input_files=["path/to/file1", "path/to/file2"])

通过 exclude 参数指明不要读取某个目录下的文件，而其他文件则会被读取：

SimpleDirectoryReader(input_dir="path/to/directory", exclude=["path/to/file1", "path/to/file2"])

通过 required_exts 参数指定要读取的文件类型，其他类型的文件都不会被读取：

SimpleDirectoryReader(input_dir="path/to/directory", required_exts=[".pdf", ".docx"])

还可以限制要读取的文件数量：

SimpleDirectoryReader(input_dir="path/to/directory", num_files_limit=100)

文件编码

默认情况下，希望文件是 utf-8 格式，但是也可以通过 encoding 参数指定格式：

SimpleDirectoryReader(input_dir="path/to/directory", encoding="latin-1")

抽取元数据

我们通过 file_metadata 参数指定一个函数，该函数将读取每个文件并抽取元数据，并将元数据附加到 Document 对象：

def get_meta(file_path):return {"foo": "bar", "file_path": file_path}

SimpleDirectoryReader(input_dir="path/to/directory", file_metadata=get_meta)

该函数应采用单个参数（这里是文件路径）并返回元数据字典。

扩展到其他文件类型

首先我们需要继承 BaseReader 实现一个可以读取其他文件类型的类，然后将文件扩展名和类作为字典传递给 file_extractor 参数。例如，添加对 .myfile 文件的自定义支持：

from llama_index.core import SimpleDirectoryReaderfrom llama_index.core.readers.base import BaseReaderfrom llama_index.core import Document

class MyFileReader(BaseReader):def load_data(self, file, extra_info=None):with open(file, "r") as f:text = f.read()# load_data returns a list of Document objectsreturn [Document(text=text + "Foobar", extra_info=extra_info or {})]

reader = SimpleDirectoryReader(input_dir="./data", file_extractor={".myfile": MyFileReader()})
documents = reader.load_data()

BaseReader 应该读取文件并返回 Document 列表。

注意！这将覆盖我们指定的文件类型的默认文件提取器，因此如果我们想支持它们，则需要将它们添加回来。

外部文件系统

通过 fs 参数，我们可以遍历远程文件系统。

fs 参数值可以是由 fsspec 协议实现的任何文件系统对象。fsspec 协议具有针对各种远程文件系统的开源实现，包括 AWS S3、Azure Blob 和 DataLake、Google Drive、SFTP 等。

比如，要读取 S3 文件系统：

from s3fs import S3FileSystem
s3_fs = S3FileSystem(key="...", secret="...")bucket_name = "my-document-bucket"
reader = SimpleDirectoryReader(input_dir=bucket_name,fs=s3_fs,recursive=True,# recursively searches all subdirectories)
documents = reader.load_data()