我要投稿

GraphRAG如何配置处理csv文件

发布日期：2024-07-29 15:10:24 浏览次数： 3960

作者：深入LLM Agent应用开发

微信搜一搜，关注“深入LLM Agent应用开发”

经常有粉丝朋友在群里问，GraphRAG怎么处理CSV文件啊？你会发现如果只是按照生成的settings.yaml模板配置，你是不可能成功的。比如这样

input:
  type: file # or blob
  file_type: csv # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.csv"

为什么呢？让我们一探究竟。

我已经建了一个LLM Agent应用和GraphRAG讨论群，如果希望进群交流的朋友，后台回复加群即可。

1. 配置csv文件输入

GraphRAG的索引输入代码位于graphrag/index/config/input.py，它目前支持加载csv文件和txt文本文件。因此如果你想实现类似PDF加载，我们需要在这里实现相应代码。回到正题，让我们看一下csv.py代码。

 async def load_file(path: str, group: dict | None) -> pd.DataFrame:
        ....
        if "id" not in data.columns:
            data["id"] = data.apply(lambda x: gen_md5_hash(x, x.keys()), axis=1)
        # 获取指定的source列，并保存为source列
        if csv_config.source_column is not None and "source" not in data.columns:
            ...
            else:
                data["source"] = data.apply(
                    lambda x: x[csv_config.source_column], axis=1
                )
        # 获取指定的text列，并保存为text列
        if csv_config.text_column is not None and "text" not in data.columns:
            ...
            else:
                data["text"] = data.apply(lambda x: x[csv_config.text_column], axis=1)
        # 获取指定的title_column并将其保存为tilte列
        if csv_config.title_column is not None and "title" not in data.columns:
            ...
                data["title"] = data.apply(lambda x: x[csv_config.title_column], axis=1)
    # 获取指定的时间列，处理时间列timestamp_column
        if csv_config.timestamp_column is not None:
          ...
         else:
            data["timestamp"] = pd.to_datetime(
                      data[csv_config.timestamp_column], format=fmt
                  )
        return data

所以如果我们要处理CSV，需要通过指定配置说明你的文本，标题，来源和时间，当然你也可以直接修改你的csv文件来包含这几个列名。那么通过配置的话，我们有哪些选项可以配置呢？

type: The type of input to use. Options are file or blob.
file_type: The file type field discriminates between the different input types. Options are csv and text.
base_dir: The base directory to read the input files from. This is relative to the config file.
file_pattern: A regex to match the input files. The regex must have named groups for each of the fields in the file_filter.
post_process: A DataShaper workflow definition to apply to the input before executing the primary workflow.
source_column (type: csv only): The column containing the source/author of the data
text_column (type: csv only): The column containing the text of the data
timestamp_column (type: csv only): The column containing the timestamp of the data
timestamp_format (type: csv only): The format of the timestamp

如果你需要timestamp列，你一定要配置timestamp_format列，告诉它如何解析，解析代码在上面。所以对于一个形如以下的csv文件

我们只需要如下配置，设定文本列为Text，设定来源为Source列，标题列也为Source即可。

input:
type: file # or blob
file_type: csv # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.csv"
source_column: Source
text_column: Text
title_column: Source

2. 开始索引

poetry run poe index --root .

然后索引完成。

3. 测试

准备测试。我最近为GraphRAG开发了一个流式服务器，并修改了部分GraphRAG代码，使之能够秒速输出内容，相比较之前使用命令行查询，动辄等待十几秒的，这体验提升的太明显了，丝滑～

启动Web服务，然后下载cherry-studio配置API端点和模型即可。

python -m uvicorn webserver.main:app --reload --port 20213

4. 总结

本篇介绍了如何为GraphRAG配置csv文件输入，并最终通过自己编写的web服务进行查询测试，体验丝滑。下一篇，我将介绍如何实现秒速查询响应流式输出和UI配置。

参考链接：

cherry-studio: https://cherry-ai.com/

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业

相关资讯

2026-01-05

MegaRAG ：用“多模态知识图谱”打破 RAG 的“次元壁”

2026-01-03

打造你的企业级智能文档问答系统——Everything plus RAG 实战指南

2026-01-02

LEANN：200GB 压到 6GB，笔记本跑 RAG 不是梦

2026-01-02

如何用NotebookLM，把枯燥的财报解读成精美的PPT？

2026-01-01

这次，RAG记忆被微信AI团队的超图盘活了

2026-01-01

企业级 RAG + 知识图谱的4 种主流实现路径

2025-12-31

企业RAG知识库系统中关于向量数据库的对比选型指南

2025-12-31

EdgeVec：浏览器原生向量数据库，让AI应用彻底摆脱服务器

了解更多

160+中大型企业正在使用53AI

立即咨询预约演示

把握AI发展的机遇，共同探索、共同进步

2025-01-22

如何打造基于GenAI的员工服务机器人

2025-01-22

热点资讯

企业级 RAG 系统实战（2万+文档）：10 个项目踩过的坑（附代码工程示例）

2025-10-11

总结了 13 个顶级 RAG 技术

2025-10-12

RAG 深度解读：检索增强生成如何改变人工智能

2025-12-04

大模型生态的“不可能三角”：规模化应用的架构困境？

2025-11-04

RAGFlow 深度介绍

2025-10-31

大模型RAG入门宝典｜从AI搜索到实战搭建，小白&程序员必收藏的检索增强指南

2025-12-03

RAGFlow v0.22.0 发布：数据源同步、变量聚合、全新管理界面与多项重大更新

2025-11-13

任何格式RAG数据实现秒级转换！彻底解决RAG系统中最令人头疼的数据准备环节

2025-10-12

基于大模型的智能问答场景解决方案——RAG提升召回率的关键

2025-10-16

2025 年 RAG 最佳 Reranker 模型

2025-10-16

大家都在问

如何用NotebookLM，把枯燥的财报解读成精美的PPT？

2026-01-02

为什么Claude Code不用RAG？

2025-12-23

终于，NotebookLM 和 Gemini 合体了。这是什么神之更新？

2025-12-21

Apple 入局 RAG：深度解析 CLaRa 框架，如何实现 128x 文档语义压缩？

2025-12-10

RAG知识库迎来大洗牌：GraphRAG如何让机器真正读懂世界？

2025-11-23

再谈RAG的文档解析——文档解析的难点在哪里？

2025-11-20

为什么RDF是AI系统的“天然知识层”？

2025-11-19

大模型生态的“不可能三角”：规模化应用的架构困境？

2025-11-04

热门标签

内容创作大模型技术个人提效 langchain llamaindex 多模态技术 RAG技术智能客服知识图谱模型微调 RAGFlow coze Dify Fastgpt Bisheng Qanything AI+汽车 AI+金融 AI+工业 AI+培训 AI+SaaS 提示词框架提示词技巧 AI+电商 AI面试数字员工 ChatBI AI知识库开源大模型智能营销智能硬件智能化改造 AI+医疗 MaxKB Palantir Glean