微信扫码
与创始人交个朋友
我要投稿
当下文档处理的自动化需求日益增长,尤其是对于 PDF 文档的有效处理成为了关键任务(ParseStudio:使用统一语法简化PDF文档解析)。随着人工智能技术的迅猛发展,大型语言模型(LLMs)如 ChatGPT 等在自然语言处理领域取得了显著成果,而自动化文档处理也成为了这场技术革命的最大受益者之一。然而,传统的文本处理方式在面对 PDF 文档时面临诸多挑战,如非文本元素(如图像、表格等)的处理困难。今天我们聊一下如何利用 Gemini 构建针对 PDF 文档的 AI 管道,以实现高效、精准的文档处理与信息提取。
PDF(Portable Document Format)作为一种广泛使用的文档格式,其设计初衷是确保文档在不同平台和设备上的一致性和可读性。PDF文档由字符、图像、线条及其精确坐标的集合组成,没有固有的“文本”结构,而是被设计为按原样查看,而非作为文本处理。这导致了在处理PDF时(探索 Docling:高效且安全的 PDF 解析利器),仅使用文本方法会丢失大量的布局和视觉元素,从而损失重要的上下文和信息。
例如,PDF中的表格、图表和图像通常包含重要的数据和视觉线索,这些对于理解文档内容至关重要。然而,传统的文本处理工具无法有效提取和解释这些信息,导致信息的不完整和误解。
为了克服这些挑战,多模态大型语言模型应运而生。Gemini是其中之一,它能够处理包括文本、代码和图像在内的多种模态数据。这种能力为处理PDF文档提供了一种更简洁的解决方案,即使用一个模型同时完成所有任务。
与传统的基于文本的方法相比,Gemini能够理解和处理页面布局,识别表格、图像和文本块,并将它们转换为可用于下游任务的格式。这不仅提高了文档处理的准确性,还大大简化了管道的设计和实施。
pdf2image
库将 PDF 文档的每一页提取为PIL
图像格式,随后将图像编码为 Base64 格式,以便于添加到 LLM 请求中。这一步骤确保了文档的页面能够以适合模型处理的格式进行输入,为后续的分割和总结操作奠定基础。例如,在处理包含大量图表的财务报告 PDF 时,通过这一步骤能够准确地将每一页转换为图像格式,保留图表的完整性和清晰度(MinerU:精准解析PDF文档的开源解决方案)。from document_ai_agents.document_utils import extract_images_from_pdf
from document_ai_agents.image_utils import pil_image_to_base64_jpeg
from pathlib import Path
class DocumentParsingAgent:
@classmethod
def get_images(cls, state):
"""
Extract pages of a PDF as Base64-encoded JPEG images.
"""
assert Path(state.document_path).is_file(), "File does not exist"
# Extract images from PDF
images = extract_images_from_pdf(state.document_path)
assert images, "No images extracted"
# Convert images to Base64-encoded JPEG
pages_as_base64_jpeg_images = [pil_image_to_base64_jpeg(x) for x in images]
return {"pages_as_base64_jpeg_images": pages_as_base64_jpeg_images}
LayoutElements
和DetectedLayoutItem
等自定义类定义输出结构,明确每个布局元素的类型(如表格、图表、文本块等)及其对应的总结内容。from pydantic import BaseModel, Field
from typing import Literal
import json
import google.generativeai as genai
from langchain_core.documents import Document
class DetectedLayoutItem(BaseModel):
"""
Schema for each detected layout element on a page.
"""
element_type: Literal["Table", "Figure", "Image", "Text-block"] = Field(
...,
description="Type of detected item. Examples: Table, Figure, Image, Text-block."
)
summary: str = Field(..., description="A detailed description of the layout item.")
class LayoutElements(BaseModel):
"""
Schema for the list of layout elements on a page.
"""
layout_items: list[DetectedLayoutItem] = []
class FindLayoutItemsInput(BaseModel):
"""
Input schema for processing a single page.
"""
document_path: str
base64_jpeg: str
page_number: int
class DocumentParsingAgent:
def __init__(self, model_name="gemini-1.5-flash-002"):
"""
Initialize the LLM with the appropriate schema.
"""
layout_elements_schema = prepare_schema_for_gemini(LayoutElements)
self.model_name = model_name
self.model = genai.GenerativeModel(
self.model_name,
generation_config={
"response_mime_type": "application/json",
"response_schema": layout_elements_schema,
},
)
def find_layout_items(self, state: FindLayoutItemsInput):
"""
Send a page image to the LLM for segmentation and summarization.
"""
messages = [
f"Find and summarize all the relevant layout elements in this PDF page in the following format: "
f"{LayoutElements.schema_json()}. "
f"Tables should have at least two columns and at least two rows. "
f"The coordinates should overlap with each layout item.",
{"mime_type": "image/jpeg", "data": state.base64_jpeg},
]
# Send the prompt to the LLM
result = self.model.generate_content(messages)
data = json.loads(result.text)
# Convert the JSON output into documents
documents = [
Document(
page_content=item["summary"],
metadata={
"page_number": state.page_number,
"element_type": item["element_type"],
"document_path": state.document_path,
},
)
for item in data["layout_items"]
]
return {"documents": documents}
find_layout_items
函数进行处理。这对于处理大型多页 PDF 文档尤为重要,能够显著缩短处理时间,提高整体效率。例如,在处理一本数百页的学术书籍 PDF 时,并行处理可以充分利用计算资源,快速完成页面分割和总结任务。from langgraph.types import Send
class DocumentParsingAgent:
@classmethod
def continue_to_find_layout_items(cls, state):
"""
Generate tasks to process each page in parallel.
"""
return [
Send(
"find_layout_items",
FindLayoutItemsInput(
base64_jpeg=base64_jpeg,
page_number=i,
document_path=state.document_path,
),
)
for i, base64_jpeg in enumerate(state.pages_as_base64_jpeg_images)
]
from langgraph.graph import StateGraph, START, END
class DocumentParsingAgent:
def build_agent(self):
"""
Build the agent workflow using a state graph.
"""
builder = StateGraph(DocumentLayoutParsingState)
# Add nodes for image extraction and layout item detection
builder.add_node("get_images", self.get_images)
builder.add_node("find_layout_items", self.find_layout_items)
# Define the flow of the graph
builder.add_edge(START, "get_images")
builder.add_conditional_edges("get_images", self.continue_to_find_layout_items)
builder.add_edge("find_layout_items", END)
self.graph = builder.compile()
if __name__ == "__main__":
_state = DocumentLayoutParsingState(
document_path="path/to/document.pdf"
)
agent = DocumentParsingAgent()
# Step 1: Extract images from PDF
result_images = agent.get_images(_state)
_state.pages_as_base64_jpeg_images = result_images["pages_as_base64_jpeg_images"]
# Step 2: Process the first page (as an example)
result_layout = agent.find_layout_items(
FindLayoutItemsInput(
base64_jpeg=_state.pages_as_base64_jpeg_images[0],
page_number=0,
document_path=_state.document_path,
)
)
# Display the results
for item in result_layout["documents"]:
print(item.page_content)
print(item.metadata["element_type"])
ChromaDB
等向量数据库对 Agent 1 生成的文档总结进行索引。在索引过程中,不仅存储总结内容,还保留了文档路径、页面编号等重要元数据,以便后续检索和引用。例如,当用户查询特定信息时,这些元数据可以帮助快速定位相关文档页面,提供准确的上下文信息。在索引之前,会检查文档是否已被索引,避免重复操作,提高处理效率。class DocumentRAGAgent:
def index_documents(self, state: DocumentRAGState):
"""
Index the parsed documents into the vector store.
"""
assert state.documents, "Documents should have at least one element"
# Check if the document is already indexed
if self.vector_store.get(where={"document_path": state.document_path})["ids"]:
logger.info(
"Documents for this file are already indexed, exiting this node"
)
return # Skip indexing if already done
# Add parsed documents to the vector store
self.vector_store.add_documents(state.documents)
logger.info(f"Indexed {len(state.documents)} documents for {state.document_path}")
class DocumentRAGAgent:
def answer_question(self, state: DocumentRAGState):
"""
Retrieve relevant chunks and generate a response to the user's question.
"""
# Retrieve the top-k relevant documents based on the query
relevant_documents: list[Document] = self.retriever.invoke(state.question)
# Retrieve corresponding page images (avoid duplicates)
images = list(
set(
[
state.pages_as_base64_jpeg_images[doc.metadata["page_number"]]
for doc in relevant_documents
]
)
)
logger.info(f"Responding to question: {state.question}")
# Construct the prompt: Combine images, relevant summaries, and the question
messages = (
[{"mime_type": "image/jpeg", "data": base64_jpeg} for base64_jpeg in images]
+ [doc.page_content for doc in relevant_documents]
+ [
f"Answer this question using the context images and text elements only: {state.question}",
]
)
# Generate the response using the LLM
response = self.model.generate_content(messages)
return {"response": response.text, "relevant_documents": relevant_documents}
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-12-21
Gemini 2.0 Flash Thinking:谷歌推出实验性多模态推理模型,在快速生成的同时展示详细的思考过程
2024-12-20
快手可灵1.6正式上线,他们又一次超越了自己。
2024-12-19
GPT-4o掀起全模态热潮!一文梳理全模态大模型最新研究进展
2024-12-19
国家电网发布国内首个千亿级多模态电力行业大模型
2024-12-19
初创公司 Odyssey 推出 AI 工具 Explorer了
2024-12-18
一手实测豆包新发布的视觉理解大模型,他们真的卷起飞了。
2024-12-18
百度飞桨:多模态大模型技术进展与产业应用实践
2024-12-18
Kimi发布视觉思考模型k1,会看图做题,还能看图定位你在哪里
2024-09-12
2024-05-30
2024-06-17
2024-08-06
2024-08-30
2024-04-21
2024-06-14
2024-06-26
2024-07-21
2024-07-07