我要投稿

利用 Gemini 构建 PDF 文档 AI 管道：原理、实现与应用（含代码）

发布日期：2024-12-19 08:16:56 浏览次数： 2240

作者：大模型之路

微信搜一搜，关注“大模型之路”

当下文档处理的自动化需求日益增长，尤其是对于 PDF 文档的有效处理成为了关键任务（ParseStudio：使用统一语法简化PDF文档解析）。随着人工智能技术的迅猛发展，大型语言模型（LLMs）如 ChatGPT 等在自然语言处理领域取得了显著成果，而自动化文档处理也成为了这场技术革命的最大受益者之一。然而，传统的文本处理方式在面对 PDF 文档时面临诸多挑战，如非文本元素（如图像、表格等）的处理困难。今天我们聊一下如何利用 Gemini 构建针对 PDF 文档的 AI 管道，以实现高效、精准的文档处理与信息提取。

一、PDF文档处理的挑战

PDF（Portable Document Format）作为一种广泛使用的文档格式，其设计初衷是确保文档在不同平台和设备上的一致性和可读性。PDF文档由字符、图像、线条及其精确坐标的集合组成，没有固有的“文本”结构，而是被设计为按原样查看，而非作为文本处理。这导致了在处理PDF时（探索 Docling：高效且安全的 PDF 解析利器），仅使用文本方法会丢失大量的布局和视觉元素，从而损失重要的上下文和信息。

例如，PDF中的表格、图表和图像通常包含重要的数据和视觉线索，这些对于理解文档内容至关重要。然而，传统的文本处理工具无法有效提取和解释这些信息，导致信息的不完整和误解。

二、Gemini多模态LLM的优势

为了克服这些挑战，多模态大型语言模型应运而生。Gemini是其中之一，它能够处理包括文本、代码和图像在内的多种模态数据。这种能力为处理PDF文档提供了一种更简洁的解决方案，即使用一个模型同时完成所有任务。

与传统的基于文本的方法相比，Gemini能够理解和处理页面布局，识别表格、图像和文本块，并将它们转换为可用于下游任务的格式。这不仅提高了文档处理的准确性，还大大简化了管道的设计和实施。

三、构建文档 AI 管道的具体步骤

（一）页面分割与总结（Agent 1）

提取 PDF 页面为图像

使用pdf2image库将 PDF 文档的每一页提取为PIL图像格式，随后将图像编码为 Base64 格式，以便于添加到 LLM 请求中。这一步骤确保了文档的页面能够以适合模型处理的格式进行输入，为后续的分割和总结操作奠定基础。例如，在处理包含大量图表的财务报告 PDF 时，通过这一步骤能够准确地将每一页转换为图像格式，保留图表的完整性和清晰度（MinerU：精准解析PDF文档的开源解决方案）。

from document_ai_agents.document_utils import extract_images_from_pdffrom document_ai_agents.image_utils import pil_image_to_base64_jpegfrom pathlib import Path
class DocumentParsingAgent:    @classmethod    def get_images(cls, state):        """        Extract pages of a PDF as Base64-encoded JPEG images.        """        assert Path(state.document_path).is_file(), "File does not exist"        # Extract images from PDF        images = extract_images_from_pdf(state.document_path)        assert images, "No images extracted"        # Convert images to Base64-encoded JPEG        pages_as_base64_jpeg_images = [pil_image_to_base64_jpeg(x) for x in images]        return {"pages_as_base64_jpeg_images": pages_as_base64_jpeg_images}

使用 LLM 进行分割和总结

将 Base64 编码的图像发送给 Gemini 等多模态 LLM 进行处理。通过定义特定的输入和输出结构，确保模型能够准确识别页面中的不同布局元素，并生成相应的总结。例如，对于包含表格的页面，模型能够识别表格结构，并对表格内容进行总结；对于图表，模型可以根据图像内容生成描述性总结。具体来说，通过LayoutElements和DetectedLayoutItem等自定义类定义输出结构，明确每个布局元素的类型（如表格、图表、文本块等）及其对应的总结内容。

from pydantic import BaseModel, Fieldfrom typing import Literalimport jsonimport google.generativeai as genaifrom langchain_core.documents import Document
class DetectedLayoutItem(BaseModel):    """    Schema for each detected layout element on a page.    """    element_type: Literal["Table", "Figure", "Image", "Text-block"] = Field(        ...,         description="Type of detected item. Examples: Table, Figure, Image, Text-block."    )    summary: str = Field(..., description="A detailed description of the layout item.")
class LayoutElements(BaseModel):    """    Schema for the list of layout elements on a page.    """    layout_items: list[DetectedLayoutItem] = []
class FindLayoutItemsInput(BaseModel):    """    Input schema for processing a single page.    """    document_path: str    base64_jpeg: str    page_number: int
class DocumentParsingAgent:    def __init__(self, model_name="gemini-1.5-flash-002"):        """        Initialize the LLM with the appropriate schema.        """        layout_elements_schema = prepare_schema_for_gemini(LayoutElements)        self.model_name = model_name        self.model = genai.GenerativeModel(            self.model_name,            generation_config={                "response_mime_type": "application/json",                "response_schema": layout_elements_schema,            },        )    def find_layout_items(self, state: FindLayoutItemsInput):        """        Send a page image to the LLM for segmentation and summarization.        """        messages = [            f"Find and summarize all the relevant layout elements in this PDF page in the following format: "            f"{LayoutElements.schema_json()}. "            f"Tables should have at least two columns and at least two rows. "            f"The coordinates should overlap with each layout item.",            {"mime_type": "image/jpeg", "data": state.base64_jpeg},        ]        # Send the prompt to the LLM        result = self.model.generate_content(messages)        data = json.loads(result.text)
        # Convert the JSON output into documents        documents = [            Document(                page_content=item["summary"],                metadata={                    "page_number": state.page_number,                    "element_type": item["element_type"],                    "document_path": state.document_path,                },            )            for item in data["layout_items"]        ]        return {"documents": documents}

并行处理页面

为提高处理效率，采用并行处理方式对文档的每一页进行操作。通过创建一系列任务，将每个页面作为独立任务发送给find_layout_items函数进行处理。这对于处理大型多页 PDF 文档尤为重要，能够显著缩短处理时间，提高整体效率。例如，在处理一本数百页的学术书籍 PDF 时，并行处理可以充分利用计算资源，快速完成页面分割和总结任务。

from langgraph.types import Send
class DocumentParsingAgent:    @classmethod    def continue_to_find_layout_items(cls, state):        """        Generate tasks to process each page in parallel.        """        return [            Send(                "find_layout_items",                FindLayoutItemsInput(                    base64_jpeg=base64_jpeg,                    page_number=i,                    document_path=state.document_path,                ),            )            for i, base64_jpeg in enumerate(state.pages_as_base64_jpeg_images)        ]

完整的工作流：

from langgraph.graph import StateGraph, START, END
class DocumentParsingAgent:    def build_agent(self):        """        Build the agent workflow using a state graph.        """        builder = StateGraph(DocumentLayoutParsingState)
        # Add nodes for image extraction and layout item detection        builder.add_node("get_images", self.get_images)        builder.add_node("find_layout_items", self.find_layout_items)        # Define the flow of the graph        builder.add_edge(START, "get_images")        builder.add_conditional_edges("get_images", self.continue_to_find_layout_items)        builder.add_edge("find_layout_items", END)
        self.graph = builder.compile()

4、测试运行

if __name__ == "__main__":    _state = DocumentLayoutParsingState(        document_path="path/to/document.pdf"    )    agent = DocumentParsingAgent()
    # Step 1: Extract images from PDF    result_images = agent.get_images(_state)    _state.pages_as_base64_jpeg_images = result_images["pages_as_base64_jpeg_images"]
    # Step 2: Process the first page (as an example)    result_layout = agent.find_layout_items(        FindLayoutItemsInput(            base64_jpeg=_state.pages_as_base64_jpeg_images[0],            page_number=0,            document_path=_state.document_path,        )    )    # Display the results    for item in result_layout["documents"]:        print(item.page_content)        print(item.metadata["element_type"])

（二）嵌入和上下文检索（Agent 2）

索引分割后的文档

使用ChromaDB等向量数据库对 Agent 1 生成的文档总结进行索引。在索引过程中，不仅存储总结内容，还保留了文档路径、页面编号等重要元数据，以便后续检索和引用。例如，当用户查询特定信息时，这些元数据可以帮助快速定位相关文档页面，提供准确的上下文信息。在索引之前，会检查文档是否已被索引，避免重复操作，提高处理效率。

class DocumentRAGAgent:    def index_documents(self, state: DocumentRAGState):        """        Index the parsed documents into the vector store.        """        assert state.documents, "Documents should have at least one element"        # Check if the document is already indexed        if self.vector_store.get(where={"document_path": state.document_path})["ids"]:            logger.info(                "Documents for this file are already indexed, exiting this node"            )            return  # Skip indexing if already done        # Add parsed documents to the vector store        self.vector_store.add_documents(state.documents)        logger.info(f"Indexed {len(state.documents)} documents for {state.document_path}")

处理用户问题

当用户提出问题时，Agent 2 首先在向量数据库中检索与问题最相关的文档块。然后，根据检索到的文档块的页面编号，获取对应的页面图像，将图像和相关总结组合成上下文信息。例如，用户询问关于文档中某一概念的详细解释时，Agent 2 能够快速找到包含该概念的相关文档块及其所在页面图像，为 LLM 提供全面的上下文。最后，将组合后的上下文信息与用户问题一起发送给 Gemini，生成准确、有针对性的回答。

class DocumentRAGAgent:    def answer_question(self, state: DocumentRAGState):        """        Retrieve relevant chunks and generate a response to the user's question.        """        # Retrieve the top-k relevant documents based on the query        relevant_documents: list[Document] = self.retriever.invoke(state.question)
        # Retrieve corresponding page images (avoid duplicates)        images = list(            set(                [                    state.pages_as_base64_jpeg_images[doc.metadata["page_number"]]                    for doc in relevant_documents                ]            )        )        logger.info(f"Responding to question: {state.question}")        # Construct the prompt: Combine images, relevant summaries, and the question        messages = (            [{"mime_type": "image/jpeg", "data": base64_jpeg} for base64_jpeg in images]            + [doc.page_content for doc in relevant_documents]            + [                f"Answer this question using the context images and text elements only: {state.question}",            ]        )        # Generate the response using the LLM        response = self.model.generate_content(messages)        return {"response": response.text, "relevant_documents": relevant_documents}

通过利用Gemini这样的多模态大型语言模型，我们能够构建一个高效、全面的文档AI管道，用于处理PDF文档（PymuPDF4llm：PDF 提取的革命）。这个管道能够克服传统文本处理方法的局限性，通过理解和处理页面布局、表格、图像和文本块等多种元素，提供准确且完整的文档处理结果。

53AI，企业落地大模型首选服务商

产品：场景落地咨询+大模型应用平台+行业解决方案

承诺：免费POC验证，效果达标后再合作。零风险落地应用大模型，已交付160+中大型企业