我要投稿

一步步教你如何构建一个通用的大模型智能体（LLM Agent）

发布日期：2025-03-25 04:11:20 浏览次数： 1792

作者：PyTorch研习社

微信搜一搜，关注“PyTorch研习社”

LLM Agent（智能体）的高级概述：

为什么要构建一个通用的 Agent？因为它是一个出色的工具，可以用来原型化你的用例，并为设计你自己的定制 Agent 架构奠定基础。

在我们深入讨论之前，先简单介绍一下LLM Agent。你可以选择跳过这一部分。

什么是LLM Agent？

LLM Agent 是一个程序，它的执行逻辑由其底层模型控制。

从单独的 LLM 到 Agentic 系统：

LLM Agent 与 few-shot 提示或固定工作流等方法的不同之处在于，它能够自主定义并调整执行用户查询所需的步骤。

在具备一组工具（如代码执行或网页搜索）的情况下，Agent 可以决定使用哪个工具、如何使用，并根据输出结果进行迭代优化。

这种适应性使系统能够以最少的配置处理各种不同的用例。

Agentic 架构存在一个从固定工作流的可靠性到自主Agent的灵活性的连续频谱。

例如，RAG（检索增强生成）这样的固定工作流可以通过自反思（self-reflection）循环进行增强，使程序在初始响应不足时能够进行迭代优化。另一方面，ReAct Agent 可以将固定工作流作为工具来使用，从而提供一种既灵活又结构化的方法。最终，架构的选择取决于具体的使用场景，以及在可靠性和灵活性之间的权衡。

从零开始构建一个通用 LLM Agent ！

第 1 步：选择合适的 LLM

选择合适的模型对于实现预期的性能至关重要。你需要考虑多个因素，例如许可协议、成本和语言支持。

对于LLM Agent来说，最重要的考量因素是模型在关键任务（如代码生成、工具调用和推理）上的表现，评估基准包括：

MMLU（Massive Multitask Language Understanding）
（用于推理能力评估）
Berkeley’s Function Calling Leaderboard
（用于工具选择与调用评估）
HumanEval 和 BigCodeBench
（用于代码能力评估）

另一个关键因素是模型的上下文窗口大小。Agentic 工作流可能会消耗大量 token，有时甚至超过 100K，因此更大的上下文窗口会带来极大便利。

可考虑的模型（截至2025年3月1日）

闭源模型：GPT-4.5、Claude 3.7

开源模型：Qwen 2.5、DeepSeek R1、Llama 3.2

通常来说，更大的模型通常表现更佳，但能够在本地运行的小型模型仍然是不错的选择。如果选择小型模型，Agent 可能只能用于较简单的场景，并且只能连接一两个基础工具。

第 2 步：定义 Agent 的控制逻辑（即通信结构）

LLM 与 Agent 之间的主要区别在于系统提示（system prompt）。在 LLM 的上下文中，系统提示是一组指令和上下文信息，在模型处理用户查询之前提供给它。

Agent 预期的行为可以在系统提示中进行编码，从而定义其 Agentic 行为模式。这些模式可以根据具体需求进行定制，常见的 Agentic 模式：

工具调用（Tool Use）
Agent 决定何时将查询传递给合适的工具，或直接依赖自身知识回答。
自反思（Reflection）
Agent 在回应用户之前，会先检查并修正自己的回答。大多数 LLM 系统都可以加入一个反思步骤。
推理后执行（Reason-then-Act，ReAct）
Agent 逐步推理如何解决查询，执行某个操作，观察结果，并决定是继续采取行动还是直接给出答案。
规划后执行（Plan-then-Execute）
Agent 先将任务拆解成多个子步骤（如果有必要），然后逐步执行每个步骤。

其中，ReAct 和 Plan-then-Execute 是构建通用单Agent最常见的起点。

要有效实现这些行为，你需要进行Prompt Engineering（提示工程），也可能需要使用结构化生成（structured generation）技术。结构化生成的核心思想是引导 LLM 输出符合特定格式或模式，确保 Agent 的回复风格一致，并符合预期的沟通方式。

示例：Bee Agent Framework 中的 ReAct 风格 Agent 的系统提示片段：

# Communication structureYou communicate only in instruction lines. The format is: "Instruction: expected output". You must only use these instruction lines and must not enter empty lines or anything else between instruction lines.You must skip the instruction lines Function Name, Function Input and Function Output if no function calling is required.
Message: User's message. You never use this instruction line.Thought: A single-line plan of how to answer the user's message. It must be immediately followed by Final Answer.Thought: A single-line step-by-step plan of how to answer the user's message. You can use the available functions defined above. This instruction line must be immediately followed by Function Name if one of the available functions defined above needs to be called, or by Final Answer. Do not provide the answer here.Function Name: Name of the function. This instruction line must be immediately followed by Function Input.Function Input: Function parameters. Empty object is a valid parameter.Function Output: Output of the function in JSON format.Thought: Continue your thinking process.Final Answer: Answer the user or ask for more information or clarification. It must always be preceded by Thought.
## ExamplesMessage: Can you translate "How are you" into French?Thought: The user wants to translate a text into French. I can do that.Final Answer: Comment vas-tu?

中文：

# 通信结构你只能通过指令行进行通信。格式为：“指令：预期输出”。你只能使用这些指令行，并且不得在指令行之间输入空行或其他任何内容。如果不需要调用函数，则必须跳过指令行函数名称、函数输入和函数输出。
消息：用户的消息。您永远不会使用此指令行。想法：如何回答用户消息的单行计划。它必须紧接着最终答案。想法：如何回答用户消息的单行分步计划。你可以使用上面定义的可用函数。如果需要调用上面定义的可用函数之一，则此指令行必须紧接着函数名称，或者紧接着最终答案。不要在此处提供答案。函数名称：函数的名称。此指令行必须紧接着函数输入。函数输入：函数参数。空对象是有效参数。函数输出：以 JSON 格式输出函数。想法：继续你的思考过程。最终答案：回答用户或要求提供更多信息或说明。它必须始终以想法开头。
## 示例消息：你能将“How are you”翻译成法语吗？想法：用户想将文本翻译成法语。我可以做到。最终答案：Comment vas-tu?

第 3 步：定义 Agent 的核心指令

我们通常认为 LLM 具备许多开箱即用的功能，但其中一些可能并不符合你的需求。要让 Agent 达到理想的性能，你需要在系统提示中明确规定哪些功能应该启用，哪些应该禁用。

可能需要定义的指令包括：

Agent 名称与角色
Agent 的名称及其职责。
语气与简洁性
Agent 交流时应正式还是随意？应尽量简短还是提供详细信息？
何时使用工具
何时依赖外部工具，何时直接使用 LLM 知识回答？
错误处理方式
如果工具调用失败，Agent 应该如何应对？

示例：Bee Agent Framework 的部分指令：

# InstructionsUser can only see the Final Answer, all answers must be provided there.You must always use the communication structure and instructions defined above. Do not forget that Thought must be a single-line immediately followed by Final Answer.You must always use the communication structure and instructions defined above. Do not forget that Thought must be a single-line immediately followed by either Function Name or Final Answer.Functions must be used to retrieve factual or historical information to answer the message.If the user suggests using a function that is not available, answer that the function is not available. You can suggest alternatives if appropriate.When the message is unclear or you need more information from the user, ask in Final Answer.
# Your capabilitiesPrefer to use these capabilities over functions.- You understand these languages: English, Spanish, French.- You can translate and summarize, even long documents.
# Notes- If you don't know the answer, say that you don't know.- The current time and date in ISO format can be found in the last message.- When answering the user, use friendly formats for time and date.- Use markdown syntax for formatting code snippets, links, JSON, tables, images, files.- Sometimes, things don't go as planned. Functions may not provide useful information on the first few tries. You should always try a few different approaches before declaring the problem unsolvable.- When the function doesn't give you what you were asking for, you must either use another function or a different function input.  - When using search engines, you try different formulations of the query, possibly even in a different language.- You cannot do complex calculations, computations, or data manipulations without using functions.

中文：

# 说明用户只能看到最终答案，所有答案都必须在那里提供。你必须始终使用上面定义的通信结构和说明。不要忘记，思考必须是一行，后面紧跟着最终答案。你必须始终使用上面定义的通信结构和说明。不要忘记，思考必须是一行，后面紧跟着函数名称或最终答案。必须使用函数来检索事实或历史信息以回答消息。如果用户建议使用不可用的功能，请回答该功能不可用。如果合适，你可以建议替代方案。当消息不清楚或你需要用户提供更多信息时，请在最终答案中询问。#你的能力优先使用这些能力而不是功能。- 你了解这些语言：英语、西班牙语、法语。- 你可以翻译和总结，即使是长篇文档。# 注释- 如果你不知道答案，请说你不知道。- 可以在最后一条消息中找到 ISO 格式的当前时间和日期。- 回答用户问题时，请使用友好的时间和日期格式。- 使用 markdown 语法格式化代码片段、链接、JSON、表格、图像和文件。- 有时，事情不会按计划进行。函数在前几次尝试中可能无法提供有用的信息。在宣布问题无法解决之前，你应该始终尝试几种不同的方法。- 当函数无法提供你要求的内容时，你必须使用其他函数或其他函数输入。- 使用搜索引擎时，你可以尝试查询的不同表述，甚至可能使用不同的语言。- 不使用函数，你无法进行复杂的计算、运算或数据操作。

第 4 步：定义并优化核心工具

工具赋予了 Agent 强大的能力。通过一组精心设计的工具，你可以实现广泛的功能。关键工具包括：
✅ 代码执行
✅ Web 搜索
✅ 文件读取
✅ 数据分析

每个工具都应包含以下定义，并作为系统提示的一部分：

工具名称（Tool Name）
清晰描述该工具的功能。
工具描述（Tool Description）
解释工具的用途，以及何时使用它，以帮助 Agent 选择合适的工具。
工具输入模式（Tool Input Schema）
定义输入参数，包括必填项、可选项、类型和约束。
工具执行方式
如何运行工具，以及 Agent 该如何调用它。

示例：Langchain 社区的 Arxiv 工具，以下是 Arxiv API 实现的部分代码，该工具可用于检索物理学、数学、计算机科学等领域的论文：

class ArxivInput(BaseModel):    """Input for the Arxiv tool."""    query: str = Field(description="search query to look up")
class ArxivQueryRun(BaseTool):  # type: ignore[override, override]    """Tool that searches the Arxiv API."""    name: str = "arxiv"    description: str = (        "A wrapper around Arxiv.org "        "Useful for when you need to answer questions about Physics, Mathematics, "        "Computer Science, Quantitative Biology, Quantitative Finance, Statistics, "        "Electrical Engineering, and Economics "        "from scientific articles on arxiv.org. "        "Input should be a search query."    )    api_wrapper: ArxivAPIWrapper = Field(default_factory=ArxivAPIWrapper)  # type: ignore[arg-type]    args_schema: Type[BaseModel] = ArxivInput    def _run(        self,        query: str,        run_manager: Optional[CallbackManagerForToolRun] = None,    ) -> str:        """Use the Arxiv tool."""        return self.api_wrapper.run(query)

在某些情况下，你可能需要优化工具以提升性能，例如：

通过Prompt Engineering（提示工程）调整工具名称或描述，提高匹配度。
设定高级配置，处理常见错误。
过滤工具输出，确保结果符合期望。

第 5 步：制定记忆管理策略

LLM 的上下文窗口（context window）是有限的，它决定了模型可以“记住”的内容量。例如：多轮对话、长文本工具输出和额外的上下文信息都会快速占满上下文窗口。因此，合理的记忆管理策略至关重要。

在 Agent 的语境中，记忆是指系统存储、回忆和利用过去交互信息的能力。这使 Agent 能够随着时间的推移保持上下文，根据以前的交流改进其响应，并提供更个性化的体验。

常见的记忆管理策略

1️⃣ 滑动窗口记忆（Sliding Memory）：保留最近的k 轮对话，丢弃旧内容。
2️⃣ Token 记忆（Token Memory）：仅保留最近的 n 个 token，其余丢弃。
3️⃣ 摘要记忆（Summarized Memory）：每轮对话后，使用 LLM 生成摘要，然后丢弃具体对话内容。
4️⃣ 关键点存储（Key Moment Storage）：让 LLM 识别关键事实并存入长期记忆，使 Agent 能“记住”重要信息，为用户提供更个性化的体验。

目前，我们已经覆盖了 Agent 构建的五个核心步骤：

✅ Step 1：定义 Agent 任务与行为模式（ReAct、Plan-then-Execute）
✅ Step 2：Prompt Engineering（确保 Agent 行为符合预期）
✅ Step 3：编写核心指令（定义角色、风格、工具使用策略等）
✅ Step 4：定义并优化核心工具（如 Web 搜索、数据库查询等）
✅ Step 5：制定内存管理策略（防止上下文溢出）

那么，如果我们现在让 LLM 直接处理用户查询，会发生什么？ ?

举个例子，可能会发生：

User Message: Extract key insighs from this datasetFiles: bill-of-materials.csvThought: First, I need to inspect the columns of the dataset and provide basic data statistics.Function Name: PythonFunction Input: {"language":"python","code":"import pandas as pd\n\ndataset = pd.read_csv('bill-of-materials.csv')\n\nprint(dataset.columns)\nprint(dataset.describe())","inputFiles":["bill-of-materials.csv"]}Function Output:

用户消息：从此数据集中提取关信息文件：bill-of-materials.csv想法：首先，我需要检查数据集的列并提供基本数据统计信息。函数名称：Python函数输入：{"language":"python","code":"import pandas as pd\n\ndataset = pd.read_csv('bill-of-materials.csv')\n\nprint(dataset.columns)\nprint(dataset.describe())","inputFiles":["bill-of-materials.csv"]}函数输出：

此时，Agent 生成的是原始文本输出。那么，如何让它执行下一步操作呢？这就需要解析（Parsing）和编排（Orchestration）。

第 6 步：解析 Agent 的原始输出

解析器（Parser）是一种将原始数据转换为应用程序可理解格式（例如带有属性的对象）的函数。

对于我们正在构建的 Agent，解析器需要识别第 2 步中定义的通信结构，并返回结构化输出（如 JSON）。这样，应用程序就能更容易地处理和执行 Agent 的下一步操作。

注意：部分模型提供商（如 OpenAI）默认支持可解析的输出。而对于其他模型（特别是开源模型），可能需要手动配置此功能。

第 7 步：编排 Agent 的下一步操作

最后一步是设置编排逻辑，用于决定 LLM 在生成结果后的处理方式。根据输出内容，你可能需要：

执行工具调用（如运行 Python 代码、调用 API）。
返回答案，即向用户提供最终响应，或请求额外信息以进一步完成任务。

如果触发了工具调用，则工具的输出将发送回 LLM（作为其工作记忆的一部分）。然后，LLM 将确定如何处理这些新信息：执行另一个工具调用或向用户返回答案。

以下是此编排逻辑在代码中的样子：

def orchestrator(llm_agent, llm_output, tools, user_query):    """    Orchestrates the response based on LLM output and iterates if necessary.    Parameters:    - llm_agent (callable): The LLM agent function for processing tool outputs.    - llm_output (dict): Initial output from the LLM, specifying the next action.    - tools (dict): Dictionary of available tools with their execution methods.    - user_query (str): The original user query.    Returns:    - str: The final response to the user.    """    while True:        action = llm_output.get("action")        if action == "tool_call":            # Extract tool name and parameters            tool_name = llm_output.get("tool_name")            tool_params = llm_output.get("tool_params", {})            if tool_name in tools:                try:                    # Execute the tool                    tool_result = tools[tool_name](**tool_params)                    # Send tool output back to the LLM agent for further processing                    llm_output = llm_agent({"tool_output": tool_result})                except Exception as e:                    return f"Error executing tool '{tool_name}': {str(e)}"            else:                return f"Error: Tool '{tool_name}' not found."        elif action == "return_answer":            # Return the final answer to the user            return llm_output.get("answer", "No answer provided.")        else:            return "Error: Unrecognized action type from LLM output."

大功告成！ 你现在已经构建了一个可以处理多种场景的系统——无论是竞争分析、深度研究，还是自动化复杂的工作流，都能轻松应对。

Multi-Agent 系统的作用？

尽管当前一代的 LLM 功能强大，但它们仍然存在一个核心限制：难以处理信息过载。

如果上下文信息过多，或使用的工具过于复杂，模型可能会因超载而导致性能下降。单个通用 Agent 迟早会遇到这个瓶颈，尤其是当它大量消耗 token 时。

对于某些应用场景，采用 Multi-Agent（多 Agent）方案可能更合理。通过将任务拆分到多个 Agent 之间，可以减少单个 LLM 需要处理的上下文，从而提高整体效率。

不过，从单 Agent 入手仍然是一个绝佳的起点，尤其是在原型阶段。它能帮助你快速测试应用场景，并发现系统的瓶颈所在。
在此过程中，你可以：