我要投稿

微软STE：利用试错、想象与记忆机制显著提高大模型Tool学习能力，7B媲美GPT-4！

发布日期：2024-04-10 20:48:02 浏览次数： 2425

作者：PaperAgent

微信搜一搜，关注“PaperAgent”

工具（Tool）对于大型语言模型（LLMs）获取最新信息并在外部环境中采取重要行动至关重要。关于LLMs增强工具的的研究主要关注工具的广泛覆盖和添加新工具的灵活性。然而，经过研究发现，现有的LLMs（包括GPT-4和专门为工具使用微调的开源LLMs）使用它所训练的工具的正确率仅在30%到60%之间，远未达到实际应用的可靠性。

受到生物系统中成功工具使用行为的启发，提出了一种模拟试错（Simulated Trial and Error, STE）的方法，用于增强大型语言模型（LLMs）在工具使用方面的学习能力。它包括试错、想象和记忆三个关键机制，通过LLM的“想象”能力来模拟使用工具的合理场景，然后LLM与工具互动，从执行反馈中学习。短期记忆和长期记忆被用来分别提高探索的深度和广度。

STE模拟试错方法整体示意图。在探索阶段，大型语言模型（LLM）与工具互动，并通过试错逐步积累使用工具的经验。具体来说，a) 在每次试验中，LLM想象与目标工具相关的合理场景，迭代地与工具互动以满足用户查询，并在最后对试验进行自我反思；b) 由最近试验轨迹组成的短期记忆鼓励从细微的成功和失败中学习，并更深入地探索API；c) 粗粒度的长期记忆记录了过去的试验和错误经验，保持了长期的进步学习。在利用阶段，探索经验被提炼成一组工具使用示例，用于上下文学习（ICL）或微调。

探索阶段

为了提高探索的有效性、全面性和多样性，STE整合了三个核心设计组件：迭代自我完善、短期记忆和长期记忆。

使用模拟试错进行探索的一个例子，突出显示了记忆机制。每个情景都从API规格说明开始（仅在第一次试验中），接着是一系列动态添加到短期记忆中的试验。长期记忆在每次试验开始时被加载到上下文中，以便LLM逐步想象新的情境，然后在之后卸载。

迭代自我完善[上图左上]：LLM通过执行反馈来学习，不断调整自己的输出，直到模型认为API调用返回了足够的信息或达到了预定义的最大调用次数。
短期记忆[上图左边]：LLM被赋予短期记忆，包含最近试验的探索轨迹。这允许模型从最近的成功和失败中学习，并在接下来的试验中更深入地探索API。
长期记忆[上图右边]：为了支持长期的进步学习，LLM还配备了长期记忆，记录过去探索的查询和它们是否被成功满足。长期记忆在每个新试验开始时被加载到上下文中，指导模型想象与之前探索不同的新场景。

STE探索阶段使用的Prompt模版，模版中包含迭代反思Observation+Thought、短期记忆ReAct、长期记忆History Query：

Your task is to answer the user's query as best you can. You have access to the following tools which you can use via API call to help with your response:
{api_descriptions}
Now you have the chance to explore the available APIs. You can do this by 1) synthesizing some natural user query that calling the API could help, and 2) trying to respond to the user query with the help of the APIs. Here, you can focus on queries that only require calling the API once.
Now, first input your synthesized user query. You should make the query natural - for example, try to avoid using the provided API descriptions or API names in the query, as the user does not know what APIs you have access to. Also try to make the query as specific as possible. Input just the user query alone; do NOT solve the query for now.
User Query:
=========
Now, try to respond to the query using the available APIs.
The format you use the API is by specifying 1) Action: the API function name you'd like to call 2) Action Input: the input parameters of the API call in a json string format. The result of the API call will be returned starting with "Observation:". Remember that you should only perform a SINGLE action at a time, do NOT return a list of multiple actions.
Reminder:1) the only values that should follow "Action:" are: {api_names}2) use the following json string format for the API arguments:
Action Input:{{"key_1": "value_1",..."key_n": "value_n",}}
Remember to ALWAYS use the following format:
Thought: you should always think about what to do nextAction: the API function nameAction Input: the input parameters of the API call in json string formatObservation: the return result of the API call. This is what I will provide you with; you do not need to repeat it in your response.... (this Thought/Action/Action Input/Observation can repeat N times)Thought: I now know the final answerFinal Answer: the response to the user query
Begin! Remember that your response should never start with "Observation:" since that is what I will provide you with. Once you have enough information, please immediately use \nThought: I now know the final answer\nFinal Answer:
User Query (the same you just synthesized): {query}
=========
Now you know a bit more about the API. You can synthesize another user query to explore the API a bit further and consolidate your understanding of the API, based on things that you discovered about this API. Again, just input the user query alone; do NOT solve the query for now.
User Query:
=========
Now try to solve the query using the API. Remember to follow the same format, i.e,\nThought:\nAction:\nAction Input:\nObservation:\nFinal Answer:\n.

利用阶段

从探索阶段获得的试验被用来通过ICL或微调来增强LLM的工具使用能力。对于每个试验，提取合成的用户查询、LLM的最后一个API调用及其执行结果，以及最终响应。然后，使用GPT-4来判断每个示例的有效性，并为每个新API改写有效示例，以保持平衡并增加语言变化。

STE使用阶段用于判断每个示例有效性的过滤Prompt模版：

An assistant is trying to respond to the user query with the help of some APIs. The APIs that the assistant has access to are as follows:
{api_descriptions}
Now, your task is to evaluate how well the assistant did the job. Check carefully the following aspects of the assistant's response:
1) whether the response answers the user's query in an informative way. For example, if the API calls are unsuccessful and the agent can't find the answer to the request, you should say "No."2) whether the response is faithful with respect to the execution results of the API calls. The response should not include information that cannot be supported by the API call feedback3) whether the assistant used the API calls appropriately. For example, the assistant should always use relevant API calls for queries about up-to-date information or complex calculations
For each of the three aspects, you should say "Yes" or "No" indicating whether the assistant did a good job in that aspect, and explain the reason behind your judgment. Your output should follow the format below, where "<explanation>" should be your actual explanation for the corresponding judgment:
1) Yes/No. <explanation>2) Yes/No. <explanation>3) Yes/No. <explanation>
Now, the user query is: "{query}"
The assistant's API calls and the corresponding execution results are:
{chains}
The assistant's final response is:----{final_ans}----
Now, your evaluation is (remember to follow the previous format):

实验评测

在ToolBench上进行的全面实验表明，STE在上下文学习和微调设置下显著提高了LLMs的工具学习能力，为Mistral-Instruct-7B带来了46.7%的提升，并使其超越了GPT-4。

整体工具使用性能。STE在上下文学习和微调中都有效。最佳总体结果以粗体显示，每种设置下的最佳结果都加了下划线。

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Errorhttps://arxiv.org/pdf/2403.04746.pdfhttps://github.com/microsoft/simulated-trial-and-error