我要投稿

Aguvis：提升的不仅是 UI Agent 的规划推理能力

发布日期：2024-12-15 05:37:01 浏览次数： 2212 作者：CraftWarmAI

Home^[1] | GitHub^[2] | Twitter^[3] | Youtube^[4] | Bilibili^[5]

本文介绍来自 HKU & Salesforce 的 Aguvis。如我之前所说，这篇论文（数据、代码都会开源）至少值 2 个算法工程师 1 个月的工资。论文里面有很多细节都值得深挖，属于外行看热闹，内行看门道的那种。

本文是视频 UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架^[6] 对应的文字版，建议与视频对照着看。

Aguvis 相关资料：

[2412.04454] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction^[7], HKU & Salesforce
https://aguvis-project.github.io^[8]
【视频分享】UI Agent 论文分享：Aguvis-来自 HKU & Salesforce 的大一统训练数据和训练框架^[9]

Aguvis 这个词应该是作者造的，没查到什么意思。发现这个工作的作者跟 OS-Copilot^[10] 还有耦合，而 OS-Copilot^[11] 跟 OS-Atlas^[12] 是相同的一作。

Aguvis 基于 Qwen2-VL-7B 和 Qwen2-VL-72B 进行全量微调（只 freeze ViT 部分），设置最大序列长度为 8192，max pixels 为 1280 x 720。

本文主要贡献：

生成了 IM（observation、thought、low-level instruction）数据，相当于 planning & reasoning 数据，用于第二阶段的模型微调。验证了 IM 数据能大幅提升模型的效果
构建了统一的 grounding 和 reasoning 大数据集，数据即将开源

利用 pyautogui 统一了不同平台的动作空间，这样来自不同平台的数据可以统一使用

训练数据使用 grounding packing strategy 方法，把训练效率提升了 5 倍

把多个单轮的 grounding 任务合成一个多轮的单个任务

统一了 grounding 和 planning & reasoning 2 个训练阶段的数据格式

论文详解

比较标准的两阶段训练方式。第一阶段主要针对 grounding 能力，第二阶段主要针对 planning & reasoning 能力。

Inner Monologue（内心独白，简称 IM）包括 3 个部分：

1. observation description
2. internal reasoning (thought)
3. low-level action instruction

决策过程可以分为 2 步完成：Planner 生成 IM 内容，然后 Grounder 按照产生具体的 grounding 信息。

可插拔的动作空间

把动作执行统一成了函数调用（可以借力 base 模型的 function call 能力）：

类似函数调用的方式在 prompt 中告知有哪些函数是可调用的。

Aguvis Collection 数据集

Aguvis Collection 数据集是作者汇总其他数据集构建的训练数据集；包括以下 2 部分，顾名思义，对应上面的两阶段训练；后续会开源

1. grounding split：作者把以下数据集中的 Meta 信息都统一成 pyautogui 命令格式的数据

2. planning & reasoning split

"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach, where the model is exposed to various levels of cognitive complexity, from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

Grounding Stage

以下是 grounding 阶段训练使用的数据格式：

⁉️ 疑问：
1. 对于 grounding 数据，Prompt 中的 overall_goal 和 previous_actions 分别是什么？
2. <|diff_marker|> 这个标记的用途是什么？
模型可以利用这个标记来识别需要关注的特定部分，从而生成更加相关和准确的内容。例如，在进行内容编辑或补全时，模型能够基于此标记理解上下文中的变化。

Grounding Packing Strategy

效率提升了 5 倍，效果还稍微有点提升。

reduces overall GPU hours from 6 hours to 1 hour. Moreover, this strategy even marginally improve the performance of ScreenSpot website split from 73.3 to 76.8.
可以在 16 个节点的机器上花费 2 天微调 72B VLM。

⛔ "We train AGUVIS on a cluster of H100-80G GPUs: AGUVIS-7Buses8 nodesand completes the grounding training within5 hoursandplanning & reasoning trainingwithin1 hour.AGUVIS-72B uses 16 nodesand completes the grounding training within30 hoursandplanning & reasoning trainingwithin6 hours."

Planning & Reasoning Stage

IM 是用户自己通过 GPT-4o 构造出来的。

使用 GPT-4o 生成 planning & reasoning 数据，以下是 prompt 和示例：

上面获得的增强数据需要满足以下条件才被认为是成功的：

Match the action type and action target elements of the ground truth
Correctly describe the step’s intention
Establish a clear connection between the step’s intention and the overall goal
Assist the agent in successfully completing the task

在抽样的数据当中，作者发现 86.7％ 展现出了与真实动作和总体目标的动作意图相一致的中间推理。剩下的 7.8％ 的案例受到数据集噪声的影响（任务中的不相关或不必要动作），5.5％ 的案例则是由于在干净数据下对动作意图的误读。

作者分析发现，训练数据中的非必要动作可能致使 VLM 无法在这些多余动作和总体目标之间建立关联，最终造成不正确的推理和规划。

以下是此阶段训练使用的数据格式：

<|recipient|>all：预测 IM；<|recipient|>os：预测具体动作

作为对比，以下是上面给出的 Grounding 阶段的数据格式：

一些注意点：

planning 阶段的具体动作选择，形式上和 grounding 阶段是一样的
"Thanks to our detailed inner monologue trajectory data, we implement a reasoning mixture approach, where the model is exposed to various levels of cognitive complexity, from straightforward low-level action instructions to full inner monologues that include observation descriptions, thoughts, and detailed action plans. By dynamically adjusting the complexity of these trajectories, we train the model to be adaptable, fostering step-by-step reasoning and high-level decision-making abilities. This diversity in reasoning ensures that the model can handle a wide range of tasks with nuanced understanding and precision."

第二阶段的训练数据中，也混合了 low-level instructions 数据？

Enforced Plan & Self Plan

<|recipient|>all：预测 IM；<|recipient|>os：预测具体动作

Enforced Plan: employ the <|recipient|>all\nThought prompt to compel the model to first generate a planning phase, and then a pyautogui command.

Self Plan: do not add any word after <|recipient|>, so the model can choose to generate os to directly produce a pyautogui command, or generate all to first create natural language reasoning and then generate a pyautogui command.