0% found this document useful (0 votes)
30 views25 pages

Large Action Models: From Inception To Implementation: Lu Wang Fangkai Yang Chaoyun Zhang Junting Lu

Uploaded by

flytiger.d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views25 pages

Large Action Models: From Inception To Implementation: Lu Wang Fangkai Yang Chaoyun Zhang Junting Lu

Uploaded by

flytiger.d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Large Action Models: From Inception to Implementation

Lu Wang∗ Fangkai Yang∗ Chaoyun Zhang∗ Junting Lu†


Microsoft Microsoft Microsoft Peking University

Jiaxu Qian† Shilin He Pu Zhao Bo Qiao


Peking University Microsoft Microsoft Microsoft

Ray Huang Si Qin Qisheng Su† Jiayi Ye†


Microsoft Microsoft Peking University Zhejiang University
arXiv:2412.10047v1 [cs.AI] 13 Dec 2024

Yudi Zhang† Jian-Guang Lou Qingwei Lin Saravan Rajmohan


Eindhoven University of Microsoft Microsoft Microsoft
Technology

Dongmei Zhang Qi Zhang


Microsoft Microsoft
ABSTRACT multiple modalities such as language, vision, and speech, have be-
As AI continues to advance, there is a growing demand for sys- come foundational in numerous AI-driven applications [26, 55, 61,
tems that go beyond language-based assistance and move toward 66]. Their success is evident in systems like question answering
intelligent agents capable of performing real-world actions. This in conversational agents [43], code generation in GitHub Copi-
evolution requires the transition from traditional Large Language lot [81], and improved search capabilities in platforms like Bing
Models (LLMs), which excel at generating textual responses, to [62]. The key strengths of LLMs—namely their vast knowledge,
Large Action Models (LAMs), designed for action generation and ability to support multimodal inputs, and capacity for human-like
execution within dynamic environments. Enabled by agent systems, responses—have propelled them to the forefront of AI research [45].
LAMs hold the potential to transform AI from passive language Their capability to generalize via zero-shot learning has further
understanding to active task completion, marking a significant expanded the horizons of what AI systems can achieve, making
milestone in the progression toward artificial general intelligence. significant contributions to the productivity of both everyday tasks
In this paper, we present a comprehensive framework for de- and specialized professional activities. These innovations mark an
veloping LAMs, offering a systematic approach to their creation, important milestone on the path toward artificial general intelli-
from inception to deployment. We begin with an overview of LAMs, gence (AGI) [14].
highlighting their unique characteristics and delineating their dif- However, while LLMs excel in generating intricate textual re-
ferences from LLMs. Using a Windows OS-based agent as a case sponses, they are often constrained by their inability to directly
study, we provide a detailed, step-by-step guide on the key stages of interact with or manipulate the physical world [65]. In many real-
LAM development, including data collection, model training, envi- world applications, intelligent systems need to perform tasks that
ronment integration, grounding, and evaluation. This generalizable go beyond conversational exchanges—tasks that involve tangible
workflow can serve as a blueprint for creating functional LAMs actions [16]. The maxim “actions speak louder than words” [50]
in various application domains. We conclude by identifying the underscores the limitations of purely text-based interactions, as
current limitations of LAMs and discussing directions for future users increasingly expect intelligent agents to go beyond passive
research and industrial deployment, emphasizing the challenges responses and engage in real-world actions. For instance, a truly
and opportunities that lie ahead in realizing the full potential of transformative AI assistant could automate tasks in software appli-
LAMs in real-world applications. cations, manage household chores, or even engage with children in
The code for the data collection process utilized in this paper is meaningful ways. The realization of such capabilities would mark
publicly available at: https://github.com/microsoft/UFO/tree/main/ a revolutionary shift in how we integrate AI into our daily lives, en-
dataflow, and comprehensive documentation can be found at abling widespread automation and augmenting human capabilities
https://microsoft.github.io/UFO/dataflow/overview/. across diverse environments [54].
Achieving this vision requires LLMs to extend their expertise
1 INTRODUCTION from language processing to action generation. However, this tran-
sition is not straightforward. While leading LLMs from industry
In recent years, large language models (LLMs) have demonstrated
giants have demonstrated impressive performance in language-
remarkable advancements across a range of natural language pro-
based tasks, they encounter substantial limitations when tasked
cessing (NLP) tasks [4, 69, 77]. These models, often incorporating
with action generation [79]. Completing a task in the real world
∗ Theseauthors contributed equally to this work. For inquiries, please contact: {wlu, involves a sequence of complex steps: accurately understanding
fangkaiyang, chaoyun.zhang}@microsoft.com. user intent, devising a plan, and executing the necessary actions
† Work done during internship at Microsoft.
Action + Plan Trajectories
Task:
Create a slide base on draft.docx
Plan: Fine-tuning
1. Open the draft.docx and read the content.
2. Create a new PowerPoint file.

LLM 3. For page 1, add … + LAM


Action:
- Open (“draft.docx”)
Textual + Agent
- Click (Button(“New”), “left”) ... Action Output

User Queries Textual Output Task Requests Interaction


Chatbot Buy a shoes on the Polish the style of
How is the Improve the writing ...
weather today? of the thesis website the slide
Code generation
Set up a meeting Filling the form
What is in the Write a Python
with ... with the Excel...
figure? function for ...

... Send an email to ... ... ...


Polish my email...

Environment
Figure 1: The transition from LLMs to LAMs.

[28]. Current LLMs may excel at understanding and planning in tex- LAM has been trained, it must be incorporated into an agent sys-
tual form but often fall short when required to produce actionable tem that can effectively interact with its environment. This system
outputs. This is particularly true in scenarios that demand precise typically includes components for gathering observations, utiliz-
task decomposition, long-term planning [12, 88], and the coordi- ing tools, maintaining memory, and implementing feedback loops.
nation of multi-step actions [63]. Furthermore, LLMs are generally These components are critical for ensuring that the LAM can not
optimized for broad, general-purpose tasks rather than tailored for only execute actions but also adapt its behavior based on real-time
specific scenarios or environments. This lack of specialization can feedback and evolving situations [83]. The integration of these ele-
result in suboptimal performance, especially when interacting with ments enhances the LAM’s capacity to perform tasks autonomously,
unfamiliar or dynamic environments where adaptive and robust interact meaningfully with its surroundings, and make decisions
action sequences are essential [39]. that are grounded in the context of its environment.
These limitations highlight a significant gap in the ability of A final but crucial step in the development of LAMs is evalua-
LLMs to transition from passive understanding to active, real-world tion [73]. Before deploying a LAM for real-world applications, it is
engagement. To address these challenges, the development of Large imperative to rigorously assess its reliability, robustness, and safety.
Action Models (LAMs) represents a transformative shift in AI capa- Unlike LLMs, which may be limited to generating text-based out-
bilities [20]. Unlike traditional LLMs that primarily focus on text puts, LAMs have the capacity to directly affect their environment
generation and response, LAMs are designed to perform actions in through actions. This introduces new risks, as incorrect or inap-
both physical and digital environments. These models are capable propriate actions could have significant consequences. Therefore,
of interpreting user intentions from diverse data inputs, automating thorough evaluation processes are essential to ensure that both the
complex processes, planning for task completion, and interacting LAM and its accompanying agent are capable of making reliable
with the world via agents. This evolution marks a critical step decisions while minimizing potential risks. These evaluations often
toward a future where intelligent systems not only comprehend involve testing the model in a variety of scenarios to ensure that it
human language but can also translate that understanding into can generalize across different environments and tasks, as well as
tangible, meaningful actions [85]. effectively handle unexpected situations.
LAMs are often built upon the foundation of LLMs, but the tran- Given the complexity involved in developing LAMs, the purpose
sition from LLMs to LAMs is neither straightforward nor seamless, of this paper is to provide a comprehensive understanding of LAMs
as shown in Figure 1. The process of transforming an LLM into and guide practitioners in transforming an LLM into a functional
a functional LAM involves multiple intricate stages, each requir- LAM for real-world applications. To this end, we first present an
ing substantial effort and expertise. First, it is essential to collect overview of LAMs, clarifying their distinctions from traditional
comprehensive datasets that capture user requests, environmen- LLMs and discussing their unique characteristics. By offering this
tal states, and corresponding actions [11]. These data serve as the foundational knowledge, we aim to give readers a clear concep-
basis for training or fine-tuning LLMs to perform actions rather tual understanding of LAMs, enabling them to grasp the broader
than merely generate text. This stage involves the integration of implications of their development and use.
advanced training techniques that enable the model to understand Next, we delve into the practical process of obtaining a LAM from
and execute actions within specific environments [21]. Once the scratch. Using a Graphical User Interface (GUI) agent on Windows
OS as an example, we provide a detailed, step-by-step exploration

2
of the entire pipeline—beginning with data collection and prepara- LLM Step 1: Open an online shopping website.
tion, followed by model training, integration, and grounding. This Step 2: Search for “jacket for men”.
includes how to prepare datasets that capture user requests, envi- Step 3: Go through all jackets
...
ronmental states, and actions, as well as how to fine-tune LLMs Buy a jacket
for men.
to generate executable actions rather than text responses. We also
Step 1 Step 2 Step 3
demonstrate how to integrate a trained LAM into an agent system, LAM
equipping it with tools, memory, and feedback mechanisms to en-
able dynamic interaction with its environment. The final stages
focus on rigorous evaluation, ensuring that the LAM is robust, safe,
and capable of handling real-world tasks. While this paper uses Figure 2: The objective difference between LLMs and LAMs.
the Windows OS as a case study, the methodology outlined can be
adapted to other environments, providing a generalizable workflow
for obtaining functional LAMs. Finally, we address several limita- While LLMs possess significant language understanding and
tions and challenges faced by LAMs in both research and industry. generation capabilities, they are primarily limited to generating text-
While LAMs represent a significant advancement over traditional based outputs. They excel at interacting with users and generating
LLMs, they are still in an early stage of development and present text, but they lack the ability to directly interface with environments
substantial areas for improvement. Issues such as privacy concerns, to execute actions. This limitation restricts their applicability in
latency, safety risks, scalability, and ethical considerations all pose scenarios that require tangible interaction with digital or physical
challenges that must be addressed for LAMs to be fully realized as environments.
practical tools. To extend their utility, LLMs are often embedded within agent
The emergence of LAMs represents not merely an incremen- frameworks [65]. These agent systems augment LLMs, enabling
tal advancement over LLMs, but a fundamental shift from passive them to interact with dynamic environments by collecting data from
language processing to active, real-world engagement. By execut- various sources [72], structuring it into meaningful inputs [32], and
ing actions, LAMs can interact dynamically with both digital and prompting the LLM for inference [72]. The agent then interprets
physical environments, marking a transformative milestone in the the model’s output—whether in the form of code [67] or tool-based
broader pursuit of AGI. We envision this paper as a foundational actions [54]—and grounds it within the environment by executing
guide to LAMs, offering both theoretical insights and practical, actions and collecting feedback [58]. Agents equipped with LLMs
actionable steps for creating and deploying LAMs in real-world typically function in a loop, continuously gathering environmental
scenarios. information, using LLM inference to form plans, executing those
plans, and refining future actions based on feedback. This iterative
process can incorporate external memory systems, enabling the
agent to track historical actions and environmental states, further
2 LARGE ACTION MODELS 101 improving the decision-making process over time [24, 89].
Large Action Models (LAMs) represent a significant advancement
in artificial intelligence, extending the capabilities of Large Lan- 2.2 From LLMs to LAMs
guage Models (LLMs) [85]. While LLMs are proficient at generating
LAMs build upon the foundational capabilities of LLMs but are
human-like text based on user inputs, LAMs go beyond text gener-
specifically optimized for action-oriented tasks. They are designed
ation by performing actions in both physical and digital environ-
to perform actions in both physical and digital environments, in-
ments [82]. These models interpret user intentions from various
terpreting user intentions from various data forms, automating
data forms, automate entire processes as per user requirements,
processes as per user requirements, planning for task completion,
plan for task completion, and interact with the world. This evo-
and interacting with the world [82]. This evolution signifies a shift
lution signifies a shift from mere language interaction to action
from passive language interaction to generating action sequences
sequences that are grounded in real-world contexts.
that are grounded in real-world contexts.
An illustrative example is shown in Figure 2. An LLM can com-
prehend a user’s request to purchase a jacket and generate a detailed
2.1 Large Language Models textual plan or recommendation, but it cannot autonomously com-
LLMs are neural networks with billions to hundreds of billions of plete the transaction on a website. In contrast, a LAM leverages
parameters, trained on extensive text corpora to address general- this foundational understanding to generate action sequences that
purpose language tasks [31, 40, 75, 84, 92]. These models demon- directly interact with the website, completing the request on the
strate exceptional capabilities in natural language understanding user’s behalf. This ability to transition from understanding to execu-
and generation, allowing them to perform complex tasks such as an- tion bridges the gap between the model and real-world applications,
swering questions [27], generating code [86], and providing human- moving beyond mere language output to tangible outcomes.
like textual responses [10] with minimal task-specific training, Furthermore, due to their specialization in specific domains or
known as zero-shot [69] or few-shot [4] learning. Unlike tradi- tasks, LAMs can be smaller in scale compared to general-purpose
tional language models, which required extensive task-specific data LLMs while achieving comparable or superior performance within
and training, LLMs leverage their vast knowledge base to generalize their operational scope. By focusing on a narrower range of tasks,
across diverse tasks with minimal supervision. LAMs prioritize efficiency and effectiveness, leveraging targeted
3
data and optimized architectures to reduce computational overhead error occurs or a resource becomes unavailable, a LAM can replan
without sacrificing capability. This specialization not only makes and adjust its actions to still achieve the desired outcome.
LAMs more practical for deployment in real-world applications but
also opens opportunities for developing lightweight models that 2.3.4 Specialization and Efficiency. LAMs are fine-tuned for ex-
can operate in resource-constrained environments. ecuting specialized sequences of actions within specific environ-
The evolution from LLMs to LAMs is achieved through spe- ments [8]. By focusing on particular domains, LAMs achieve a
cialized training and integration with agent systems. These sys- high degree of accuracy and adaptability, outperforming general-
tems enable LAMs to translate their inferences into real-world purpose LLMs in targeted applications. This specialization allows
actions, bridging the gap between understanding and execution. LAMs to encode comprehensive knowledge about the environment
Thus, LAMs not only enhance the functionality of LLMs but also deeply into their architecture, including available actions, system
redefine their applicability in real-world scenarios. constraints, and contextual nuances. As a result, LAMs can operate
more efficiently, reducing computational overhead and improving
2.3 Key Characteristics of LAMs response times. Furthermore, since LAMs are expected to com-
LAMs are distinguished by advanced capabilities that enable them plete actionable tasks within a more limited scope, their scale can
to perform complex tasks effectively. These characteristics include: be smaller compared to general-purpose LLMs while achieving a
comparable level of performance within that specific domain. This
2.3.1 Interpretation of User Intentions. A fundamental capability makes LAMs more practical for deployment in real-world applica-
of LAMs is the ability to accurately interpret user intentions from tions, including resource-constrained environments such as edge
diverse forms of input. These inputs may include natural language devices or local systems.
requests, voice commands, images, or videos, such as device screen-
shots or instructional videos [8]. User inputs are often abstract or 2.3.5 Summary. In summary, LAMs transcend the basic function-
implicit [6], requiring LAMs to leverage their internal knowledge ality of converting user requests into a series of steps by compre-
and complementary information to discern the true intent behind hending the underlying logic that interconnects and contextual-
the input. This process involves understanding nuances, disam- izes these actions. They understand sequence dependencies—why
biguating instructions, and inferring unstated objectives. LAMs certain steps must precede or follow others—and recognize when
must translate these user intentions into actionable plans and steps, to adapt the plan to accommodate changing circumstances. LAMs
facilitating subsequent interactions with the environment to fulfill extend AI systems into the realm of actionable intelligence. This sig-
the user’s objectives. This requires a robust foundation in LLMs, nificantly enhances their ability to autonomously perform complex,
particularly those with multi-round conversational capabilities [57], real-world tasks, making them invaluable in applications requiring
enhancing LAMs’ proficiency in engaging with users to accurately precise interaction and manipulation within defined operational
understand and execute their requests. contexts.
2.3.2 Action Generation. The hallmark feature of LAMs is their
capacity for action generation grounded in the environment. LAMs 2.4 From Inception to Implementation
translate user intentions into actionable steps that can be executed LAMs have the potential to significantly extend the impact of LLMs
within specific contexts. These actions can take various forms: op- by enabling tangible interactions with real-world environments.
erations on graphical user interface (GUI) elements, API calls for To harness this potential, an LAM must be developed from the
software applications, physical manipulations performed by robots, ground up and deployed within a real-world application, allowing
invoking other AI agents or models, or autonomously generat- it to operate effectively in a physical environment. This process
ing code or combining meta-actions [5]. By incorporating detailed involves 5 critical steps, as shown in Figure 3:
knowledge of the environment, including available actions, system
(1) Data Collection and Preparation (Section 3): The first
states, and expected inputs, LAMs can select appropriate actions
step involves gathering and curating the necessary data for
and apply them correctly to meet user requests. This involves not
the specific use case. This includes not only user queries but
only executing predefined actions but also adapting to new situa-
also environmental context, potential actions, and any other
tions by generating novel action sequences when necessary.
relevant data required to train the LAM effectively. The data
2.3.3 Dynamic Planning and Adaptation. LAMs exhibit a sophisti- must undergo cleaning and pre-processing before it is used
cated capability for dynamic planning and adaptation, which is cru- for training or fine-tuning a LAM.
cial for handling complex user requests that span multiple steps [19]. (2) Model Training (Section 4): Using the prepared data, the
They can decompose a complex task into several subtasks, each next step is to train the LAM. This training process can
further broken down into specific action steps. This hierarchical involve various techniques such as supervised fine-tuning
planning enables LAMs to approach task execution with a forward- and reinforcement learning to ensure the model can perform
looking perspective, anticipating future requirements and potential the desired actions accurately and efficiently.
obstacles. Moreover, as the execution of each action alters the state (3) Offline Evaluation (Section 5): After obtaining the LAM,
of the environment, LAM will react to these changes, adapting and we evaluate its performance using an offline dataset to verify
revising their plans and actions accordingly [58]. This flexibility its reliability in a controlled, static environment.
ensures robustness in dynamic scenarios where deviations from (4) Integration and Grounding (Section 6): The LAM is inte-
initial expectations are common. For instance, if an unexpected grated into an agent framework that serves as its operational
4
③/⑤ Offline/Online
① Data Collection and Test data Evaluation
Preparation Task 1: Task 3:
Task: Set upTask Beautify
a meeting with ...the style of the slide.
Task: 2:
Create a slide base on draft.docx
Fill the form of on the website with the excel Send an email to ...
Task: data.
Plan: Buy a jacket for men.
1. Open the draft.docxPlan:and read the content.
2. Create a new PowerPoint file.
Plan: 1. Open the excel form. Input Output Execution
3. For page 1, add …
1. Open the2.online
Extract the content
shopping from the excel file.
website.
3. jacket
2. Search for Open the website...
for men.
Action:3. Go through all items ...
results
- Open (“draft.docx”)Action:
- Click (Button(“New”), “left”) ...
Action: - Open (“form.csv”)
Agent
- Extract (“form.csv”) ...
- Input (“www.onlineshop.xxx”)
- Input (“jacket for men”) ... Action Actions
② Model Executors
+ Training
Memory Feedback ...
LAM ...
Environment
④ Integration and
LLM Grounding
Figure 3: The process pipeline for LAM development and implementation.

platform. This involves grounding the model with the abil- user requests expressed in natural language, while plans are
ity to interact with external tools, maintain memory, and detailed, step-by-step procedures designed to fulfill these
interface with the environment. By equipping the LAM with requests. For example, a task such as “How to change the
these capabilities, it becomes capable of making meaningful font size in Word?” would have a corresponding plan out-
impacts in the physical world. lining the steps required to complete the task. This data is
(5) Online Evaluation (Section 7): Finally, the performance used to fine-tune the model to generate effective plans and
of the LAM must be rigorously evaluated in the real en- improve its high-level reasoning and planning capabilities.
vironment from multiple perspectives, including accuracy, However, task-plan data cannot be directly executed in the
efficiency, and effectiveness in completing tasks. This step environment, requiring the following data conversion phase.
is crucial to ensure that the LAM functions as intended and (2) Task-Action Data Collection: In this phase, the task-plan
meets the desired operational standards. data is converted into task-action data, which includes tasks,
Through these steps, LAMs can be effectively developed and de- plans, and the associated action sequences needed to execute
ployed to bring LLMs’ capabilities into real-world applications, those plans. Tasks and plans are refined to become more con-
enabling them to interact with and manipulate the physical envi- crete and grounded within a specific environment. Action
ronment, thereby making a tangible impact. sequences are generated at this stage, such as select_text(
In the following sections, we use the Windows GUI agent UFO [83]∗ text="hello") or click(on=Button("20"), how="left",
as a case study to illustrate the process of building a robust LAM double=False), which represent actionable instructions ca-
from the ground up. This LAM will serve as the core inference pable of directly interacting with the environment. This en-
engine for UFO, enabling it to autonomously fulfill user requests riched data provides the necessary granularity for training
within the Windows OS environment. While this example focuses an LAM to perform reliable and accurate task executions in
on a Windows GUI agent, the outlined steps can be adapted for real-world scenarios.
developing LAMs in other scenarios or for different applications.

3 DATA COLLECTION AND PREPARATION


Data is a cornerstone in training LLMs, where high-quality data The task-plan data aims at enhancing the model’s high-level
significantly enhances their performance [35, 68]. Similarly, LAMs planning capabilities, allowing it to generate detailed, step-by-step
require well-prepared, high-quality action-oriented data during plans based on user requests. Meanwhile, the task-action data fo-
the supervised fine-tuning phase. Off-the-shelf LLMs often face cuses on refining the model’s ability to execute these plans by
challenges when interacting with real-world environments. These converting each planned step into a concrete, executable step or
difficulties typically arise from either a lack of domain-specific sequence while considering environmental feedback. The data col-
knowledge or the generation of hallucinated outputs that fail to lection and preparation pipeline ensures that the model is capable
be actionable. To mitigate these issues, we adopt a two-phase data of both high-level planning and low-level action execution, thereby
collection approach: task-plan collection and task-action collection, bridging the gap between natural language plans and executable
as shown in Figure 4. Specifically: actions.
In the following sections, we detail the methodologies employed
(1) Task-Plan Data Collection: In this phase, we collect data
for data collection, pre-processing, and integration of task-plan and
consisting of tasks and their corresponding plans. Tasks are
task-action data. We illustrate how these steps enable the LLM to
∗ https://github.com/microsoft/UFO LAM transformation.
5
Task-Action Data
Task-Plan Data Task:
Initial Task Highlight Text “Hello World” in template.doc
Task:
Task:
How to highlight text on word? Solution:
How to highlight text on
[
word? Phase 1 Plan: {
1. Select the text you want to highlight. "step":"choose text “Hello World”,
2. Click the highlight button. "controlLabel": "",
"controlText": "",
Grounding "function": "select_text",
Phase 2 "args": {"text": "text to edit"}
},
{
"step":"click the highlight button",
"controlLabel": "37",
"controlText": " Text Highlight Color ",
"function": "click_input",
"args": {"button": "left", "double": false}
}
]
Word Environment

Figure 4: The two-phrase data collection and preparation process.

Data Extraction Data Preprocessing documentation. From Bing search logs, a 1% sample of queries
• Data parsing and • Remove non-English,
selection too short/long and mentioning application names (e.g., Word, Excel, PowerPoint)
• Format unifying non-relevant samples
• Data duplication from the past year was taken.

Documents wikiHow
3.1.2 Data Extraction and Pre-Processing. The initial step in pro-
Data Evolving
• Instruct Evolve to Data Construction cessing raw data involves parsing to extract task-relevant content
generate more complex • GPT to construct task-
task-plan data plan data in JSON while filtering out unnecessary or irrelevant information. This in-
How to quickly delete
all notes in a PPT file? cludes removing non-English entries, samples that are excessively
Historical Search Queries All Task-Plan data short or long based on predefined heuristics, and data unrelated to
actionable tasks (e.g., content focused on smartphone operations).
The filtered data is then standardized into a unified format for
Figure 5: The pipeline to construct the task plan data.
further processing.
3.1.3 Data Construction. To create structured JSON samples, GPT-
3.1 Task-Plan Data 4o is employed to extract and format tasks along with their associ-
Figure 5 outlines a multi-step pipeline for collecting and processing ated plans. For historical search queries, synthetic data is generated
task-plan data, essential for training LAMs. The process begins with to enrich the raw input, addressing the common issue of insuffi-
gathering raw data from diverse sources, including application cient context. GPT-4o reformulates these queries into complete,
documentation, WikiHow, and historical search queries. This is sentence-like user requests, ensuring consistency across all data
followed by structured pre-processing to ensure that the data is sources and facilitating effective downstream processing.
high-quality and relevant to specific tasks. The resulting dataset contains structured JSON samples, with
each entry including a unique task identifier (task_id), the task
3.1.1 Data Sources.
description (task), and a step-by-step plan (plan). An example is
(1) Application Documentation: Documentation and usage man- shown below:
uals for software applications provide authoritative task de-
scriptions. These resources, maintained by product teams, are 1 {" task_id ": " word_032 ",
considered highly reliable. Relevant documentation, such as 2 " task ": " Add a border to a page in Word " ,
M365 documentation† , is crawled, with outdated or inaccessi- 3 " plan ": [
ble pages being filtered out. The HTML content is converted 4 1. Go to Design > Page Borders .
into markdown format, and GPT-4o is used to extract task-plan 5 2. Make selections for how you want the
pairs in the desired structured format. border to look .
(2) WikiHow: WikiHow‡ hosts a wide range of how-to articles, 6 3. To adjust the distance between the
including application-specific operational guides. Webpages border and the edge of the page ,
related to Windows platform applications are crawled, and GPT- select Options . Make your changes
4o extracts task and plan components, ensuring the resulting and select OK .
data aligns with the desired structured format. 7 4. Select OK .
(3) Historical Search Queries: Search engine logs provide insight 8 ]
into real user demands, addressing gaps not covered by formal 9 }
† https://learn.microsoft.com/en-us/microsoft-365/?view=o365-worldwide With the above process, we initially collected a total of 29,182 task-
‡ https://www.wikihow.com/Main-Page plan data samples.
6
3.1.4 Data Evolving. With the initial dataset processed, we employ highlights the need for actionable task-action data to bridge the
data augmentation techniques to enhance its diversity and complex- divide between planning and execution. To enable LAMs to pro-
ity. Inspired by WizardLM [74] and AgentGen [23], we use GPT-4o duce actionable outputs, we generate task-action data derived from
to evolve the raw task to generate new task-plan pairs, improving the previously collected task-plan data. Task-action data captures
the model’s ability to follow instructions and handle more complex the granular interactions required to complete a task in the appli-
tasks. cation environment, including GUI navigation, button clicks, and
The data evolving process generates new tasks from the original responding to environmental feedback.
ones by introducing additional complexity, constraints, or steps Traditional approaches for action data collection often involve
while preserving relevance. The guidelines for task evolution are manual or agent-based annotation for each task, which is both
as follows: costly and labor-intensive. To address these limitations, we propose
– The evolved task must be executable step-by-step on a Win- an efficient, fully automated, and low-cost pipeline that leverages
dows OS or application. LLMs and real-world application interactions. This pipeline consists
– The evolved task should include additional requirements, of four stages, as depicted in Figure 6: Instantiation, Execution,
increasing its complexity without exceeding 20 extra words. Evaluation, and Post-Processing. Specifically,
– The evolved task must remain concise and related to the (1) Instantiation: In this stage, the task-plan data is trans-
original task. formed into an executable trajectory. Using an LLM, each
For each evolved task, GPT-4o generates a corresponding plan task is instantiated with specific operational objects, and re-
adhering to the following guidelines: lated high-level plan is instantiated into a concrete sequence
– The plan must provide correct and actionable steps for Win- of actions that can be directly executed in the application
dows environments or applications. environment.
– The plan should be concise and highlight critical action ob- (2) Execution: The instantiated trajectory is then executed
jects using bold emphasis. within the real-world application environment. During this
stage, the system interacts with the application’s GUI to
This augmentation process results in a richer dataset where tasks
carry out the specified actions. For example, the instantiated
become progressively more challenging, and plans incorporate
trajectory for highlighting text would involve selecting the
domain-specific knowledge. For example:
appropriate text, navigating to the highlight tool, and apply-
Raw task: Create a drop-down list in Excel for Office 365.
ing the highlight. The result of this execution is the captured
Evolved Task: Create a dependent drop-down list in Excel for
executed trajectory, including any feedback or environmen-
Office 365, where selecting an item from the first list filters options
tal changes observed during the process.
in the second list.
(3) Evaluation: Once the execution is complete, the trajectory
Evolved Plan:
is evaluated for correctness using an LLM. The evaluation
– Prepare your data by organizing it into two columns. The stage verifies whether the executed trajectory successfully
first column contains items for the primary drop-down list, accomplishes the intended task. This involves comparing
and the second column contains items for the dependent list. the observed outcomes with the expected results outlined
– Name your ranges for the first and second lists. in the task-plan data. Tasks that fail to meet the criteria are
– Create the primary drop-down list using Data Validation. flagged for review, while successful executions are retained
– Use the INDIRECT function to create the dependent drop- for further processing.
down list linked to the first selection. (4) Post-Processing: In the final stage, successful task-action
– ··· trajectories undergo post-processing to ensure consistency,
Using data augmentation, we increased the original task-plan completeness, and readiness for training. This includes refin-
dataset by 150%, generating a larger pool of samples. This aug- ing the data format, ensuring compatibility with the training
mentation significantly enhances the diversity and complexity of pipeline, and annotating the data with relevant metadata
the dataset, allowing the model to learn from a broader range of (e.g., task IDs, execution time, and step-by-step feedback).
scenarios and develop robust planning capabilities. The augmented The post-processed task-action data is then added to the
data introduces more challenging tasks and detailed plans, further training dataset, enabling the LAM to learn from real-world
enriching the training process and enabling the LAM to handle interactions.
complex real-world applications effectively. The pipeline minimizes human intervention and reduces the num-
ber of LLM calls required, significantly improving scalability and
3.2 Task-Action Data efficiency.
The task-plan data collected in the previous stage provides high-
level, step-by-step plans for resolving user-requested tasks, serving 3.2.1 Instantiation. The task-plan data are primarily collected from
as general guidelines. However, these plans are textual and not help documents or public websites, creating a gap between the
directly executable in a real-world environment. For instance, a generalized task-plan data and the specific requirements needed
task-plan data sample for the task “Highlight text in document” for execution within a particular environment. A common issue
outlines the necessary steps but does not translate into actionable is the lack of specificity. For instance, the task “highlight text in
instructions for interacting with the application’s GUI. This gap document” does not specify actionable objects, such as “which text”
7
Task:
PLAN ACTION Highlight Text in document.
1. LLM 1. Solution:
2. 2. 1. choose the target text Word templates Function Pool
3. 2. click the highlight button
1 Instantiation 3.
Task-Plan Data Task-Action Data
Instantiation
LLM
2 Execution Task: Task:
LLM
Highlight Text “Hello World” in Highlight Text “Hello World” in
template.doc template.doc
Selected 3 Evaluation Solution: Solution:
[ [
Trajectory { {
Match with
4 Post-Processing "step":"choose text “Hello World”,
control item
"step":"choose text “Hello World”,
"controlLabel": "", "controlLabel": "",
"controlText": "", "controlText": "",
"function": "select_text", "function": "select_text",
"args": {"text": "text to edit"} "args": {"text": "text to edit"}
}, },
{ {
Discarded
"step":"click the highlight button", "step":"click the highlight button",
Training Data "controlLabel": "", "controlLabel": "37",
"controlText": "Highlight", "controlText": " Text Highlight Color ",
"function": "click_input", "function": "click_input",
"args": {"button": "left", "double": false} "args": {"button": "left", "double": false}
Figure 6: The pipeline of task-action data conversion and } }
] ]
collection.

Figure 7: An example of task instantiation.


or “which document”. This lack of detail poses significant challenges
in executing tasks within real-world applications. actual environment), the instantiated task is discarded. Conversely,
To address this problem, we instantiate the task-plan data to im- if all actions in the task execute successfully, the action-validated
pute target objects and related functions. First, we prepare template task is forwarded to the evaluation stage described in the following
Word files to serve as specific targets for the actions. These template section. Additionally, screenshots of the application environment
files include various Word components such as paragraphs, tables, are captured after each step in the execution process, forming a
and figures. Each template file is accompanied by a description indi- detailed trajectory to assist in subsequent evaluation.
cating its content, providing context for grounding actions. Several It is important to note that the instantiated task-action data is
sample template files can be found in Appendix A. not guaranteed to be valid. Since the data is generated through a
Given a task-plan data sample, the task description is matched single GPT-4 call based on task-plan data, it lacks the step-by-step
with the template file descriptions to select an appropriate template refinement that might be necessary for certain tasks. In some cases,
file as the target for actions. GPT-4 is then prompted to instantiate execution results from previous steps are required to instantiate
the task-plan with target objects present in the selected template subsequent steps accurately. In such scenarios, the one-call instan-
file (detailed prompts can be found in Appendix B.1). Simultane- tiated task-action data may fail in validation and is removed from
ously, we filter relevant functions from the available function pool the dataset. This execution stage bridges the gap between planning
using the task description, allowing the instantiation process to and action, ensuring that task-action data is actionable, robust, and
populate the task-action data with specific functions and their input aligned with real-world application requirements.
parameters.
As a result of this process, the task description becomes more 3.2.3 Evaluation. Even if the task-action data is successfully ex-
concrete and grounded in a specific environment, while the corre- ecuted in the real application without errors, further evaluation
sponding action sequences needed to complete the task are gen- is required to ensure its validity. Some tasks may be incorrectly
erated. Figure 7 provides an example of the instantiation process. instantiated from the task-plan data, resulting in trajectories that,
Notably, the task-action data is not directly generated with GPT-4 while executable, do not fulfill the original task description. Simi-
due to the risk of hallucinations. Instead, instantiating grounded larly, the executed results might fail to align with the intended task
task-plan data ensures the generation of more reliable and faithful outcomes. For evaluation, we utilize the instantiated task along
step-by-step actions. with its execution trajectory, which includes:

3.2.2 Execution. To ensure that the steps in the instantiated task- – Consecutive actions performed during execution.
plan data are accurate and truly actionable, the execution stage – Screenshots captured before and after each action.
verifies the action sequence by matching control items with the – Environmental changes observed between the initial and
real application environment and performing the specified actions. final states§ .
This process validates the task-action data, ensuring its correctness Using this comprehensive trajectory, we prompt GPT-4o to evaluate
and compatibility with the application GUI. whether the executed task aligns with the original task description
For instance, as shown in Figure 7, the control item “Text High- and achieves successful completion. The evaluation considers both
light Color” with its associated control label is retrieved using the the sequence of actions and the resulting application state. The
action text “Highlight” from the control item pool. The correspond- process assigns a "task-complete" key to indicate the outcome as
ing task-action data is then executed in the application without
further intervention from the LLM. During execution, if an error § More specifically, we compare the .xml files which is the underlying data representa-
occurs (e.g., a mismatch between the predicted control item and the tion of Microsoft Word.
8
Phase 1: Task-Plan Pretraining. (Section 4.2)
Foundation for
Data Resource task-agnostic planning
HelpDoc, Evolved Data, wikiHow, Training Method
Bing Search Queries
Supervised fine-tuning
Data Format
Objective
Task: How to insert a picture in Word?
Learn structured task plans 𝑳𝑨𝑴𝟏
Plan: 1. How
Task: Go totothe "Insert"text
highlight tab.on word?
Plan: 1. Select the text ...

Phase 2: Learning from Experts. (Section 4.3) Phase 4: Learning from Reward Model. (Section 4.5)
Data Resource Data Resource
Building essential
GPT-4o traj grounded in task-plan data Training Method action skills : GPT-4o and LAM successes.
: LAM self-exploration failures. Training Method
llm Word Env Supervised fine-tuning
: LAM self-exploration successes.
Supervised fine-tuning
Objective Data Format
Data Format 𝑳𝑨𝑴𝟐 Objective
Success
Learn basic action sequences State:
GPT-4o Trained on success/failure
State: “Task”:How to insert a picture in Word? Reward Model
Failures to predict action quality.
“Task”: How to highlight text on word? “Observation”:… 𝑹𝑴𝝓
“Observation”:… Action:[{“step”: ...
Action:[{“step”: ... Reward:

Data Resource
Near-optimal success
Phase 3: Self-Boosting Exploration. (Section 4.4) LAM failure tasks
Training Method rate
Data Resource LAM Offline RL
Generalization on Failure 𝑳𝑨𝑴𝟑 𝑹𝑴𝝓
LAM self grounded in GPT-4o fail cases Training Method difficult tasks tasks Objective
Supervised fine-tuning Data Format iterative refinement and 𝑳𝑨𝑴𝟒
𝑳𝑨𝑴𝟐 Word Env
State: reward-based optimization.
Objective
“Task”: How to insert a video in Word?
Data Format Success Explore and refine policy “Observation”:…
𝑳𝑨𝑴𝟑
GPT-4o on challenging tasks Action:[{“step”: ...
State:
Successes Reward:
“Task”:How to insert a picture in Word?
“Observation”:…
Action:[{“step”: ...

Figure 8: The overview of LAM training pipeline.

"yes," "no," or "unsure." If the task is evaluated as "yes", the tra- leveraging reward-based optimization. Throughout these stages,
jectory is deemed successful; otherwise, it is classified as a failure. the model progressively evolves from LAM1 to LAM4 .
The detailed prompt used for this evaluation is provided in Ap- At a high level, Phase 1: Task-Plan Pretraining provides a
pendix B.2. This evaluation step ensures that only valid, accurate strong foundation by teaching the model to generate coherent, step-
task-action data is included in the training dataset, contributing to by-step plans for various tasks. Phase 2: Learning from Experts
the reliability and robustness of the LAM. then introduces action trajectories labeled by GPT-4o, enabling
LAM2 to align its plan generation with actionable steps. However,
3.2.4 Post-Processing. As noted in Section 3.2.2, a trajectory was
relying solely on expert successes limits diversity and adaptability.
recorded during the execution process. This trajectory includes:
To address this, Phase 3: Self-Boosting Exploration encourages
– Screenshots captured at each step. the model to tackle tasks that even GPT-4o failed to solve, au-
– Environment states before and after each action. tonomously generating new success cases and evolving into LAM3 .
– Plans and corresponding actions for every step. Finally, Phase 4: Learning from a Reward Model incorporates
During the post-processing stage, these trajectories are combined reinforcement learning (RL) principles, allowing LAM4 to learn
with the original task requests to generate synthetic step-wise from both successes and failures, refining its decision-making in
training data. The resulting data format uses the task request as complex, previously unseen scenarios. Table 1 summarizes the data
input and LAM’s plan and actions as output. This structured format used in each phase. Each phase uses different training objectives,
is critical for training LAMs to map task requests to actionable namely (i) task-plan pretraining (phase 1) and (ii) decision-making
sequences effectively. The detailed template for the data format can training (phase 2-4), as detailed in Appendix E.
be found in Appendix C.
4.1 Phase 1: Task-Plan Pretraining
4 MODEL TRAINING The initial stage focuses on imparting a broad understanding of
Our objective is to develop an LAM from scratch that can map how tasks can be decomposed into logical steps. We start with
user inputs to appropriate plans and executable actions, ultimately Mistral-7B [25] as the base model. A total of 76,672 task-plan pairs
enabling complex task completion. To achieve this, we adopt a (𝑡𝑖 , 𝑃𝑖 ) are collected from various sources, including application
staged training strategy consisting of four phases, each building help documentation, WikiHow, and historical search queries. Of
upon the previous one. As illustrated in Figure 8, these phases guide these, 29,182 pairs are sourced directly, while 47,490 are gener-
the model from learning structured task plans, to imitating expert ated via data evolution techniques (as described in Section 3.1.4),
demonstrations, to self-boosting from its own successes, and finally enriching the dataset with more complex and diverse tasks.
9
Table 1: Training data summary for each phase of LAM training.

Model Data Type Data Source Input → Output Format Data Size
Application documentation, WikiHow,
LAM1 Task-Plan Pairs 𝑡𝑖 → 𝑃𝑖 76,672 tasks
historical search queries, evolved data
LAM2 Task-Action Trajectories GPT-4o 𝑠𝑡 → 𝑎𝑡 2,192 trajectories
LAM3 Task-Action Trajectories LAM2 + GPT-4o 𝑠𝑡 → 𝑎𝑡 2,688 trajectories
LAM4 Task-Action-Reward Trajectories RM + LAM3 (𝑠𝑡 , 𝑟𝑡 ) → 𝑎𝑡 1,788 trajectories
Reward Model Task-Action-Reward Trajectories GPT-4o + LAM3 (𝑠𝑡 , 𝑎𝑡 ) → 𝑟𝑡 4,476 trajectories

In this phase, LAM1 is trained via supervised fine-tuning (SFT) These self-labeled successes, combined with the original 2,192 GPT-
to predict the correct plan sequence 𝑃𝑖 for a given task 𝑡𝑖 : 4o successes, form an augmented dataset.
𝑁 We then fine-tune LAM2 on this enriched data, yielding LAM3 :
1 ∑︁ pred
LSFT (LAM𝜃1 ) = LCE (𝑃𝑖 , 𝑃𝑖true ). 𝑁 𝑇𝑖
𝑁 𝑖=1 1 ∑︁ ∑︁
LSFT (LAM𝜃3 ) = LCE (LAM𝜃3 (𝑠𝑡 ), 𝑎𝑡 ).
Here, LCE denotes the cross-entropy loss, and 𝑁 is the number 𝑁 𝑖=1 𝑡 =1
of tasks. Although no actions are generated at this stage, LAM1
This self-boosting step allows the model to learn from its own
gains a robust task-agnostic planning capability. This knowledge
newly discovered solutions, overcoming previous limitations and
will prove critical in guiding the model’s action execution in later
improving adaptability. By leveraging planning knowledge from
phases, ensuring that the agent understands the logical structure
Phase 1 and expert strategies from Phase 2, LAM3 becomes more
of tasks before attempting to perform them.
resourceful, even in scenarios with sparse or absent expert guidance.
4.2 Phase 2: Learning from Experts
4.4 Phase 4: Learning from a Reward Model
While LAM1 can produce structured plans, it lacks the ability to
execute them. In Phase 2, we introduce expert-labeled task-action Despite the improvements, Phases 1–3 focus on successes or expert-
trajectories from GPT-4o (Section 3.2) to teach the model how to like behavior. They offer limited insights into intermediate decision
perform actions. The illustrative application in this paper is the Mi- quality and fail to exploit learning opportunities presented by failed
crosoft Word environment, where we have 2,192 successful expert attempts. In Phase 4, we integrate reinforcement learning (RL) to
trajectories. Each trajectory consists of a sequence of state-action address these shortcomings.
pairs (𝑠𝑡 , 𝑎𝑡 ), representing observed UI states and the corresponding To this end, we design a two-stage approach, where we first
actions to progress the task. The reward model (RM) is built using LAM3 as the base model,
We split these 2,192 trajectories into a training set of 1,757 and with an additional output layer added to produce scalar values
a test set of 435 trajectories, providing a total of 3,959 steps for representing the quality of actions. Using the trained RM, we fine-
training. By imitation learning LAM1 on these successful action tune LAM4 in an offline RL setting. Here, the model refines its
sequences, we obtain LAM2 . The objective is to minimize: policy without additional environmental interactions, leveraging
previously collected trajectories to learn from failures and improve
𝑁 𝑇𝑖
1 ∑︁ ∑︁ action selection.
LSFT (LAM𝜃2 ) = LCE (LAM𝜃2 (𝑠𝑡 ), 𝑎𝑡 ),
𝑁 𝑖=1 𝑡 =1
4.4.1 Reward Model Training. First, we train a reward model (RM)
where 𝑁 is the number of trajectories and 𝑇𝑖 is the number of on both LAM3 ’s successful (496) and failed (1788) trajectories and
steps in trajectory 𝑖. By imitating the expert’s policy, LAM2 trans- GPT-4o’s successful trajectories (2192) gathered in previous phases.
forms from a passive planner into a model capable of executing All steps in successful trajectories are assigned a reward of +1, and
actions aligned with its plans, grounding its reasoning in the real all steps in failed trajectories a reward of −1. This uniform, binary
application environment. labeling of outcomes ensures the RM consistently captures overall
trajectory quality. Formally:
4.3 Phase 3: Self-Boosting Exploration
𝑟𝑡 = RM(𝑠𝑡 , 𝑎𝑡 ; 𝜙),
Up to Phase 2, LAM2 only learns from successful trajectories pro-
vided by GPT-4o. This limits diversity and adaptability, as the model where 𝜙 presents the RM parameters, and 𝑟𝑡 ∈ {+1, −1} is the
never sees how to handle situations that even GPT-4o could not deal assigned reward. The RM is trained via mean squared error (MSE)
with. To overcome this limitation, Phase 3 introduces self-boosting to approximate these ground-truth rewards.
exploration. The training dataset for the RM includes both failed and suc-
Here, we revisit failed GPT-4o trajectories, i.e., tasks that GPT-4o cessful task-action trajectories generated by LAM3 , as well as the
did not complete successfully, and let LAM2 attempt them. Using the successful trajectories from the collected task-action data. All steps
ReAct mechanism [58, 80], LAM2 interacts with the environment in successful trajectories receive a reward of +1, while every step in
and tries alternative strategies for these challenging tasks. From failed trajectories is assigned a reward of −1. This uniform labeling
these attempts, we sampled 2284 GPT-4o failed tasks and then strategy ensures that the RM consistently reflects overall trajectory
collect 496 newly successful trajectories generated by LAM2 itself. quality and effectively guides policy optimization.
10
4.4.2 Optimizing with Offline PPO [56]. Armed with the RM to Table 2: Performance (%) comparison of different models on
evaluate intermediate actions, we fine-tune LAM4 via offline PPO. planning.
This stage focuses on the 1,788 failure trajectories collected during
Phase 3, providing a unique opportunity to learn from mistakes. Model TSR (%) Step Precision (%) Step Recall (%)
The training objective of PPO is: LAM1 82.2 54.7 55.7
GPT-4o 84.5 28.2 66.1
𝑁 𝑇𝑖
1 ∑︁ ∑︁ LAM𝜃4 (𝑎𝑡 |𝑠𝑡 ) Mistral-7B 0.0 0.1 0.5
LPPO (LAM𝜃4 ) = min 𝐴ˆ𝑡 ,
𝑁 𝑖=1 𝑡 =1 LAM𝜃4 (𝑎𝑡 |𝑠𝑡 )
old
 LAM4 (𝑎𝑡 |𝑠𝑡 ) with the AdamW optimizer, and spans 2 epochs. The training is
!

clip 𝜃 ˆ
, 1 − 𝜖, 1 + 𝜖 𝐴𝑡 , conducted on 8 × A100 80G NVIDIA GPUs.
LAM𝜃4 (𝑎𝑡 |𝑠𝑡 )
old
5.1.3 PPO Training (Phase 4). For Proximal Policy Optimization
where 𝐴ˆ𝑡 denotes the advantage derived from RM-generated re- (PPO) training, we use a learning rate of 1.4 × 10 −5 and set the
wards, and 𝜖 is a clipping parameter to ensure stable updates. generated sample length to 256. The batch size is 8, and the mini-
By incorporating signals from both successes and failures, LAM4 batch size is 1, with 4 PPO epochs and 1 gradient accumulation step
gains a deeper understanding of action quality. This RL-based fine- per iteration. The target KL divergence is set to 0.1, and the initial
tuning helps the model generalize to complex, previously unseen KL coefficient is set to 0.2. To ensure robust training, reward values
scenarios, ensuring more robust and reliable decision-making. are normalized to the range [-0.5, 0.5]. The training is conducted
on 8 NVIDIA A100 80G GPUs.
4.5 Summary
The four-phase training pipeline incrementally builds a fully capa- 5.2 Task-Plan Pretraining Results (Phase 1)
ble LAM. Phase 1 imparts a fundamental planning ability, Phase 5.2.1 Evaluation Metrics. We evaluate LAM1 on its ability to gener-
2 incorporates expert knowledge for action execution, Phase 3 ate task plans. We use three metrics for this evaluation: (i) Task Suc-
empowers the model to generate and learn from new successes, cess Rate (TSR), measuring whether the predicted plan matches
and Phase 4 leverages rewards from both successes and failures to the ground truth at the task level; (ii) Step Precision, evaluating
optimize decision-making. By combining static knowledge with the proportion of predicted plan steps that appear in the ground
expert demonstrations, self-guided exploration, and reward-based truth; and (iii) Step Recall, assessing the proportion of ground
refinement, we transform a general-purpose language model into a truth plan steps that are correctly predicted.
versatile LAM. This progressive training strategy ensures a robust, To compute these metrics, we leverage GPT-4o to compare each
adaptive model ready to handle diverse and complex tasks. step of the LAM1 output with the corresponding ground truth steps.
The counts of matched steps are then used to calculate the final
5 OFFLINE EVALUATIONS evaluation metrics. Detailed prompt information for the evaluation
can be found in Appendix D.
The offline evaluation results of Task-Plan Pretraining Results
(Phase 1) and Task-Action Results (Phases 2–4) will be pre- 5.2.2 Performance of LAM1 on Planning. Table 2 presents the per-
sented in this section. Offline evaluation allows us to systematically formance of LAM1 in planning prediction across 15,334 tasks on
assess the performance of LAM1 and subsequent phases (LAM2 , Windows OS, utilizing the dataset detailed in Section 3.1. LAM1
LAM3 , and LAM4 ) without interacting with the environment. This achieves a TSR of 82.2%, which is comparable to GPT-4o’s TSR
setup effectively provides a controlled and reproducible framework of 84.5%. While GPT-4o demonstrates a slightly higher TSR, it ex-
for comparing task success rates, precision, and recall metrics across hibits a lower Step Precision of 28.2%, indicating inefficiencies in
models. its planning by generating additional unnecessary steps. In con-
trast, LAM1 achieves a higher Step Precision, reflecting its ability
5.1 Experiment Setup to produce more efficient and accurate plans. This superior preci-
5.1.1 SFT Training (Phase 1, 2, 3). For supervised fine-tuning (SFT), sion is attributed to LAM1 ’s training regimen, which incorporates
the learning rate is set to 2 × 10 −5 with cosine decay and 2 warmup domain-specific knowledge through task-plan pretraining.
steps. The batch size is 16, and the training is conducted for 3 epochs Additionally, the baseline Mistral-7B model, without any fine-
on the training data. Loss is calculated only for the target tokens tuning, performs inadequately with a TSR of 0.0%, Step Precision
rather than the full input sequence, optimizing the efficiency of the of 0.1%, and Step Recall of 0.5%. These stark results underscore
fine-tuning process. The training is performed on 8 × A100 80G the critical importance of task-plan pretraining in transforming a
NVIDIA GPUs. general-purpose language model into a competent task planner.
Overall, the evaluation highlights that while general-purpose
5.1.2 Reward Training (Phase 4). Reward scores are normalized to models like GPT-4o can achieve high success rates, their lower step
the range [0, 1] using sigmoid function. We employ the LoRA (Low- precision suggests a propensity for overcomplicating plans. In con-
Rank Adaptation) method [22] to train the reward model (RM). The trast, specialized models like LAM1 not only maintain competitive
LoRA parameters include rank of 8, LoRA alpha of 32, and LoRA success rates but also generate more streamlined and accurate ac-
dropout of 0.1. The task type is sequence classification. The training tion sequences. This validates the effectiveness of targeted training
process uses learning rate of 2 × 10 −5 with linear decay, optimized approaches in enhancing planning capabilities and demonstrates
11
Table 3: Offline performance comparison across different models and metrics on decision making.

Metric LAM1 LAM2 LAM3 LAM4 GPT-4o (Text-only) GPT-4o Mini (Text-only)
Object Acc (%) 39.4 85.6 87.4 87.8 73.2 74.6
Operation Acc (%) 59.9 97.3 97.7 97.7 94.2 91.5
Status Acc (%) 32.7 97.8 98.2 99.0 52.1 67.4
Step Success Rate (SSR) (%) 33.0 83.6 85.9 86.2 68.8 73.4
Task Success Rate (TSR) (%) 35.6 76.8 79.3 81.2 67.2 62.3

the necessity of task-plan pretraining for developing reliable and training process relies on progressively collected data and incre-
efficient task planners. mental refinements tailored to each phase.
The step-by-step training strategy explains these gains. In Phase 1
5.3 Task-Action Results (Phases 2–4) (LAM1 ), task-plan pretraining establishes a foundational under-
standing of task structures, resulting in a modest increase in TSR.
5.3.1 Evaluation Metrics. To assess the performance of agents in
In Phase 2 (LAM2 ), imitation learning on GPT-4o-labeled success
completing tasks, we employ five primary metrics: Object Ac-
trajectories imparts efficient execution strategies, driving a signifi-
curacy (Object Acc.), Operation Accuracy (Operation Acc.),
cant jump in TSR from 35.6% to 76.8%. Phase 3 (LAM3 ) introduces
Status Accuracy (Status Acc.), Step Success Rate (SSR), and
self-boosting exploration, where LAM autonomously tackles cases
Task Success Rate (TSR). The definitions and calculation methods
previously failed by GPT-4o. This yields an additional increase in
for these metrics are detailed below:
TSR to 79.3%. Finally, in Phase 4 (LAM4 ), reward-guided fine-tuning
(1) Object Accuracy (Object Acc.): This metric measures the refines decision-making based on sparse feedback, further elevating
accuracy of selecting the correct control object for each task TSR to 81.2%.
step. The predicted object is compared with the set of acceptable An important outcome is that the LAM framework enables the
objects defined in the ground truth. It evaluates the agent’s model to surpass GPT-4o, despite GPT-4o providing initial annota-
ability to correctly identify and interact with the appropriate tions. Through targeted data collection and progressive refinement,
UI elements. LAM not only assimilates the strengths of GPT-4o, but also learns
(2) Operation Accuracy (Operation Acc.): For operations such from its failures to develop more robust and adaptable policies. The
as Click, Type, or Select Option, this metric evaluates the ReAct mechanism plays a crucial role here, allowing LAM2 and
correctness of the predicted action. It ensures that the agent beyond to gather new success trajectories from challenging tasks,
performs the correct operation as specified in the ground truth. thereby enhancing its policy and overall performance.
(3) Status Accuracy (Status Acc.): This metric assesses whether In summary, the phased training approach and judicious data uti-
the agent correctly identifies the task’s completion status based lization enable LAM to excel where a state-of-the-art LLM (GPT-4o)
on its predictions. It evaluates the agent’s understanding of the falls short. This highlights the effectiveness of the LAM framework
overall progression and whether the task is marked as finished in crafting agents that are both data-efficient and capable of execut-
appropriately. ing complex, multi-step tasks with high accuracy and reliability.
(4) Step Success Rate (SSR): A step is considered successful only
if the selected object, predicted operation, and predicted status
are all correct. This metric evaluates each step of the task inde- 6 INTEGRATION AND GROUNDING
pendently by comparing the predicted outputs with the ground Once the LAM is trained, we integrate it into the GUI agent UFO [83],
truth action history. enabling the model’s predicted actions to be grounded and ex-
(5) Task Success Rate (TSR): A task is considered successful only ecutable within the Windows OS environment. The UFO agent
if all steps within the task are successful, making this a stringent accepts user requests in natural language and completes tasks by
evaluation metric. This metric provides a holistic measure of the interacting with the UI controls of Windows applications.
agent’s ability to complete complex, multi-step tasks accurately.
These metrics collectively cover various aspects of agent perfor-
mance, including precision in object selection, operation execution,
6.1 LAM Agent In a Nutshell
task understanding, and overall task completion. By combining In UFO, the LAM serves as the inference engine within the Ap-
step-level and task-level evaluations, they provide a comprehensive pAgent, enabling efficient and accurate task completion. Figure 9
assessment of the agent’s effectiveness in real-world task execution. illustrates the architecture of the AppAgent. UFO, equipped with
LAMs, is designed for interactive engagement with Windows ap-
5.3.2 Performance on Decision Making. Table 3 summarizes the plications. For simplicity, we focus on automating tasks within
results on 435 tasks of the Word Application. The four-phase LAM Microsoft Word, a widely used productivity tool with a sophisti-
training framework demonstrates incremental and cumulative im- cated GUI and diverse functionalities, making it an ideal testbed
provements in task completion. Notably, LAM4 achieves a TSR for training and evaluating LAM.
of 81.2%, outperforming both GPT-4o (67.2%) and GPT-4o-mini During each inference step, the agent collects critical contex-
(62.3%). This performance gap is substantial, considering that LAM’s tual information from the application environment, which is then
12
AppAgent from a predefined list. The function calls inferred by LAM are lim-
Memory ited to pre-defined operations, such as mouse and keyboard actions,
action t-1
...
action t-n plan t-1 as well as APIs specific to Word-related tasks. Once inferred, these
operations are parsed and executed within the environment.
Environment
Action
Request Sequence Grounding 6.4 Action Execution
Action UFO employs a control interactor to ground the action strings
Feedback ...
LAM Executor generated by LAMs, translating them into tangible impacts within
Env. State Data the target application. Each action typically consists of two key
Input Collection components:
{“type”: Botton, “title”: “New”,“position”: [0.45, 0.78] } (1) Control Element: This refers to the specific UI control
{“type”: Edit, “title”: “Document”,“position”: [0.87, 0.43] }
{“type”: Botton, “title”: “Design”“position”: [0.25, 0.21] } within the application that will receive the action, such as a
{“type”: ComboBox, “title”: “SaveAs”“position”: [0.67, 0.32] }
button, text box, or scroll bar.
Screenshots UI Information
(2) Function Call: This represents the operation to be per-
formed on the control element, such as a mouse click, key-
Figure 9: The overall architecture of the AppAgent employed board input, or invocation of native APIs.
in UFO.
By combining the control element and its associated function call,
UFO executes the inferred actions within the application.
passed to the LAM for decision-making. The LAM performs plan-
ning, orchestrates actions, and infers the necessary steps to fulfill 6.5 Memory
the user request. These inferred actions are grounded in the envi- UFO maintains additional information in its memory to assist LAMs
ronment by mapping them to predefined tools and function calls in making more informed and accurate decisions. This memory
used by the agent, such as mouse clicks, keyboard inputs, or API includes:
calls. This process iterates, with LAM continuously adjusting its (1) Historical Actions: A log of action trajectories and their ex-
plan based on real-time feedback from the environment, until the ecution results from the initial step onwards. This helps LAM
task is completed. Additionally, the agent maintains a memory that understand the current system state and aids in exploring
logs historical actions and plans, providing essential context for the the next steps based on prior actions.
LAM to make more informed and adaptive decisions as the task pro- (2) Previous Plan: The textual planning for future actions, gen-
gresses. This integration ensures that UFO can efficiently manage erated by LAM in the previous step. This serves as a refer-
and complete complex, real-world tasks in Windows environments. ence for guiding the current and future actions, ensuring
consistency across steps.
6.2 Environment
This memory is fed into LAM at each decision point, allowing for
The UFO agent leverages the LAM to interact with applications in more effective decision-making. By maintaining a comprehensive
the Windows environment. At each decision step, UFO employs record of past actions and plans, LAMs can better understand what
the UI Automation (UIA) API [13] to inspect all actionable con- has been accomplished, what remains to be done, and the outcomes
trols within the target Windows application, retrieving contextual of previous actions. This situational awareness enhances LAMs’
information for each control¶ . This information is passed to the ability to complete user requests more effectively and efficiently.
LAM for control selection and action inference. The control data is
structured as a list of dictionaries, where each control is assigned a 7 ONLINE EVALUATIONS
numerical index (as a label), along with its title and control type,
With the integration of the Windows GUI agent UFO, we evaluate
allowing the LAM to make informed decisions regarding control
the performance of the LAM in real-world environments. The eval-
selection and the corresponding action. This input format mirrors
uation process and results are detailed in the following subsections.
the structure used during offline data collection for consistency in
training and execution.
7.1 Testing Dataset
6.3 LAM Inference The online performance of LAM is evaluated on the same set of 435
test requests used during LAM training. The testing environments,
Using the environmental observations of application control in-
specifically the Word document templates corresponding to each
formation, UFO constructs prompts in the same format as the of-
task, are also maintained as identical to the training setup to ensure
fline training data, using planning and thought generation tech-
consistency and comparability.
niques [12, 70] to enable LAM to make reliable inferences about the
appropriate controls and operations to invoke. These inferences tar-
get the controls detected by the UIA, where each control is selected
7.2 Implementation
Our LAM was deployed on a virtual machine (VM) configured as
¶ UIA is the native Windows OS APIs used to detect actionable controls and pro-
NC24s v3. The VM is equipped with 24 virtual cores (vCPUs), 448
vide their metadata, such as names and locations. For other platforms, UIA can be
replaced by vision-based detectors that analyze screenshots or by utilizing alternative GB of memory, and two NVIDIA Tesla V100 GPUs, each with 16
accessibility APIs. GB of memory, to support efficient inference. This computational
13
Table 4: Performance comparison of LAM and baseline models across metrics.

Text-only Text + Visual


Metric
LAM GPT-4o GPT-4o Mini GPT-4o GPT-4o Mini
Task Success Rate (%) 71.0 63.0 57.8 75.5 66.7
Task Completion Time (s) 30.42 86.42 35.24 96.48 46.21
Task Completion Steps 5.62 6.73 5.99 4.98 6.34
Average Step Latency (s) 5.41 12.84 5.88 19.36 7.29

setup was designed to meet the demanding requirements of LAM’s These metrics collectively evaluate both the accuracy and effi-
inference processes effectively. ciency of task completion, providing a comprehensive assessment
The UFO agent operates on six VMs running in parallel using of the LAM’s capabilities in real-world scenarios.
Azure Dedicated Host∥ to accelerate the testing process. Each VM
is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 7.5 Experimental Analysis
2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version The experimental results are presented in Table 4. LAM achieves a
23H2. Microsoft applications, such as Word and Excel, are installed TSR of 71.0%, demonstrating competitive performance compared
on version 2410. GUI control is facilitated through the MSTSC to the GPT-4o models. While GPT-4o with visual inputs attains the
tool∗∗ . This setup ensures a consistent and controlled environment highest TSR of 76.5%, slightly outperforming LAM, its reliance on
for evaluating the LAM’s performance. visual data introduces significant trade-offs in efficiency. Notably,
when visual inputs are excluded, GPT-4o’s TSR drops to 63.0%, an
7.3 Baselines 8.0 percentage point decrease compared to LAM. Similarly, GPT-4o
To benchmark the performance of LAM, we compared it against Mini exhibits lower TSRs for both visual and non-visual settings
two baseline models: GPT-4o and GPT-4o Mini. These models are (66.7% and 57.8%, respectively). These results underscore LAM’s
widely recognized for their robust natural language processing capability as a text-only model to maintain high task success rates,
and reasoning capabilities, making them popular choices in the outperforming the text-only variants of the baseline models.
development of GUI agents. To ensure consistency in evaluation, Efficiency is assessed through Task Completion Time and Aver-
the top_p and temperature hyperparameters were set to 0 for both age Step Latency, where LAM demonstrates clear superiority. LAM
baseline models. achieves the shortest Task Completion Time of 30.42 seconds,
To further examine the impact of input modalities, we conducted substantially outperforming all baseline models. In comparison,
an ablation study comparing performance with and without the in- GPT-4o without visual inputs records a completion time of 86.42
clusion of screenshots. Notably, LAM processes only textual inputs, seconds, more than 2.84 times longer than LAM. GPT-4o with visual
excluding screenshots, while the baseline models were evaluated inputs fares even worse, with a completion time of 96.48 seconds.
using both textual and visual modalities. Although GPT-4o Mini models show slightly better efficiency than
their larger counterparts, they remain less efficient than LAM, with
completion times of 35.24 seconds (without visual inputs) and 46.21
7.4 Evaluation Metrics seconds (with visual inputs).
We employ the following metrics to comprehensively evaluate the LAM also excels in Average Step Latency, achieving the shortest
performance of LAM: time per action step at 5.41 seconds. Without visual inputs, GPT-4o
• Task Success Rate (TSR): The percentage of tasks success- reduces its step latency to 12.84 seconds but still remains more than
fully completed out of the total tasks attempted. Task success twice as slow as LAM. In comparison, GPT-4o with visual inputs
is determined by an evaluation agent using GPT-4o, which exhibits the highest step latency at 19.36 seconds per step, more
assesses the full task completion trajectory, including plans, than triple LAM’s latency. GPT-4o Mini models show moderate
action sequences, and screenshots, to verify task completion. improvements but still fall short, with step latencies of 7.29 seconds
• Task Completion Time: The total time taken to complete (with visual inputs) and 5.88 seconds (without visual inputs).
each task, measured from the initial request to the final These findings highlight LAM’s strengths as a text-only model,
action. offering a compelling balance of competitive accuracy and superior
• Task Completion Steps: The total number of action steps efficiency. It achieves rapid task completion and low latency without
performed by the agent to successfully complete each task. sacrificing performance, making it an effective solution for real-
• Average Step Latency: The average time taken per action world applications. Its specialized training enables precise action
step, reflecting the model’s efficiency in generating and exe- inference and execution, underscoring the potential of LAMs to
cuting each action. enhance automation and productivity in agent-based systems.

8 LIMITATION AND FUTURE RESEARCH


∥ https://azure.microsoft.com/en-us/products/virtual-machines/dedicated-host While significant strides have been made in the development of
∗∗ https://learn.microsoft.com/en-us/windows-server/administration/windows- LAMs, their current state is still in its infancy. Many technical chal-
commands/mstsc lenges and limitations prevent LAMs from being fully productized
14
and integrated into commercial use for real-world applications. Be- environment may fail when confronted with changes it has not en-
low, we outline key limitations and areas for future research to countered before, leading to poor performance or outright failures
address these challenges. [18, 33, 87].
In addition, scaling LAMs to new environments or applications is
challenging due to the high cost of collecting domain-specific data
8.1 Safety Risk [45, 47]. Gathering sufficient training data for each new context is
The ability of LAMs to perform real-world actions in physical or time-consuming and resource-intensive. Furthermore, the model’s
digital environments introduces significant safety risks. Unlike ability to generalize across different environments is often limited,
traditional LLMs, which primarily generate text, LAMs have the po- as it may not be familiar with the nuances of new systems or tasks.
tential to manipulate external systems, control hardware, or make Future work should focus on improving the adaptability and
changes within software environments. While this capability is a generalizability of LAMs through techniques like transfer learning,
key strength, it also presents a double-edged sword: errors in infer- multi-task learning, and few-shot learning. These approaches allow
ence or execution can lead to unintended or harmful consequences a model to generalize from one environment to another with min-
[41, 91]. imal retraining. Moreover, developing automated data collection
For instance, a LAM controlling a robotic system could misin- methods and self-supervised learning techniques could significantly
terpret a command and cause physical damage. Similarly, a LAM reduce the effort required to scale LAMs to new domains.
operating within a financial or healthcare application could exe- Summary. While LAMs represent a promising advancement in
cute erroneous actions with substantial real-world repercussions. the evolution of AI systems, they are still constrained by several
Therefore, safety mechanisms such as formal verification, action technical, ethical, and practical limitations. Addressing these chal-
validation, and fallback strategies must be integrated into LAM lenges will be essential for enabling the widespread adoption and
systems. Future research must focus on developing robust error commercialization of LAMs in real-world applications. By ensuring
detection, rollback mechanisms, and fail-safe systems that prevent safety, addressing ethical concerns, and improving scalability and
actions from being executed until they have been thoroughly vetted adaptability, future research can help unlock the full potential of
for correctness and safety [17, 34, 90]. LAMs.

8.2 Ethical and Regulatory Concerns 9 RELATED WORK


The deployment of LAMs raises significant ethical and regulatory The emergence of LAMs has led to significant impact across various
challenges [2, 44, 46, 51, 76]. As these models gain the ability to agentic domains. In the following sections, we will review related
interact with real-world environments, questions about account- research and practices at three levels: (i) data of LAMs, (ii) training
ability, transparency, and fairness come to the forefront [15, 36, 38]. LAMs, and (iii) agents with LAMs.
For instance, who is held accountable if a LAM causes harm or
damage due to a misinterpretation of a user’s command? How do
we ensure that these systems are making decisions in a fair and 9.1 Data of LAMs
unbiased manner? These concerns are amplified by the fact that The emergence of LLM-based agents has spurred the development
LAMs are often trained on large datasets that may contain biases, of numerous datasets specifically tailored to LAM applications and
which can influence the model’s decision-making processes [48]. their corresponding agent systems. These datasets can be divided
Moreover, there are regulatory concerns regarding the deploy- into two main categories, namely (i) Datasets for Training LAMs:
ment of LAMs in critical sectors such as healthcare, finance, and These datasets provide the necessary input for training LAMs, in-
transportation, where strict guidelines govern the use of automated cluding diverse user commands, environmental contexts, and action
systems [9, 30, 37]. Future research must address these concerns sequences. (ii) Evaluation Benchmarks: These benchmarks are
by developing transparent model architectures that allow for inter- curated for testing and evaluating the capabilities of LAMs and
pretability and explainability of actions taken by LAMs. Addition- agent systems.
ally, establishing clear regulatory frameworks and ethical guidelines Mind2Web [11] is the first dataset developed for web agents that
will be crucial for ensuring that LAMs are deployed in a manner follow natural language instructions to complete complex tasks
that prioritizes safety, fairness, and accountability. across diverse websites. It includes task descriptions, action se-
quences, and webpage snapshots, offering rich data for training and
testing models in various web-based scenarios. Rawles et al., intro-
8.3 Scalability, Generalizability and duced a large dataset called Android in the Wild (AITW) [53], which
Adaptability is designed specifically for training models to control Android de-
LAMs are often tailored to specific environments or scenarios, mak- vices. SeeClick [8] combines web, mobile, and general GUI tasks,
ing their scalability, generalizability, and adaptability significant creating a dataset of over 1 million samples for training LAMs. Simi-
limitations. Most LAMs are designed to operate within a narrowly larly, GUICourse [7] and OmniACT [29] provide datasets across web,
defined context, such as a specific operating system, application, smartphone, and desktop platforms, containing detailed user re-
or interface. These environments, however, are subject to frequent quests, environmental states, and action sequences. These datasets
updates, changes in APIs, and the introduction of new applica- are invaluable resources for training LAMs in specific domains and
tions or functionalities. A LAM trained on a specific version of an evaluating their task execution abilities.
15
Several benchmarks have also been developed to evaluate the the potential of LAMs in complex web interactions. In the mobile
capabilities of LAMs and their associated agents in different environ- domain, MobileAgent [64] and AppAgent [78] focus on automat-
ments. WebCanvas provides 542 tasks with dynamic environments, ing tasks within Android applications by leveraging GUI agents.
designed to assess the task completion ability of web agents. An- These systems demonstrate how LAMs can power task automa-
droidWorld [52] offers a fully functional Android environment, fea- tion on mobile platforms, transforming how users interact with
turing 116 programmatic tasks across 20 real-world Android apps applications.
with reward signals for performance evaluation. WindowsArena One of the most advanced systems, UFO [83], is a UI-focused
[3] focuses on benchmarking LAMs within the Windows GUI, while agent designed for automating tasks on the Windows OS, further
OSWorld [73] extends this to a more diverse environment, encom- enhanced with APIs [42]. UFO is composed of two key compo-
passing Windows, macOS, and Ubuntu. These benchmarks provide nents: a HostAgent that decomposes user requests into subtasks
standardized settings to measure and compare the effectiveness and an AppAgent that executes these subtasks within individual
of LAMs and their agents in various real-world environments, en- applications. This architecture significantly enhances UFO’s capa-
abling a unified evaluation framework for agentic models. bility to handle cross-application tasks seamlessly, providing robust
task automation across diverse software environments. In parallel,
9.2 Training LAMs ScreenAgent [49], Cradle [60], OS-Copilot [71], and MMAC-Copilot
[59] also focus on automating UI tasks in desktop environments.
Using both open and private domain-specific datasets, significant
Notably, Cradle and OS-Copilot push the boundaries by enabling
research efforts have been directed toward training LAMs for spe-
agents to learn from their experiences and self-evolve over time,
cialized purposes, enhancing the action inference abilities of tradi-
further enhancing their effectiveness and autonomy.
tional LLMs to enable automation and tangible real-world impact.
By integrating LAMs into agents to handle complex tasks in
For example, SeeClick [8] and GUICourse [7], in addition to re-
these various scenarios, These pioneering efforts are opening new
leasing their own datasets, leverage these resources to train LAMs,
possibilities for the future of human-computer interaction, revolu-
grounding real-world data into models that effectively interact with
tionizing traditional methods of interacting with GUIs and paving
their environments.
the way for more intelligent, automated, and user-friendly systems.
Hong et al., trained an 18-billion-parameter visual language LAM,
named CogAgent [21], which specializes in GUI understanding and
navigation tasks across both PC and Android interfaces. By utilizing 10 CONCLUSION
datasets like Mind2Web and AITW, CogAgent has been optimized “Actions speak louder than words.” The transition from generating
for complex navigation and action execution tasks in diverse GUI language responses to executing tangible actions marks the evolu-
environments. ScreenAI [1] introduced a textual representation tion of large language models into large action models, enabling
for user interfaces (UIs) to teach models how to understand and them to make real-world impacts, a critical step towards achieving
interact with UIs. This approach also facilitates automatic gener- AGI. This technical report provides a comprehensive introduction
ation of large-scale training data, which is then used to pretrain to LAMs, covering their conceptual foundations, system architec-
and fine-tune models for a wide spectrum of tasks, including UI ture, and the step-by-step process of developing a LAM—from data
and infographic understanding and navigation. Additionally, Zhang collection to model training and deployment in real-world agent
et al., released a series of large action models (xLAM) tailored for systems. We use the Windows OS environment and its GUI agent
AI agent tasks [85], including five models with both dense and UFO, as a case study to demonstrate how to build a LAM from
mixture-of-expert architectures. By unifying datasets from diverse the ground up. Detailed implementation strategies and evaluation
environments, xLAM ensures consistency in data format, simplify- results are presented to offer practical insights into this process.
ing model training and enhancing generalization across multiple However, despite progress, the development of high-quality
benchmarks. These models have achieved outstanding performance LAMs is still in its early stages, with several limitations remaining.
in diverse scenarios, demonstrating the capability of LAMs to ex- These include the extensive need for training data and computa-
tend beyond traditional LLMs and perform complex real-world tional resources, inference latency, and the risk of errors during
tasks. real-world execution. While current LAMs have shown potential,
These pioneering works have laid the foundation for advancing there is substantial room for improvement. We anticipate that as
the action-oriented capabilities of LLMs, making LAMs a critical these challenges are addressed, more sophisticated and reliable LAM
component in achieving robust automation and impactful real- applications will emerge, bringing us closer to fully autonomous
world applications. systems capable of meaningful action in complex environments.

9.3 Agents with LAMs REFERENCES


With the development of LAMs, researchers have integrated these [1] Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor,
models into real-world agent systems, which provide the neces- Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma.
2024. Screenai: A vision-language model for ui and infographics understanding.
sary components and workflows to ensure effective interaction arXiv preprint arXiv:2402.04615 (2024).
between LAMs and their environments, enabling them to fulfill [2] Anjanava Biswas and Wrick Talukdar. 2023. Guardrails for trust, safety, and
user requests efficiently. As a pioneer, Zhang et al., demonstrated ethical development and deployment of Large Language Models (LLM). Journal
of Science & Technology 4, 6 (2023), 55–82.
that GPT-V can serve as a capable LAM for web navigation when [3] Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali,
coupled with appropriate agent techniques and tools, revealing Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and
16
Zack Hui. 2024. Windows Agent Arena: Evaluating Multi-Modal OS Agents at arXiv:2310.06825 (2023).
Scale. arXiv preprint arXiv:2409.08264 (2024). [26] Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu
[4] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em-
arXiv:2005.14165 (2020). powering incident management with query recommendations via large language
[5] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, models. In Proceedings of the IEEE/ACM 46th International Conference on Software
and Pierre-Yves Oudeyer. 2023. Grounding large language models in interactive Engineering. 1–13.
environments with online reinforcement learning. In International Conference on [27] Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we
Machine Learning. PMLR, 3676–3713. know when language models know? on the calibration of language models for
[6] Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao question answering. Transactions of the Association for Computational Linguistics
Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. 2024. When large language 9 (2021), 962–977.
models meet personalization: Perspectives of challenges and opportunities. World [28] Sai Shashank Kalakonda, Shubh Maheshwari, and Ravi Kiran Sarvadevabhatla.
Wide Web 27, 4 (2024), 42. 2023. Action-gpt: Leveraging large-scale language models for improved and gen-
[7] Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi eralized action generation. In 2023 IEEE International Conference on Multimedia
Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. 2024. GUICourse: From General and Expo (ICME). IEEE, 31–36.
Vision Language Models to Versatile GUI Agents. arXiv preprint arXiv:2406.11317 [29] Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble,
(2024). Waseem Alshikh, and Ruslan Salakhutdinov. 2024. OmniACT: A Dataset and
[8] Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop
and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual and Web. arXiv preprint arXiv:2402.17553 (2024).
gui agents. arXiv preprint arXiv:2401.10935 (2024). [30] Mert Karabacak and Konstantinos Margetis. 2023. Embracing large language
[9] Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, and Ziran Wang. 2024. Receive, models for medical applications: opportunities and challenges. Cureus 15, 5
reason, and react: Drive as you say, with large language models in autonomous (2023).
vehicles. IEEE Intelligent Transportation Systems Magazine (2024). [31] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna
[10] Ishita Dasgupta, Andrew K Lampinen, Stephanie CY Chan, Antonia Creswell, Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke
Dharshan Kumaran, James L McClelland, and Felix Hill. 2022. Language models Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges
show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051 of large language models for education. Learning and individual differences 103
(2022). (2023), 102274.
[11] Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, [32] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2024. Language models can
Huan Sun, and Yu Su. 2024. Mind2web: Towards a generalist agent for the web. solve computer tasks. Advances in Neural Information Processing Systems 36
Advances in Neural Information Processing Systems 36 (2024). (2024).
[12] Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, [33] Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao
Si Qin, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2023. Everything Zhang. 2020. Calibrated language model fine-tuning for in-and out-of-distribution
of thoughts: Defying the law of penrose triangle for thought generation. arXiv data. arXiv preprint arXiv:2010.11506 (2020).
preprint arXiv:2311.04254 (2023). [34] Richard Koo and Sam Toueg. 1987. Checkpointing and rollback-recovery for
[13] Duong Tran Dinh, Pham Ngoc Hung, and Tung Nguyen Duy. 2018. A method for distributed systems. IEEE Transactions on software Engineering 1 (1987), 23–31.
automated user interface testing of windows-based applications. In Proceedings of [35] Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya
the 9th International Symposium on Information and Communication Technology. Tyamagundlu, and Oriana Riva. 2024. On the Effects of Data Scale on Computer
337–343. Control Agents. arXiv preprint arXiv:2406.03679 (2024).
[14] Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, [36] Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on
Guanyu Lin, and Jiaxuan You. [n. d.]. How Far Are We From AGI: Are LLMs All fairness in large language models. arXiv preprint arXiv:2308.10149 (2023).
We Need? Transactions on Machine Learning Research ([n. d.]). [37] Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. 2023. Large language
[15] Emilio Ferrara. 2024. GenAI against humanity: Nefarious applications of genera- models in finance: A survey. In Proceedings of the fourth ACM international
tive artificial intelligence and large language models. Journal of Computational conference on AI in finance. 374–382.
Social Science (2024), 1–21. [38] Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse. 2023. Opening up
[16] Jensen Gao, Bidipta Sarkar, Fei Xia, Ted Xiao, Jiajun Wu, Brian Ichter, Anirudha ChatGPT: Tracking openness, transparency, and accountability in instruction-
Majumdar, and Dorsa Sadigh. 2024. Physically grounded vision-language models tuned text generators. In Proceedings of the 5th international conference on con-
for robotic manipulation. In 2024 IEEE International Conference on Robotics and versational user interfaces. 1–6.
Automation (ICRA). IEEE, 12462–12469. [39] Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang
[17] William J Gehring, Brian Goss, Michael GH Coles, David E Meyer, and Emanuel Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al. 2023. Do-
Donchin. 1993. A neural system for error detection and compensation. Psycho- main specialization as the key to make large language models disruptive: A
logical science 4, 6 (1993), 385–390. comprehensive survey. arXiv preprint arXiv:2305.18703 (2023).
[18] Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein [40] Jun Liu, Chaoyun Zhang, Jiaxu Qian, Minghua Ma, Si Qin, Chetan Bansal, Qingwei
Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. 2023. Studying Lin, Saravan Rajmohan, and Dongmei Zhang. 2024. Large Language Models
large language model generalization with influence functions. arXiv preprint can Deliver Accurate and Interpretable Time Series Anomaly Detection. arXiv
arXiv:2308.03296 (2023). preprint arXiv:2405.15370 (2024).
[19] Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. [41] Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. 2024.
2023. Leveraging pre-trained large language models to construct and utilize Towards safer large language models through machine unlearning. arXiv preprint
world models for model-based task planning. Advances in Neural Information arXiv:2402.10058 (2024).
Processing Systems 36 (2023), 79081–79094. [42] Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qing-
[20] Jianliang He, Siyu Chen, Fengzhuo Zhang, and Zhuoran Yang. 2024. From Words wei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. 2024. Turn every
to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous application into an agent: Towards efficient human-agent-computer interaction
Systems. arXiv preprint arXiv:2405.19883 (2024). with api-first llm-based agents. arXiv preprint arXiv:2409.17140 (2024).
[21] Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui [43] Zilin Ma, Yiyang Mei, and Zhaoyuan Su. 2023. Understanding the benefits and
Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. CogAgent: A challenges of using large language model-based conversational agents for mental
visual language model for GUI agents. In Proceedings of the IEEE/CVF Conference well-being support. In AMIA Annual Symposium Proceedings, Vol. 2023. American
on Computer Vision and Pattern Recognition. 14281–14290. Medical Informatics Association, 1105.
[22] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean [44] Bertalan Meskó and Eric J Topol. 2023. The imperative for regulatory oversight
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large of large language models (or generative AI) in healthcare. NPJ digital medicine 6,
language models. arXiv preprint arXiv:2106.09685 (2021). 1 (2023), 120.
[23] Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, Jianguang Lou, Qingwei Lin, [45] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard
Ping Luo, Saravan Rajmohan, and Dongmei Zhang. 2024. AgentGen: Enhancing Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A
Planning Abilities for Large Language Model based Agent via Environment and survey. arXiv preprint arXiv:2402.06196 (2024).
Task Generation. CoRR (2024). [46] Timo Minssen, Effy Vayena, and I Glenn Cohen. 2023. The challenges for regu-
[24] Yucheng Hu and Yuxing Lu. 2024. Rag and rau: A survey on retrieval-augmented lating medical use of ChatGPT and other large language models. Jama (2023).
language model in natural language processing. arXiv preprint arXiv:2404.19543 [47] Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane
(2024). Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. 2024.
[25] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- Scaling data-constrained language models. Advances in Neural Information
vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Processing Systems 36 (2024).
Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint
17
[48] Roberto Navigli, Simone Conia, and Björn Ross. 2023. Biases in large language in large language models. Advances in neural information processing systems 35
models: origins, inventory, and discussion. ACM Journal of Data and Information (2022), 24824–24837.
Quality 15, 2 (2023), 1–21. [71] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu,
[49] Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. Os-copilot: Towards generalist
Yi Chang, and Qi Wang. 2024. Screenagent: A vision language model-driven computer agents with self-improvement. arXiv preprint arXiv:2402.07456 (2024).
computer control agent. arXiv preprint arXiv:2402.07945 (2024). [72] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming
[50] Alastair Pennycook. 1985. Actions speak louder than words: Paralanguage, Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2023. The rise and potential
communication, and education. Tesol Quarterly 19, 2 (1985), 259–282. of large language model based agents: A survey. arXiv preprint arXiv:2309.07864
[51] Andrés Piñeiro-Martín, Carmen García-Mateo, Laura Docío-Fernández, and (2023).
Maria Del Carmen Lopez-Perez. 2023. Ethical challenges in the development of [73] Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng
virtual assistants powered by large language models. Electronics 12, 14 (2023), Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024.
3170. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer
[52] Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle environments. arXiv preprint arXiv:2404.07972 (2024).
Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, [74] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
et al. 2024. AndroidWorld: A dynamic benchmarking environment for au- Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language
tonomous agents. arXiv preprint arXiv:2405.14573 (2024). models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
[53] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lilli- [75] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A
crap. 2024. Androidinthewild: A large-scale dataset for android device control. systematic evaluation of large language models of code. In Proceedings of the 6th
Advances in Neural Information Processing Systems 36 (2024). ACM SIGPLAN International Symposium on Machine Programming. 1–10.
[54] Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Hangyu Mao, [76] Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado,
Ziyue Li, Xingyu Zeng, Rui Zhao, et al. 2023. Tptu: Task planning and tool usage Guanliang Chen, Xinyu Li, Yueqiao Jin, and Dragan Gašević. 2024. Practical and
of large language model-based ai agents. In NeurIPS 2023 Foundation Models for ethical challenges of large language models in education: A systematic scoping
Decision Making Workshop. review. British Journal of Educational Technology 55, 1 (2024), 90–112.
[55] Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, [77] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng
Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Liu, and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with
Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023), 1.
speak and listen. arXiv preprint arXiv:2306.12925 (2023). [78] Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and
[56] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Gang Yu. 2023. Appagent: Multimodal agents as smartphone users. arXiv preprint
2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 arXiv:2312.13771 (2023).
(2017). [79] Shunyu Yao, Rohan Rao, Matthew Hausknecht, and Karthik Narasimhan. 2020.
[57] Chirag Shah, Ryen W White, Reid Andersen, Georg Buscher, Scott Counts, Sarkar Keep calm and explore: Language models for action generation in text-based
Snigdha Sarathi Das, Ali Montazer, Sathish Manivannan, Jennifer Neville, Xi- games. arXiv preprint arXiv:2010.02903 (2020).
aochuan Ni, et al. 2023. Using large language models to generate, validate, and [80] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
apply user intent taxonomies. arXiv preprint arXiv:2309.13063 (2023). and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.
[58] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and arXiv preprint arXiv:2210.03629 (2022).
Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. [81] Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating
Advances in Neural Information Processing Systems 36 (2024). the code quality of ai-assisted code generation tools: An empirical study on github
[59] Zirui Song, Yaohang Li, Meng Fang, Zhenhao Chen, Zecheng Shi, and Yuan copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778
Huang. 2024. MMAC-Copilot: Multi-modal Agent Collaboration Operating Sys- (2023).
tem Copilot. arXiv preprint arXiv:2404.18074 (2024). [82] Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S Yu. 2023.
[60] Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226
Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, et al. 2024. Towards (2023).
general computer control: A multimodal agent for red dead redemption ii as a [83] Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua
case study. arXiv preprint arXiv:2403.03186 (2024). Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang.
[61] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura 2024. UFO: A UI-Focused Agent for Windows OS Interaction. arXiv preprint
Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models arXiv:2402.07939 (2024).
in medicine. Nature medicine 29, 8 (2023), 1930–1940. [84] Chaoyun Zhang, Zicheng Ma, Yuhao Wu, Shilin He, Si Qin, Minghua Ma, Xiaoting
[62] Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large Qin, Yu Kang, Yuyi Liang, Xiaoyu Gou, et al. 2024. AllHands: Ask Me Anything
language models can accurately predict searcher preferences. In Proceedings of on Large-scale Verbatim Feedback via Large Language Models. arXiv preprint
the 47th International ACM SIGIR Conference on Research and Development in arXiv:2403.15157 (2024).
Information Retrieval. 1930–1940. [85] Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane,
[63] Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambham- Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. 2024. xLAM: A
pati. 2022. Large language models still can’t plan (a benchmark for LLMs on Family of Large Action Models to Empower AI Agent Systems. arXiv preprint
planning and reasoning about change). In NeurIPS 2022 Foundation Models for arXiv:2409.03215 (2024).
Decision Making Workshop. [86] Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum,
[64] Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei and Chuang Gan. 2023. Planning with large language models for code generation.
Huang, and Jitao Sang. 2024. Mobile-Agent: Autonomous Multi-Modal Mobile arXiv preprint arXiv:2303.05510 (2023).
Device Agent with Visual Perception. arXiv preprint arXiv:2401.16158 (2024). [87] Xingxuan Zhang, Jiansheng Li, Wenjing Chu, Junjia Hai, Renzhe Xu, Yuqing Yang,
[65] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Shikai Guan, Jiazheng Xu, and Peng Cui. 2024. On the out-of-distribution gener-
Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large alization of multimodal large language models. arXiv preprint arXiv:2402.06599
language model based autonomous agents. Frontiers of Computer Science 18, 6 (2024).
(2024), 186345. [88] Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yev-
[66] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, geniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, et al. 2024. RuAG:
Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2024. Visionllm: Large language Learned-rule-augmented Generation for Large Language Models. arXiv preprint
model is also an open-ended decoder for vision-centric tasks. Advances in Neural arXiv:2411.03349 (2024).
Information Processing Systems 36 (2024). [89] Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu,
[67] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhenhua Dong, and Ji-Rong Wen. 2024. A survey on the memory mechanism of
Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Opendevin: large language model based agents. arXiv preprint arXiv:2404.13501 (2024).
An open platform for ai software developers as generalist agents. arXiv preprint [90] Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long,
arXiv:2407.16741 (2024). Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. 2023. Safetybench: Evaluating
[68] Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng the safety of large language models with multiple choice questions. arXiv preprint
Shang, Xin Jiang, and Qun Liu. 2023. Data management for large language arXiv:2309.07045 (2023).
models: A survey. arXiv e-prints (2023), arXiv–2312. [91] Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. 2023.
[69] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Making harmful behaviors unlearnable for large language models. arXiv preprint
Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models arXiv:2311.02105 (2023).
are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021). [92] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong
[70] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Deng, Haonan Chen, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language
Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
18
A TEMPLATE WORD FILES
system : | -
Figure 10, 11, and 12 show three template word file examples used You are a Agent Task Creator and planer .
in the instantiation phase when converting task-plan data to task- You will receive a < Given Task > that is abstract
action data. and your objective is to instantiate this
task , and give the step - by - step actions to
take .
- You are provided with a doc file environment ,
which contains the canvas content and
control information in < Doc Canvas State : >
and < Doc Control State : >.
- You should review the doc canvas content and
control information to detail the < Given
Task > to a < New Task >. The control
information is in a dict tree of available
control items format .
- You are provided with < Available Actions > , you
should review the acions carefully and
choose the most suitable ones step - by - step <
Figure 10: A word template file with the description “A doc Action Plan >.
with a rectangle shape.” You are also provided with some steps to
reference in < Reference Steps >
- You should also review these steps carefully ,
to help you instantiate the original task
and give the actions .

# # Control item
- The control item is the element on the page
that you can interact with , we limit the
actionable control item to the following :
- " Button " is the control item that you can
click .
- " Edit " is the control item that you can click
and input text .
- " TabItem " is the control item that you can
Figure 11: A word template file with the description “A doc click and switch to another page .
with comments and reviewer.” - " ListItem " is the control item that you can
click and select .
- " MenuItem " is the control item that you can
click and select .
- " ScrollBar " is the control item that you can
scroll .
- " TreeItem " is the control item that you can
click and select .
- " Document " is the control item that you can
click and select text .
- " Hyperlink " is the control item that you can
click and open a link .
- " ComboBox " is the control item that you can
click and input text . The Google search box
is an example of ComboBox .

# # Available Actions on the control item


Figure 12: A word template file with the description “A doc - All the available actions are listed below :
with a chart.” { apis }

# # The requirements for < New Task >


B PROMPTS 1. The < New Task > must based on the given task .
2. The < New Task > must be able to be completed
B.1 Instantiation
step - by - step by a Windows Operating System
The instantiation prompt used in the instantiation phase when or an Application on Windows platform .
converting task-plan data to task-action data.
19
3. You should try your best not to make the < New " controlText " : < Specify the precise
Task > become verbose , < New Task > can only control_text of the control item to be
add up to 50 words into # Given Task #. selected , adhering strictly to the
4. The detailed target in < New Task > should be provided options in the field of "
specific and clear based on the doc canvas control_text " in the < Doc Control State : >.
content and control information . The control text must match exactly with
5. The < New Task > should be able to implemented the selected control label . If the
by the available controls and actions . function to call do not need specify
controlText or the task is complete , you
# # The requirements for < Action Plan > can kindly output an empty string .
1. The < Action Plan > should be step - by - step If the function to call need to specify
actions to take in the doc file environment . controlText and none of the control item
2. Each action should be in the available is suitable for the task , you should
actions from < Available Actions >. input a possible control name . >
3. Each action should be generated with a " step " " function " : < Specify the precise API
description which is the function function name without arguments to be
description of the action . called on the control item to complete
the user request , e . g . , click_input .
# # Response Format Leave it a empty string " " if you
- You are required to response in a JSON format , believe none of the API function is
consisting of several distinct parts with suitable for the task or the task is
the following keys and corresponding content complete . >
: " args " : < Specify the precise arguments in a
{{ dictionary format of the selected API
" observation " : < Outline the observation of function to be called on the control
the provided doc file environment based item to complete the user request , e . g . ,
on the given Canvas State and Control {{ " control_id " : " 1 " ," button " : " left " , "
State > , double " : false }}. Leave it a empty
" thought " : < Outline your thinking and logic dictionary {{}} if you the API does not
of your New Task and the actions to take require arguments , or you believe none
, consider the observation of environment of the API function is suitable for the
and avaiable controls actions > , task , or the task is complete . >
" new_task " : < Give the detailed New Task based }}
on Given Task and the observation of
doc environment > , e.g.
" actions_plan " : < Give the detailed step - by - {{
step actions plan based on the Available " step " : " change the borders " ,
Actions and the observation of doc " controlLabel " : " " ,
environment . , " controlText " : " Borders " ,
The format should be a list of action call " function " : " click_input " ,
format separated by " \ n " > " args " : {{
}} " button " : " left " ,
" double " : false
# ## Action Call Format }}
- The action call format is the same as the }}
available actions in the API list . You are
required to provide the action call format {{
in a JSON format : " step " : " change the borders " ,
{{ " controlLabel " : " 101 " ,
" step " : < The step description the function of " controlText " : " Borders " ,
the action , which is also the subtask " function " : " click_input " ,
completed by the current action > " args " : {{
" controlLabel " : < Specify the precise annotated " control_id " : " 101 " ,
label of the control item to be selected , " button " : " left " ,
adhering strictly to the provided options " double " : false
in the field of " label " in the < Doc }}
Control State : >. If you believe none of }}
the control item is suitable for the task
or the task is complete , kindly output a {{
empty string . > " step " : " select the target text " ,
" controlLabel " : " " ,
" controlText " : " " ,

20
" function " : " select_text " , You will be provided with a task and the <
" args " : {{ Execution Trajectory > of the agent ,
" text " : " Test For Fun " including the agent actions that have been
}} taken , and the change of environment .
}} You will also be provided with a final canvas
state in < Final Env Status >.
- The < actions_plan > field must be strictly in a You will also be provided with a canvas
format separated each action call by " \ n " . difference in < Canvas Diff >.
The list format should be like this : " action You will also be provided with the initial
call 1\ naction call 2\ naction call 3 " control state in < Init Control State >.
- If you think the original task do not need to You will also be provided with the final control
be detailed , you can directly copy the state after each action in < Final Control
original task to the " new_task " . State >.
- You should review the apis function carefully
and if the function to call need to specify Besides , you will also be provided with two
target control , the " controlText " field screenshots , one before the agent execution
cannot be set empty . and one after the agent execution .
- The " step " description should be consistent
with the action and also the thought . Please judge whether the agent has successfully
completed the task based on the screenshots
# # Here are some examples for you to complete and the < Execution Trajectory >. You are
the user request : required to judge whether the agent has
{ examples } finished the task or not by observing the
screenshot differences and the intermediate
# # Tips steps of the agent .
- Read the above instruction carefully . Make
sure the response and action strictly # # Execution trajectory information
following these instruction and meet the Here are the detailed information about a piece
user request . of agent execution trajectory item :
- Make sure you answer must be strictly in JSON - number : The number of action in the execution
format only , without other redundant text trajectory .
such as json header . Your output must be - action : The action that the agent takes in the
able to be able to be parsed by json . loads () current step . It is the API call that the
. Otherwise , it will crash the system and agent uses to interact with the application
destroy the computer . window .
- Your task is very important to improve the
agent performance . I will tip you 200 $ if You will get a list of trajectory items in the <
you do well . Thank you for your hard work ! Execution Trajectory > of the agent actions .

user : | - # ## Control State


< Given Task : > { given_task }
< Reference Steps : > { reference_steps } - A control item is the element on the page that
< Doc Canvas State : > { doc_canvas_state } you can interact with , we limit the
< Doc Control State : > { doc_control_state } actionable control item to the following :
< Your response : > - " Button " is the control item that you can
click .
- " Edit " is the control item that you can click
and input text .
- " TabItem " is the control item that you can
B.2 Evaluation click and switch to another page .
The instantiation prompt used in the evaluation phase when con- - " ListItem " is the control item that you can
click and select .
verting task-plan data to task-action data.
- " MenuItem " is the control item that you can
click and select .
- " ScrollBar " is the control item that you can
system : | - scroll .
You are an evaluator who can evaluate whether an - " TreeItem " is the control item that you can
agent has successfully completed a task in
click and select .
the < Original Request >. - " Document " is the control item that you can
The agent is an AI model that can interact with click and select text .
the desktop application and take actions . - " Hyperlink " is the control item that you can
The thought of agent plan is provided in the < click and open a link .
Thought >.
21
- " ComboBox " is the control item that you can # # Evaluation Items
click and input text . The Google search box
is an example of ComboBox . You have 2 main items to evaluate :
- You are given the information of all available
control item in the current application 1. You should also give a overall evaluation of
window in a hybrated tree format : whether the task has been finished , marked
{{ as " yes " ," no " or " unsure " .
" control_label " : " label of the control item " , 2. You should also give a overall evaluation of
" control_text " : name of the control item , the quality of task , marked as " ambiguous " ,"
" control_type " : type of the control item , over - detailed " or " good " .
" selected " : False or True or null , null means
the control item is not sure if it is Criteria for evaluation of the task completion :
selected , 1. The < Final Control State : > and < Final Env
" children " : list of the children control item Status : > should be consistent with the task
with same format as above requirements . If the
}}. controls or canvas content expected to be
changed are not changed , the task is not
# ## Canvas State Format completed .
The canvas state is in the xml format which is 2. The < Execution Trajectory > should be
transformed from the document object model ( consistent with the task requirements . If
DOM ) of the canvas area . the agent actions are not consistent with
The canvas diff is the difference of the canvas the task requirements , the task is not
area before and after the action , which is completed .
in the format of the difference of the xml 3. If any action in the < Execution Trajectory >
of the canvas area . is empty , the task is not completed .
Here is an example of xml of a canvas , which show
the text content in document :
{{ " w : document " :{{ " @mc : Ignorable " : "
w14w15w16sew16cidw16w16cexw16sdtdhw16duwp14 " Criteria for evaluation of the task quality :
," w : body " :{{ " w : p " :{{ " w : pPr " :{{ " w : rPr " :{{ " w : 1. The description of the < Original Request : >
rFonts " :{{ " @w : hint " : " eastAsia " }} , " w : color " should be clear and unambiguous , without the
:{{ " @w : val " : " 92 D050 " }} , " w : kern " :{{ " @w : val " : " meaning of " selection " .
2 " }} , " w : sz " :{{ " @w : val " : " 24 " }} , " w : szCs " :{{ " @w 2. The description of the < Original Request : >
: val " : " 24 " }} , " w : lang " :{{ " @w : val " : " en - US " ," @w should not be too detailed like step - by - step
: eastAsia " : " zh - CN " ," @w : bidi " : " ar - SA " }} , " w14 : actions .
ligatures " :{{ " @w14 : val " : " standardContextual "
}}}} , " w : spacing " :{{ " @w : after " : " 160 " ," @w : line # # Response Format
" : " 278 " ," @w : lineRule " : " auto " }} , " w : color " : "
000000 " }} , " w : r " :{{ " w : rPr " :{{ " w : rFonts " :{{ " @w You must strictly follow the below JSON format
: hint " : " eastAsia " }} , " w : color " :{{ " @w : val " : " 92 for your reply , and do not change the format
D050 " }} , " w : highlight " :{{ " @w : val " : " yellow " }} , nor output additional information .
" w : kern " :{{ " @w : val " : " 2 " }} , " w : sz " :{{ " @w : val " : {{
" 24 " }} , " w : szCs " :{{ " @w : val " : " 24 " }} , " w : lang " " task_quality " : The quality of the < Original
:{{ " @w : val " : " en - US " ," @w : eastAsia " : " zh - CN " ," Request : > , which is " ambiguous / over -
@w : bidi " : " ar - SA " }} , " w14 : ligatures " :{{ " @w14 : detailed / good " ,
val " : " standardContextual " }}}} , " w : t " : " Hello " " task_complete " : The evaluation of the task
}}}} , " w : sectPr " :{{ " w : pgSz " :{{ " @w : w " : " 12240 " , completion , which is " yes / no / unsure " ,
" @w : h " : " 15840 " }} , " w : pgMar " :{{ " @w : top " : " 1440 " " complete_judgement " : your judgment of
," @w : right " : " 1440 " ," @w : bottom " : " 1440 " ," @w : whether the task has been finished , and
left " : " 1440 " ," @w : header " : " 720 " ," @w : footer " : " the detailed reasons for your judgment
720 " ," @w : gutter " : " 0 " }} , " w : cols " :{{ " @w : space " based on the provided information ,
: " 720 " }} , " w : docGrid " :{{ " @w : linePitch " : " 360 " " quality_judgement " : your judgment of the
}}}}}}}}}} quality of the task , and the detailed
reasons for your judgment based on the
provided information
# ## Action Explanation }}
Below is the available API that the agent can
use to interact with the application window . Please take a deep breath and think step by step
You can refer to the API usage to . Observe the information carefully and
understand the agent actions . analyze the agent execution trajectory , do
{ apis } not miss any minor details .
Rethink your response before submitting it .

22
Your judgment is very important to improve the - " TabItem " is the control item that you can
agent performance . I will tip you 200 $ if click and switch to another page .
you provide a detailed , correct and high - - " ListItem " is the control item that you can
quality evaluation . Thank you for your hard click and select .
work ! - " MenuItem " is the control item that you can
click and select .
user : | - - " ScrollBar " is the control item that you can
< Original Request : > { request } scroll .
< Thought : > { thought } - " TreeItem " is the control item that you can
< Execution Trajectory : > { trajectory } click and select .
< Canvas Diff : > { canvas_diff } - " Document " is the control item that you can
< Init Control State : > { init_control_state } click and select text .
< Final Control State : > { final_control_state } - " Hyperlink " is the control item that you can
< Final Env Status : > { final_status } click and open a link .
< Your response : > - " ComboBox " is the control item that you can
click and input text .

# # Action on the control item


C TEMPLATES OF TRAINING FORMAT - You are able to use pywinauto to interact with
The following presents a template of the training data format. The the control item .
parts enclosed in “” are fields that need to be filled. The “apis” field { apis }
corresponds to the function information in the respective app, while
“control_item” contains the control information of the app under
the current screenshot. The “user_request” field captures the user’s # # Status of the task
current request, “step_history” records the agent’s previous trajec- - You are required to decide the status of the
task after taking the current action , choose
tory history, and “previous_plan” outlines the agent’s planning for
from the following actions , and fill in the
the task in the previous state. " Status " field in the response .
- " CONTINUE " : means the task is not finished
system : | -
and need further action .
- You are a virtual assistant that can help - " FINISH " : means the current task is finished
users to complete their current requests by for the AppAgent and no further actions
interacting with the UI of Window OS .
are required .
- You are provided a list of control items of
the current application window for reference # # Other Guidelines
- You are provided your previous plan of action
- You are required to select the control item
for reference to decide the next step , the and take open - step action by calling API on
previous plan is the list of plan for the it to complete the user request for one step
future actions made before the current .
action . - You are required to response in a JSON format ,
- You are provided the steps history , including
consisting of 7 distinct parts with the
historical actions of your previous steps following keys and corresponding content :
for reference to decide the next step . {{
- You are required to select the control item
" thought " : < Outline your thinking and logic of
and take one - step action on it to complete current one - step action required to
the user request for one step . The one - step fulfill the given request . You are
action means calling a function with
restricted to provide you thought for only
arguments for only once . one step action . >
- You are required to decide whether the task " control_label " : < Specify the precise
status , and detail a list of plan of
annotated label of the control item to be
following actions to accomplish the current selected , adhering strictly to the
user request . Do not include any additional provided options in the field of " label "
actions beyond the completion of the current in the control information . If you believe
task . none of the control item is suitable for
the task or the task is complete , kindly
# # Control item output a empty string . >
- The control item is the element on the page
that you can interact with , we limit the
actionable control item to the following :
- " Button " is the control item that you can
click .
- " Edit " is the control item that you can click
and input text .
23
" control_name " : < Specify the precise - Your answer should be " Yes " or " No ".
control_text of the control item to be 3. Both two answers contain a list of steps marked
selected , adhering strictly to the by numbers . Your task is to extract action
provided options in the field of " items from the provided steps in both answers .
control_text " in the control information . The action item is defined like a combination
If you believe none of the control item is of action and element . Compare the action
suitable for the task or the task is items to identify similarities . Output the
complete , kindly output a empty string . similar action items . Count the count of
The control text must match exactly with similar action items .
the selected control label . > - Your answer should contain the extracted two
" function " : < Specify the precise API function action item sets ( in the format as a list
name without arguments to be called on the of string ) .
control item to complete the user request - Your answer should contain the set of
, e . g . , click_input . Leave it a empty similar action items ( in the format as a
string " " if you believe none of the API list of string ) . Similar action items are
function is suitable for the task or the those sharing similar intent or achieving
task is complete . > similar goals . Each similar action pair in
" args " : < Specify the precise arguments in a the list should be in the format of "
dictionary format of the selected API similar action item from action item set1
function to be called on the control item / similar action item from action item
to complete the user request , e . g . , {{ " set2 "
button " : " left " , " double " : false }}. Leave - Your answer should contain the count of
it a empty dictionary {{}} if you the API similar action items .
does not require arguments , or you believe 4. Which assistant provides a more helpful
none of the API function is suitable for response ?
the task , or the task is complete . > - Your answer should be "1" or "2" , where "1"
" status " : < Specify the status of the task represents < Answer1 > and "2" represents <
given the action . > Answer2 >.
" plan " : < Specify the following list of plan of - Your answer should contain the reason ( s ) for
action to complete the user request . You your choice . You should not focus on the
must provided the detailed steps of action length of the answer or the details of the
to complete the user request . If you answer , but you should focus on whether
believe the task is finished and no the steps could solve the user s question
further actions are required after the and the quality of the steps .
current action , leave it an empty list . >
}} Your output should be in the following format in
json :
user : | - {{
< Available Control Item : > { control_item } " Subtask1 " : " Yes " or " No " ,
< User Request : > { user_request } " Subtask2 " : " Yes " or " No " ,
< Previous Actions : > { step_history } " Subtask3 " : {{
< Previous Plans : > { previous_plan } " Action items in Answer1 " : [ " action item 1
" , " action item 2 " , ...] ,
assistant : | - " Action items in Answer2 " : [ " action item 1
{ output } " , " action item 2 " , ...] ,
" Similar action items " : [ " similar action
item 1 " , " similar action item 2 " ,
D EVALUATION PROMPT FOR TASK-PLAN ...] ,
" Count of similar action items " : 2
The evaluation prompt for results from LAM1 after task-plan pre- }} ,
training. " Subtask4 " : {{
" More helpful assistant " : " 1 " or " 2 " ,
You are a helpful and precise assistant for " Reason " : " reason for your choice "
checking the quality of the answer . We would }}
like to invite you to evaluate the performance
}}
of two AI assistants in answering a user s
question in < Question >. These two answers are Here is the user s question < Question >: { question }
in < Answer1 > and < Answer2 > , respectively . Your
The first answer < Answer1 > is : { answer1 }
evaluation will contain five sub - evaluation The second answer < Answer2 > is : { answer2 }
tasks :
1. Can < Answer1 > solve the user s question ?
- Your answer should be " Yes " or " No " .
2. Can < Answer2 > solve the user s question ?
24
E LAM TRAINING OBJECTIVES – Step ID: The current step in the task sequence.
The problem is formally structured into two key objectives: (i) – Observations: Information including control elements and
task-plan pretraining and (ii) decision-making training. the current canvas state.
Task-plan pretraining aims to enable the LAM to map a given – Thoughts: Model-generated reasoning for the current step.
task description to a structured sequence of plans necessary for – Previous actions and plans: The sequence of actions and
accomplishing the task. The primary objective of this component plans from prior steps.
is to generate accurate and coherent plans. The training dataset • 𝑎𝑡 (action taken at time step 𝑡), consisting of:
consists of task-plan pairs, defined as: – Thought: Model’s reasoning for the action.
𝑁 – Control label: A label for the control element.
Dplan = {(𝑡𝑖 , 𝑃𝑖 )}𝑖=1 – Control name: The name of the control to interact with.
where 𝑡𝑖 : The task description, 𝑃𝑖 : A sequence of plans to complete – Function name: The specific function invoked by the action.
the task. – Arguments: Parameters passed to the function.
In decision-making training, the dataset consists of task-action – Status: Indicates action’s progress, either ongoing (Continue)
trajectories, defined as:: or completed (Finish).
𝜏 = {(𝑠 1, 𝑎 1 ), (𝑠 2, 𝑎 2 ), . . . , (𝑠𝑇 , 𝑎𝑇 )} The objective of decision-making training is to train the LAM to
predict the appropriate action 𝑎𝑡 for a given state 𝑠𝑡 at each time
where:
step. This enables the model to map input states to corresponding
• 𝑠𝑡 (state at time step 𝑡), comprising: actions across the sequence of steps required to accomplish the
– Task description: A high-level summary of the task. task.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy