Large Action Models: From Inception To Implementation: Lu Wang Fangkai Yang Chaoyun Zhang Junting Lu
Large Action Models: From Inception To Implementation: Lu Wang Fangkai Yang Chaoyun Zhang Junting Lu
Environment
Figure 1: The transition from LLMs to LAMs.
[28]. Current LLMs may excel at understanding and planning in tex- LAM has been trained, it must be incorporated into an agent sys-
tual form but often fall short when required to produce actionable tem that can effectively interact with its environment. This system
outputs. This is particularly true in scenarios that demand precise typically includes components for gathering observations, utiliz-
task decomposition, long-term planning [12, 88], and the coordi- ing tools, maintaining memory, and implementing feedback loops.
nation of multi-step actions [63]. Furthermore, LLMs are generally These components are critical for ensuring that the LAM can not
optimized for broad, general-purpose tasks rather than tailored for only execute actions but also adapt its behavior based on real-time
specific scenarios or environments. This lack of specialization can feedback and evolving situations [83]. The integration of these ele-
result in suboptimal performance, especially when interacting with ments enhances the LAM’s capacity to perform tasks autonomously,
unfamiliar or dynamic environments where adaptive and robust interact meaningfully with its surroundings, and make decisions
action sequences are essential [39]. that are grounded in the context of its environment.
These limitations highlight a significant gap in the ability of A final but crucial step in the development of LAMs is evalua-
LLMs to transition from passive understanding to active, real-world tion [73]. Before deploying a LAM for real-world applications, it is
engagement. To address these challenges, the development of Large imperative to rigorously assess its reliability, robustness, and safety.
Action Models (LAMs) represents a transformative shift in AI capa- Unlike LLMs, which may be limited to generating text-based out-
bilities [20]. Unlike traditional LLMs that primarily focus on text puts, LAMs have the capacity to directly affect their environment
generation and response, LAMs are designed to perform actions in through actions. This introduces new risks, as incorrect or inap-
both physical and digital environments. These models are capable propriate actions could have significant consequences. Therefore,
of interpreting user intentions from diverse data inputs, automating thorough evaluation processes are essential to ensure that both the
complex processes, planning for task completion, and interacting LAM and its accompanying agent are capable of making reliable
with the world via agents. This evolution marks a critical step decisions while minimizing potential risks. These evaluations often
toward a future where intelligent systems not only comprehend involve testing the model in a variety of scenarios to ensure that it
human language but can also translate that understanding into can generalize across different environments and tasks, as well as
tangible, meaningful actions [85]. effectively handle unexpected situations.
LAMs are often built upon the foundation of LLMs, but the tran- Given the complexity involved in developing LAMs, the purpose
sition from LLMs to LAMs is neither straightforward nor seamless, of this paper is to provide a comprehensive understanding of LAMs
as shown in Figure 1. The process of transforming an LLM into and guide practitioners in transforming an LLM into a functional
a functional LAM involves multiple intricate stages, each requir- LAM for real-world applications. To this end, we first present an
ing substantial effort and expertise. First, it is essential to collect overview of LAMs, clarifying their distinctions from traditional
comprehensive datasets that capture user requests, environmen- LLMs and discussing their unique characteristics. By offering this
tal states, and corresponding actions [11]. These data serve as the foundational knowledge, we aim to give readers a clear concep-
basis for training or fine-tuning LLMs to perform actions rather tual understanding of LAMs, enabling them to grasp the broader
than merely generate text. This stage involves the integration of implications of their development and use.
advanced training techniques that enable the model to understand Next, we delve into the practical process of obtaining a LAM from
and execute actions within specific environments [21]. Once the scratch. Using a Graphical User Interface (GUI) agent on Windows
OS as an example, we provide a detailed, step-by-step exploration
2
of the entire pipeline—beginning with data collection and prepara- LLM Step 1: Open an online shopping website.
tion, followed by model training, integration, and grounding. This Step 2: Search for “jacket for men”.
includes how to prepare datasets that capture user requests, envi- Step 3: Go through all jackets
...
ronmental states, and actions, as well as how to fine-tune LLMs Buy a jacket
for men.
to generate executable actions rather than text responses. We also
Step 1 Step 2 Step 3
demonstrate how to integrate a trained LAM into an agent system, LAM
equipping it with tools, memory, and feedback mechanisms to en-
able dynamic interaction with its environment. The final stages
focus on rigorous evaluation, ensuring that the LAM is robust, safe,
and capable of handling real-world tasks. While this paper uses Figure 2: The objective difference between LLMs and LAMs.
the Windows OS as a case study, the methodology outlined can be
adapted to other environments, providing a generalizable workflow
for obtaining functional LAMs. Finally, we address several limita- While LLMs possess significant language understanding and
tions and challenges faced by LAMs in both research and industry. generation capabilities, they are primarily limited to generating text-
While LAMs represent a significant advancement over traditional based outputs. They excel at interacting with users and generating
LLMs, they are still in an early stage of development and present text, but they lack the ability to directly interface with environments
substantial areas for improvement. Issues such as privacy concerns, to execute actions. This limitation restricts their applicability in
latency, safety risks, scalability, and ethical considerations all pose scenarios that require tangible interaction with digital or physical
challenges that must be addressed for LAMs to be fully realized as environments.
practical tools. To extend their utility, LLMs are often embedded within agent
The emergence of LAMs represents not merely an incremen- frameworks [65]. These agent systems augment LLMs, enabling
tal advancement over LLMs, but a fundamental shift from passive them to interact with dynamic environments by collecting data from
language processing to active, real-world engagement. By execut- various sources [72], structuring it into meaningful inputs [32], and
ing actions, LAMs can interact dynamically with both digital and prompting the LLM for inference [72]. The agent then interprets
physical environments, marking a transformative milestone in the the model’s output—whether in the form of code [67] or tool-based
broader pursuit of AGI. We envision this paper as a foundational actions [54]—and grounds it within the environment by executing
guide to LAMs, offering both theoretical insights and practical, actions and collecting feedback [58]. Agents equipped with LLMs
actionable steps for creating and deploying LAMs in real-world typically function in a loop, continuously gathering environmental
scenarios. information, using LLM inference to form plans, executing those
plans, and refining future actions based on feedback. This iterative
process can incorporate external memory systems, enabling the
agent to track historical actions and environmental states, further
2 LARGE ACTION MODELS 101 improving the decision-making process over time [24, 89].
Large Action Models (LAMs) represent a significant advancement
in artificial intelligence, extending the capabilities of Large Lan- 2.2 From LLMs to LAMs
guage Models (LLMs) [85]. While LLMs are proficient at generating
LAMs build upon the foundational capabilities of LLMs but are
human-like text based on user inputs, LAMs go beyond text gener-
specifically optimized for action-oriented tasks. They are designed
ation by performing actions in both physical and digital environ-
to perform actions in both physical and digital environments, in-
ments [82]. These models interpret user intentions from various
terpreting user intentions from various data forms, automating
data forms, automate entire processes as per user requirements,
processes as per user requirements, planning for task completion,
plan for task completion, and interact with the world. This evo-
and interacting with the world [82]. This evolution signifies a shift
lution signifies a shift from mere language interaction to action
from passive language interaction to generating action sequences
sequences that are grounded in real-world contexts.
that are grounded in real-world contexts.
An illustrative example is shown in Figure 2. An LLM can com-
prehend a user’s request to purchase a jacket and generate a detailed
2.1 Large Language Models textual plan or recommendation, but it cannot autonomously com-
LLMs are neural networks with billions to hundreds of billions of plete the transaction on a website. In contrast, a LAM leverages
parameters, trained on extensive text corpora to address general- this foundational understanding to generate action sequences that
purpose language tasks [31, 40, 75, 84, 92]. These models demon- directly interact with the website, completing the request on the
strate exceptional capabilities in natural language understanding user’s behalf. This ability to transition from understanding to execu-
and generation, allowing them to perform complex tasks such as an- tion bridges the gap between the model and real-world applications,
swering questions [27], generating code [86], and providing human- moving beyond mere language output to tangible outcomes.
like textual responses [10] with minimal task-specific training, Furthermore, due to their specialization in specific domains or
known as zero-shot [69] or few-shot [4] learning. Unlike tradi- tasks, LAMs can be smaller in scale compared to general-purpose
tional language models, which required extensive task-specific data LLMs while achieving comparable or superior performance within
and training, LLMs leverage their vast knowledge base to generalize their operational scope. By focusing on a narrower range of tasks,
across diverse tasks with minimal supervision. LAMs prioritize efficiency and effectiveness, leveraging targeted
3
data and optimized architectures to reduce computational overhead error occurs or a resource becomes unavailable, a LAM can replan
without sacrificing capability. This specialization not only makes and adjust its actions to still achieve the desired outcome.
LAMs more practical for deployment in real-world applications but
also opens opportunities for developing lightweight models that 2.3.4 Specialization and Efficiency. LAMs are fine-tuned for ex-
can operate in resource-constrained environments. ecuting specialized sequences of actions within specific environ-
The evolution from LLMs to LAMs is achieved through spe- ments [8]. By focusing on particular domains, LAMs achieve a
cialized training and integration with agent systems. These sys- high degree of accuracy and adaptability, outperforming general-
tems enable LAMs to translate their inferences into real-world purpose LLMs in targeted applications. This specialization allows
actions, bridging the gap between understanding and execution. LAMs to encode comprehensive knowledge about the environment
Thus, LAMs not only enhance the functionality of LLMs but also deeply into their architecture, including available actions, system
redefine their applicability in real-world scenarios. constraints, and contextual nuances. As a result, LAMs can operate
more efficiently, reducing computational overhead and improving
2.3 Key Characteristics of LAMs response times. Furthermore, since LAMs are expected to com-
LAMs are distinguished by advanced capabilities that enable them plete actionable tasks within a more limited scope, their scale can
to perform complex tasks effectively. These characteristics include: be smaller compared to general-purpose LLMs while achieving a
comparable level of performance within that specific domain. This
2.3.1 Interpretation of User Intentions. A fundamental capability makes LAMs more practical for deployment in real-world applica-
of LAMs is the ability to accurately interpret user intentions from tions, including resource-constrained environments such as edge
diverse forms of input. These inputs may include natural language devices or local systems.
requests, voice commands, images, or videos, such as device screen-
shots or instructional videos [8]. User inputs are often abstract or 2.3.5 Summary. In summary, LAMs transcend the basic function-
implicit [6], requiring LAMs to leverage their internal knowledge ality of converting user requests into a series of steps by compre-
and complementary information to discern the true intent behind hending the underlying logic that interconnects and contextual-
the input. This process involves understanding nuances, disam- izes these actions. They understand sequence dependencies—why
biguating instructions, and inferring unstated objectives. LAMs certain steps must precede or follow others—and recognize when
must translate these user intentions into actionable plans and steps, to adapt the plan to accommodate changing circumstances. LAMs
facilitating subsequent interactions with the environment to fulfill extend AI systems into the realm of actionable intelligence. This sig-
the user’s objectives. This requires a robust foundation in LLMs, nificantly enhances their ability to autonomously perform complex,
particularly those with multi-round conversational capabilities [57], real-world tasks, making them invaluable in applications requiring
enhancing LAMs’ proficiency in engaging with users to accurately precise interaction and manipulation within defined operational
understand and execute their requests. contexts.
2.3.2 Action Generation. The hallmark feature of LAMs is their
capacity for action generation grounded in the environment. LAMs 2.4 From Inception to Implementation
translate user intentions into actionable steps that can be executed LAMs have the potential to significantly extend the impact of LLMs
within specific contexts. These actions can take various forms: op- by enabling tangible interactions with real-world environments.
erations on graphical user interface (GUI) elements, API calls for To harness this potential, an LAM must be developed from the
software applications, physical manipulations performed by robots, ground up and deployed within a real-world application, allowing
invoking other AI agents or models, or autonomously generat- it to operate effectively in a physical environment. This process
ing code or combining meta-actions [5]. By incorporating detailed involves 5 critical steps, as shown in Figure 3:
knowledge of the environment, including available actions, system
(1) Data Collection and Preparation (Section 3): The first
states, and expected inputs, LAMs can select appropriate actions
step involves gathering and curating the necessary data for
and apply them correctly to meet user requests. This involves not
the specific use case. This includes not only user queries but
only executing predefined actions but also adapting to new situa-
also environmental context, potential actions, and any other
tions by generating novel action sequences when necessary.
relevant data required to train the LAM effectively. The data
2.3.3 Dynamic Planning and Adaptation. LAMs exhibit a sophisti- must undergo cleaning and pre-processing before it is used
cated capability for dynamic planning and adaptation, which is cru- for training or fine-tuning a LAM.
cial for handling complex user requests that span multiple steps [19]. (2) Model Training (Section 4): Using the prepared data, the
They can decompose a complex task into several subtasks, each next step is to train the LAM. This training process can
further broken down into specific action steps. This hierarchical involve various techniques such as supervised fine-tuning
planning enables LAMs to approach task execution with a forward- and reinforcement learning to ensure the model can perform
looking perspective, anticipating future requirements and potential the desired actions accurately and efficiently.
obstacles. Moreover, as the execution of each action alters the state (3) Offline Evaluation (Section 5): After obtaining the LAM,
of the environment, LAM will react to these changes, adapting and we evaluate its performance using an offline dataset to verify
revising their plans and actions accordingly [58]. This flexibility its reliability in a controlled, static environment.
ensures robustness in dynamic scenarios where deviations from (4) Integration and Grounding (Section 6): The LAM is inte-
initial expectations are common. For instance, if an unexpected grated into an agent framework that serves as its operational
4
③/⑤ Offline/Online
① Data Collection and Test data Evaluation
Preparation Task 1: Task 3:
Task: Set upTask Beautify
a meeting with ...the style of the slide.
Task: 2:
Create a slide base on draft.docx
Fill the form of on the website with the excel Send an email to ...
Task: data.
Plan: Buy a jacket for men.
1. Open the draft.docxPlan:and read the content.
2. Create a new PowerPoint file.
Plan: 1. Open the excel form. Input Output Execution
3. For page 1, add …
1. Open the2.online
Extract the content
shopping from the excel file.
website.
3. jacket
2. Search for Open the website...
for men.
Action:3. Go through all items ...
results
- Open (“draft.docx”)Action:
- Click (Button(“New”), “left”) ...
Action: - Open (“form.csv”)
Agent
- Extract (“form.csv”) ...
- Input (“www.onlineshop.xxx”)
- Input (“jacket for men”) ... Action Actions
② Model Executors
+ Training
Memory Feedback ...
LAM ...
Environment
④ Integration and
LLM Grounding
Figure 3: The process pipeline for LAM development and implementation.
platform. This involves grounding the model with the abil- user requests expressed in natural language, while plans are
ity to interact with external tools, maintain memory, and detailed, step-by-step procedures designed to fulfill these
interface with the environment. By equipping the LAM with requests. For example, a task such as “How to change the
these capabilities, it becomes capable of making meaningful font size in Word?” would have a corresponding plan out-
impacts in the physical world. lining the steps required to complete the task. This data is
(5) Online Evaluation (Section 7): Finally, the performance used to fine-tune the model to generate effective plans and
of the LAM must be rigorously evaluated in the real en- improve its high-level reasoning and planning capabilities.
vironment from multiple perspectives, including accuracy, However, task-plan data cannot be directly executed in the
efficiency, and effectiveness in completing tasks. This step environment, requiring the following data conversion phase.
is crucial to ensure that the LAM functions as intended and (2) Task-Action Data Collection: In this phase, the task-plan
meets the desired operational standards. data is converted into task-action data, which includes tasks,
Through these steps, LAMs can be effectively developed and de- plans, and the associated action sequences needed to execute
ployed to bring LLMs’ capabilities into real-world applications, those plans. Tasks and plans are refined to become more con-
enabling them to interact with and manipulate the physical envi- crete and grounded within a specific environment. Action
ronment, thereby making a tangible impact. sequences are generated at this stage, such as select_text(
In the following sections, we use the Windows GUI agent UFO [83]∗ text="hello") or click(on=Button("20"), how="left",
as a case study to illustrate the process of building a robust LAM double=False), which represent actionable instructions ca-
from the ground up. This LAM will serve as the core inference pable of directly interacting with the environment. This en-
engine for UFO, enabling it to autonomously fulfill user requests riched data provides the necessary granularity for training
within the Windows OS environment. While this example focuses an LAM to perform reliable and accurate task executions in
on a Windows GUI agent, the outlined steps can be adapted for real-world scenarios.
developing LAMs in other scenarios or for different applications.
Data Extraction Data Preprocessing documentation. From Bing search logs, a 1% sample of queries
• Data parsing and • Remove non-English,
selection too short/long and mentioning application names (e.g., Word, Excel, PowerPoint)
• Format unifying non-relevant samples
• Data duplication from the past year was taken.
Documents wikiHow
3.1.2 Data Extraction and Pre-Processing. The initial step in pro-
Data Evolving
• Instruct Evolve to Data Construction cessing raw data involves parsing to extract task-relevant content
generate more complex • GPT to construct task-
task-plan data plan data in JSON while filtering out unnecessary or irrelevant information. This in-
How to quickly delete
all notes in a PPT file? cludes removing non-English entries, samples that are excessively
Historical Search Queries All Task-Plan data short or long based on predefined heuristics, and data unrelated to
actionable tasks (e.g., content focused on smartphone operations).
The filtered data is then standardized into a unified format for
Figure 5: The pipeline to construct the task plan data.
further processing.
3.1.3 Data Construction. To create structured JSON samples, GPT-
3.1 Task-Plan Data 4o is employed to extract and format tasks along with their associ-
Figure 5 outlines a multi-step pipeline for collecting and processing ated plans. For historical search queries, synthetic data is generated
task-plan data, essential for training LAMs. The process begins with to enrich the raw input, addressing the common issue of insuffi-
gathering raw data from diverse sources, including application cient context. GPT-4o reformulates these queries into complete,
documentation, WikiHow, and historical search queries. This is sentence-like user requests, ensuring consistency across all data
followed by structured pre-processing to ensure that the data is sources and facilitating effective downstream processing.
high-quality and relevant to specific tasks. The resulting dataset contains structured JSON samples, with
each entry including a unique task identifier (task_id), the task
3.1.1 Data Sources.
description (task), and a step-by-step plan (plan). An example is
(1) Application Documentation: Documentation and usage man- shown below:
uals for software applications provide authoritative task de-
scriptions. These resources, maintained by product teams, are 1 {" task_id ": " word_032 ",
considered highly reliable. Relevant documentation, such as 2 " task ": " Add a border to a page in Word " ,
M365 documentation† , is crawled, with outdated or inaccessi- 3 " plan ": [
ble pages being filtered out. The HTML content is converted 4 1. Go to Design > Page Borders .
into markdown format, and GPT-4o is used to extract task-plan 5 2. Make selections for how you want the
pairs in the desired structured format. border to look .
(2) WikiHow: WikiHow‡ hosts a wide range of how-to articles, 6 3. To adjust the distance between the
including application-specific operational guides. Webpages border and the edge of the page ,
related to Windows platform applications are crawled, and GPT- select Options . Make your changes
4o extracts task and plan components, ensuring the resulting and select OK .
data aligns with the desired structured format. 7 4. Select OK .
(3) Historical Search Queries: Search engine logs provide insight 8 ]
into real user demands, addressing gaps not covered by formal 9 }
† https://learn.microsoft.com/en-us/microsoft-365/?view=o365-worldwide With the above process, we initially collected a total of 29,182 task-
‡ https://www.wikihow.com/Main-Page plan data samples.
6
3.1.4 Data Evolving. With the initial dataset processed, we employ highlights the need for actionable task-action data to bridge the
data augmentation techniques to enhance its diversity and complex- divide between planning and execution. To enable LAMs to pro-
ity. Inspired by WizardLM [74] and AgentGen [23], we use GPT-4o duce actionable outputs, we generate task-action data derived from
to evolve the raw task to generate new task-plan pairs, improving the previously collected task-plan data. Task-action data captures
the model’s ability to follow instructions and handle more complex the granular interactions required to complete a task in the appli-
tasks. cation environment, including GUI navigation, button clicks, and
The data evolving process generates new tasks from the original responding to environmental feedback.
ones by introducing additional complexity, constraints, or steps Traditional approaches for action data collection often involve
while preserving relevance. The guidelines for task evolution are manual or agent-based annotation for each task, which is both
as follows: costly and labor-intensive. To address these limitations, we propose
– The evolved task must be executable step-by-step on a Win- an efficient, fully automated, and low-cost pipeline that leverages
dows OS or application. LLMs and real-world application interactions. This pipeline consists
– The evolved task should include additional requirements, of four stages, as depicted in Figure 6: Instantiation, Execution,
increasing its complexity without exceeding 20 extra words. Evaluation, and Post-Processing. Specifically,
– The evolved task must remain concise and related to the (1) Instantiation: In this stage, the task-plan data is trans-
original task. formed into an executable trajectory. Using an LLM, each
For each evolved task, GPT-4o generates a corresponding plan task is instantiated with specific operational objects, and re-
adhering to the following guidelines: lated high-level plan is instantiated into a concrete sequence
– The plan must provide correct and actionable steps for Win- of actions that can be directly executed in the application
dows environments or applications. environment.
– The plan should be concise and highlight critical action ob- (2) Execution: The instantiated trajectory is then executed
jects using bold emphasis. within the real-world application environment. During this
stage, the system interacts with the application’s GUI to
This augmentation process results in a richer dataset where tasks
carry out the specified actions. For example, the instantiated
become progressively more challenging, and plans incorporate
trajectory for highlighting text would involve selecting the
domain-specific knowledge. For example:
appropriate text, navigating to the highlight tool, and apply-
Raw task: Create a drop-down list in Excel for Office 365.
ing the highlight. The result of this execution is the captured
Evolved Task: Create a dependent drop-down list in Excel for
executed trajectory, including any feedback or environmen-
Office 365, where selecting an item from the first list filters options
tal changes observed during the process.
in the second list.
(3) Evaluation: Once the execution is complete, the trajectory
Evolved Plan:
is evaluated for correctness using an LLM. The evaluation
– Prepare your data by organizing it into two columns. The stage verifies whether the executed trajectory successfully
first column contains items for the primary drop-down list, accomplishes the intended task. This involves comparing
and the second column contains items for the dependent list. the observed outcomes with the expected results outlined
– Name your ranges for the first and second lists. in the task-plan data. Tasks that fail to meet the criteria are
– Create the primary drop-down list using Data Validation. flagged for review, while successful executions are retained
– Use the INDIRECT function to create the dependent drop- for further processing.
down list linked to the first selection. (4) Post-Processing: In the final stage, successful task-action
– ··· trajectories undergo post-processing to ensure consistency,
Using data augmentation, we increased the original task-plan completeness, and readiness for training. This includes refin-
dataset by 150%, generating a larger pool of samples. This aug- ing the data format, ensuring compatibility with the training
mentation significantly enhances the diversity and complexity of pipeline, and annotating the data with relevant metadata
the dataset, allowing the model to learn from a broader range of (e.g., task IDs, execution time, and step-by-step feedback).
scenarios and develop robust planning capabilities. The augmented The post-processed task-action data is then added to the
data introduces more challenging tasks and detailed plans, further training dataset, enabling the LAM to learn from real-world
enriching the training process and enabling the LAM to handle interactions.
complex real-world applications effectively. The pipeline minimizes human intervention and reduces the num-
ber of LLM calls required, significantly improving scalability and
3.2 Task-Action Data efficiency.
The task-plan data collected in the previous stage provides high-
level, step-by-step plans for resolving user-requested tasks, serving 3.2.1 Instantiation. The task-plan data are primarily collected from
as general guidelines. However, these plans are textual and not help documents or public websites, creating a gap between the
directly executable in a real-world environment. For instance, a generalized task-plan data and the specific requirements needed
task-plan data sample for the task “Highlight text in document” for execution within a particular environment. A common issue
outlines the necessary steps but does not translate into actionable is the lack of specificity. For instance, the task “highlight text in
instructions for interacting with the application’s GUI. This gap document” does not specify actionable objects, such as “which text”
7
Task:
PLAN ACTION Highlight Text in document.
1. LLM 1. Solution:
2. 2. 1. choose the target text Word templates Function Pool
3. 2. click the highlight button
1 Instantiation 3.
Task-Plan Data Task-Action Data
Instantiation
LLM
2 Execution Task: Task:
LLM
Highlight Text “Hello World” in Highlight Text “Hello World” in
template.doc template.doc
Selected 3 Evaluation Solution: Solution:
[ [
Trajectory { {
Match with
4 Post-Processing "step":"choose text “Hello World”,
control item
"step":"choose text “Hello World”,
"controlLabel": "", "controlLabel": "",
"controlText": "", "controlText": "",
"function": "select_text", "function": "select_text",
"args": {"text": "text to edit"} "args": {"text": "text to edit"}
}, },
{ {
Discarded
"step":"click the highlight button", "step":"click the highlight button",
Training Data "controlLabel": "", "controlLabel": "37",
"controlText": "Highlight", "controlText": " Text Highlight Color ",
"function": "click_input", "function": "click_input",
"args": {"button": "left", "double": false} "args": {"button": "left", "double": false}
Figure 6: The pipeline of task-action data conversion and } }
] ]
collection.
3.2.2 Execution. To ensure that the steps in the instantiated task- – Consecutive actions performed during execution.
plan data are accurate and truly actionable, the execution stage – Screenshots captured before and after each action.
verifies the action sequence by matching control items with the – Environmental changes observed between the initial and
real application environment and performing the specified actions. final states§ .
This process validates the task-action data, ensuring its correctness Using this comprehensive trajectory, we prompt GPT-4o to evaluate
and compatibility with the application GUI. whether the executed task aligns with the original task description
For instance, as shown in Figure 7, the control item “Text High- and achieves successful completion. The evaluation considers both
light Color” with its associated control label is retrieved using the the sequence of actions and the resulting application state. The
action text “Highlight” from the control item pool. The correspond- process assigns a "task-complete" key to indicate the outcome as
ing task-action data is then executed in the application without
further intervention from the LLM. During execution, if an error § More specifically, we compare the .xml files which is the underlying data representa-
occurs (e.g., a mismatch between the predicted control item and the tion of Microsoft Word.
8
Phase 1: Task-Plan Pretraining. (Section 4.2)
Foundation for
Data Resource task-agnostic planning
HelpDoc, Evolved Data, wikiHow, Training Method
Bing Search Queries
Supervised fine-tuning
Data Format
Objective
Task: How to insert a picture in Word?
Learn structured task plans 𝑳𝑨𝑴𝟏
Plan: 1. How
Task: Go totothe "Insert"text
highlight tab.on word?
Plan: 1. Select the text ...
Phase 2: Learning from Experts. (Section 4.3) Phase 4: Learning from Reward Model. (Section 4.5)
Data Resource Data Resource
Building essential
GPT-4o traj grounded in task-plan data Training Method action skills : GPT-4o and LAM successes.
: LAM self-exploration failures. Training Method
llm Word Env Supervised fine-tuning
: LAM self-exploration successes.
Supervised fine-tuning
Objective Data Format
Data Format 𝑳𝑨𝑴𝟐 Objective
Success
Learn basic action sequences State:
GPT-4o Trained on success/failure
State: “Task”:How to insert a picture in Word? Reward Model
Failures to predict action quality.
“Task”: How to highlight text on word? “Observation”:… 𝑹𝑴𝝓
“Observation”:… Action:[{“step”: ...
Action:[{“step”: ... Reward:
Data Resource
Near-optimal success
Phase 3: Self-Boosting Exploration. (Section 4.4) LAM failure tasks
Training Method rate
Data Resource LAM Offline RL
Generalization on Failure 𝑳𝑨𝑴𝟑 𝑹𝑴𝝓
LAM self grounded in GPT-4o fail cases Training Method difficult tasks tasks Objective
Supervised fine-tuning Data Format iterative refinement and 𝑳𝑨𝑴𝟒
𝑳𝑨𝑴𝟐 Word Env
State: reward-based optimization.
Objective
“Task”: How to insert a video in Word?
Data Format Success Explore and refine policy “Observation”:…
𝑳𝑨𝑴𝟑
GPT-4o on challenging tasks Action:[{“step”: ...
State:
Successes Reward:
“Task”:How to insert a picture in Word?
“Observation”:…
Action:[{“step”: ...
"yes," "no," or "unsure." If the task is evaluated as "yes", the tra- leveraging reward-based optimization. Throughout these stages,
jectory is deemed successful; otherwise, it is classified as a failure. the model progressively evolves from LAM1 to LAM4 .
The detailed prompt used for this evaluation is provided in Ap- At a high level, Phase 1: Task-Plan Pretraining provides a
pendix B.2. This evaluation step ensures that only valid, accurate strong foundation by teaching the model to generate coherent, step-
task-action data is included in the training dataset, contributing to by-step plans for various tasks. Phase 2: Learning from Experts
the reliability and robustness of the LAM. then introduces action trajectories labeled by GPT-4o, enabling
LAM2 to align its plan generation with actionable steps. However,
3.2.4 Post-Processing. As noted in Section 3.2.2, a trajectory was
relying solely on expert successes limits diversity and adaptability.
recorded during the execution process. This trajectory includes:
To address this, Phase 3: Self-Boosting Exploration encourages
– Screenshots captured at each step. the model to tackle tasks that even GPT-4o failed to solve, au-
– Environment states before and after each action. tonomously generating new success cases and evolving into LAM3 .
– Plans and corresponding actions for every step. Finally, Phase 4: Learning from a Reward Model incorporates
During the post-processing stage, these trajectories are combined reinforcement learning (RL) principles, allowing LAM4 to learn
with the original task requests to generate synthetic step-wise from both successes and failures, refining its decision-making in
training data. The resulting data format uses the task request as complex, previously unseen scenarios. Table 1 summarizes the data
input and LAM’s plan and actions as output. This structured format used in each phase. Each phase uses different training objectives,
is critical for training LAMs to map task requests to actionable namely (i) task-plan pretraining (phase 1) and (ii) decision-making
sequences effectively. The detailed template for the data format can training (phase 2-4), as detailed in Appendix E.
be found in Appendix C.
4.1 Phase 1: Task-Plan Pretraining
4 MODEL TRAINING The initial stage focuses on imparting a broad understanding of
Our objective is to develop an LAM from scratch that can map how tasks can be decomposed into logical steps. We start with
user inputs to appropriate plans and executable actions, ultimately Mistral-7B [25] as the base model. A total of 76,672 task-plan pairs
enabling complex task completion. To achieve this, we adopt a (𝑡𝑖 , 𝑃𝑖 ) are collected from various sources, including application
staged training strategy consisting of four phases, each building help documentation, WikiHow, and historical search queries. Of
upon the previous one. As illustrated in Figure 8, these phases guide these, 29,182 pairs are sourced directly, while 47,490 are gener-
the model from learning structured task plans, to imitating expert ated via data evolution techniques (as described in Section 3.1.4),
demonstrations, to self-boosting from its own successes, and finally enriching the dataset with more complex and diverse tasks.
9
Table 1: Training data summary for each phase of LAM training.
Model Data Type Data Source Input → Output Format Data Size
Application documentation, WikiHow,
LAM1 Task-Plan Pairs 𝑡𝑖 → 𝑃𝑖 76,672 tasks
historical search queries, evolved data
LAM2 Task-Action Trajectories GPT-4o 𝑠𝑡 → 𝑎𝑡 2,192 trajectories
LAM3 Task-Action Trajectories LAM2 + GPT-4o 𝑠𝑡 → 𝑎𝑡 2,688 trajectories
LAM4 Task-Action-Reward Trajectories RM + LAM3 (𝑠𝑡 , 𝑟𝑡 ) → 𝑎𝑡 1,788 trajectories
Reward Model Task-Action-Reward Trajectories GPT-4o + LAM3 (𝑠𝑡 , 𝑎𝑡 ) → 𝑟𝑡 4,476 trajectories
In this phase, LAM1 is trained via supervised fine-tuning (SFT) These self-labeled successes, combined with the original 2,192 GPT-
to predict the correct plan sequence 𝑃𝑖 for a given task 𝑡𝑖 : 4o successes, form an augmented dataset.
𝑁 We then fine-tune LAM2 on this enriched data, yielding LAM3 :
1 ∑︁ pred
LSFT (LAM𝜃1 ) = LCE (𝑃𝑖 , 𝑃𝑖true ). 𝑁 𝑇𝑖
𝑁 𝑖=1 1 ∑︁ ∑︁
LSFT (LAM𝜃3 ) = LCE (LAM𝜃3 (𝑠𝑡 ), 𝑎𝑡 ).
Here, LCE denotes the cross-entropy loss, and 𝑁 is the number 𝑁 𝑖=1 𝑡 =1
of tasks. Although no actions are generated at this stage, LAM1
This self-boosting step allows the model to learn from its own
gains a robust task-agnostic planning capability. This knowledge
newly discovered solutions, overcoming previous limitations and
will prove critical in guiding the model’s action execution in later
improving adaptability. By leveraging planning knowledge from
phases, ensuring that the agent understands the logical structure
Phase 1 and expert strategies from Phase 2, LAM3 becomes more
of tasks before attempting to perform them.
resourceful, even in scenarios with sparse or absent expert guidance.
4.2 Phase 2: Learning from Experts
4.4 Phase 4: Learning from a Reward Model
While LAM1 can produce structured plans, it lacks the ability to
execute them. In Phase 2, we introduce expert-labeled task-action Despite the improvements, Phases 1–3 focus on successes or expert-
trajectories from GPT-4o (Section 3.2) to teach the model how to like behavior. They offer limited insights into intermediate decision
perform actions. The illustrative application in this paper is the Mi- quality and fail to exploit learning opportunities presented by failed
crosoft Word environment, where we have 2,192 successful expert attempts. In Phase 4, we integrate reinforcement learning (RL) to
trajectories. Each trajectory consists of a sequence of state-action address these shortcomings.
pairs (𝑠𝑡 , 𝑎𝑡 ), representing observed UI states and the corresponding To this end, we design a two-stage approach, where we first
actions to progress the task. The reward model (RM) is built using LAM3 as the base model,
We split these 2,192 trajectories into a training set of 1,757 and with an additional output layer added to produce scalar values
a test set of 435 trajectories, providing a total of 3,959 steps for representing the quality of actions. Using the trained RM, we fine-
training. By imitation learning LAM1 on these successful action tune LAM4 in an offline RL setting. Here, the model refines its
sequences, we obtain LAM2 . The objective is to minimize: policy without additional environmental interactions, leveraging
previously collected trajectories to learn from failures and improve
𝑁 𝑇𝑖
1 ∑︁ ∑︁ action selection.
LSFT (LAM𝜃2 ) = LCE (LAM𝜃2 (𝑠𝑡 ), 𝑎𝑡 ),
𝑁 𝑖=1 𝑡 =1
4.4.1 Reward Model Training. First, we train a reward model (RM)
where 𝑁 is the number of trajectories and 𝑇𝑖 is the number of on both LAM3 ’s successful (496) and failed (1788) trajectories and
steps in trajectory 𝑖. By imitating the expert’s policy, LAM2 trans- GPT-4o’s successful trajectories (2192) gathered in previous phases.
forms from a passive planner into a model capable of executing All steps in successful trajectories are assigned a reward of +1, and
actions aligned with its plans, grounding its reasoning in the real all steps in failed trajectories a reward of −1. This uniform, binary
application environment. labeling of outcomes ensures the RM consistently captures overall
trajectory quality. Formally:
4.3 Phase 3: Self-Boosting Exploration
𝑟𝑡 = RM(𝑠𝑡 , 𝑎𝑡 ; 𝜙),
Up to Phase 2, LAM2 only learns from successful trajectories pro-
vided by GPT-4o. This limits diversity and adaptability, as the model where 𝜙 presents the RM parameters, and 𝑟𝑡 ∈ {+1, −1} is the
never sees how to handle situations that even GPT-4o could not deal assigned reward. The RM is trained via mean squared error (MSE)
with. To overcome this limitation, Phase 3 introduces self-boosting to approximate these ground-truth rewards.
exploration. The training dataset for the RM includes both failed and suc-
Here, we revisit failed GPT-4o trajectories, i.e., tasks that GPT-4o cessful task-action trajectories generated by LAM3 , as well as the
did not complete successfully, and let LAM2 attempt them. Using the successful trajectories from the collected task-action data. All steps
ReAct mechanism [58, 80], LAM2 interacts with the environment in successful trajectories receive a reward of +1, while every step in
and tries alternative strategies for these challenging tasks. From failed trajectories is assigned a reward of −1. This uniform labeling
these attempts, we sampled 2284 GPT-4o failed tasks and then strategy ensures that the RM consistently reflects overall trajectory
collect 496 newly successful trajectories generated by LAM2 itself. quality and effectively guides policy optimization.
10
4.4.2 Optimizing with Offline PPO [56]. Armed with the RM to Table 2: Performance (%) comparison of different models on
evaluate intermediate actions, we fine-tune LAM4 via offline PPO. planning.
This stage focuses on the 1,788 failure trajectories collected during
Phase 3, providing a unique opportunity to learn from mistakes. Model TSR (%) Step Precision (%) Step Recall (%)
The training objective of PPO is: LAM1 82.2 54.7 55.7
GPT-4o 84.5 28.2 66.1
𝑁 𝑇𝑖
1 ∑︁ ∑︁ LAM𝜃4 (𝑎𝑡 |𝑠𝑡 ) Mistral-7B 0.0 0.1 0.5
LPPO (LAM𝜃4 ) = min 𝐴ˆ𝑡 ,
𝑁 𝑖=1 𝑡 =1 LAM𝜃4 (𝑎𝑡 |𝑠𝑡 )
old
LAM4 (𝑎𝑡 |𝑠𝑡 ) with the AdamW optimizer, and spans 2 epochs. The training is
!
clip 𝜃 ˆ
, 1 − 𝜖, 1 + 𝜖 𝐴𝑡 , conducted on 8 × A100 80G NVIDIA GPUs.
LAM𝜃4 (𝑎𝑡 |𝑠𝑡 )
old
5.1.3 PPO Training (Phase 4). For Proximal Policy Optimization
where 𝐴ˆ𝑡 denotes the advantage derived from RM-generated re- (PPO) training, we use a learning rate of 1.4 × 10 −5 and set the
wards, and 𝜖 is a clipping parameter to ensure stable updates. generated sample length to 256. The batch size is 8, and the mini-
By incorporating signals from both successes and failures, LAM4 batch size is 1, with 4 PPO epochs and 1 gradient accumulation step
gains a deeper understanding of action quality. This RL-based fine- per iteration. The target KL divergence is set to 0.1, and the initial
tuning helps the model generalize to complex, previously unseen KL coefficient is set to 0.2. To ensure robust training, reward values
scenarios, ensuring more robust and reliable decision-making. are normalized to the range [-0.5, 0.5]. The training is conducted
on 8 NVIDIA A100 80G GPUs.
4.5 Summary
The four-phase training pipeline incrementally builds a fully capa- 5.2 Task-Plan Pretraining Results (Phase 1)
ble LAM. Phase 1 imparts a fundamental planning ability, Phase 5.2.1 Evaluation Metrics. We evaluate LAM1 on its ability to gener-
2 incorporates expert knowledge for action execution, Phase 3 ate task plans. We use three metrics for this evaluation: (i) Task Suc-
empowers the model to generate and learn from new successes, cess Rate (TSR), measuring whether the predicted plan matches
and Phase 4 leverages rewards from both successes and failures to the ground truth at the task level; (ii) Step Precision, evaluating
optimize decision-making. By combining static knowledge with the proportion of predicted plan steps that appear in the ground
expert demonstrations, self-guided exploration, and reward-based truth; and (iii) Step Recall, assessing the proportion of ground
refinement, we transform a general-purpose language model into a truth plan steps that are correctly predicted.
versatile LAM. This progressive training strategy ensures a robust, To compute these metrics, we leverage GPT-4o to compare each
adaptive model ready to handle diverse and complex tasks. step of the LAM1 output with the corresponding ground truth steps.
The counts of matched steps are then used to calculate the final
5 OFFLINE EVALUATIONS evaluation metrics. Detailed prompt information for the evaluation
can be found in Appendix D.
The offline evaluation results of Task-Plan Pretraining Results
(Phase 1) and Task-Action Results (Phases 2–4) will be pre- 5.2.2 Performance of LAM1 on Planning. Table 2 presents the per-
sented in this section. Offline evaluation allows us to systematically formance of LAM1 in planning prediction across 15,334 tasks on
assess the performance of LAM1 and subsequent phases (LAM2 , Windows OS, utilizing the dataset detailed in Section 3.1. LAM1
LAM3 , and LAM4 ) without interacting with the environment. This achieves a TSR of 82.2%, which is comparable to GPT-4o’s TSR
setup effectively provides a controlled and reproducible framework of 84.5%. While GPT-4o demonstrates a slightly higher TSR, it ex-
for comparing task success rates, precision, and recall metrics across hibits a lower Step Precision of 28.2%, indicating inefficiencies in
models. its planning by generating additional unnecessary steps. In con-
trast, LAM1 achieves a higher Step Precision, reflecting its ability
5.1 Experiment Setup to produce more efficient and accurate plans. This superior preci-
5.1.1 SFT Training (Phase 1, 2, 3). For supervised fine-tuning (SFT), sion is attributed to LAM1 ’s training regimen, which incorporates
the learning rate is set to 2 × 10 −5 with cosine decay and 2 warmup domain-specific knowledge through task-plan pretraining.
steps. The batch size is 16, and the training is conducted for 3 epochs Additionally, the baseline Mistral-7B model, without any fine-
on the training data. Loss is calculated only for the target tokens tuning, performs inadequately with a TSR of 0.0%, Step Precision
rather than the full input sequence, optimizing the efficiency of the of 0.1%, and Step Recall of 0.5%. These stark results underscore
fine-tuning process. The training is performed on 8 × A100 80G the critical importance of task-plan pretraining in transforming a
NVIDIA GPUs. general-purpose language model into a competent task planner.
Overall, the evaluation highlights that while general-purpose
5.1.2 Reward Training (Phase 4). Reward scores are normalized to models like GPT-4o can achieve high success rates, their lower step
the range [0, 1] using sigmoid function. We employ the LoRA (Low- precision suggests a propensity for overcomplicating plans. In con-
Rank Adaptation) method [22] to train the reward model (RM). The trast, specialized models like LAM1 not only maintain competitive
LoRA parameters include rank of 8, LoRA alpha of 32, and LoRA success rates but also generate more streamlined and accurate ac-
dropout of 0.1. The task type is sequence classification. The training tion sequences. This validates the effectiveness of targeted training
process uses learning rate of 2 × 10 −5 with linear decay, optimized approaches in enhancing planning capabilities and demonstrates
11
Table 3: Offline performance comparison across different models and metrics on decision making.
Metric LAM1 LAM2 LAM3 LAM4 GPT-4o (Text-only) GPT-4o Mini (Text-only)
Object Acc (%) 39.4 85.6 87.4 87.8 73.2 74.6
Operation Acc (%) 59.9 97.3 97.7 97.7 94.2 91.5
Status Acc (%) 32.7 97.8 98.2 99.0 52.1 67.4
Step Success Rate (SSR) (%) 33.0 83.6 85.9 86.2 68.8 73.4
Task Success Rate (TSR) (%) 35.6 76.8 79.3 81.2 67.2 62.3
the necessity of task-plan pretraining for developing reliable and training process relies on progressively collected data and incre-
efficient task planners. mental refinements tailored to each phase.
The step-by-step training strategy explains these gains. In Phase 1
5.3 Task-Action Results (Phases 2–4) (LAM1 ), task-plan pretraining establishes a foundational under-
standing of task structures, resulting in a modest increase in TSR.
5.3.1 Evaluation Metrics. To assess the performance of agents in
In Phase 2 (LAM2 ), imitation learning on GPT-4o-labeled success
completing tasks, we employ five primary metrics: Object Ac-
trajectories imparts efficient execution strategies, driving a signifi-
curacy (Object Acc.), Operation Accuracy (Operation Acc.),
cant jump in TSR from 35.6% to 76.8%. Phase 3 (LAM3 ) introduces
Status Accuracy (Status Acc.), Step Success Rate (SSR), and
self-boosting exploration, where LAM autonomously tackles cases
Task Success Rate (TSR). The definitions and calculation methods
previously failed by GPT-4o. This yields an additional increase in
for these metrics are detailed below:
TSR to 79.3%. Finally, in Phase 4 (LAM4 ), reward-guided fine-tuning
(1) Object Accuracy (Object Acc.): This metric measures the refines decision-making based on sparse feedback, further elevating
accuracy of selecting the correct control object for each task TSR to 81.2%.
step. The predicted object is compared with the set of acceptable An important outcome is that the LAM framework enables the
objects defined in the ground truth. It evaluates the agent’s model to surpass GPT-4o, despite GPT-4o providing initial annota-
ability to correctly identify and interact with the appropriate tions. Through targeted data collection and progressive refinement,
UI elements. LAM not only assimilates the strengths of GPT-4o, but also learns
(2) Operation Accuracy (Operation Acc.): For operations such from its failures to develop more robust and adaptable policies. The
as Click, Type, or Select Option, this metric evaluates the ReAct mechanism plays a crucial role here, allowing LAM2 and
correctness of the predicted action. It ensures that the agent beyond to gather new success trajectories from challenging tasks,
performs the correct operation as specified in the ground truth. thereby enhancing its policy and overall performance.
(3) Status Accuracy (Status Acc.): This metric assesses whether In summary, the phased training approach and judicious data uti-
the agent correctly identifies the task’s completion status based lization enable LAM to excel where a state-of-the-art LLM (GPT-4o)
on its predictions. It evaluates the agent’s understanding of the falls short. This highlights the effectiveness of the LAM framework
overall progression and whether the task is marked as finished in crafting agents that are both data-efficient and capable of execut-
appropriately. ing complex, multi-step tasks with high accuracy and reliability.
(4) Step Success Rate (SSR): A step is considered successful only
if the selected object, predicted operation, and predicted status
are all correct. This metric evaluates each step of the task inde- 6 INTEGRATION AND GROUNDING
pendently by comparing the predicted outputs with the ground Once the LAM is trained, we integrate it into the GUI agent UFO [83],
truth action history. enabling the model’s predicted actions to be grounded and ex-
(5) Task Success Rate (TSR): A task is considered successful only ecutable within the Windows OS environment. The UFO agent
if all steps within the task are successful, making this a stringent accepts user requests in natural language and completes tasks by
evaluation metric. This metric provides a holistic measure of the interacting with the UI controls of Windows applications.
agent’s ability to complete complex, multi-step tasks accurately.
These metrics collectively cover various aspects of agent perfor-
mance, including precision in object selection, operation execution,
6.1 LAM Agent In a Nutshell
task understanding, and overall task completion. By combining In UFO, the LAM serves as the inference engine within the Ap-
step-level and task-level evaluations, they provide a comprehensive pAgent, enabling efficient and accurate task completion. Figure 9
assessment of the agent’s effectiveness in real-world task execution. illustrates the architecture of the AppAgent. UFO, equipped with
LAMs, is designed for interactive engagement with Windows ap-
5.3.2 Performance on Decision Making. Table 3 summarizes the plications. For simplicity, we focus on automating tasks within
results on 435 tasks of the Word Application. The four-phase LAM Microsoft Word, a widely used productivity tool with a sophisti-
training framework demonstrates incremental and cumulative im- cated GUI and diverse functionalities, making it an ideal testbed
provements in task completion. Notably, LAM4 achieves a TSR for training and evaluating LAM.
of 81.2%, outperforming both GPT-4o (67.2%) and GPT-4o-mini During each inference step, the agent collects critical contex-
(62.3%). This performance gap is substantial, considering that LAM’s tual information from the application environment, which is then
12
AppAgent from a predefined list. The function calls inferred by LAM are lim-
Memory ited to pre-defined operations, such as mouse and keyboard actions,
action t-1
...
action t-n plan t-1 as well as APIs specific to Word-related tasks. Once inferred, these
operations are parsed and executed within the environment.
Environment
Action
Request Sequence Grounding 6.4 Action Execution
Action UFO employs a control interactor to ground the action strings
Feedback ...
LAM Executor generated by LAMs, translating them into tangible impacts within
Env. State Data the target application. Each action typically consists of two key
Input Collection components:
{“type”: Botton, “title”: “New”,“position”: [0.45, 0.78] } (1) Control Element: This refers to the specific UI control
{“type”: Edit, “title”: “Document”,“position”: [0.87, 0.43] }
{“type”: Botton, “title”: “Design”“position”: [0.25, 0.21] } within the application that will receive the action, such as a
{“type”: ComboBox, “title”: “SaveAs”“position”: [0.67, 0.32] }
button, text box, or scroll bar.
Screenshots UI Information
(2) Function Call: This represents the operation to be per-
formed on the control element, such as a mouse click, key-
Figure 9: The overall architecture of the AppAgent employed board input, or invocation of native APIs.
in UFO.
By combining the control element and its associated function call,
UFO executes the inferred actions within the application.
passed to the LAM for decision-making. The LAM performs plan-
ning, orchestrates actions, and infers the necessary steps to fulfill 6.5 Memory
the user request. These inferred actions are grounded in the envi- UFO maintains additional information in its memory to assist LAMs
ronment by mapping them to predefined tools and function calls in making more informed and accurate decisions. This memory
used by the agent, such as mouse clicks, keyboard inputs, or API includes:
calls. This process iterates, with LAM continuously adjusting its (1) Historical Actions: A log of action trajectories and their ex-
plan based on real-time feedback from the environment, until the ecution results from the initial step onwards. This helps LAM
task is completed. Additionally, the agent maintains a memory that understand the current system state and aids in exploring
logs historical actions and plans, providing essential context for the the next steps based on prior actions.
LAM to make more informed and adaptive decisions as the task pro- (2) Previous Plan: The textual planning for future actions, gen-
gresses. This integration ensures that UFO can efficiently manage erated by LAM in the previous step. This serves as a refer-
and complete complex, real-world tasks in Windows environments. ence for guiding the current and future actions, ensuring
consistency across steps.
6.2 Environment
This memory is fed into LAM at each decision point, allowing for
The UFO agent leverages the LAM to interact with applications in more effective decision-making. By maintaining a comprehensive
the Windows environment. At each decision step, UFO employs record of past actions and plans, LAMs can better understand what
the UI Automation (UIA) API [13] to inspect all actionable con- has been accomplished, what remains to be done, and the outcomes
trols within the target Windows application, retrieving contextual of previous actions. This situational awareness enhances LAMs’
information for each control¶ . This information is passed to the ability to complete user requests more effectively and efficiently.
LAM for control selection and action inference. The control data is
structured as a list of dictionaries, where each control is assigned a 7 ONLINE EVALUATIONS
numerical index (as a label), along with its title and control type,
With the integration of the Windows GUI agent UFO, we evaluate
allowing the LAM to make informed decisions regarding control
the performance of the LAM in real-world environments. The eval-
selection and the corresponding action. This input format mirrors
uation process and results are detailed in the following subsections.
the structure used during offline data collection for consistency in
training and execution.
7.1 Testing Dataset
6.3 LAM Inference The online performance of LAM is evaluated on the same set of 435
test requests used during LAM training. The testing environments,
Using the environmental observations of application control in-
specifically the Word document templates corresponding to each
formation, UFO constructs prompts in the same format as the of-
task, are also maintained as identical to the training setup to ensure
fline training data, using planning and thought generation tech-
consistency and comparability.
niques [12, 70] to enable LAM to make reliable inferences about the
appropriate controls and operations to invoke. These inferences tar-
get the controls detected by the UIA, where each control is selected
7.2 Implementation
Our LAM was deployed on a virtual machine (VM) configured as
¶ UIA is the native Windows OS APIs used to detect actionable controls and pro-
NC24s v3. The VM is equipped with 24 virtual cores (vCPUs), 448
vide their metadata, such as names and locations. For other platforms, UIA can be
replaced by vision-based detectors that analyze screenshots or by utilizing alternative GB of memory, and two NVIDIA Tesla V100 GPUs, each with 16
accessibility APIs. GB of memory, to support efficient inference. This computational
13
Table 4: Performance comparison of LAM and baseline models across metrics.
setup was designed to meet the demanding requirements of LAM’s These metrics collectively evaluate both the accuracy and effi-
inference processes effectively. ciency of task completion, providing a comprehensive assessment
The UFO agent operates on six VMs running in parallel using of the LAM’s capabilities in real-world scenarios.
Azure Dedicated Host∥ to accelerate the testing process. Each VM
is equipped with a 15-core Intel(R) Xeon(R) Platinum 8370C CPU @ 7.5 Experimental Analysis
2.80GHz, 64GB of RAM, and runs Windows 11 Enterprise version The experimental results are presented in Table 4. LAM achieves a
23H2. Microsoft applications, such as Word and Excel, are installed TSR of 71.0%, demonstrating competitive performance compared
on version 2410. GUI control is facilitated through the MSTSC to the GPT-4o models. While GPT-4o with visual inputs attains the
tool∗∗ . This setup ensures a consistent and controlled environment highest TSR of 76.5%, slightly outperforming LAM, its reliance on
for evaluating the LAM’s performance. visual data introduces significant trade-offs in efficiency. Notably,
when visual inputs are excluded, GPT-4o’s TSR drops to 63.0%, an
7.3 Baselines 8.0 percentage point decrease compared to LAM. Similarly, GPT-4o
To benchmark the performance of LAM, we compared it against Mini exhibits lower TSRs for both visual and non-visual settings
two baseline models: GPT-4o and GPT-4o Mini. These models are (66.7% and 57.8%, respectively). These results underscore LAM’s
widely recognized for their robust natural language processing capability as a text-only model to maintain high task success rates,
and reasoning capabilities, making them popular choices in the outperforming the text-only variants of the baseline models.
development of GUI agents. To ensure consistency in evaluation, Efficiency is assessed through Task Completion Time and Aver-
the top_p and temperature hyperparameters were set to 0 for both age Step Latency, where LAM demonstrates clear superiority. LAM
baseline models. achieves the shortest Task Completion Time of 30.42 seconds,
To further examine the impact of input modalities, we conducted substantially outperforming all baseline models. In comparison,
an ablation study comparing performance with and without the in- GPT-4o without visual inputs records a completion time of 86.42
clusion of screenshots. Notably, LAM processes only textual inputs, seconds, more than 2.84 times longer than LAM. GPT-4o with visual
excluding screenshots, while the baseline models were evaluated inputs fares even worse, with a completion time of 96.48 seconds.
using both textual and visual modalities. Although GPT-4o Mini models show slightly better efficiency than
their larger counterparts, they remain less efficient than LAM, with
completion times of 35.24 seconds (without visual inputs) and 46.21
7.4 Evaluation Metrics seconds (with visual inputs).
We employ the following metrics to comprehensively evaluate the LAM also excels in Average Step Latency, achieving the shortest
performance of LAM: time per action step at 5.41 seconds. Without visual inputs, GPT-4o
• Task Success Rate (TSR): The percentage of tasks success- reduces its step latency to 12.84 seconds but still remains more than
fully completed out of the total tasks attempted. Task success twice as slow as LAM. In comparison, GPT-4o with visual inputs
is determined by an evaluation agent using GPT-4o, which exhibits the highest step latency at 19.36 seconds per step, more
assesses the full task completion trajectory, including plans, than triple LAM’s latency. GPT-4o Mini models show moderate
action sequences, and screenshots, to verify task completion. improvements but still fall short, with step latencies of 7.29 seconds
• Task Completion Time: The total time taken to complete (with visual inputs) and 5.88 seconds (without visual inputs).
each task, measured from the initial request to the final These findings highlight LAM’s strengths as a text-only model,
action. offering a compelling balance of competitive accuracy and superior
• Task Completion Steps: The total number of action steps efficiency. It achieves rapid task completion and low latency without
performed by the agent to successfully complete each task. sacrificing performance, making it an effective solution for real-
• Average Step Latency: The average time taken per action world applications. Its specialized training enables precise action
step, reflecting the model’s efficiency in generating and exe- inference and execution, underscoring the potential of LAMs to
cuting each action. enhance automation and productivity in agent-based systems.
# # Control item
- The control item is the element on the page
that you can interact with , we limit the
actionable control item to the following :
- " Button " is the control item that you can
click .
- " Edit " is the control item that you can click
and input text .
- " TabItem " is the control item that you can
Figure 11: A word template file with the description “A doc click and switch to another page .
with comments and reviewer.” - " ListItem " is the control item that you can
click and select .
- " MenuItem " is the control item that you can
click and select .
- " ScrollBar " is the control item that you can
scroll .
- " TreeItem " is the control item that you can
click and select .
- " Document " is the control item that you can
click and select text .
- " Hyperlink " is the control item that you can
click and open a link .
- " ComboBox " is the control item that you can
click and input text . The Google search box
is an example of ComboBox .
20
" function " : " select_text " , You will be provided with a task and the <
" args " : {{ Execution Trajectory > of the agent ,
" text " : " Test For Fun " including the agent actions that have been
}} taken , and the change of environment .
}} You will also be provided with a final canvas
state in < Final Env Status >.
- The < actions_plan > field must be strictly in a You will also be provided with a canvas
format separated each action call by " \ n " . difference in < Canvas Diff >.
The list format should be like this : " action You will also be provided with the initial
call 1\ naction call 2\ naction call 3 " control state in < Init Control State >.
- If you think the original task do not need to You will also be provided with the final control
be detailed , you can directly copy the state after each action in < Final Control
original task to the " new_task " . State >.
- You should review the apis function carefully
and if the function to call need to specify Besides , you will also be provided with two
target control , the " controlText " field screenshots , one before the agent execution
cannot be set empty . and one after the agent execution .
- The " step " description should be consistent
with the action and also the thought . Please judge whether the agent has successfully
completed the task based on the screenshots
# # Here are some examples for you to complete and the < Execution Trajectory >. You are
the user request : required to judge whether the agent has
{ examples } finished the task or not by observing the
screenshot differences and the intermediate
# # Tips steps of the agent .
- Read the above instruction carefully . Make
sure the response and action strictly # # Execution trajectory information
following these instruction and meet the Here are the detailed information about a piece
user request . of agent execution trajectory item :
- Make sure you answer must be strictly in JSON - number : The number of action in the execution
format only , without other redundant text trajectory .
such as json header . Your output must be - action : The action that the agent takes in the
able to be able to be parsed by json . loads () current step . It is the API call that the
. Otherwise , it will crash the system and agent uses to interact with the application
destroy the computer . window .
- Your task is very important to improve the
agent performance . I will tip you 200 $ if You will get a list of trajectory items in the <
you do well . Thank you for your hard work ! Execution Trajectory > of the agent actions .
22
Your judgment is very important to improve the - " TabItem " is the control item that you can
agent performance . I will tip you 200 $ if click and switch to another page .
you provide a detailed , correct and high - - " ListItem " is the control item that you can
quality evaluation . Thank you for your hard click and select .
work ! - " MenuItem " is the control item that you can
click and select .
user : | - - " ScrollBar " is the control item that you can
< Original Request : > { request } scroll .
< Thought : > { thought } - " TreeItem " is the control item that you can
< Execution Trajectory : > { trajectory } click and select .
< Canvas Diff : > { canvas_diff } - " Document " is the control item that you can
< Init Control State : > { init_control_state } click and select text .
< Final Control State : > { final_control_state } - " Hyperlink " is the control item that you can
< Final Env Status : > { final_status } click and open a link .
< Your response : > - " ComboBox " is the control item that you can
click and input text .
25