0% found this document useful (0 votes)
29 views

Automating A Test Process using LLMs

This paper discusses the automation of vehicle API testing using Large Language Models (LLMs) in the automotive industry, specifically focusing on the SPAPI web server. The authors present a structured approach to automate the testing process, highlighting the challenges of manual testing and the benefits of full automation. Experiments demonstrate that LLMs can effectively handle tasks requiring human judgment, leading to improved efficiency and quality in software testing.

Uploaded by

iamkarue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Automating A Test Process using LLMs

This paper discusses the automation of vehicle API testing using Large Language Models (LLMs) in the automotive industry, specifically focusing on the SPAPI web server. The authors present a structured approach to automate the testing process, highlighting the challenges of manual testing and the benefits of full automation. Experiments demonstrate that LLMs can effectively handle tasks requiring human judgment, leading to improved efficiency and quality in software testing.

Uploaded by

iamkarue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Automating a Complete Software Test Process

Using LLMs: An Automotive Case Study


Shuai Wang1 , Yinan Yu1 , Robert Feldt1 , Dhasarathy Parthasarathy2
1
Chalmers University of Technology 2 Volvo Group
Gothenburg, Sweden
shuaiwa@chalmers.se, yinan@chalmers.se, robert.feldt@chalmers.se, dhasarathy.parthasarathy@volvo.com

Abstract—Vehicle API testing verifies whether the interactions


between a vehicle’s internal systems and external applications
meet expectations, ensuring that users can access and con-
arXiv:2502.04008v1 [cs.SE] 6 Feb 2025

trol various vehicle functions and data. However, this task is


inherently complex, requiring the alignment and coordination
of API systems, communication protocols, and even vehicle
simulation systems to develop valid test cases. In practical
industrial scenarios, inconsistencies, ambiguities, and interde-
pendencies across various documents and system specifications
pose significant challenges. This paper presents a system designed
for the automated testing of in-vehicle APIs. By clearly defining
and segmenting the testing process, we enable Large Language
Models (LLMs) to focus on specific tasks, ensuring a stable and
controlled testing workflow. Experiments conducted on over 100
APIs demonstrate that our system effectively automates vehicle
API testing. The results also confirm that LLMs can efficiently
handle mundane tasks requiring human judgment, making them
suitable for complete automation in similar industrial contexts.
Index Terms—software testing, vehicle API testing, test au- Fig. 1. We present the case of automatically testing SPAPI, an in-vehicle web
tomation, large language model server. Previously, the multistep process of testing SPAPI was largely manual.
Using LLMs to automate each manual step, we achieve complete automation.

I. I NTRODUCTION
Large Language Models (LLMs) are revolutionizing soft- heterogeneous conditions, it is not immediately apparent how
ware engineering. In the past few years, we have witnessed one can effectively integrate LLMs into a testing process and
the application of LLMs for assisting or automating numer- gain efficiencies. In response to these challenges, we present
ous software engineering tasks like requirements engineering, a case study that (1) focuses upon a real-world test process
software design, coding, and testing [1] [2]. Software testing, in the automotive industry that is largely performed manually,
in particular, is one area where LLMs have been applied and (2) automates it using a recipe that seamlessly combines
with vigor. Facing ever-increasing needs for automation due selective use of LLMs with conventional automation.
to the volume and intensity of work involved, testing is The focus of this case study – our system under test – is
rapidly benefiting from the generative capabilities of LLMs. SPAPI, a web server that is deployed in trucks made by a
As systematically surveyed in [3], LLMs have been applied in leading vehicle manufacturer. SPAPI exposes a set of REST
many testing tasks including system input generation, test case APIs which can be used by clients to read or write selected
generation, test oracle generation, debugging, and program vehicle states. For example, SPAPI exposes /speed that can
repair. be used to read the vehicle speed, and /climate that can be
While a considerable amount of recent literature has focused used to change the cabin climate. Essentially, SPAPI serves as
on applying LLMs in narrowly scoped tasks [4] – such as a gateway between web clients (like apps on a tablet) on one
specific unit tests [5][6], isolated integration tests [7], or side, and in-vehicle control and monitoring applications on
individual verification scenarios [8][9] – few have reported the other side. More importantly for the purposes of this pa-
on their application to automate a complete test process. per, since SPAPI enables crucial customer-facing applications,
Practical testing processes are a diverse mix of steps that are considerable effort is spent in ensuring its quality.
mechanical, creative, and anything in between [10][11]. They Testing SPAPI requires a dedicated team of 2-3 full-time
also involve several (teams of) engineers and tools, whose engineers. As shown in Figure 1 (left), when new APIs
harmonious cooperation is essential to ensure the quality are released, the team first reviews the API specifications.
and cadence of testing. The challenge is only greater when They then (2-3) consult multiple documentation sources to
testing automotive embedded systems, where software coexists understand the associated vehicle states, (4-5) organize this
with mechatronics and other physical systems. Under such information to determine appropriate mocks and test inputs,
and (6-7) write and integrate test cases into a nightly regression Client Server Database
API
suite. Finally, they assess results (8), particularly test failures, (1)
Selected From
Web API in
to identify valid problems. Notably, as highlighted in Figure classic three-tier
architecture Json/XML Requested Data

1, most of this process is still performed manually. Test API’s responses


In-vehicle embedded system
Gateway ECU
These observations prompt the question – why is such in- API CAN link
App A
ECU X
tense manual effort needed to test an arguably simple gateway (2)
SPAPI server in a
real vehicle App B
server? The main reasons are structural. First, as a gateway, Test API’s responses
Json/XML
ECU Y

SPAPI’s engineering spans multiple teams with overlapping re- Gateway ECU
Virtual Vehicle (VV) system
API
sponsibilities. The three core components—the server, vehicle (3)
SPAPI server in a
state system, and mocking system—are developed by separate test rig
Json/XML Test virtual
vehicle’s status
Test API’s responses
teams, while testing falls to a fourth team that must interpret CAN link

disparate documentation from each. Second, SPAPI bridges


web applications and traditional in-vehicle systems, which Fig. 2. A comparative illustration of the SPAPI architecture – (1) a web server
differ fundamentally in documentation style. SPAPI APIs are in the classic three-tier architecture, (2) SPAPI in a real in-vehicle embedded
system, and (3) SPAPI in a test rig with vehicle state mocked by a Virtual
specified in Swagger, making them machine-readable, whereas Vehicle (VV) system. Compared to traditional API testing, vehicle API testing
vehicle states are documented in a mix of natural and formal requires not only verifying the API’s responses but also checking the vehicle’s
languages, often requiring human interpretation. Third, SPAPI status.
testers rely heavily on implicit knowledge built over years to
manage inconsistencies across systems and teams, leading to our criteria and demonstrating its full automation using
highly specialized expertise that intensifies manual effort and our proposed recipe.
complicates team turnover. 4) As the test process structure remains largely intact, we
In SPAPI testing, the potential for full automation presents highlight how evaluating the quality of AI-driven automa-
two significant benefits: (1) SPAPI testing can be fully au- tion can be simplified by independently assessing each
tomated, increasing the cadence with which APIs can be step where an LLM is applied.
delivered to customers, and (2) SPAPI testers can be unbur-
As the following sections will demonstrate, using a real
dened of their tedious job, allowing their creative talents to
industrial example of in-vehicle embedded software testing,
be applied elsewhere. Our observations on SPAPI highlight
we show that a manual process like SPAPI testing can be fully
that full automation is not only beneficial but essential under
automated (see Figure 1) to deliver practical improvements.
certain test process conditions.
Specifically, full automation is crucial in scenarios where II. BACKGROUND
testers function as a “glue” between tools, systems, and Since SPAPI is a web server that exposes REST APIs, our
stakeholders in tasks that rely on judgment rather than cre- case study falls within the ambit of API testing [12]. Aspects
ativity. Here, automation enhances engineering quality while of the SPAPI test process are therefore recognizable within
improving the testers’ experience. Additionally, in testing the larger universe of API testing, but there are also several
workflows with extensive manual steps, partial automation case-specific adjustments, which we now highlight.
offers limited gains, reinforcing the need for a comprehensive,
all-or-nothing automation approach. Furthermore, when testers A. System architecture
navigate legacy processes weighed down by technical debt, As jointly illustrated in Figures 2 and 3, SPAPI follows
partial debt mitigation falls short; complete automation is the typical 3-tier architecture of decoupling presentation [13],
necessary to address and eliminate debt effectively, benefiting business logic, and data, each of which we discuss below.
both testers and the organization as a whole. Presentation – Like any web server, SPAPI presents REST-
Recognizing these advantages and the rapid advancements ful endpoints with GET and PUT methods and JSON pay-
in LLMs for automating manual processes, we explore the loads/responses. Each API transacts an object of the form S =
central question: can LLMs serve as the key to fully automat- {(ki , vi )}N
i=1 , with N attribute-value pairs. Each pair (ki , vi )
ing a largely manual test process? To address this, we make in the object corresponds to some vehicle state (ki∗ , vi∗ ) that is
the following contributions: managed by a control or monitoring application deployed in an
1) We argue that a test process with clearly decomposed Electronic Control Unit (ECU) in the vehicle. Figure 3 shows
tasks, many of which are executed manually, is a prime an example where /speed endpoint provides a GET method
candidate for complete automation based on LLMs. that returns the instantaneous speed of the vehicle which, in
2) When these criteria are satisfied, we propose a recipe for turn, is calculated by a SpeedEstimation application in a
full automation that involves (a) retaining the test process vehicle master control ECU. The same figure also illustrates
structure, (b) leveraging LLMs as a general-purpose tool the /climate endpoint with a PUT method that sets different
to automate each manual step, and (c) combining LLMs cabin climate states by communicating with an ACControl
with conventional automation when required. application in a climate control ECU. Thus, the essence of
3) We present in-vehicle web server testing as a case study, SPAPI is presenting APIs for reading or writing an object
illustrating how a real-world testing process aligns with S = {(ki , vi )}N i=1 . This corresponds to interacting with vehicle
specification, look up related information, write test cases,
run and access the test cases. Specifically, the tester should
first identify the specific object set S by understanding the
documentation. Following this, the tester will retrieve the
corresponding CAN signal documentation S ′ and the VV
system documentation S ∗ . It is crucial to ensure that each
attribute in S can be mapped to both S ′ and S ∗ . This means
verifying that every attribute can be converted into a CAN
signal and can be simulated in the VV system, and testers can
write test cases based on the matched results. Typically, two
key aspects need to be checked during API testing. The first
aspect is to verify whether the virtual vehicle’s state aligns
Fig. 3. Three tiers of SPAPI operation (1) presentation - SPAPI objects, (2) with expectations after setting certain attributes to specific
data access - CAN signals, and (3) data - vehicle states. values via API:
S ∗ ← PUT(S)
states S ∗ = {(ki∗ , vi∗ )}N
i=1 managed by applications distributed ? (1)
across the in-vehicle embedded system. S ∗ = Sexpected
Data and data access – The typical web server may hold its The second aspect is to check whether the API returns the
data in a database, but, clearly, ‘data’ for SPAPI is vehicle expected values under a specific virtual vehicle state:
state information managed by different in-vehicle control ap-
plications. As shown in Figure 2, these in-vehicle applications S = GET()
are distributed across several ECUs, interconnected using ? (2)
S = Sexpected
Controller Area Network (CAN) links. While the typical web
server may access data by executing database queries, SPAPI In the following content, we will introduce the details of
accesses data by exchanging CAN signals S ′ = {(ki′ , vi′ )}N i=1 each step.
with in-vehicle applications. A CAN signal is a pre-defined 1) Understand API specification: Test engineers need to
typed quantity sent through a CAN link between designated understand the API documentation to extract the basic objects
sender and receiver applications. In the simplest case, each about the API. The documentation, like Swagger file, always
vehicle state (ki∗ , vi∗ ) maps to one CAN signal and value details each API’s essential information, such as all available
pair (ki′ , vi′ ), which SPAPI sends or receives to access the endpoints, expected request formats, and possible response
state. We also clarify that this case study focuses upon testing formats for each endpoint. Additionally, Swagger defines the
SPAPI in a rig, and not in the real vehicle. In the test rig data structures used in the API, including objects, properties,
(see Figure 2), vehicle state is emulated by a Virtual Vehicle and their types. An example of a Swagger file snippet de-
(VV) system, which maintains the superset N of all vehicle scribing the Climate object is shown in Figure 5(a). In this file,
states S ∗ = {(ki∗ , vi∗ )}N i=1 in a single table, emulating the testers should parse the object’s acMode and its corresponding
state managed by distributed control applications. To maintain details in the pairs. In summary, a thorough understanding the
consistency of interaction, VV allows state (ki∗ , vi∗ ) to be API documentation manually is essential for constructing a
accessed using the same CAN signal (ki′ , vi′ ) that SPAPI uses comprehensive object set S = {(ki , vi )}Ni=1 from the original
in the real vehicle. In addition to easing testing using virtual system documentations.
means, unlike many other API testing cases, VV offers the 2) Retrieve related information: After obtaining the at-
advantage of being able to freely mock vehicle state for testing tributes and values corresponding to the object, denoted as S,
purposes. Due to the continuous evolution of CAN signals and it is necessary to search for related documentation, including
the VV platform, it is essential to monitor the vehicle’s state the information about CAN signals and the details about the
to accurately capture relevant state changes. virtual vehicle. The search process is illustrated in Figure 4.
API logic – Since SPAPI is a gateway, the logic for each First, the tester needs to locate the relevant CAN signal
endpoint is relatively lean. When a client invokes an endpoint, documentation from CAN signal table. Then, by matching the
SPAPI does the mapping (ki , vi ) → (ki′ , vi′ ) of each attribute- corresponding key and value, the original state S is converted
value pair in the API object to the corresponding CAN signal- into the CAN signal S ′ . Afterward, the relevant virtual vehicle
value pair. Then, by sending or receiving the CAN signal and documentation is consulted, and the corresponding key and
value (ki′ , vi′ ), SPAPI reads or writes the corresponding vehicle value are mapped to obtain the specific operation S ∗ that needs
state (ki∗ , vi∗ ). Based upon the result of state manipulation, to be performed on the VV.
SPAPI sends an appropriate response to the client. Look up information: The three main components of
SPAPI testing are the server, vehicle state system, and VV
B. Current manual API testing system. Correspondingly, system information, CAN signal
The current manual workflow for API testing, as shown specifications and mocking documentation are needed to be
in Figure 1, involves steps such as: understand the API retrieved.
testers need to carry out such fuzzy matching cautiously based
on their own knowledge and experience.
API Virtual
Information CAN Tables
Vehicle
TABLE I
5 TYPES OF PROBLEMS THAT REQUIRE FUZZY MATCHING .
look up look up
Example
Category
S S’ S* Key 1 Key 2
{ 𝑘𝑖 , 𝑣𝑖 } { 𝑘𝑖′ , 𝑣𝑖′ } { 𝑘𝑖∗ , 𝑣𝑖∗ } Spelling errors DriverTimeSetting DriverTimeSeting
Abbreviations standard STD
Sample object’s Transform to CAN GET/PUT the Similar writing formats standard mode STANDARDMODE
states signal states in VV Logical equivalents OFF NOT ON
Semantic equivalents AutoStart AutoLaunch
Fig. 4. The process of setting and getting vehicle status according to the API
information. 3) Write Test Cases: Based on the organized information,
testers can write reasonable and comprehensive test cases.
When testing an attribute, the corresponding values should Specifically, the two main methods of a vehicle API, PUT
be looked up in both the CAN signal and VV tables. For and GET, need to be tested separately. The PUT method is
example, our goal is to set the vehicle’s status to ECONOMY. used to set the car’s state, while the GET method is used to
First, we locate the relevant attribute acMode in the system retrieve the car’s current state. To verify the effectiveness of
documentation S. Then, we look up acMode in the CAN the PUT method, we set the car’s state to S using the PUT
signal table S ′ and find its corresponding value for ECONOMY, method and then check whether all the virtual vehicle’s states
which might be 1. We then transmit this information to the VV S ∗ in the VV system are as expected.
system via the CAN signal. Subsequently, in the VV system, To verify whether the GET method is valid, we directly call
we read the corresponding CAN signal and look up the VV the GET method to retrieve the car’s current states, and check
table S ∗ to find the value of the acMode under the ECONOMY if the retrieved states S match the expected states.
state, which might be 2. Finally, we set the value of acMode The process of writing test cases requires testers to have
to 2 in the VV system. Finally, the acMode in the VV system a comprehensive understanding of the organized information
is set to 2 to achieve the desired vehicle state of ECONOMY. and a background in computer science, such as ensuring the
Information organizing: In automotive systems, to trans- correctness of data types in test cases. In addition, testers need
mit signals via CAN and utilize VV system correctly, we need to consider all test situations consider as many test situations as
to ensure that each attribute and its corresponding value in possible to ensure high coverage of test cases. Finally, testers
the system document S = {(ki , vi )}N i=1 can be looked up ′in write the test code to execute the test cases.
the CAN signal specifications for getting S ′ = {(ki′ , vi′ )}N
i=1 . 4) Running code and Evaluating results: Once the envi-
Simultaneously, each attribute and its value in S ′ should ronment and code are prepared, the code can be executed to
be looked up in the mocking documentations to get S ∗ = automatically test the API. Existing test frameworks and tools

{(ki∗ , vi∗ )}N
i=1 . Formally, our goal is to find a mapping such can be used to organize the results, allowing to directly obtain
that: the final outcomes.
′ ′ ′ ′ ′
∀(ki , vi ), ∃(kj , vj ) ∈ S where (ki , vi ) → (kj , vj ) C. Obstacles to automation
′ ′ ′ ′ (3) Based on the current API testing process, the obstacles
∀(ki , vi ), ∃(kk∗ , vk∗ ) ∈ S ∗ where (ki , vi ) → (kk∗ , vk∗ )
to achieving automated API testing can be summarized as
This ensures that every key ki from S maps to a corresponding follows:
key in S ′ and every key ki′ from S ′ maps to a corresponding • Fuzzy Matching: Since the system information, CAN
key in S ∗ . bus specifications and VV documentations are recorded
However, these three components are developed by different by different teams, the names of the attributes (i.e., the
teams, and the corresponding document table may not match key in the table) are sometimes inconsistent, as shown in
exactly, e.g. the names of the attributes in each table may not Table I. In addition, the value also needs to be mapped
be consistent since some attributes is recorded using a mixture based on the semantics of the key. For instance, the
of natural language and formal language. Table I summarizes attribute isAlarmActive may be TRUE/FALSE in
5 common types of records with different forms. Besides, system files but Active/Inactive in CAN specifi-
there can be discrepancies in the number of values for an cations. Such inconsistency makes it difficult to achieve
attribute. For example, the acMode attribute may have two exact matching, necessitating the implementation of a
states, STANDARD and ECONOMY, in the system document, fuzzy matching mechanism.
but there are 3 modes (also TURBO) in the CAN signal • Informal Pseudocoded Mappings: In the CAN signal
specification. In such cases, it is also needed to match the table, data is often represented in the form of informal
values with equivalent meanings. Moreover, there are instances pseudocode, leading to situations where a single key-
of missing attributes, where a corresponding mapping key value pair maps to multiple counterparts. For exam-
cannot be found. Since such issues are diverse and irregular, ple, activating the car’s alarm clock requires setting the
attribute and value as {AlarmActive:True}. However, A. Documentation understanding
the corresponding data in the CAN signal table could The purpose of documentation understanding is to extract
be represented as {AlarmClockStat:Active OR Alarm- the test objects from the API documentation. Standard API
ClockStat:Ringing OR AlarmClockStat:Snoozed}. In this documentation, commonly in YAML or Json format [14],
scenario, it is necessary not only to parse the CAN as shown in Figure 5(a), is structured to list attributes and
signal table but also to match the original key-value pair values associated with various objects. This structured format
with each entry in the CAN signal table. This requires lends itself well to template-based parsing. We parse these
recognizing and parsing these pseudocode forms and documents and use predefined templates to extract the relevant
being able to handle one-to-many mappings. attributes and values. Based on existing templates [15], we
• Inconsistent Units: Automotive values are often associ- define a few simple and common rules to ensure the method’s
ated with units, such as speed, which can be measured general applicability. These templates focus on fundamental
in km/h or m/s. When units are inconsistent, direct elements, such as endpoint names, attribute names, and data
mapping of values between tables is not possible. Values types. Additionally, if sample API calls are provided in the
must be converted to the corresponding units before documentation, we extract these directly to test the basic
mapping. Thus, the variety of units and the different accessibility and functionality of the API.
conversions required between them make detecting unit However, using templates alone is insufficient for deter-
inconsistencies and performing conversions a major chal- mining reasonable attribute values. We have identified the
lenge. following issues with relying only on templates:
• Inter-Parameter Dependencies: Parameters often have (1) Cannot utilize attribute description: API documentation
complex interdependencies, requiring coordinated set- often includes natural language descriptions of attributes that
tings. For example, the attribute alarmTime might be templates cannot interpret or utilize. These descriptions typ-
represented as a date-time string in system files, but in ically contain constraints on the attributes, which are crucial
CAN files, it might need to be mapped separately to to prevent generating incorrect values.
hours and minutes. Capturing and managing these (2) Lack of robustness: API documentation can some-
parameter relationships is not an easy task. times be informal or inconsistent. For instance, attributes of
enumeration types are usually presented as [”STANDARD”,
III. F ULLY AUTOMATED SPAPI TESTING WITH LLM S ”ECONOMY”], but some documents might incorrectly use
”STANDARD or ECONOMY”. Only using templates makes
This section presents the details of our automated testing it difficult to address these random and informal issues effec-
tool, SPAPI-Tester, which can integrate with LLMs to fully tively.
automate the entire API testing process. The overall process, To overcome these two issues, we introduce LLMs to en-
as shown in Figure 5, can be divided into four main steps. hance the process. LLMs are utilized to analyze the entire API
These steps are detailed in Process 1. documentation, leveraging natural language descriptions to
understand attribute constraints more effectively. Since LLMs
Process 1: Overall Workflow of SPAPI-Tester are capable of semantic understanding, they also mitigate
1 TestTracker = InitializeTestTracker() the impact of informal formatting or inconsistencies. This
2 For APISpec in List(APISecifications) allows the system not only to parse API properties but also to
3 S = ExtractTestObjects(APISpec) map them to CAN signals, which is covered in detail in the
4 S’ = APIToCANMapping(S, CANTable) subsequent section. LLMs further generate constraints based
5 S* = CANToVVMapping(S’, VVTable) on attribute descriptions, producing reasonable values within
6 TestCases = GenerateTestCase(S, S*) these constraints. The contextual insights provided by LLMs
7 TestCode = WritingTestCode(TestCases) help create a broader set of valid test values, thereby improving
8 TestTracker.analyzeTestRun(TestCode) the coverage and reliability of our test cases.
9 TestReport = PushToTestRepo(TestTracker)
In practice, to ensure the stability of LLM outputs
and reduce the effect of the specific prompt formulations,
we employ DSPy [16] to automate prompt optimization.
After initializing SPAPI (line 1), the entire testing process DSPy enables us to write declarative LLM invocations as
is divided into four parts: (1) Documentation understanding Python code. Figure 6 illustrates a simplified example of
(line 3): This part involves identifying test objects based on one of our prompts, along with the DSPy Signature. This
the API specifications. (2) Information matching (lines 4, 5): APIPropertyToCANSignal signature outlines the process
This part entails look up relevant CAN table and virtual of converting structured API properties to CAN signals, which
vehicle documents to matching all these objects. (3) Test automates the time-consuming task of constructing an API
case generation (line 6): Using the matched data, this step property (ki , vi ) and mapping it to a corresponding CAN
focuses on generating test cases for the API’s return results signal (ki′ , vi′ ).
and verifying the virtual vehicle’s status. (4) Executing test To further improve the accuracy and ease of extracting
cases and generating test reports (line 7, 8, 9). structured data from the LLM, we format the LLM inputs
Doc understanding & Matching Test case gen Test code writing Test execution

(a) API Specification (b) Matching Results (c) Test Cases (d) Test Code
ClimateObject: "ClimateObject": [{ import pytest
type: object "api_property": "acMode", API response import json
description: Manipulate climate "api_property_mappings": { import time
settings on the truck. "can_signal": "ClimateAPIObject":
required: "APIACModeRqst", {
def test_put_climate(spapi_setup_teardown,
- type "vv_state": "type": "Climate",
api_client, vv):
properties: "apiacmode_rqst" "acMode":
response = api_client.put(
"ECONOMY" Jinja

Test rig
acMode : }, url="/api/climate",
type: string "api_value_mappings": [ { }
data=json.dumps({"type": "Climate",
enum: ["STANDARD", "api_value": "ECONOMY", "acMode": "ECONOMY"})
"ECONOMY"] "can_value": "LOW", Virtual vehicle )
autoFanLevel: "vv_state_value": "1"}, "ClimateVVObject":
type: string { { # Check for correct status cod==e
enum: ["LOW", "NORMAL", "api_value": ”STANDARD", "apiacmode_rqst": assert response.status_code 200
"HIGH"] "can_value": ”HIGH", "1"
isAuxiliaryHeaterActivated: "vv_state_value": ”2"},] } # Assert VV attributes to verify correct behavior
type: boolean }] assert vv.climate_control.apiacmode_rqst == 1

Fig. 5. Architecture and workflow of SPAPI-Tester: The pipeline largely preserves the manual process and selectively uses LLMs to automate discrete steps.

class APIPropertyToCANSignal(dspy.Signature): B. Information Matching


""" Given an API table and an API property -> CAN
signal map, generate a list of API properties -> CAN As illustrated in Figure 4, the mapping of information in
signal(s). """ our system encompasses two stages: mapping API properties
api_to_can_dict: dict = dspy.InputField(desc="A dictionary
containing the mappings between an API attribute and its to CAN signals and mapping CAN signals to Virtual Vehicle
corresponding CAN signal.")
input_example: dict = dspy.InputField()
(VV) signals. These mappings are crucial for enabling signal
output_example: list[dict] = dspy.OutputField() transmission within the vehicle as well as setting or verifying
mapped_api_to_can: list[dict] = dspy.OutputField(desc="A
list of API properties, with corresponding CAN signals.")
the vehicle’s state. Since the processes and methods for these
two mappings are similar, we will detail the approach for
mapping API properties to CAN signals as an example.
Fig. 6. A DSPy Signature for automating API to CAN lookup (simplified).
First, we retrieve a set of candidate CAN signal key-value
pairs {(ki′ , vi′ )} from a CAN signal library through solely
API_CAN_INPUT_EXAMPLE = {
"ABCObject::valueOne": "CANSignal1",
matching the name of endpoint. Subsequently, we use the
"ABCObject::valueFour::TRUE": "AASignal:BB OR PV_AnotherSignal:CC", extracted API attributes S = {(ki , vi )}Ni=1 and the candidate
"ABCObject::valueFour::False": "AASignal:AA",}
CAN signals {(ki′ , vi′ )} as input to an LLM, enabling many-
API_CAN_OUTPUT_EXAMPLE = [
{"api_property": "valueOne", "can_signals": [{"can_name": to-many matching between API properties and CAN signals.
"CanSignal1"}]}, In many cases, attributes may have multiple enumerated val-
{"api_property": "valueFour","can_signals": [{
"can_name": "AASignal", ues. For instance, as shown in Figure 7, an API property
"can_mappings": [
{"api_value": "true", "can_value": ["BB"]}, ‘valueFour’ might take the values ‘True’ or ‘False’, while
{"api_value": "false", "can_value": ["AA"]},],},
{"can_name": "PV_AnotherSignal",
the corresponding CAN signal might represent these states
"can_mappings": [{"api_value": "true", "can_value": as ‘AA’ and ‘BB’. This type of mapping is common, and to
["CC"]}],},],},
] increase the stability of SPAPI-Tester, we utilize a separate
DSPy module specifically for matching enumerated values.
Fig. 7. Templatized examples for guiding API to CAN look up. The input consists of enumerated values from both the API
property and the CAN signal, and the output is a mapping of
these values.
and outputs as dictionaries. We define dictionary-based prompt As discussed in Section II.C, there are several challenges
templates to make tasks more comprehensible for the LLM in the mapping process. First, for fuzzy matching, the LLM’s
[17], as demonstrated in Figure 7. By specifying the expected strong semantic understanding is well-suited to handle these
output fields, the signature directs the LLM to navigate incon- cases. Second, for pseudocode mappings, we enhance template
sistencies in documentation and accurately associate API prop- robustness by embedding examples directly into the prompt,
erties with CAN signal values. Furthermore, by typing fields as shown in Figure 8. For example, we map “AAsignal:BB OR
in the signature, we enable the use of a TypedPredictor PV AnotherSignal:CC” to “can value” : “BB”, thereby mini-
in DSPy, which validates the LLM response. If the response mizing document noise while extracting relevant information.
does not conform to the specified types, DSPy re-prompts the Third, for unit inconsistencies, we apply a dedicated DSPy
LLM, repeating this up to a maximum threshold until compli- module that uses a Chain-of-Thought (CoT) approach to ex-
ance is achieved. This structured approach capitalizes on the tract and normalize units within values. This module converts
improved format adherence of LLMs, enhancing consistency units (e.g., ‘kW’ to ‘Kilowatts’), ensuring unit alignment in
and reliability. the test case generation phase.
The final output is structured as a list, as defined in Figure D. Executing test cases and generating test reports
6, with each element containing a fully matched pair. The
To ensure the automation of the entire process, the system
same approach is then applied to map information between
automatically executes the test code on the test rig [18], and
CAN signals and VV signals, ultimately yielding complete
then generates a comprehensive test report. This report docu-
matching results S ′ and S∗.
ments the details of the automated testing process, including
the test objects, the matching results, the generated test cases,
C. Test case generation and the execution logs. Such documentation ensures that our
system maintains a high level of transparency, rather than
Let's think step by step to generate the values for the API functioning as a black box.
properties.
1. Identify dependencies and rules in the API spec, such as
a property setting the unit for another property.
2. Set property values based on descriptions to maintain
IV. E XPERIMENTS
consistency among dependent properties.
3. Set values for properties: Our evaluation investigates the following questions.
- For strings, follow the format (e.g., date-time, enum)
and choose a random value. RQ1: What are the pass rate, coverage, and failure-detection
- For numbers, select a value based on 'can_min',
'can_max', and 'can_resolution'.
capability of the test cases generated by the SPAPI-
- For properties with the same CAN signal, set values Tester?
using logical consistency and dependency rules.
RQ2: To what extent can LLMs overcome the obstacles out-
lined in Section II.C to achieve end-to-end automated
Fig. 8. Chain-of-Thought prompt for test case generation (simplified).
testing?
RQ3: How efficient is this automated API testing?
After matching all the necessary information, we integrate RQ4: How effective is SPAPI-Tester in testing real-world in-
these details into a structured document, as illustrated in Figure dustrial APIs?
5(b), which then serves as the basis for generating test cases.
Specifically, RQ1-RQ3 focus on ablation studies of SPAPI-
Given the need to address multiple constraints during test
Tester, using controlled experiments to evaluate its capabilities
case generation—such as unit consistency—we employ a
and performance. RQ4 examines the application of SPAPI-
stepwise CoT approach to progressively incorporate these
Tester in the real-world, industrial setting with newly devel-
constraints. Specifically, for inconsistent units, we prompt
oped (and thus guaranteed to be unseen) APIs to demonstrate
the LLM to identify relationships between units and perform
the effectiveness of our end-to-end automated testing system.
any necessary conversions. For inter-parameter dependencies,
the LLM captures relationships among parameters, ensuring
A. Experimental Setup
compatibility and avoiding value conflicts. Additionally, the
LLM identifies property types and manages specialized for- In this section, we describe our experimental setup.
mats, such as date-time strings. Finally, we guide the LLM in 1) Subjects: Our research focuses on automating vehicle
handling cases common in industrial contexts, such as shared API testing within an industrial setting, addressing unique
CAN signals among multiple properties or specific constraints challenges such as inconsistencies across documentation and
on value ranges. system specifications. As no existing methods directly address
To ensure these constraints are applied consistently, we these issues in vehicle API automation, we could not compare
leverage DSPy’s TypedChainOfThought method, which our approach with general API testing techniques, as they
consolidates all conditions within a single prompt. Figure 8 lack the capability to handle the specific requirements of our
provides a simplified example of this prompt. For ease of use, industrial setting.
we specify that the module outputs test cases in dictionary We evaluated the quality of generated test cases for 41 truck
format, as depicted in Figure 5(c). APIs using metrics such as pass rate and coverage. To assess
After generating the test cases, we use them to create test SPAPI-Tester’s error detection capabilities, we annotated an
code. The test code generally consists of two sections: a setup additional 109 APIs developed by a leading vehicle manufac-
section, which includes essential elements such as package turer. These APIs were supported by system documentation
imports and requests to enable program execution, and a from in-house truck experts, CAN signal protocols from the
validation section containing assertions. Since the setup code CAN-bus team, and virtual vehicle documentation from the
remains consistent across tests, we design distinct Jinja1 Virtual Vehicle team.
templates for PUT and GET test cases. Using a simple code We tested four LLMs: two classic models—GPT-3.5-turbo
renderer, we inject the generated API and VV test objects into (OpenAI, 2023-07-01-preview) and LLaMA3-70B (2024-04-
the Jinja template to render the Pytest test case. Figure 18)—and two recent advancements, GPT-4o (2024-05-13) and
5(d) shows an example test case rendered by the test-writing LLaMA3.1-70B (2024-07-23). To ensure flexibility and reduce
module. maintenance, we opted not to fine-tune these models with
company-specific data, allowing seamless adaptation to new
1 https://palletsprojects.com/projects/jinja/ models or data without retraining.
TABLE II precision was below perfect, errors originated from limitations
PASS RATE ON DIFFERENT TYPES OF API S . in the fuzzy matching step.
LLMs However, recall rates did not reach optimal levels primarily
API Type Num. due to missing information in the API documentation, such as
GPT-3.5 LLaMA3 LLaMA3.1 GPT-4o
Energy 8 0.88 1.0 0.88 1.0 absent units or variable types for some attributes. To maintain
Driver Settings 6 0.83 0.83 1.0 0.83 high precision, SPAPI-Tester skips samples that lack sufficient
Visibility Control 11 0.91 1.0 0.91 1.0
Software Control 3 1.0 1.0 1.0 1.0 context for accurate matching, resulting in a recall loss of
Vehicle Condition 9 1.0 1.0 1.0 1.0 approximately 15 percentage points. All untested attributes are
Other 4 1.0 1.0 1.0 1.0 logged in the testing report, allowing developers to trace and
Total/Average 41 0.93 0.98 0.95 0.98 address these underlying issues.
Failure detection: To further assess the effectiveness of
2) Metrics: We evaluate our SPAPI-Tester both at the API
the generated test cases in detecting failures, vehicle experts
level and at the test case level. At API level, we use the pass
labeled 109 additional truck APIs, being developed, iden-
rate of the APIs as our metric. If all generated test cases for a
tifying 38 as buggy. SPAPI-Tester created test cases that
given API pass the tests, we consider that API to have passed.
successfully detected all buggy APIs with only four false
Conversely, if any test case fails, the API is considered to have
positives, achieving a 96% accuracy rate.
failed. Therefore, the pass rate is defined as the proportion of
All models performed comparably, highlighting that our
APIs that pass the tests.
stepwise, structured pipeline design reduces dependence on
At test case level, we primarily assess the quality and specific LLM choices. We seamlessly migrated SPAPI-Tester
coverage of the generated test cases. For these evaluations, to different LLMs without requiring additional adaptation.
we employ precision and recall as our key metrics. Precision This largely model-agnostic pipeline design allows us to focus
measures the quality of the test cases generated, while recall on refining the testing process rather than selecting specific
measures their coverage of API properties. LLMs, given the abundance of options.
B. Pass Rate, Coverage, and Failure Detection (RQ1) C. LLMs’ ability of overcoming obstacles (RQ2)
Pass rate: Since APIs with similar functions typically call Fuzzy matching presents a significant challenge in auto-
the same electronic control unit (ECU) in embedded systems mated API testing. We categorized common fuzzy matching
and, thus, share documentation within the same domain, we examples into five classes, selecting 20 test samples per class,
grouped 41 truck APIs into 6 categories based on their supplementing with manually written samples if needed. The
functions to present the results more clearly. Table II details results, shown in Table IV (upper part), indicate that all
the pass rates for each category. models achieved high precision rates, highlighting the LLMs’
These 41 APIs are online and pre-verified, ensuring that any capability to accurately recognize and match fuzzy inputs, a
failures observed during testing were due to issues within the key requirement for full automation. For semantic equivalents,
generated test cases or code. Results show that for the majority logical equivalents, and similar writing formats, all the models
of categories, all the APIs can pass the tests successfully, with attained an accuracy of 1.0 or nearly so, demonstrating their
all four LLMs achieving high pass rates. Notably, SPAPI- strong pattern matching abilities in semantics and logic. How-
Tester achieved a 98% pass rate when using LLaMA3 and ever, for spelling errors, accuracy slightly dropped as some
GPT-4o, demonstrating the method’s accuracy in generating errors altered word semantics, like mistaking date for data.
valid test samples. However, GPT-3.5 exhibited slightly lower In the abbreviations category, some abbreviations were too
performance in handling structured input-output, failing in two short to discern, complicating the matching process.
cases due to improper CAN connection settings. Additionally, For the inconsistent units issue, we selected 200 samples
a common error across all LLMs stemmed from missing unit for experiment. The results in Table IV (lower part) indicate
descriptions in API specifications. For example, when docu- that while SPAPI-Tester achieves a high precision rate, the
mentation omitted units for battery power, LLMs incorrectly recall remains suboptimal. The reason is that some documen-
defaulted to watts (W) instead of kilowatts (kW), leading to tation explicitly annotates units for each attribute, while others
test case failures. Broad patterns of errors like this could likely omit these details. In these cases, it becomes necessary to infer
be addressed by further refining the prompts. the units based on descriptions or other contextual information,
Coverage: In addition to pass rate analysis, we evaluated which can affect the performance.
the coverage of generated test cases to assess whether they Another notable challenge is informal pseudocoded map-
adequately test each API. A vehicle expert group was invited pings, where a single test case may correspond to multiple
to create ground truth test cases for 12 representative APIs, values. We selected 100 representative test cases for this
each including 5 to 30 test cases across 4 categories. The experiment. Each test case consists of two sets with multiple
results are presented in Table III. All LLMs demonstrated high (key, value) pairs, and the goal is to map elements between
precision, with precision rates exceeding 0.97 across the board these sets as accurately and comprehensively as possible. To
and reaching 1.0 for half of the APIs, showcasing the high increase complexity, we intentionally selected test cases where
quality of test cases generated by our model. For cases where the sets contained different numbers of elements, creating
TABLE III
T EST CASE COVERAGE OF DIFFERENT TYPES OF API S . ’P’ IS P RECISION , ’R’ IS R ECALL , AND ’F1’ IS THE F1 SCORE .

GPT-3.5 LLaMA3 LLaMA3.1 GPT-4o


API Type
P R F1 P R F1 P R F1 P R F1
Energy 0.96 0.69 0.78 0.98 0.76 0.85 0.96 0.74 0.84 0.96 0.79 0.87
Visibility Control 0.97 0.70 0.78 0.96 0.70 0.79 0.97 0.74 0.84 0.96 0.80 0.87
Vehicle Condition 1.0 0.95 0.97 1.0 0.9 0.95 1.0 0.95 0.97 1.0 0.95 0.97
Other 1.0 0.63 0.77 1.0 0.85 0.92 1.0 0.83 0.91 1.0 0.80 0.89
Average 0.97 0.73 0.80 0.98 0.79 0.88 0.98 0.81 0.89 0.97 0.85 0.90

TABLE IV
P ERFORMANCE ON DIFFERENT TYPES OF F UZZY M ATCHING ( UPPER PART ) AND I NCONSISTENT U NITS ( LOWER PART ).

GPT-3.5 LLaMA3 LLaMA3.1 GPT-4o


API Type
P R F1 P R F1 P R F1 P R F1
Spelling errors 0.89 0.76 0.82 0.92 0.73 0.81 0.91 0.78 0.84 0.91 0.83 0.87
Abbreviations 0.93 0.68 0.79 0.88 0.74 0.80 0.92 0.75 0.83 0.98 0.74 0.84
Similar writing formats 0.95 0.95 0.95 1.0 0.95 0.97 1.0 0.95 0.97 0.95 0.95 0.95
Logical equivalents 1.0 0.75 0.86 0.95 0.78 0.86 0.92 0.70 0.80 0.95 0.75 0.84
Semantic equivalents 1.0 0.70 0.82 1.0 0.73 0.84 0.94 0.73 0.82 1.0 0.70 0.82
Average 0.95 0.77 0.85 0.95 0.79 0.86 0.94 0.78 0.85 0.96 0.80 0.87
Inconsistent Units 0.95 0.67 0.79 0.95 0.59 0.73 0.95 0.67 0.79 0.98 0.70 0.82

TABLE V
T IME TO GENERATE TEST CASES , PER STEP ( SECONDS ). DU IS
1.00
DOCUMENT UNDERSTANDING ; RI IS RETRIEVAL INFORMATION ; TSG IS
0.98 TEST CASE GENERATION ; RUN MEANS RUNNING THE TEST CASES .

0.96 Requests Total DU & RI TSG Run


Precision

0.94 GET 55.0 6.6 3.6 44.8


PUT 56.3 6.8 4.3 45.2
0.92 Average 55.7 6.7 4.0 45
0.90 GPT-3.5
LLaMA3-70B
0.88 GPT-4 for GPT-3.5. We calculated the average time spent on testing
LLaMA-31
all APIs. Additionally, we separately computed the time for the
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall two major types of requests, i.e., PUT and GET. The results
are shown in Table V.
Fig. 9. Matching performance on informal pseudocoded mappings.
The results indicate that most of the time is consumed
scenarios where matching was non-trivial and recall could during the execution of test cases, with a significant portion
fluctuate. To explore this, we conducted experiments under dedicated to environment setup. SPAPI’s complexity requires
both strict (precision-focused) and relaxed (recall-focused) the appropriate configuration of embedded system environ-
matching conditions. Examples reflecting different levels of ments, such as setting up the CAN bus for signal transmission.
strictness were included in the prompts, and the strictness Additionally, the VV system needs to read CAN signals and
level (e.g., strict, moderate, relaxed) was explicitly stated in complete the reading or setting of the virtual vehicle’s state,
the prompts. The results are presented in Figure 9. which consumes a large amount of time.
The experimental results indicate that under very strict con- The time required for PUT and GET requests is almost
ditions, precision can reach up to 100%; however, recall drops identical, as our approach batch-generates matching results
significantly, even below 20%. As the conditions are relaxed, or test cases for these requests, effectively minimizing time
precision slightly decreases, but recall increases substantially, differences. In the full pipeline, the DSPy module, which
reaching up to 55%. Under the most relaxed conditions, recall leverages LLM-based capabilities, is called six times: once
rates for all the models approach 90%. This demonstrates that for documentation comprehension, four times for information
our method can achieve high recall rates while maintaining a matching, and once for test case generation. Additionally,
high level of precision. DSPy’s retry mechanism re-calls the module if the output does
not adhere to the predefined format. On average, the entire
D. Time efficiency (RQ3) process from initial input to test case generation takes about 11
In practical industrial scenarios, time consumption is an seconds, which is remarkably fast for automated API testing.
important criterion for measuring tool efficiency. Therefore, Manual vehicle API testing in the automotive industry is
we measured the total time and the time taken at each stage traditionally a time-consuming process, as it requires con-
of the SPAPI-Tester in the testing process. Given that the sideration of numerous conditions. To better understand the
LLaMA model relies on local computational resources and time demands of manual testing, we surveyed experts in our
that the processing speeds of GPT-3.5 and GPT-4o do not industry, including three senior engineers and one technical
significantly differ in this pipeline, we report only the results lead. They estimated that creating test cases for each API
takes approximately 0.1 to 3 FTE workdays, with most APIs properties due to incomplete API documentation (e.g., missing
requiring about two hours. They generate 5 to 30 test cases units). SPAPI-Tester reached 85% coverage with GPT-4o,
for each API. while other models ranged between 73% and 81%.
In contrast, our SPAPI-Tester achieves remarkable efficiency To evaluate failure detection, we selected 10 APIs (5 of
improvements. The system generates a complete set of test which contained known bugs) from the 109 APIs mentioned
cases for a single API in just 11 seconds, representing a in Section IV.B. Both engineers identified all buggy APIs,
dramatic reduction in time and effort. This substantial speedup although one created a test case that falsely flagged a correct
not only reduces the time and effort required for API testing API as erroneous, resulting in a recall rate of 100% and a
but also alleviates the traditionally high time burden associated precision rate of 91% for manual testing. Similarly, SPAPI-
with manual test case creation, greatly enhancing the API Tester achieved a recall rate of 100% with a slightly lower
testing process. precision of 90%.
In summary, SPAPI-Tester consistently generates high-
E. Performance on real-world industry APIs (RQ4) quality test cases, demonstrating comparable performance to
To demonstrate the capability of SPAPI-Tester in an real- manual testing in terms of pass rate, coverage, and failure
world setting, we collected 193 newly developed and unveri- detection.
fied truck APIs and their corresponding documentation from
a leading truck manufacturing facility. We then employed V. D ISCUSSION
SPAPI-Tester to conduct end-to-end automated testing, aiming On complete test process automation – Perhaps the most
to identify issues within these APIs. significant finding from this case study is that our recipe is
SPAPI-Tester identified 23 test failures. The test report capable of completely automating a real world test process.
indicates that 22 test cases failed due to issues within the API Put simply, SPAPI testing – a process that currently takes 2-
implementation, and one test case failed due to an error while 3 FTEs – has effectively been substituted by SPAPI-Tester,
parsing the API documentation. On consultation with the API a fully automatic pipeline. This success stems from com-
developers, these were determined to be legitimate bugs in the bining LLMs with conventional automation, allowing SPAPI
API implementation. The team has already started addressing testing to proceed without human intervention. Key to this
these issues upon receiving the checking results. achievement is the nature of the SPAPI test process: it is
In addition, this demonstrates that SPAPI-Tester not only well-structured, decomposable, and requires human judgment
has a high accuracy in detecting API errors but also provides but not creativity. In such cases, LLMs serve as the critical
detailed reports that help quickly identify the root causes link to full automation by systematically replacing manual
of failures. Even when SPAPI-Tester was unable to generate steps. Maintaining the existing process structure further aids
correct code, the detailed reports can help to identify the automation in two ways. First, it defines clear, verifiable steps
failure causes quickly, thereby minimizing misdiagnoses. This where LLMs can be applied. Second, preserving the status
capability significantly enhances the practical utility of SPAPI- quo ensures that automation is achievable without imposing
Tester by providing precise and actionable insights. In sum- possibly unreasonable costs of changing the test process – an
mary, these results underscore the robust practical applicability observation that is crucial for real world application.
of SPAPI-Tester in real industrial environments. On the generality of LLMs as problem solvers – Preserving
the design of the process no doubt identifies discrete tasks
F. Performance comparison with manual testing where LLMs can be used. However, the clear enabler for
To illustrate the advantages of SPAPI-Tester over man- complete automation is that the LLM automates all manual
ual API testing, we conducted a comparative evaluation. tasks with little practical regard to the actual nature of the task.
As described in Section IV.B, an expert team created Alternative automation methods exist, such as using fuzzy
ground truth test cases for 12 APIs. To measure the pass matching for inconsistent key-value mappings or a formal
rate of manual testing, two additional engineers indepen- language to specify cardinality in key-value relationships.
dently created test cases for these APIs. Results showed However, LLMs, as general problem solvers, eliminate the
that one engineer’s test cases passed 10 APIs, while the need for multiple specialized solutions, simplifying real-world
other’s passed 11. Both engineers missed one or two APIs implementations. While there is a cost to recast an LLM to
due to confusion over similar data entries. For instance, solve a specific problem – like defining prompts or signatures
attributes like reducedWeeklyRestsForCurrentWeek – the cost turns out to be manageable.
and regularWeeklyRestsForCurrentWeek proved On implications on dependent processes – If SPAPI testing
challenging for human testers to differentiate, whereas SPAPI- can be fully automated, its impact on adjacent processes
Tester’s LLMs handled them effortlessly. This led to an becomes a natural consideration. API implementation directly
average pass rate of 87.5% for manual testing at the API level, precedes SPAPI testing, while integration within user-facing
while SPAPI-Tester, with test cases generated by four different subsystems follows it. Given SPAPI’s simplicity, LLMs could
LLMs, achieved pass rates between 93% and 98%. potentially automate these dependent processes, extending
In terms of coverage, the average rate for manually created automation across much of the development lifecycle—an
test cases was 82%, with human testers occasionally skipping important step for in-vehicle software engineering. Further,
automating SPAPI-dependent applications could create a cas- et al. [40] proposed constructing dependency graphs from
cade of fully automated lifecycles, reshaping automotive soft- documentation to enhance test coverage. Other studies fine-
ware development. While promising, this vision comes with tuned LLMs using Postman test cases [41] or applied masking
challenges. Our results demonstrate LLMs’ ability to automate techniques to predict test values [42]. However, these methods
well-defined tasks and connect dependent processes, but also face challenges in ensuring the validity and robustness of
highlight the effort required to adapt them for specific, veri- generated test cases [43].
fiable problems. These insights encourage further exploration However, existing methods focus solely on test case gen-
toward realizing this ambitious potential. eration, which is only one part of the API testing process,
On the transferability of this recipe – We may have and do not address the automation of the entire process.
showcased completely automatic testing of an in-vehicle em- In practical applications, these methods require significant
bedded software application, but it is clear that many of manual verification. For instance, some approaches need to
our observations and findings are transferable. Our proposed retrieve relevant yet often ambiguous information from ex-
criteria for automation—a decomposable process with steps ternal databases. Moreover, these methods lack robustness; if
requiring judgment but not creativity—can extend to other the API specification is missing parameters or contains minor
domains. Additionally, our approach involves six distinct LLM errors, the process may fail. Unlike previous approaches, we
interactions: three align with general API testing workflows, are the first to explore the automation of the entire API testing
while the others, though tailored to automotive scenarios, process, focusing on current bottlenecks in API automation
require minimal adaptation for different contexts. For example, and considering how to leverage LLMs to address these
applying this method to another vehicle manufacturer would challenges robustly.
take roughly one full workday (1 FTE). Certain aspects may
also benefit web server testing. Finally, our recipe of largely
VII. C ONCLUSION
preserving a test process and using LLMs to verifiably auto-
mate discrete manual steps is transferable to any test process
Automated API testing is a critical process in software engi-
that meets the criteria we propose.
neering, essential for ensuring the reliability and functionality
VI. R ELATED W ORK of software systems. Despite its importance, API testing is
often time-consuming, labor-intensive, and prone to errors.
Existing research on API testing mainly focus on black-
In practical applications, API testing involves retrieving and
box and white-box testing, depending on whether the source
organizing relevant documents, and writing test cases based
code of the API is accessible [19]. White-box testing typically
on the organized information. Due to the fuzzy matching of
involves generating test cases to thoroughly test the logic
information across documents, manual intervention is required,
within the code [20] [21]. For example, EvoMaster [22]
hindering the automation of the entire testing process.
uses the Many Independent Objective (MIO) evolutionary
In this paper, we introduced SPAPI-Tester, the first system
algorithm to optimize multiple metrics simultaneously, such
designed for the automated testing of automotive APIs. We
as line coverage, branch coverage, HTTP status coverage, and
decomposed the API testing process into a series of steps,
the number of errors. Building on this, some studies have
identifying the obstacles to automation at each stage. By
employed additional tools for code instrumentation, such as
leveraging LLMs, we addressed these challenges, enabling full
JVM [23] [24] and NodeJS programs [25] [26]. Atlidakis et
automation of the testing workflow. The results from real-
al. [27] calculate code coverage by pre-configuring basic block
world industrial API testing demonstrate that SPAPI-Tester
locations and use this feedback to guide test generation.
achieves high detection accuracy. Our comprehensive experi-
Currently, most studies focus on black-box API testing,
ments show that our system is highly robust and effective.
aiming to enhance test case coverage for more comprehensive
API testing [28]. Template-based methods, such as fixed test Our system offers valuable insights for other automated
specifications and JSON schemas, are commonly used for gen- API testing tasks and can be extended to web server API
erating accurate test cases [29] [30] [31] [32] [33]. However, testing. The findings underscore the potential of LLMs to
these approaches struggle to capture parameter dependencies. transform API testing by reducing manual effort and im-
To address this, Stallenberg et al. [34] proposed a hierarchical proving efficiency, paving the way for broader adoption and
clustering method, while Lin et al. [35] introduced a tree- implementation in various testing environments.
based representation of parameter relationships. Martin et al.
[36] further improved test diversity by integrating external ACKNOWLEDGMENT
knowledge bases to generate reasonable values. Despite these
advancements, traditional methods often fail to achieve robust This work was partially funded by the Wallenberg AI,
and comprehensive testing. Autonomous Systems and Software Program (WASP), sup-
Recently, LLMs have emerged as a promising direction ported by the Knut and Alice Wallenberg Foundation, and
for API testing [37] [38]. Kim et al. [39] demonstrated the Chalmers Artificial Intelligence Research Centre (CHAIR).
the utility of LLMs in interpreting natural language API The authors also thank Earl T. Barr for his insightful discus-
documentation to generate test values. Building on this, Le sions.
R EFERENCES [24] A. Arcuri, “Restful api automated test case generation with evomaster,”
ACM Transactions on Software Engineering and Methodology (TOSEM),
vol. 28, no. 1, pp. 1–37, 2019.
[1] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo,
[25] M. Zhang, A. Belhadi, and A. Arcuri, “Javascript instrumentation for
and J. M. Zhang, “Large language models for software engineering:
search-based software testing: A study with restful apis,” in 2022 IEEE
Survey and open problems,” in IEEE/ACM International Conference
Conference on Software Testing, Verification and Validation (ICST),
on Software Engineering: Future of Software Engineering, ICSE-FoSE
pp. 105–115, IEEE, 2022.
2023, Melbourne, Australia, May 14-20, 2023, pp. 31–53, IEEE, 2023.
[26] A. Møller and M. T. Torp, “Model-based testing of breaking changes in
[2] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. C. node. js libraries,” in Proceedings of the 2019 27th ACM joint meeting
Grundy, and H. Wang, “Large language models for software engineering: on european software engineering conference and symposium on the
A systematic literature review,” CoRR, vol. abs/2308.10620, 2023. foundations of software engineering, pp. 409–419, 2019.
[3] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software [27] V. Atlidakis, R. Geambasu, P. Godefroid, M. Polishchuk, and B. Ray,
testing with large language models: Survey, landscape, and vision,” IEEE “Pythia: grammar-based fuzzing of rest apis with coverage-guided feed-
Trans. Software Eng., vol. 50, no. 4, pp. 911–936, 2024. back and learning-based mutations,” arXiv preprint arXiv:2005.11498,
[4] J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software 2020.
testing with large language models: Survey, landscape, and vision,” IEEE [28] E. Viglianisi, M. Dallago, and M. Ceccato, “Resttestgen: automated
Transactions on Software Engineering, 2024. black-box testing of restful apis,” in 2020 IEEE 13th International
[5] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language Conference on Software Testing, Validation and Verification (ICST),
models to self-debug,” arXiv preprint arXiv:2304.05128, 2023. pp. 142–152, IEEE, 2020.
[6] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, [29] C. Benac Earle, L.-Å. Fredlund, Á. Herranz, and J. Mariño, “Jsongen: A
“No more manual tests? evaluating and improving chatgpt for unit test quickcheck based library for testing json web services,” in Proceedings
generation,” arXiv preprint arXiv:2305.04207, 2023. of the Thirteenth ACM SIGPLAN workshop on Erlang, pp. 33–41, 2014.
[7] D. Ajiga, P. A. Okeleke, S. O. Folorunsho, and C. Ezeigweneme, [30] S. K. Chakrabarti and P. Kumar, “Test-the-rest: An approach to testing
“Enhancing software development practices with ai insights in high-tech restful web-services,” in 2009 Computation World: Future Computing,
companies,” 2024. Service Computation, Cognitive, Adaptive, Content, Patterns, pp. 302–
[8] J. Yoon, R. Feldt, and S. Yoo, “Intent-driven mobile gui testing with 308, IEEE, 2009.
autonomous large language model agents,” in 2024 IEEE Conference [31] T. Fertig and P. Braun, “Model-driven testing of restful apis,” in
on Software Testing, Verification and Validation (ICST), pp. 129–139, Proceedings of the 24th International Conference on World Wide Web,
IEEE, 2024. pp. 1497–1502, 2015.
[9] R. Feldt, S. Kang, J. Yoon, and S. Yoo, “Towards autonomous test- [32] A. Arcuri, “Test suite generation with the many independent objec-
ing agents via conversational large language models,” in 2023 38th tive (mio) algorithm,” Information and Software Technology, vol. 104,
IEEE/ACM International Conference on Automated Software Engineer- pp. 195–206, 2018.
ing (ASE), pp. 1688–1693, IEEE, 2023. [33] P. Godefroid, B.-Y. Huang, and M. Polishchuk, “Intelligent rest api data
[10] M. Fani Sani, M. Sroka, and A. Burattin, “Llms and process mining: fuzzing,” in Proceedings of the 28th ACM joint meeting on European
Challenges in rpa: Task grouping, labelling and connector recommen- software engineering conference and symposium on the foundations of
dation,” in International Conference on Process Mining, pp. 379–391, software engineering, pp. 725–736, 2020.
Springer, 2023. [34] D. Stallenberg, M. Olsthoorn, and A. Panichella, “Improving test case
[11] M. Boukhlif, N. Kharmoum, and M. Hanine, “Llms for intelligent generation for rest apis through hierarchical clustering,” in 2021 36th
software testing: a comparative study,” in Proceedings of the 7th Inter- IEEE/ACM International Conference on Automated Software Engineer-
national Conference on Networking, Intelligent Systems and Security, ing (ASE), pp. 117–128, IEEE, 2021.
pp. 1–8, 2024. [35] J. Lin, T. Li, Y. Chen, G. Wei, J. Lin, S. Zhang, and H. Xu, “for-
[12] A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing restful apis: A est: A tree-based approach for fuzzing restful apis,” arXiv preprint
survey,” ACM Trans. Softw. Eng. Methodol., vol. 33, nov 2023. arXiv:2203.02906, 2022.
[36] A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “Restest: Black-
[13] X. Liu, J. Heo, and L. Sha, “Modeling 3-tiered web applications,” in 13th
box constraint-based testing of restful web apis,” in Service-Oriented
IEEE international symposium on modeling, analysis, and simulation of
Computing: 18th International Conference, ICSOC 2020, Dubai, United
computer and telecommunication systems, pp. 307–310, IEEE, 2005.
Arab Emirates, December 14–17, 2020, Proceedings 18, pp. 459–475,
[14] OpenAPI, “Openapi standard,” 2023. https://www.openapis.org. Springer, 2020.
[15] OpenAPI, “Openapi template,” 2024. https://openapi-generator.tech/docs [37] N. Li, J. Wang, C. Chen, and H. Hu, “Application of api automation test-
/templating. ing based on microservice mode in industry software,” in Proceedings
[16] O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, of the International Conference on Algorithms, Software Engineering,
S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, et al., and Network Security, pp. 460–464, 2024.
“Dspy: Compiling declarative language model calls into self-improving [38] T. Olasehinde and S. Shekhar, “Optimizing microservices and api testing
pipelines,” arXiv preprint arXiv:2310.03714, 2023. pipelines with ai,”
[17] OpenAI, “Introducing structured outputs in the api.” https://openai.com/ [39] M. Kim, T. Stennett, D. Shah, S. Sinha, and A. Orso, “Leveraging large
index/introducing-structured-outputs-in-the-api/, 2023. Accessed: 2024- language models to improve rest api testing,” in Proceedings of the 2024
10-21. ACM/IEEE 44th International Conference on Software Engineering:
[18] M. Asyraf, M. Ishak, M. Razman, and M. Chandrasekar, “Fundamentals New Ideas and Emerging Results, pp. 37–41, 2024.
of creep, testing methods and development of test rig for the full-scale [40] T. Le, T. Tran, D. Cao, V. Le, T. N. Nguyen, and V. Nguyen, “Kat:
crossarm: A review,” Jurnal Teknologi, vol. 81, no. 4, 2019. Dependency-aware automated api testing with large language models,”
[19] A. Golmohammadi, M. Zhang, and A. Arcuri, “Testing restful apis: A in 2024 IEEE Conference on Software Testing, Verification and Valida-
survey,” ACM Transactions on Software Engineering and Methodology, tion (ICST), pp. 82–92, IEEE, 2024.
vol. 33, no. 1, pp. 1–41, 2023. [41] S. Deepika Sri, M. Aadil S, S. Varshini R, R. CSP Raman, G. Rajagopal,
[20] M. Zhang, B. Marculescu, and A. Arcuri, “Resource-based test case and S. Taranath Chan, “Automating rest api postman test cases using
generation for restful web services,” in Proceedings of the genetic and llm,” arXiv e-prints, pp. arXiv–2404, 2024.
evolutionary computation conference, pp. 1426–1434, 2019. [42] A. Decrop, G. Perrouin, M. Papadakis, X. Devroey, and P.-Y. Schobbens,
[21] M. Zhang, B. Marculescu, and A. Arcuri, “Resource and dependency “You can rest now: Automated specification inference and black-box
based test case generation for restful web services,” Empirical Software testing of restful apis with large language models,” arXiv preprint
Engineering, vol. 26, no. 4, p. 76, 2021. arXiv:2402.05102, 2024.
[22] A. Arcuri, “Automated black-and white-box testing of restful apis with [43] A. Pereira, B. Lima, and J. P. Faria, “Apitestgenie: Automated api
evomaster,” IEEE Software, vol. 38, no. 3, pp. 72–78, 2020. test generation through generative ai,” arXiv preprint arXiv:2409.03838,
[23] A. Arcuri, “Test suite generation with the many independent objec- 2024.
tive (mio) algorithm,” Information and Software Technology, vol. 104,
pp. 195–206, 2018.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy