Attribute Map Jerry 1
Attribute Map Jerry 1
Abstract—Existing approaches to automatic data transforma- • The data transformation logic in the building sector in-
tion are insufficient to meet the requirements in many real-world volves multiple combinations of aggregation, attribute flatten-
scenarios, such as the building sector. First, there is no convenient
arXiv:2309.01957v2 [cs.DB] 6 Sep 2023
Fig. 1. Private sectors are using diverse formats to describe building load profiles. Each profile dataset must be converted into a unified
target format for each different purpose. This figure provides several simplified examples.
Existing Text2SQL works [8]–[11] focus on selection • We are the first to apply LLMs to generate SQL code
queries, but cannot handle creation and modification queries. for data transformation. Our system, termed SQLMorpher,
Furthermore, the utilization of LLMs for our target scenario includes a prompt generator that can be easily integrated with
is not only unique but is also faced with new challenges: domain-specific knowledge, high-level schema-change hints,
• Schema Change Challenge: Different from existing and historical prompt knowledge. It also includes an iterative
Text2SQL works, SQLMorpher needs to generate the query prompt optimization tool that identifies flaws in the prompt
that maps data from the source schema to the target schema. for enhancement. We implemented an evaluation framework
• Prompt Engineering Challenge: Designing a unified based on SQLMorpher. (See details in Sec. III)
prompt to handle different types of schema changes and data • We set up a benchmark that consists of 105 real-world
transformation contexts is boring and tedious. data transformation cases in 15 groups in the smart building
• Accuracy Challenge: Most importantly, the code generated domain. We document each case using the source schema, the
by LLMs could be error-prone and even dangerous (e.g., source data examples, the target schema, available domain-
leading to security concerns such as SQL injection attacks). specific knowledge, the schema hints, and a working trans-
formation SQL query for users to validate the solutions. We
To address these challenges, the proposed system, as illus- made the benchmark publicly available to benefit multiple
trated in Fig. 2, consists of the following unique components: communities in smart building, Text2SQL, and automatic data
First, a unique prompt generator is designed to provide transformation 1 2 . (See details in Sec. IV-B)
a unified prompt template. It allows external tools to be • We have conducted a detailed empirical evaluation with
easily plugged into the component, such as domain-specific ablation studies. SQLMorpher using ChatGPT-3.5-turbo-16K
databases, vector databases that index historical successful achieved up to 96% accuracy in 105 real-world cases in the
prompts, and existing schema change detection tools [13]– smart building domain. We verified that our approach can
[15] to retrieve various optional information. The prompt generalize to scenarios beyond building energy data, such as
generator compresses the prompt size by using a few sample COVID-19 data and existing data transformation benchmarks.
data to replace the source datasets for generating the SQL code We also managed to compare SQLMorpher to state-of-the-art
applicable to transforming the entire source datasets. data transformation tools such as Auto-Pipeline (though these
Second, an automatic and iterative prompt optimization tools are not publicly available) on their commercial bench-
component executes the SQL code extracted from the LLM mark. The results showed that SQLMorpher can achieve 81%
response in a sandbox database that is separated from user without using any domain knowledge, and 94% accuracy using
data. It also automatically detects flaws in the last prompt domain knowledge, both of which outperform Auto-Pipeline’s
and adds a request to fix the flaws in the new prompt. accuracy on this benchmark. We also summarized a list of
Examples of the flaws include errors mentioned in the last insights and observations that are helpful to communities. (See
LLM response, the errors that occurred when executing the details in Sec. IV)
SQL query generated by the LLM, as well as insights extracted
from these errors based on rules.
1 https://github.com/asu-cactus/Data Transformation Benchmark
Our Key Contributions are summarized as follows: 2 https://github.com/asu-cactus/ChatGPTwithSQLscript
Domain- Schema Historical Source Testing Validation at
Specific change Prompt Database Database
Database Tooling Database Experiment Time
source dataset transformed e.g., ground truth
target schema prompt Response Generation response SQL Execution target dataset
Prompt Generation Large Language Model Database
Validation at
e.g., ChatGPT e.g., PostgreSQL
Deployment Time
e.g., unit test cases
If failed to pass the validation, return
If passed validation, self-consistency
the errors for augmenting the prompt
return the target dataset
II. R ELATED W ORKS You are a SQL developer. Please generate a Postgres SQL script to
convert the first table to be consistent with the format of the
second table.
Existing Text2SQL tools [8]–[11] automatically generate
First, you must create the first table named $SourceTable with the
SQL code to answer text-based questions on relations. How- given attribute names: {$source_data_schema} and Insert $k rows
ever, existing Text2SQL tools focus on generating selection into the source table. Optional (If not provided,
ChatGPT will generate the
queries. According to our knowledge, there do not exist {k rows of data to be inserted into the source table.} data to be inserted )
any Text2SQL tools that support modification queries (e.g., Second, you must create a second table named $TargetTable with This step generates
insertions) that are required by data transformation. In addi- the given attributes: {$target_data_schema} the query that
converts the first
tion, we surveyed multiple Text2SQL benchmarks including Finally, insert all rows from the first table into the second table. table to the schema
of the second table,
Spider [16], SQUALL [17], Criteria2SQL [18], KaggleD- {explanation for the source table schema} Optional called the target
BQA [19], and so on. However, we didn’t find any data transformation
{explanation for the target table schema} Optional query
transformation use cases in these benchmarks, which also
{hints about schema changes from the source to target} Optional
indicates that data transformation problems are not the focus
of today’s Text2SQL research. {demonstrations} Optional
Existing automatic data transformation [3]–[5], [20]–[28] {flaws in last round’s response} Not needed for the initial round
(a) Basic Prompt and its response has only one error in the aggregation (b) Basic Prompt with domain-specific explanation for the target table returns
function, which should be SUM rather than MAX the correct target transformation query
(2) Schema change hints suggests how the source schema is critical that the demonstrating prompts need to be similar
is mapped to the target schema. Given the strong semantic to the current prompt. For SQLMorpher design, we choose to
reasoning capability of LLMs, hints are also optional. We store the embedding vectors of historically successful prompts
found that some high-level hints, such as “use aggregation”, in a vector database, such as Faiss for top-k nearest-neighbor
are sufficient for LLMs to generate correct Group-By clause search, as illustrated in Fig. 5. In this example, the prompt
and aggregation functions in most scenarios. As illustrated in will fail unless it includes both the schema change hints (in
the brown box in Fig. 5, such information can be provided the brown box) and the demonstration (in the purple box). We
by (a) a rule engine that analyzes domain-specific databases used the ChatGPT 3.5-turbo-16k model API in August 2023
as illustrated in Fig. 6, (b) a schema mapping tool such to generate all examples in this section.
as Starmie [13], or (c) even an LLM itself (e.g., using a To retrieve the various types of information to augment the
separate LLM prompt that asks the LLM to identify schema prompt, the SQLMorpher design includes a callback system.
changes between the source and the target). Fig. 6 illustrates Each type of information corresponds to an event, and the user
the example information that is available in a domain-specific can register one or more callback functions with an event.
database for smart buildings that can be leveraged to generate Each callback function is expected to return a JSON object
schema change hints. In this experimental study, most schema that specifies the retrieved information as well as a status
change hints are derived from the domain-specific databases code and error message that specifies connection or execution
as illustrated in Fig. 4 and Fig. 6. errors, if any. When generating a prompt, SQLMorpher will
(3) Demonstrations add a few examples of historical prompt- go through all types of information, and invoke all callback
response pairs to the prompt to perform few-shot learning. It functions associated with each information type.
Prompt
A Real-world Test Case (Case 100, Group 14)
You are a SQL developer. Please generate a Postgres SQL script to Source Schema
convert the first table to be consistent with the format of the Source14_3(site,timestamp,TOTAL BLDG WHS [emon ch1,2],AC COMPRESSOR WHS [emon
second table. ch3,4],AIR HANDLER WHS [emon ch5,6],WATER HEATER WHS [emon ch7,8], DRYER WHS (1-CT)
[emon ch9] , RANGE WHS (1-CT) [emon ch10] , DISH WASHER WHS [emon ch11] , Primary Fridge
WHS [emon ch12] , 2nd Fridge WHS [emon ch13] , SPARE1 WHS (1-CT) [emon ch14] , SPARE2
First, you must create the first table named $SourceTable with the WHS (1-CT) [xpod chA-1] , SPARE3 WHS (1-CT) [xpod chA-2] ,POOL PUMP WHS (2-CTs) [xpod chA-
given attribute names: {$source_data_schema} and Insert $k rows 3,4],SPARE4 WHS (2-CTs) [xpod chA-5,6],Minisplit WHS (2-CTs) [xpod chA-7,8],Dryer WHS (2-CTs)
into the first table. [xpod chA-9,10], Calculated Unmeasured loads (Whr) , Calculated Energy Use (Whr) ,Future use-
WHS (2-CTs)[xpod chB-3,4],Future use- WHS (2-CTs)[xpod chB-5,6], eMonitor Temp (deg. F) ,
Wattsup Cumulative Ent.Ctr (WHS) , Wattsup Energy Ent.Ctr (WHS) , Wattsup Cumulative Washer
{k rows of data to be inserted into the source table.} (WHS) , Wattsup Energy Washer (WHS) , LaCrosse Device Temp (deg. F) , PointSix Temp (deg. F) ,
PointSix Humidity (%) , HOBO LOGGER TEMP (deg. F) , HOBO LOGGER RH (%) )
Second, you must create a second table named $TargetTable with
only the given attributes: {$target_data_schema} Target Schema
Target_14(month, hour, HVAC, domestic,_water_heating major_appliances, lighting, miscellaneous, Total)
Finally, insert all rows from the first table into the second table.
Samples in the Source Table
36 "8/22/12 16:00" 322.0 323.0 324.0 325.0 326.0 327.0 328.0 329.0 330.0 331.0 332.0 333.0
{explanation for the source table schema} 334.0 335.0 336.0 337.0 74.9 319.0 320.0 321.0 322.0 323.0 324.0 325.0 326.0 327.0 75.9
53.0 327.0 327.0
36 "8/22/12 17:00" 322.0 323.0 324.0 325.0 326.0 327.0 328.0 329.0 330.0 331.0 332.0 333.0
{explanation for the target table schema} 334.0 335.0 336.0 337.0 74.9 319.0 320.0 321.0 322.0 323.0 324.0 325.0 326.0 327.0 75.9
53.0 327.0 327.0
{hints about schema changes from the source to target} …
{demonstrations}
Nearest Embedding
neighbor vector
retrieving Vector Embedding Historical Successful
Schema Change Hint Database Layer Prompts
(Faiss)
Hints obtained from a Domain-Specific Database:
Domain-Specific Database
A successful prompt-Response Pair for the above case
+Hints obtained from a simple rule-based tool You are a SQL developer. Please generate a Postgres SQL script to convert the first table to be
consistent with the format of the second table.
Use row aggregation group by month, hour. First, you must create the first table named $SourceTable with the given attribute names:
{$source_data_schema} and Insert $k rows into the first table.
Use column aggregation for total
{k rows of data to be inserted into the source table.}
Second, you must create a second table named $TargetTable with only the given attributes:
{$target_data_schema}
Finally, insert all rows from the first table into the second table.
{explanation for the source table schema}
Example Rule: {explanation for the target table schema}
{hints about schema changes from the source to target}
If source tuple is at hour level
Domain-specific rule-based and target tuple is at month- The correct response is:
Schema change hints hour level, use aggregation SELECT EXTRACT(MONTH FROM time) AS month, EXTRACT(HOUR FROM time) AS
generator group by month, hour. hour, … FROM Source14_1 GROUP BY month, hour; (Partial of the query omitted due to
The tool can also be replaced by space limitation)
a human expert.
Fig. 5. A Working Prompt for a Real-World Case (Case 100 in Group 14 in Tab. I).
B. SQL Execution validation tests, the target table will be removed or archived
before running the next iteration.
Compared to existing Text2SQL that focuses on selection
queries that are read-only, leveraging LLM to generate modifi-
C. Validation
cation queries is more complicated, partially because running
the generated query may raise security concerns. In the initial The validation in the production environment could be chal-
iteration for a given user request, the system automatically du- lenging due to the lack of ground truth. It needs an automatic
plicates the source dataset in a separate PostgreSQL database quality measurement (e.g., unit test cases, self-consistency, or
that serves as a sandbox environment to isolate the errors, accuracy of downstream tasks) for the transformed data, which
if the duplicate does not exist. This is to ensure that the we leave for future work to address.
generated code will not corrupt the source dataset. Then, the In this work, we manually prepare the ground truth trans-
script creates the target table. Finally, it runs the generated formation queries for each transformation case in the experi-
query to transform the entire source dataset into the target mental environment. At the validation stage, the ground truth
format and insert all transformed tuples into the target table. transformation query will be executed against the source table,
If another iteration is needed, e.g., the response cannot pass the resulting in the target table, which is called the ground truth
Table 1. Enduse Categorization
No Enduse Code Description
End Uses
HVAC
Sub End Uses
Heating
Source14_3
CMPPWR 1 BLDPWR TOTAL BLDG WHS [emon ch1,2]
An example of useful prompt flaws that we observed in
Cooling AHUPWR 2 CMPPWR AC COMPRESSOR WHS [emon ch3,4]
Furnace/AC MSPLIT
3 AHUPWR AIR HANDLER WHS [emon ch5,6]
those cases is “ERROR: INSERT has more expressions than
fan 4 DWHPWR WATER HEATER WHS [emon ch7,8]
Boiler pumps
5 DRY1CT DRYER WHS (1-CT) [emon ch9]
target columns LINE 100: PCT HOURLY 2500”. Before
6 RNG1CT RANGE WHS (1-CT) [emon ch10]
Kitchen range
exhaust fan 7 DSHWSR DISH WASHER WHS [emon ch11]
adding this error to the prompt, ChatGPT cannot correctly
Bath exhaust 8 FRIDG1 Primary Fridge WHS [emon ch12]
fan 9 FRIDG2 2nd Fridge WHS [emon ch13]
handle an attribute that exists in the source table but not in
Domestic Domestic DWHPWR 10 SPARE1 SPARE1 WHS (1-CT) [emon ch14]
water water heating 11 SPARE2 SPARE2 WHS (1-CT) [xpod chA-1]
the target table, PCT HOURLY 2500. Adding the error to the
heating
12 SPARE3 SPARE3 WHS (1-CT) [xpod chA-2]
Major
appliances
Refrigerator FRIDG1
NA 13 POOLPW POOL PUMP WHS (2-CTs) [xpod chA-3,4]
prompt will resolve the problem.
Clothes
washer 14 SPARE4 SPARE4 WHS (2-CTs) [xpod chA-5,6]
Clothes dryer DRY1CT 15 MSPLIT Minisplit WHS (2-CTs) [xpod chA-7,8]
DRY2CT 16 DRY2CT Dryer WHS (2-CTs) [xpod chA-9,10] IV. E XPERIMENTAL E VALUATION
Dishwasher DSHWSR 17 OTHPWR Calculated Unmeasured loads (Whr)
Cooking RNG1CT
range
18
19
BLDPWC
EXTRA1
Calculated Energy Use (Whr)
Future use- WHS (2-CTs)[xpod chB-3,4]
In this section, we first describe the goal of the comparison
Pool/spa POOLPW
pumps
NA
20
21
EXTRA2
EMTEMP
Future use- WHS (2-CTs)[xpod chB-5,6]
eMonitor Temp (deg. F)
study and all baselines that were used. Then, we present a
Pool/spa
Lighting
heaters
Interior NA
22
23
WUPCUM
WUPWHR
Wattsup Cumulative Ent.Ctr (WHS)
Wattsup Energy Ent.Ctr (WHS)
benchmark, which is the first benchmark for smart building
Exterior
Miscellaneo Miscellaneous WUPWHR
24
25
WUCUMW
WUWASH
Wattsup Cumulative Washer (WHS)
Wattsup Energy Washer (WHS)
data standardization problems. We further describe the setup of
us plug plug loads
loads
FRIDG2
26
27
LCTMP1
P6TEMP
LaCrosse Device Temp (deg. F)
PointSix Temp (deg. F)
the experiments and the evaluation metrics. Ultimately, we will
Other
refrigerators
Car NA
28
29
P6HUMI
HBTEMP
PointSix Humidity (%)
HOBO LOGGER TEMP (deg. F)
present and analyze the results and summarize key findings.
Total BLDPWR 30 HBRHUM HOBO LOGGER RH (%)
A. Comparison and Baselines
o Appendix E: FSEC (red font means enduses not used for comparison; green font means enduses used for comparison;
Fig. 6. Example information from the smart building domain-specific
No Enduse Code Description In this work, we mainly compare the effectiveness of six
database specifies the mapping 1 from source
BLDPWR TOTALattributes
BLDG WHS [emonto
ch1,2]the target
attributes for the example in 23Fig.CMPPWR
5. AC COMPRESSOR WHS [emon ch3,4] different types of initial prompt templates:
AHUPWR AIR HANDLER WHS [emon ch5,6]
• Prompt-1: Basic prompt with domain-specific description
target table. At the same time, by executing the target trans- for the target schema.
formation query contained in the LLM response, as described • Prompt-2: Prompt-1 with a domain-specific description
in Sec. III-B, we can also obtain a target table, which is called for the source schema.
the generated target table. • Prompt-3: Prompt-2 with schema change hints.
We designed a validation script, which compares the gen- • Prompt-1+Demo: Prompt-1 with one demonstration.
erated target table to the ground truth target table. The • Prompt-2+Demo: Prompt-2 with one demonstration.
comparison first validates whether two tables have the same • Prompt-3+Demo: Prompt-3 with one demonstration.
number of attributes and tuples. Then, it performs attribute- The first three prompt templates are designed for zero-shot
reordering and tuple-sorting to ensure two tables share the learning when there does not exist a database of abundant
same column-wise and row-wise orderings. Furthermore, the historical working prompts. The last three prompt templates
script will compare the similarity of the values for each are designed for one-shot learning.
attribute in the two tables. We use the ratio of the number We also considered comparing our approach to Auto-
of equivalent values (difference should be less than e−10 ) Pipeline [4], which is a state-of-the-art automatic data trans-
to the total number of values to measure the similarity of formation tool that only requires schema and tuple examples
numerical attributes. We use the Jaccard similarity to measure of the source and target tables and applies deep reinforcement
the similarity of categorical and text attributes. We average learning to synthesize the transformation pipeline.
the similarity for each attribute to derive an overall similarity
score. If the similarity score is below 1, the validation fails. B. Benchmark Design
1) Building Energy Data Transformation: We collected 105
D. Iterative Prompt Optimization data transformation examples in the smart building domain
This component is incorporated to evaluate the LLM’s po- from 21 energy companies in the United States. These exam-
tential self-optimization capability for the data transformation ples are divided into 15 groups so that each group has one
problem. If the validation fails, the prompt will be automat- target dataset and multiple source datasets of different types.
ically augmented by identifying errors in the prompt: (1) Each source needs to be converted to the target format in the
errors mentioned in the LLM response or met when executing group. The groups are described in Tab. I. In Tab. II, we further
the generated transformation query; (2) errors detected in the show more statistics of the 105 test cases by groups: (1) the
transformed dataset, e.g., reporting the difference between the number of distinct SQL keywords used in the ground truth
schema of the transformed dataset and the target schema; query and (2) the length (i.e., number of characters) of the
(3) inconsistency between the schema change hint and the ground truth query. For each group, we compute the average
response query, e.g., reporting if the hint specifies to use of the above metrics for all cases in the group.
aggregation, but no Group-By or aggregation functions have We document the following information in the benchmark:
been used. Then, these errors will be appended to the prompt, (1) target schema and domain-specific explanations for at-
and the new prompt will be sent back to the LLM, and it will tributes; (2) for each source dataset, its schema, domain-
repeat this process until it passes the validation, the maximum specific explanation of attributes, examples of instances,
number of iterations has been reached, or the new prompt has schema change hints for transforming the source table to the
no difference with the last prompt. target format, and the ground truth query that transforms the
source to the target. The benchmark dataset is open-sourced D. Experimental Setups
in a GitHub repository 1 . We implemented the end-to-end workflow as illustrated in
2) Other Benchmarks Used: We also used two other bench- Fig. 2 in a Python script that uses the ChatGPT-3.5-turbor-16K
marks that go beyond the smart building data transformation model. We did not present results on ChatGPT-4, because the
for different purposes. One commercial benchmark consists corresponding OpenAI API has a limit of 4K bytes for the total
of 16 cases used by the Auto-Pipeline baseline. Since Auto- prompt-response size at this point, while this size is insufficient
Pipeline code is not publicly available, we apply our proposed for a significant portion of real-world cases. For example,
approach (without and with domain-specific knowledge) to the tables in Group 10 to Group 15 have up to 152 attributes,
benchmark and compare the results. leading to a large prompt size. We set the temperature to zero
Another benchmark consists of four COVID-19 data trans- to avoid randomness for several reasons. First, a primary goal
formation cases, which we used to validate further how of this work is to evaluate the effectiveness of LLMs on data
well our methodology can generalize to other data transfor- transformation tasks by using different types of initial prompts
mation scenarios. It includes all four transformation cases and the effectiveness of iterative prompt optimization. Random
observed in the Github commit history of a widely used responses require additional methods (e.g., majority voting)
real-world COVID-19 data repository maintained by John for self-consistency, which will complicate the comparison.
Hopkins University [31]. The attributes in the target data Second, setting the temperature to zero will achieve better
are (Province/State, Country/Region, Last Update, Confirmed, quality results in most cases, according to a recent OpenAI
Deaths, Recovered), which represent the state-level COVID- article [32]. All SQL codes are run on PostgreSQL database
19 statistics. The source schemas of the first two cases involve version 15.0 for validation. All descriptions for source and
county-level data with different numbers of columns, and the target attributes are obtained from a domain-specific database3 .
latter two cases involve state-level data with different column
names and different numbers of columns. E. Smart Building Data Transformation Results
Overall, we have tested 125 cases in three benchmarks, 1) Overall Results.: The zero-shot learning results for
among which, 27 cases involve attribute merging, 89 cases the smart building data transformation benchmark are illus-
involve attribute name changes, 32 cases involve pivoting, 5 trated in Tab. II. Using Prompt-3, our proposed SQLMorpher
cases involve attribute flattening, 50 cases involve group-by methodology achieved an execution accuracy of 96%, which
and aggregation, 8 cases involve join. is significantly higher than Prompt-1 and Prompt-2, which
achieved an execution accuracy of 28% and 36%, respectively.
C. Evaluation Metrics It demonstrated the importance of supplying domain-specific
knowledge, particularly schema change hints, as part of the
We report the following metrics in the experimental study: prompt to the LLM. The observation justifies the integration
• Execution Accuracy: This metric is defined as the ratio of of the LLM with the domain-specific knowledge base and the
the number of correctly transformed cases to the total number schema mapping tools for data transformation pipelines.
of transformation cases. For each case, if the LLM can return 2) Effectiveness of One-shot Learning.: For the four cases
the correct transformation query that passes the experimental that failed with Prompt-3, we applied Prompt-4, Prompt-5,
validation tests as described in Sec. III-C within 5 iterations, and Prompt-6 to check whether providing one demonstration
it is considered a correctly transformed case. example that involves a similar prompt and a correct response
• Column Similarity: We compute the similarity score for each can improve the LLM response. The results are illustrated
column in the transformed dataset and its corresponding col- in Tab. III, which showed that using prompts that combine
umn in the ground truth target dataset (defined in Sec. III-C). domain-specific knowledge and demonstration is capable of
As detailed in Sec. III-C, we compute a similarity score for solving all four complicated cases that failed with Prompt-3.
each column. We further define the column similarity per 3) Effectiveness of the Iterative Optimization Process.:
case as the average similarity scores of all target attributes Compared to Prompt-1 and Prompt-2, we have found that
in the case, the column similarity per group as the average Prompt-3 can gain significantly more from iterative prompt
similarity scores of all cases in the group, and the overall optimization. When using Prompt-1, five cases in three groups,
column similarity as the average similarity scores of all cases Group-1, Group-4, and Group-7, benefit from iterative prompt
in all groups. The similarity score is set to zero for cases that optimization, the average number of iterations being 1.2,
fail to generate output data for similarity comparison. 1.3, and 2.0, respectively, as illustrated in Tab. II. Other
groups either have all cases passed in one iteration or have
• Number of Iterations to Success: For each case, we record
all cases failed. When using Prompt-2, four cases in three
the number of iterations used to achieve the correct response
groups, Group-2, Group-3, and Group-7, require more than
for the case. We record the average number of iterations
one iteration to succeed, the average number of iterations being
to success for all successful cases that achieved a column
1.4, 1.2, and 1.2, respectively. When using Prompt-3, 10 cases
similarity score of 1.0 within 5 iterations in each group, and
in all groups. The latter is termed as the overall number of 3 The domain-specific database is maintained by co-author Liang Zhang.
iterations to success. Some example information in the database is illustrated in Fig. 6 and Fig. 4
TABLE I
D ESCRIPTIONS OF B ENCHMARK G ROUPS
in six groups, require more than one iteration to succeed. on all 16 transformation problems in their benchmark only
It means that 9.5% of total cases can benefit from iterative using the basic prompt, without additional domain-specific
prompt optimization when using Prompt-3. knowledge. The execution accuracy achieved by Auto-Pipeline
on this benchmark is below 70% [4]. The comparison implies
F. Results on Benchmarks Beyond Smart Building. that our approach has great potential to outperform state-of-
First, we tested our approach on the COVID-19 benchmark. the-art automatic data transformation tools.
The results are illustrated in Tab. IV, which showed that our
G. Summary of Key Findings
proposed methodology resolves all four cases simply using the
basic prompt (Prompt-1). • Large language models are promising to automatically
Second, we also compared our proposed approach with the resolve complicated smart building data transformation cases if
Auto-Pipeline approach, using its commercial benchmark [4]. domain-specific knowledge is available and easily retrievable.
The results are illustrated in Tab. V. It showed that our We achieved 96% accuracy on our proposed benchmark, which
proposed methodology achieved perfect execution accuracy consists of 105 real-world smart building cases.
TABLE II
C OMPARISON OF E XECUTION ACCURACY U SING D IFFERENT P ROMPT T EMPLATES WITH Z ERO -S HOT L EARNING (G RP STANDS FOR G ROUP )
Grp-1 Grp-2 Grp-3 Grp-4 Grp-5 Grp-6 Grp-7 Grp-8 Grp-9 Grp-10 Grp-11 Grp-12 Grp-13 Grp-14 Grp-15
#keywords avg. 15.6 18.7 19.6 19.6 16.4 16.3 24.2 26.7 23.7 13.7 20.3 28.0 5.0 25.3 31.5
length avg. 1802 2826 1957 2023 1731 1713 1548 3239 1034 1712 2085 2365 1412 1732 1918
Prompt 1. Overall execution accuracy: 29/105 (28%); Overall column similarity scores: 0.4; Overall iteration to success: 1.3
exec auc 6/10 2/10 4/10 6/10 3/10 2/10 6/10 0/10 0/3 0/3 0/4 0/4 0/3 0/4 0/4
sim score avg. 0.7 0.5 0.6 0.6 0.3 0.3 0.8 0.0 0.0 0.6 0.3 0.0 0.3 0.0 0.0
iter-to-succ avg. 1.2 1.0 1.0 1.3 1.0 1.0 2.0 - - - - - - - -
Prompt 2. Overall execution accuracy: 38/105 (36%); Overall column similarity score: 0.5; Avg iteration to success: 1.1
exec auc 6/10 6/10 7/10 6/10 3/10 2/10 8/10 0/10 0/3 0/3 0/4 0/4 0/3 0/4 0/4
sim score avg. 0.7 0.6 0.7 0.6 0.4 0.3 0.9 0.0 0.4 0.8 0.7 0.2 0.3 0.1 0.0
iter-to-succ avg. 1.0 1.4 1.2 1.0 1.0 1.0 1.2 - - - - - - - -
Prompt 3. Overall execution accuracy: 101/105 (96%); Overall column similarity score: 0.96; Avg iteration to success: 1.2
exec auc 10/10 10/10 10/10 10/10 10/10 10/10 10/10 9/10 3/3 3/3 4/4 3/4 3/3 2/4 4/4
sim score avg. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 0.8 1.0 0.5 1.0
iter-to-succ avg. 1.0 1.0 1.1 1.0 1.7 1.6 1.0 1.0 1.0 1.3 1.5 1.0 1.3 1.0 1.0
100.00%
TABLE IV
P ROMPT C OMPARISON FOR C OVID -19 B ENCHMARK 80.00%
# KEYWORDS AVG .: 5, LENGTH AVG .: 277 60.00%
40.00%
Prompt-1 Prompt-2 Prompt-3 20.00%
exec auc 4/4 4/4 4/4 0.00%
sim score avg. 1.0 1.0 1.0 Prompt-1 attribute attribute attribute attribute group-by join
iter-to-succ avg. 1.0 1.0 1.0 Prompt-2 merging name pivoting flattening and
Prompt-3 change aggregate