0% found this document useful (0 votes)
29 views10 pages

Attribute Map Jerry 1

The document presents SQLMorpher, an innovative framework that utilizes large language models (LLMs) to automate data transformation in the building energy sector, achieving 96% accuracy across 105 real-world cases. It addresses existing gaps in data transformation tools by integrating domain knowledge and optimizing prompts for schema changes. The framework demonstrates the potential of LLMs to enhance data management processes, highlighting their capabilities in coding, reasoning, and zero-shot learning.

Uploaded by

sakethsreeram7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Attribute Map Jerry 1

The document presents SQLMorpher, an innovative framework that utilizes large language models (LLMs) to automate data transformation in the building energy sector, achieving 96% accuracy across 105 real-world cases. It addresses existing gaps in data transformation tools by integrating domain knowledge and optimizing prompts for schema changes. The framework demonstrates the potential of LLMs to enhance data management processes, highlighting their capabilities in coding, reasoning, and zero-shot learning.

Uploaded by

sakethsreeram7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Automatic Data Transformation Using Large Language Model

– An Experimental Study on Building Energy Data


Ankita Sharmaa,b *, Xuanmao Lib,g *, Hong Guana,b *, Guoxin Sunb *, Liang Zhangc,f ‡, Lanjun Wangd ,
Kesheng Wua , Lei Caoc , Erkang Zhue , Alexander Sima , Teresa Wub , Jia Zoua,b †
Lawrence Berkeley National Laba , Arizona State Universityb , University of Arizonac , Tianjin Universityd
Microsofte , National Renewable Energy Laboratoryf , Huazhong Univ of Science and Technologyg

Abstract—Existing approaches to automatic data transforma- • The data transformation logic in the building sector in-
tion are insufficient to meet the requirements in many real-world volves multiple combinations of aggregation, attribute flatten-
scenarios, such as the building sector. First, there is no convenient
arXiv:2309.01957v2 [cs.DB] 6 Sep 2023

ing, merging, pivoting, and renaming relationships between


interface for domain experts to provide domain knowledge
easily. Second, they require significant training data collection the source and the target. They are more complicated than
overheads. Third, the accuracy suffers from complicated schema existing data transformation benchmarks [4], [5]. In addition,
changes. To bridge this gap, we present a novel approach the accuracy achieved by the state-of-art tools on these simpler
that leverages the unique capabilities of large language models benchmarks is below 80% [4], [5], indicating it still requires
(LLMs) in coding, complex reasoning, and zero-shot learning to human efforts to fix a significant portion of the cases.
generate SQL code that transforms the source datasets into the
target datasets. We demonstrate the viability of this approach • Converting a building dataset to a target schema requires
by designing an LLM-based framework, termed SQLMorpher, domain knowledge about both source/target schemas, which
which comprises a prompt generator that integrates the initial are available in domain-specific knowledge as illustrated in
prompt with optional domain knowledge and historical patterns
in external databases. It also implements an iterative prompt
Fig. 4 and Fig. 6. However, there is no easy way to directly
optimization mechanism that automatically improves the prompt supply such knowledge in existing data transformation tools.
based on flaw detection. The key contributions of this work To close the gaps, before this work, we once considered
include (1) pioneering an end-to-end LLM-based solution for
data transformation, (2) developing a benchmark dataset of finetuning a pre-trained transformer model like BERT [6] to
105 real-world building energy data transformation problems, directly transform source data to target data [7]. However, we
and (3) conducting an extensive empirical evaluation where identified many shortcomings of this approach. First, it is hard
our approach achieved 96% accuracy in all 105 problems. to formulate one unified predictive problem to transform data
SQLMorpher demonstrates the effectiveness of utilizing LLMs in for all types of schema changes. Second, the transformation
complex, domain-specific challenges, highlighting the potential of
their potential to drive sustainable solutions. process is slow to handle large-scale data. Third, preparing a
Index Terms—large language model, data transformation, fine-tuning dataset for each task could also be challenging.
smart building, ChatGPT, Text2SQL This work proposes a novel and better approach, termed
SQLMorpher, which solves the problem in two steps. The first
I. I NTRODUCTION
step is formulated as a Text2SQL problem [8]–[11]. We apply
A recent study [1] showed that in 2022, the end-use energy the LLM model to generate Structured Query Language (SQL)
consumption by the building sector accounted for 40% of code that converts the source dataset(s) into the target dataset.
total US energy consumption. It indicates that the energy This step focuses on schema mapping, so we do not need to
management of buildings plays an important role in meeting upload the entire source datasets. The second step applies the
the goals of energy sustainability [2]. Automatic building en- generated SQL code to efficiently transform the entire dataset
ergy management, including design, certification, compliance, in relational databases.
real-time control, operation, and policy-making, requires the
The idea is motivated by several key observations: (1)
integration of data from diverse sources in both the private and
LLMs demonstrate superior performance in complex rea-
public sectors. Harmonizing these data, as illustrated in Fig. 1,
soning tasks. In the building sector, domain experts often doc-
remains a manual process. Extensive labor and expertise are
ument the semantics of the source and the target tables in nat-
thus required throughout the data lifecycle in the building
ural language. LLMs can better understand such descriptions
sector. However, the state-of-the-art data transformation tools,
and reason the relationships between the source and target than
such as Auto-Transform [3], Auto-Pipeline [4], and Auto-
smaller pre-trained models. (2) LLMs have demonstrated
Tables [5], are not effective due to the following gaps:
strong coding and code explanation capability [12]. In
• These tools are not publicly available and are based on super- addition, SQL’s declarative nature makes it easier to map data
vised learning approaches, requiring non-trivial data labeling transformation queries in natural language to SQL queries. (3)
and training overheads. LLMs have outstanding capabilities in zero-shot and few-
* These authors made equal contributions; †Jia Zou is the corresponding shot adaptation and generalization. Therefore, none or only
author; ‡Liang Zhang is the contact for the datasets and use cases. a few training examples are needed.
Source Target Source Target
SUM
SUM
datetime cerc_logger_1 interior_ exterior_
2/22/2018 0:30 22.875 24 hours time heating cooling _lighting lighting … index HVAC lighting …
2/22/2018 0:40 22.937 15:00.0 170365.32 10173.74 39205.14 7653.41 … 0 180539.06 46858.54
2/22/2018 0:50 22.937 30:00.0 160103.29 9711.17 39205.14 7653.41 … 1 169814.46 46858.54
2/22/2018 1:00 22.937 CST 1:00 2:00 … 24:00 45:00.0 149229.62 9061.37 39205.14 7653.41 … 2 158290.99 46858.54
2/22/2018 1:10 23 02/22/2018 138.123 … … … 00:00.0 139390.49 8148.23 39205.14 7653.41 … 3 147538.72 46858.54
2/22/2018 1:20 23.062 SUM 15:00.0 116364.94 7864.62 23523.08 4592.04 … 4 124229.56 28115.13
2/22/2018 1:30 23.062 30:00.0 116371.65 6634.81 23523.08 4592.04 … 5 123006.46 28115.13
2/22/2018 1:40 23 45:00.0 115341.15 6496.82 23523.08 4592.04 … 6 121837.97 28115.13
2/22/2018 1:50 23.062 00:00.0 114717.11 6457.23 23523.08 4592.04 … 7 121174.34 28115.13
… … … … … … …… .. … … …
(a) Transformation based on Group-By, Aggregation, and Pivoting (b) Transformation based on Attribute Group and Merge
Source Target
Merge

DT_STRATA DOW PCT_HOURLY_0100 PCT_HOURLY_0200 … PCT_HOURLY_2400 CST 1:00 2:00 … 24:00


1/1/16 H .001222017108240 .001274017836250 … .001222017108240 Sat 1/1/16 .001222017108240 .001274017836250 … .001222017108240
1/2/16 7 .001313018382257 .001248017472245 … .001196016744234 Sun 1/2/16 .001313018382257 .001248017472245 … .001196016744234
1/3/16 Mon 1/3/16 .001157016198227 .001092015288214 … .001040014560204
1 .001157016198227 .001092015288214 … .001040014560204
Tue 1/4/16 .001105015470217 .001053014742206 … .000975013650191
1/4/16 2 .001105015470217 .001053014742206 … .000975013650191
… … … … …
... … … … … ….

(c) Transformation based on Attribute Merge and Attribute Name Change

Fig. 1. Private sectors are using diverse formats to describe building load profiles. Each profile dataset must be converted into a unified
target format for each different purpose. This figure provides several simplified examples.

Existing Text2SQL works [8]–[11] focus on selection • We are the first to apply LLMs to generate SQL code
queries, but cannot handle creation and modification queries. for data transformation. Our system, termed SQLMorpher,
Furthermore, the utilization of LLMs for our target scenario includes a prompt generator that can be easily integrated with
is not only unique but is also faced with new challenges: domain-specific knowledge, high-level schema-change hints,
• Schema Change Challenge: Different from existing and historical prompt knowledge. It also includes an iterative
Text2SQL works, SQLMorpher needs to generate the query prompt optimization tool that identifies flaws in the prompt
that maps data from the source schema to the target schema. for enhancement. We implemented an evaluation framework
• Prompt Engineering Challenge: Designing a unified based on SQLMorpher. (See details in Sec. III)
prompt to handle different types of schema changes and data • We set up a benchmark that consists of 105 real-world
transformation contexts is boring and tedious. data transformation cases in 15 groups in the smart building
• Accuracy Challenge: Most importantly, the code generated domain. We document each case using the source schema, the
by LLMs could be error-prone and even dangerous (e.g., source data examples, the target schema, available domain-
leading to security concerns such as SQL injection attacks). specific knowledge, the schema hints, and a working trans-
formation SQL query for users to validate the solutions. We
To address these challenges, the proposed system, as illus- made the benchmark publicly available to benefit multiple
trated in Fig. 2, consists of the following unique components: communities in smart building, Text2SQL, and automatic data
First, a unique prompt generator is designed to provide transformation 1 2 . (See details in Sec. IV-B)
a unified prompt template. It allows external tools to be • We have conducted a detailed empirical evaluation with
easily plugged into the component, such as domain-specific ablation studies. SQLMorpher using ChatGPT-3.5-turbo-16K
databases, vector databases that index historical successful achieved up to 96% accuracy in 105 real-world cases in the
prompts, and existing schema change detection tools [13]– smart building domain. We verified that our approach can
[15] to retrieve various optional information. The prompt generalize to scenarios beyond building energy data, such as
generator compresses the prompt size by using a few sample COVID-19 data and existing data transformation benchmarks.
data to replace the source datasets for generating the SQL code We also managed to compare SQLMorpher to state-of-the-art
applicable to transforming the entire source datasets. data transformation tools such as Auto-Pipeline (though these
Second, an automatic and iterative prompt optimization tools are not publicly available) on their commercial bench-
component executes the SQL code extracted from the LLM mark. The results showed that SQLMorpher can achieve 81%
response in a sandbox database that is separated from user without using any domain knowledge, and 94% accuracy using
data. It also automatically detects flaws in the last prompt domain knowledge, both of which outperform Auto-Pipeline’s
and adds a request to fix the flaws in the new prompt. accuracy on this benchmark. We also summarized a list of
Examples of the flaws include errors mentioned in the last insights and observations that are helpful to communities. (See
LLM response, the errors that occurred when executing the details in Sec. IV)
SQL query generated by the LLM, as well as insights extracted
from these errors based on rules.
1 https://github.com/asu-cactus/Data Transformation Benchmark
Our Key Contributions are summarized as follows: 2 https://github.com/asu-cactus/ChatGPTwithSQLscript
Domain- Schema Historical Source Testing Validation at
Specific change Prompt Database Database
Database Tooling Database Experiment Time
source dataset transformed e.g., ground truth
target schema prompt Response Generation response SQL Execution target dataset
Prompt Generation Large Language Model Database
Validation at
e.g., ChatGPT e.g., PostgreSQL
Deployment Time
e.g., unit test cases
If failed to pass the validation, return
If passed validation, self-consistency
the errors for augmenting the prompt
return the target dataset

Fig. 2. SQLMorpher: Automatic Data Transformation based on LLM

II. R ELATED W ORKS You are a SQL developer. Please generate a Postgres SQL script to
convert the first table to be consistent with the format of the
second table.
Existing Text2SQL tools [8]–[11] automatically generate
First, you must create the first table named $SourceTable with the
SQL code to answer text-based questions on relations. How- given attribute names: {$source_data_schema} and Insert $k rows
ever, existing Text2SQL tools focus on generating selection into the source table. Optional (If not provided,
ChatGPT will generate the
queries. According to our knowledge, there do not exist {k rows of data to be inserted into the source table.} data to be inserted )
any Text2SQL tools that support modification queries (e.g., Second, you must create a second table named $TargetTable with This step generates
insertions) that are required by data transformation. In addi- the given attributes: {$target_data_schema} the query that
converts the first
tion, we surveyed multiple Text2SQL benchmarks including Finally, insert all rows from the first table into the second table. table to the schema
of the second table,
Spider [16], SQUALL [17], Criteria2SQL [18], KaggleD- {explanation for the source table schema} Optional called the target
BQA [19], and so on. However, we didn’t find any data transformation
{explanation for the target table schema} Optional query
transformation use cases in these benchmarks, which also
{hints about schema changes from the source to target} Optional
indicates that data transformation problems are not the focus
of today’s Text2SQL research. {demonstrations} Optional

Existing automatic data transformation [3]–[5], [20]–[28] {flaws in last round’s response} Not needed for the initial round

fall in two categories: (1) Transform-by-Example (TBE) [20]–


[27] infers transformation programs based on user-provided Fig. 3. Prompt Template for single-table transformation.
input/output examples, which have been incorporated into
popular software such as Microsoft Excel, Power BI [29], and A. Prompt Generation
Trifacta [30]. However, these works require users to provide
examples of the transformed tuples, which is challenging We designed a prompt template as illustrated in Fig. 3.
for complicated data transformations. (2) To address the is- The naive user must provide minimal information, such as
sue, Transform-by-Target (TBT) [3]–[5] is recently proposed. the source and target table schemas and examples of the
Works in this category, such as Auto-Transform [3], Auto- tuples in the source dataset. Although a source table could
Pipeline [4], and Auto-Tables [5], transform data based only on contain many tuples, SQLMorpher only demonstrates to the
input/output data schemas and optionally output data patterns. LLM a few examples, which are sufficient to generate code for
As mentioned, they learn a pipeline of data transformation correctly transforming the whole table. Despite the sampling
operators using deep learning. They cannot easily integrate techniques that can be applied here, we chose to randomly
domain-specific knowledge represented in natural language or sample 5 source tuples in the evaluation. If source tuples are
other formats. Although those tools are not publicly available, not available, we asked the LLM to generate 5 source tuples.
we conducted a comparison by running our approach on their All other information is optional but is helpful for compli-
benchmark, as detailed in Sec. IV-F. cated transformation cases. We designed the prompt generator
to retrieve additional information from external databases
easily. Such information includes:
III. SQLMorpher S YSTEM D ESIGN (1) Domain-specific information, which explains the seman-
tics of each attribute in the source table and the target table.
As illustrated in Fig. 2, the SQLMorpher system consists Given LLMs’ diverse and ocean-volume training corpus, such
of a prompt generator, a large language model (LLM), a explanations are not required for many domains, and so it is
SQL execution engine, and a component for iterative prompt marked as optional. However, we found that using domain
optimization. In this section, we describe each component knowledge to enhance the prompt could be critical for many
in detail. Although SQLMorpher is primarily engineered to smart building data transformation cases. This information can
evaluate the LLM in our target use scenarios, it is a first-of-a- be retrieved from a domain-specific database, as illustrated in
kind design that has research values in defining the workflows Fig. 4. In this example, basic information plus the domain-
and the interfaces between LLM and external tooling for the specific information that explains only the target table is
unique data transformation problem. sufficient to address the first example, as shown in Fig. 1a.
23 WUPWHR Wattsup Energy Ent.Ctr (WHS)
24 WUCUMW Wattsup Cumulative Washer (WHS)
25 WUWASH Wattsup Energy Washer (WHS)
26 LCTMP1 LaCrosse Device Temp (deg. F)
27 P6TEMP PointSix Temp (deg. F)
28 P6HUMI PointSix Humidity (%)
29 HBTEMP HOBO LOGGER TEMP (deg. F)
30 HBRHUM HOBO LOGGER RH (%)
Basic Prompt Prompt with Domain Knowledge
You are a SQL developer. Please generate a Postgres SQL script to convert the first table to be
We append the below Company Table ID Description
consistent with the format of the second table.
First, you must create the first table named source1 with the given attributes: (datetime, domain-specific C1 C1_RHH Residential high summer use; high winter use
cerc_logger_1), and insert 10 rows into the first table: C1 C1_RHL Residential high summer use; low winter use
2/22/2018 0:30, 22.875
explanation to the
2/22/2018 0:40, 22.937 basic prompt: C1 C1_RHT Residential-with heat, each hour each month
2/22/2018 0:50, 22.937 C1 C1_L Total Load each hour each day
2/22/2018 1:00, 22.937
2/22/2018 1:10, 23 Note that the second
2/22/2018 1:20, 23.062 table records the total Domain-
2/22/2018 1:30, 23.062
2/22/2018 1:40, 23 load in each hour each Specific
2/22/2018 1:50, 23.062 day. Database
Second, you must create a second table named target1 with only the given attributes: (CST, 1:00,
2:00, 3:00, 4:00, 5:00, 6:00, 7:00, 8:00, 9:00, 10:00, 11:00, 12:00, 13:00, 14:00, 15:00, 16:00, Response to the Prompt with Domain Knowledge
17:00, 18:00, 19:00, 20:00, 21:00, 22:00, 23:00, 24:00)
Finally, insert all rows from the first table to the second table. -- Insert data from source1 into target1
INSERT INTO target1 (CST, 1:00, 2:00, 3:00, 4:00, 5:00, 6:00, 7:00, 8:00, 9:00, 10:00, 11:00, 12:00, 13:00, 14:00,
15:00, 16:00, 17:00, 18:00, 19:00, 20:00, 21:00, 22:00, 23:00, 24:00)
Response to the Basic Prompt SELECT
DATE_TRUNC('day', datetime) AS CST,
-- Insert data from source1 into target1 COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 1 THEN cerc_logger_1 END), 0) AS 1:00,
INSERT INTO target1 (CST, 1:00, 2:00, 3:00, 4:00, 5:00, 6:00, 7:00, 8:00, 9:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 2 THEN cerc_logger_1 END), 0) AS 2:00,
10:00, 11:00, 12:00, 13:00, 14:00, 15:00, 16:00, 17:00, 18:00, 19:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 3 THEN cerc_logger_1 END), 0) AS 3:00,
20:00, 21:00, 22:00, 23:00, 24:00) COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 4 THEN cerc_logger_1 END), 0) AS 4:00,
SELECT COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 5 THEN cerc_logger_1 END), 0) AS 5:00,
DATE_TRUNC('day', datetime) AS CST, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 6 THEN cerc_logger_1 END), 0) AS 6:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 1 THEN cerc_logger_1 END) AS 1:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 7 THEN cerc_logger_1 END), 0) AS 7:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 2 THEN cerc_logger_1 END) AS 2:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 8 THEN cerc_logger_1 END), 0) AS 8:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 3 THEN cerc_logger_1 END) AS 3:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 9 THEN cerc_logger_1 END), 0) AS 9:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 4 THEN cerc_logger_1 END) AS 4:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 10 THEN cerc_logger_1 END), 0) AS 10:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 5 THEN cerc_logger_1 END) AS 5:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 11 THEN cerc_logger_1 END), 0) AS 11:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 6 THEN cerc_logger_1 END) AS 6:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 12 THEN cerc_logger_1 END), 0) AS 12:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 7 THEN cerc_logger_1 END) AS 7:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 13 THEN cerc_logger_1 END), 0) AS 13:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 8 THEN cerc_logger_1 END) AS 8:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 14 THEN cerc_logger_1 END), 0) AS 14:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 9 THEN cerc_logger_1 END) AS 9:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 15 THEN cerc_logger_1 END), 0) AS 15:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 10 THEN cerc_logger_1 END) AS 10:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 16 THEN cerc_logger_1 END), 0) AS 16:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 11 THEN cerc_logger_1 END) AS 11:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 17 THEN cerc_logger_1 END), 0) AS 17:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 12 THEN cerc_logger_1 END) AS 12:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 18 THEN cerc_logger_1 END), 0) AS 18:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 13 THEN cerc_logger_1 END) AS 13:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 19 THEN cerc_logger_1 END), 0) AS 19:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 14 THEN cerc_logger_1 END) AS 14:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 20 THEN cerc_logger_1 END), 0) AS 20:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 15 THEN cerc_logger_1 END) AS 15:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 21 THEN cerc_logger_1 END), 0) AS 21:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 16 THEN cerc_logger_1 END) AS 16:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 22 THEN cerc_logger_1 END), 0) AS 22:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 17 THEN cerc_logger_1 END) AS 17:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 23 THEN cerc_logger_1 END), 0) AS 23:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 18 THEN cerc_logger_1 END) AS 18:00, COALESCE(SUM(CASE WHEN EXTRACT(HOUR FROM datetime) = 0 THEN cerc_logger_1 END), 0) AS 24:00
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 19 THEN cerc_logger_1 END) AS 19:00, FROM source1
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 20 THEN cerc_logger_1 END) AS 20:00, GROUP BY DATE_TRUNC('day', datetime);
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 21 THEN cerc_logger_1 END) AS 21:00,
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 22 THEN cerc_logger_1 END) AS 22:00, This script follows the structure of the previous scripts but adjusts the aggregation function to calculate the
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 23 THEN cerc_logger_1 END) AS 23:00, total load for each hour of the day. It sums up the cerc_logger_1 values within each hour and day to populate
MAX(CASE WHEN EXTRACT(HOUR FROM datetime) = 0 THEN cerc_logger_1 END) AS 24:00 the corresponding columns in the target1 table. The COALESCE function is used to handle cases where there's
FROM source1 no data for a particular hour.
GROUP BY DATE_TRUNC('day', datetime);

(a) Basic Prompt and its response has only one error in the aggregation (b) Basic Prompt with domain-specific explanation for the target table returns
function, which should be SUM rather than MAX the correct target transformation query

Fig. 4. Prompt-Response for the example illustrated in Fig. 1a.

(2) Schema change hints suggests how the source schema is critical that the demonstrating prompts need to be similar
is mapped to the target schema. Given the strong semantic to the current prompt. For SQLMorpher design, we choose to
reasoning capability of LLMs, hints are also optional. We store the embedding vectors of historically successful prompts
found that some high-level hints, such as “use aggregation”, in a vector database, such as Faiss for top-k nearest-neighbor
are sufficient for LLMs to generate correct Group-By clause search, as illustrated in Fig. 5. In this example, the prompt
and aggregation functions in most scenarios. As illustrated in will fail unless it includes both the schema change hints (in
the brown box in Fig. 5, such information can be provided the brown box) and the demonstration (in the purple box). We
by (a) a rule engine that analyzes domain-specific databases used the ChatGPT 3.5-turbo-16k model API in August 2023
as illustrated in Fig. 6, (b) a schema mapping tool such to generate all examples in this section.
as Starmie [13], or (c) even an LLM itself (e.g., using a To retrieve the various types of information to augment the
separate LLM prompt that asks the LLM to identify schema prompt, the SQLMorpher design includes a callback system.
changes between the source and the target). Fig. 6 illustrates Each type of information corresponds to an event, and the user
the example information that is available in a domain-specific can register one or more callback functions with an event.
database for smart buildings that can be leveraged to generate Each callback function is expected to return a JSON object
schema change hints. In this experimental study, most schema that specifies the retrieved information as well as a status
change hints are derived from the domain-specific databases code and error message that specifies connection or execution
as illustrated in Fig. 4 and Fig. 6. errors, if any. When generating a prompt, SQLMorpher will
(3) Demonstrations add a few examples of historical prompt- go through all types of information, and invoke all callback
response pairs to the prompt to perform few-shot learning. It functions associated with each information type.
Prompt
A Real-world Test Case (Case 100, Group 14)
You are a SQL developer. Please generate a Postgres SQL script to Source Schema
convert the first table to be consistent with the format of the Source14_3(site,timestamp,TOTAL BLDG WHS [emon ch1,2],AC COMPRESSOR WHS [emon
second table. ch3,4],AIR HANDLER WHS [emon ch5,6],WATER HEATER WHS [emon ch7,8], DRYER WHS (1-CT)
[emon ch9] , RANGE WHS (1-CT) [emon ch10] , DISH WASHER WHS [emon ch11] , Primary Fridge
WHS [emon ch12] , 2nd Fridge WHS [emon ch13] , SPARE1 WHS (1-CT) [emon ch14] , SPARE2
First, you must create the first table named $SourceTable with the WHS (1-CT) [xpod chA-1] , SPARE3 WHS (1-CT) [xpod chA-2] ,POOL PUMP WHS (2-CTs) [xpod chA-
given attribute names: {$source_data_schema} and Insert $k rows 3,4],SPARE4 WHS (2-CTs) [xpod chA-5,6],Minisplit WHS (2-CTs) [xpod chA-7,8],Dryer WHS (2-CTs)
into the first table. [xpod chA-9,10], Calculated Unmeasured loads (Whr) , Calculated Energy Use (Whr) ,Future use-
WHS (2-CTs)[xpod chB-3,4],Future use- WHS (2-CTs)[xpod chB-5,6], eMonitor Temp (deg. F) ,
Wattsup Cumulative Ent.Ctr (WHS) , Wattsup Energy Ent.Ctr (WHS) , Wattsup Cumulative Washer
{k rows of data to be inserted into the source table.} (WHS) , Wattsup Energy Washer (WHS) , LaCrosse Device Temp (deg. F) , PointSix Temp (deg. F) ,
PointSix Humidity (%) , HOBO LOGGER TEMP (deg. F) , HOBO LOGGER RH (%) )
Second, you must create a second table named $TargetTable with
only the given attributes: {$target_data_schema} Target Schema
Target_14(month, hour, HVAC, domestic,_water_heating major_appliances, lighting, miscellaneous, Total)

Finally, insert all rows from the first table into the second table.
Samples in the Source Table
36 "8/22/12 16:00" 322.0 323.0 324.0 325.0 326.0 327.0 328.0 329.0 330.0 331.0 332.0 333.0
{explanation for the source table schema} 334.0 335.0 336.0 337.0 74.9 319.0 320.0 321.0 322.0 323.0 324.0 325.0 326.0 327.0 75.9
53.0 327.0 327.0
36 "8/22/12 17:00" 322.0 323.0 324.0 325.0 326.0 327.0 328.0 329.0 330.0 331.0 332.0 333.0
{explanation for the target table schema} 334.0 335.0 336.0 337.0 74.9 319.0 320.0 321.0 322.0 323.0 324.0 325.0 326.0 327.0 75.9
53.0 327.0 327.0
{hints about schema changes from the source to target} …

{demonstrations}
Nearest Embedding
neighbor vector
retrieving Vector Embedding Historical Successful
Schema Change Hint Database Layer Prompts
(Faiss)
Hints obtained from a Domain-Specific Database:

month and hour information can be extracted from


‘timestamp’. A Similar Real-world Test Case (Case 98, Group 14) which has a successful prompt
‘HVAC’ should map to the sum of ’AC
COMPRESSOR WHS [emon ch3,4]’, ‘AIR HANDLER Source Schema
WHS [emon ch5,6]’, and ‘Minisplit WHS (2-CTs) Source14_1(eiaid, time, raw_count, scaled_unit_count, net_site_electricity_kwh,
[xpod chA-7,8]’ domestic_water_heating should map electricity_heating_kwh, electricity_entral_system_heating_kwh, electricity_cooling_kwh
to ‘WATER HEATER WHS [emon ch7,8]’. electricity_central_system_cooling_kwh electricity_interior_lighting_kwh ,
major_appliances should be mapped to the sum of electricity_exterior_lighting_kwh, electricity_fans_heating_kwh, electricity_fans_cooling_kwh,
‘Primary Fridge WHS [emon ch12]’, ‘DRYER WHS (1- electricity_pumps_heating_kwh, electricity_central_system_pumps_heating_kwh,
electricity_pumps_cooling_kwh, electricity_central_system_pumps_cooling_kwh,
CT) [emon ch9]’, ‘Dryer WHS (2-CTs) [xpod chA- electricity_water_systems_kwh, electricity_refrigerator_kwh, electricity_clothes_washer_kwh,
9,10]’, ‘DISH WASHER WHS [emon ch11]’, ‘RANGE electricity_clothes_dryer_kwh, electricity_cooking_range_kwh, electricity_dishwasher_kwh,
WHS (1-CT) [emon ch10]’, and ‘POOL PUMP WHS electricity_plug_loads_kwh, electricity_house_fan_kwh, electricity_range_fan_kwh,
(2-CTs) [xpod chA-3,4]’.’ lighting’ is missing. electricity_bath_fan_kwh, electricity_ceiling_fan_kwh, electricity_pool_heater_kwh,
‘Miscellaneous’ should map to the sum of ‘Wattsup electricity_pool_pump_kwh, electricity_hot_tub_heater_kwh, electricity_hot_tub_pump_kwh)
Energy Ent.Ctr (WHS)’ and ‘2nd Fridge WHS [emon
ch13]’. Total should map to ‘TOTAL BLDG WHS
[emon ch1,2]’. Target Schema
Smart Building Target_14(month, hour, HVAC, domestic,_water_heating major_appliances, lighting, miscellaneous_plug_loads, Total)

Domain-Specific Database
A successful prompt-Response Pair for the above case
+Hints obtained from a simple rule-based tool You are a SQL developer. Please generate a Postgres SQL script to convert the first table to be
consistent with the format of the second table.
Use row aggregation group by month, hour. First, you must create the first table named $SourceTable with the given attribute names:
{$source_data_schema} and Insert $k rows into the first table.
Use column aggregation for total
{k rows of data to be inserted into the source table.}
Second, you must create a second table named $TargetTable with only the given attributes:
{$target_data_schema}
Finally, insert all rows from the first table into the second table.
{explanation for the source table schema}
Example Rule: {explanation for the target table schema}
{hints about schema changes from the source to target}
If source tuple is at hour level
Domain-specific rule-based and target tuple is at month- The correct response is:
Schema change hints hour level, use aggregation SELECT EXTRACT(MONTH FROM time) AS month, EXTRACT(HOUR FROM time) AS
generator group by month, hour. hour, … FROM Source14_1 GROUP BY month, hour; (Partial of the query omitted due to
The tool can also be replaced by space limitation)
a human expert.

Fig. 5. A Working Prompt for a Real-World Case (Case 100 in Group 14 in Tab. I).

B. SQL Execution validation tests, the target table will be removed or archived
before running the next iteration.
Compared to existing Text2SQL that focuses on selection
queries that are read-only, leveraging LLM to generate modifi-
C. Validation
cation queries is more complicated, partially because running
the generated query may raise security concerns. In the initial The validation in the production environment could be chal-
iteration for a given user request, the system automatically du- lenging due to the lack of ground truth. It needs an automatic
plicates the source dataset in a separate PostgreSQL database quality measurement (e.g., unit test cases, self-consistency, or
that serves as a sandbox environment to isolate the errors, accuracy of downstream tasks) for the transformed data, which
if the duplicate does not exist. This is to ensure that the we leave for future work to address.
generated code will not corrupt the source dataset. Then, the In this work, we manually prepare the ground truth trans-
script creates the target table. Finally, it runs the generated formation queries for each transformation case in the experi-
query to transform the entire source dataset into the target mental environment. At the validation stage, the ground truth
format and insert all transformed tuples into the target table. transformation query will be executed against the source table,
If another iteration is needed, e.g., the response cannot pass the resulting in the target table, which is called the ground truth
Table 1. Enduse Categorization
No Enduse Code Description
End Uses
HVAC
Sub End Uses
Heating
Source14_3
CMPPWR 1 BLDPWR TOTAL BLDG WHS [emon ch1,2]
An example of useful prompt flaws that we observed in
Cooling AHUPWR 2 CMPPWR AC COMPRESSOR WHS [emon ch3,4]
Furnace/AC MSPLIT
3 AHUPWR AIR HANDLER WHS [emon ch5,6]
those cases is “ERROR: INSERT has more expressions than
fan 4 DWHPWR WATER HEATER WHS [emon ch7,8]
Boiler pumps
5 DRY1CT DRYER WHS (1-CT) [emon ch9]
target columns LINE 100: PCT HOURLY 2500”. Before
6 RNG1CT RANGE WHS (1-CT) [emon ch10]
Kitchen range
exhaust fan 7 DSHWSR DISH WASHER WHS [emon ch11]
adding this error to the prompt, ChatGPT cannot correctly
Bath exhaust 8 FRIDG1 Primary Fridge WHS [emon ch12]
fan 9 FRIDG2 2nd Fridge WHS [emon ch13]
handle an attribute that exists in the source table but not in
Domestic Domestic DWHPWR 10 SPARE1 SPARE1 WHS (1-CT) [emon ch14]
water water heating 11 SPARE2 SPARE2 WHS (1-CT) [xpod chA-1]
the target table, PCT HOURLY 2500. Adding the error to the
heating
12 SPARE3 SPARE3 WHS (1-CT) [xpod chA-2]
Major
appliances
Refrigerator FRIDG1
NA 13 POOLPW POOL PUMP WHS (2-CTs) [xpod chA-3,4]
prompt will resolve the problem.
Clothes
washer 14 SPARE4 SPARE4 WHS (2-CTs) [xpod chA-5,6]
Clothes dryer DRY1CT 15 MSPLIT Minisplit WHS (2-CTs) [xpod chA-7,8]
DRY2CT 16 DRY2CT Dryer WHS (2-CTs) [xpod chA-9,10] IV. E XPERIMENTAL E VALUATION
Dishwasher DSHWSR 17 OTHPWR Calculated Unmeasured loads (Whr)
Cooking RNG1CT
range
18
19
BLDPWC
EXTRA1
Calculated Energy Use (Whr)
Future use- WHS (2-CTs)[xpod chB-3,4]
In this section, we first describe the goal of the comparison
Pool/spa POOLPW
pumps
NA
20
21
EXTRA2
EMTEMP
Future use- WHS (2-CTs)[xpod chB-5,6]
eMonitor Temp (deg. F)
study and all baselines that were used. Then, we present a
Pool/spa

Lighting
heaters
Interior NA
22
23
WUPCUM
WUPWHR
Wattsup Cumulative Ent.Ctr (WHS)
Wattsup Energy Ent.Ctr (WHS)
benchmark, which is the first benchmark for smart building
Exterior
Miscellaneo Miscellaneous WUPWHR
24
25
WUCUMW
WUWASH
Wattsup Cumulative Washer (WHS)
Wattsup Energy Washer (WHS)
data standardization problems. We further describe the setup of
us plug plug loads
loads
FRIDG2
26
27
LCTMP1
P6TEMP
LaCrosse Device Temp (deg. F)
PointSix Temp (deg. F)
the experiments and the evaluation metrics. Ultimately, we will
Other
refrigerators
Car NA
28
29
P6HUMI
HBTEMP
PointSix Humidity (%)
HOBO LOGGER TEMP (deg. F)
present and analyze the results and summarize key findings.
Total BLDPWR 30 HBRHUM HOBO LOGGER RH (%)
A. Comparison and Baselines
o Appendix E: FSEC (red font means enduses not used for comparison; green font means enduses used for comparison;
Fig. 6. Example information from the smart building domain-specific
No Enduse Code Description In this work, we mainly compare the effectiveness of six
database specifies the mapping 1 from source
BLDPWR TOTALattributes
BLDG WHS [emonto
ch1,2]the target
attributes for the example in 23Fig.CMPPWR
5. AC COMPRESSOR WHS [emon ch3,4] different types of initial prompt templates:
AHUPWR AIR HANDLER WHS [emon ch5,6]
• Prompt-1: Basic prompt with domain-specific description
target table. At the same time, by executing the target trans- for the target schema.
formation query contained in the LLM response, as described • Prompt-2: Prompt-1 with a domain-specific description
in Sec. III-B, we can also obtain a target table, which is called for the source schema.
the generated target table. • Prompt-3: Prompt-2 with schema change hints.
We designed a validation script, which compares the gen- • Prompt-1+Demo: Prompt-1 with one demonstration.
erated target table to the ground truth target table. The • Prompt-2+Demo: Prompt-2 with one demonstration.
comparison first validates whether two tables have the same • Prompt-3+Demo: Prompt-3 with one demonstration.
number of attributes and tuples. Then, it performs attribute- The first three prompt templates are designed for zero-shot
reordering and tuple-sorting to ensure two tables share the learning when there does not exist a database of abundant
same column-wise and row-wise orderings. Furthermore, the historical working prompts. The last three prompt templates
script will compare the similarity of the values for each are designed for one-shot learning.
attribute in the two tables. We use the ratio of the number We also considered comparing our approach to Auto-
of equivalent values (difference should be less than e−10 ) Pipeline [4], which is a state-of-the-art automatic data trans-
to the total number of values to measure the similarity of formation tool that only requires schema and tuple examples
numerical attributes. We use the Jaccard similarity to measure of the source and target tables and applies deep reinforcement
the similarity of categorical and text attributes. We average learning to synthesize the transformation pipeline.
the similarity for each attribute to derive an overall similarity
score. If the similarity score is below 1, the validation fails. B. Benchmark Design
1) Building Energy Data Transformation: We collected 105
D. Iterative Prompt Optimization data transformation examples in the smart building domain
This component is incorporated to evaluate the LLM’s po- from 21 energy companies in the United States. These exam-
tential self-optimization capability for the data transformation ples are divided into 15 groups so that each group has one
problem. If the validation fails, the prompt will be automat- target dataset and multiple source datasets of different types.
ically augmented by identifying errors in the prompt: (1) Each source needs to be converted to the target format in the
errors mentioned in the LLM response or met when executing group. The groups are described in Tab. I. In Tab. II, we further
the generated transformation query; (2) errors detected in the show more statistics of the 105 test cases by groups: (1) the
transformed dataset, e.g., reporting the difference between the number of distinct SQL keywords used in the ground truth
schema of the transformed dataset and the target schema; query and (2) the length (i.e., number of characters) of the
(3) inconsistency between the schema change hint and the ground truth query. For each group, we compute the average
response query, e.g., reporting if the hint specifies to use of the above metrics for all cases in the group.
aggregation, but no Group-By or aggregation functions have We document the following information in the benchmark:
been used. Then, these errors will be appended to the prompt, (1) target schema and domain-specific explanations for at-
and the new prompt will be sent back to the LLM, and it will tributes; (2) for each source dataset, its schema, domain-
repeat this process until it passes the validation, the maximum specific explanation of attributes, examples of instances,
number of iterations has been reached, or the new prompt has schema change hints for transforming the source table to the
no difference with the last prompt. target format, and the ground truth query that transforms the
source to the target. The benchmark dataset is open-sourced D. Experimental Setups
in a GitHub repository 1 . We implemented the end-to-end workflow as illustrated in
2) Other Benchmarks Used: We also used two other bench- Fig. 2 in a Python script that uses the ChatGPT-3.5-turbor-16K
marks that go beyond the smart building data transformation model. We did not present results on ChatGPT-4, because the
for different purposes. One commercial benchmark consists corresponding OpenAI API has a limit of 4K bytes for the total
of 16 cases used by the Auto-Pipeline baseline. Since Auto- prompt-response size at this point, while this size is insufficient
Pipeline code is not publicly available, we apply our proposed for a significant portion of real-world cases. For example,
approach (without and with domain-specific knowledge) to the tables in Group 10 to Group 15 have up to 152 attributes,
benchmark and compare the results. leading to a large prompt size. We set the temperature to zero
Another benchmark consists of four COVID-19 data trans- to avoid randomness for several reasons. First, a primary goal
formation cases, which we used to validate further how of this work is to evaluate the effectiveness of LLMs on data
well our methodology can generalize to other data transfor- transformation tasks by using different types of initial prompts
mation scenarios. It includes all four transformation cases and the effectiveness of iterative prompt optimization. Random
observed in the Github commit history of a widely used responses require additional methods (e.g., majority voting)
real-world COVID-19 data repository maintained by John for self-consistency, which will complicate the comparison.
Hopkins University [31]. The attributes in the target data Second, setting the temperature to zero will achieve better
are (Province/State, Country/Region, Last Update, Confirmed, quality results in most cases, according to a recent OpenAI
Deaths, Recovered), which represent the state-level COVID- article [32]. All SQL codes are run on PostgreSQL database
19 statistics. The source schemas of the first two cases involve version 15.0 for validation. All descriptions for source and
county-level data with different numbers of columns, and the target attributes are obtained from a domain-specific database3 .
latter two cases involve state-level data with different column
names and different numbers of columns. E. Smart Building Data Transformation Results
Overall, we have tested 125 cases in three benchmarks, 1) Overall Results.: The zero-shot learning results for
among which, 27 cases involve attribute merging, 89 cases the smart building data transformation benchmark are illus-
involve attribute name changes, 32 cases involve pivoting, 5 trated in Tab. II. Using Prompt-3, our proposed SQLMorpher
cases involve attribute flattening, 50 cases involve group-by methodology achieved an execution accuracy of 96%, which
and aggregation, 8 cases involve join. is significantly higher than Prompt-1 and Prompt-2, which
achieved an execution accuracy of 28% and 36%, respectively.
C. Evaluation Metrics It demonstrated the importance of supplying domain-specific
knowledge, particularly schema change hints, as part of the
We report the following metrics in the experimental study: prompt to the LLM. The observation justifies the integration
• Execution Accuracy: This metric is defined as the ratio of of the LLM with the domain-specific knowledge base and the
the number of correctly transformed cases to the total number schema mapping tools for data transformation pipelines.
of transformation cases. For each case, if the LLM can return 2) Effectiveness of One-shot Learning.: For the four cases
the correct transformation query that passes the experimental that failed with Prompt-3, we applied Prompt-4, Prompt-5,
validation tests as described in Sec. III-C within 5 iterations, and Prompt-6 to check whether providing one demonstration
it is considered a correctly transformed case. example that involves a similar prompt and a correct response
• Column Similarity: We compute the similarity score for each can improve the LLM response. The results are illustrated
column in the transformed dataset and its corresponding col- in Tab. III, which showed that using prompts that combine
umn in the ground truth target dataset (defined in Sec. III-C). domain-specific knowledge and demonstration is capable of
As detailed in Sec. III-C, we compute a similarity score for solving all four complicated cases that failed with Prompt-3.
each column. We further define the column similarity per 3) Effectiveness of the Iterative Optimization Process.:
case as the average similarity scores of all target attributes Compared to Prompt-1 and Prompt-2, we have found that
in the case, the column similarity per group as the average Prompt-3 can gain significantly more from iterative prompt
similarity scores of all cases in the group, and the overall optimization. When using Prompt-1, five cases in three groups,
column similarity as the average similarity scores of all cases Group-1, Group-4, and Group-7, benefit from iterative prompt
in all groups. The similarity score is set to zero for cases that optimization, the average number of iterations being 1.2,
fail to generate output data for similarity comparison. 1.3, and 2.0, respectively, as illustrated in Tab. II. Other
groups either have all cases passed in one iteration or have
• Number of Iterations to Success: For each case, we record
all cases failed. When using Prompt-2, four cases in three
the number of iterations used to achieve the correct response
groups, Group-2, Group-3, and Group-7, require more than
for the case. We record the average number of iterations
one iteration to succeed, the average number of iterations being
to success for all successful cases that achieved a column
1.4, 1.2, and 1.2, respectively. When using Prompt-3, 10 cases
similarity score of 1.0 within 5 iterations in each group, and
in all groups. The latter is termed as the overall number of 3 The domain-specific database is maintained by co-author Liang Zhang.
iterations to success. Some example information in the database is illustrated in Fig. 6 and Fig. 4
TABLE I
D ESCRIPTIONS OF B ENCHMARK G ROUPS

ID Group Description Target Schema Target Description #Sources Source Descriptions


1 Date 1:00 2:00 ... 24:00 Date is of the format DOW MM/DD/YY, 10 Sources include load profiles captured
Daily such as ’Fri 01/01/2016’ per minute, per 5-minute, per
2 Hour-Level DT DOW HOURLY 0100 DT is in the MM/DD/YY format. DOW has 10 10-minute and per hour with different
Load Profile HOURLY 0200 ... values from 1-7 corresponding to Mon-Sun. schemas and column names. Some
Transformation HOURLY 2400 HOURLY 2500 Hourly 2500 is used for leap second. example source schemas are as follows:
3 Date Hour1 Hour2 ... Hour24 Date is of the format of MM/DD/YYYY 10 Ex1. (DateTime LoadValue), where
4 Date 1:00AM 2:00AM ... 12:00PM Date is of the format of MM/DD/YYYY 10 DateTime is a timestamp such as
5 Date Value1 Value2 ... Value24 Date is of the format of MM/DD/YYYY 10 ’2/22/2018 0:30’. Ex2. (Segment Date
6 Date Hr1 Hr2 ... Hr24 Date is of the format of MM/DD/YYYY 10 1:00 2:00 ... 24:00:00), where Date is
7 Monthly Month DayType HR1 HR2 ... Month has values such as January, Feburary, 10 in the format of DOW MM/DD/YY
Hour-Level HR24 etc. DayType can be weekday or weekend. such as ’Wed 01/01/2003’, and
8 Load Profile Hour January February ... Decem- Hour has values from 1 to 24. January records 10 Segment is an attribute that should be
Transformation ber the average load in the corresponding hour in discarded from the relation.
January. Other columns are similar.
9 Seasonal Temperature Season DayType Hour Temper- Season has values such as SPRING, SUM- 3 Source 1 and 2 are hourly temperature
Range Transformation ature Range Constant Coefficient MER, FALL, and WINTER. Datatype can be data grouped in four ranges and five
Low End High End either WEEKDAY or WEEKEND. Low End ranges, respectively. Source 3 is sea-
is the lowest temperature. High End is the sonal temperature data grouped in three
highest temperature. ranges.
10 Daily Hour-Level Load Datetime HVAC water heating The Datetime attribute has values in the for- 3 The three sources are hourly datasets
Transformation by De- Refrigerator Clothes washer mat of YYYY-MM-DD HH:00:00. with 34, 151, and 32 attributes, re-
tailed Enduse Clothes dryer Dishwasher spectively, mapped to 13 detailed end-
Cooking range Pool spa pumps uses. For example, the sum of elec-
Interior lighting Exterior lighting tricity pool pump kwh’ and ’electric-
Lighting Plug Pool spa heater ity hot tub pump kwh in source-1 is
mapped to Pool spa pumps in the tar-
get.
11 Monthly Hour-Level Month Hour HVAC water heating Month is an integer number from 1 to 12. 4 Similar to Group 10, with one additional
Load Transformation by Refrigerator Clothes washer Hour is an integer number from 0 to 24. seasonal source dataset with an End Use
Detailed Enduse Clothes dryer Dishwasher Category attribute of which the values
Cooking range Pool spa pumps map to the 13 detailed end uses.
Interior lighting Exterior lighting
Lighting Plug Pool spa heater
12 Seasonal Hour-Level Season Hour HVAC water heating Season has values such as Spring, Summer, 4 Similar to Group 11.
Load Transformation by Refrigerator Clothes washer Fall, and Winter. Hour” is an integer number
Detailed Enduse Clothes dryer Dishwasher from 0 to 24.
Cooking range Pool spa pumps
Interior lighting Exterior lighting
Lighting Plug Pool spa heater
13 Daily Hour-Level Load Datetime HVAC Do- Similar to Group 10, except that this target 3 Similar to Group 10.
Transformation by High- mestic water heating has fewer (higher-level) end-uses.
Level Enduse Major appliances Lighting
Miscellaneous plug loads Total
14 Monthly Hour-Level Month Hour HVAC Similar to Group 11, except that this target 4 Similar to Group 11.
Load Transformation by Domestic water heating has fewer (higher-level) end-uses.
High-Level Enduse Major appliances Lighting
Miscellaneous plug loads Total
15 Seasonal Hour-Level Season Hour HVAC Similar to Group 12, except that this target 4 Similar to Group 12.
Load Transformation by Domestic water heating has fewer (higher-level) end-uses.
High-Level Enduse Major appliances Lighting
Miscellaneous plug loads Total

in six groups, require more than one iteration to succeed. on all 16 transformation problems in their benchmark only
It means that 9.5% of total cases can benefit from iterative using the basic prompt, without additional domain-specific
prompt optimization when using Prompt-3. knowledge. The execution accuracy achieved by Auto-Pipeline
on this benchmark is below 70% [4]. The comparison implies
F. Results on Benchmarks Beyond Smart Building. that our approach has great potential to outperform state-of-
First, we tested our approach on the COVID-19 benchmark. the-art automatic data transformation tools.
The results are illustrated in Tab. IV, which showed that our
G. Summary of Key Findings
proposed methodology resolves all four cases simply using the
basic prompt (Prompt-1). • Large language models are promising to automatically
Second, we also compared our proposed approach with the resolve complicated smart building data transformation cases if
Auto-Pipeline approach, using its commercial benchmark [4]. domain-specific knowledge is available and easily retrievable.
The results are illustrated in Tab. V. It showed that our We achieved 96% accuracy on our proposed benchmark, which
proposed methodology achieved perfect execution accuracy consists of 105 real-world smart building cases.
TABLE II
C OMPARISON OF E XECUTION ACCURACY U SING D IFFERENT P ROMPT T EMPLATES WITH Z ERO -S HOT L EARNING (G RP STANDS FOR G ROUP )

Grp-1 Grp-2 Grp-3 Grp-4 Grp-5 Grp-6 Grp-7 Grp-8 Grp-9 Grp-10 Grp-11 Grp-12 Grp-13 Grp-14 Grp-15
#keywords avg. 15.6 18.7 19.6 19.6 16.4 16.3 24.2 26.7 23.7 13.7 20.3 28.0 5.0 25.3 31.5
length avg. 1802 2826 1957 2023 1731 1713 1548 3239 1034 1712 2085 2365 1412 1732 1918
Prompt 1. Overall execution accuracy: 29/105 (28%); Overall column similarity scores: 0.4; Overall iteration to success: 1.3
exec auc 6/10 2/10 4/10 6/10 3/10 2/10 6/10 0/10 0/3 0/3 0/4 0/4 0/3 0/4 0/4
sim score avg. 0.7 0.5 0.6 0.6 0.3 0.3 0.8 0.0 0.0 0.6 0.3 0.0 0.3 0.0 0.0
iter-to-succ avg. 1.2 1.0 1.0 1.3 1.0 1.0 2.0 - - - - - - - -
Prompt 2. Overall execution accuracy: 38/105 (36%); Overall column similarity score: 0.5; Avg iteration to success: 1.1
exec auc 6/10 6/10 7/10 6/10 3/10 2/10 8/10 0/10 0/3 0/3 0/4 0/4 0/3 0/4 0/4
sim score avg. 0.7 0.6 0.7 0.6 0.4 0.3 0.9 0.0 0.4 0.8 0.7 0.2 0.3 0.1 0.0
iter-to-succ avg. 1.0 1.4 1.2 1.0 1.0 1.0 1.2 - - - - - - - -
Prompt 3. Overall execution accuracy: 101/105 (96%); Overall column similarity score: 0.96; Avg iteration to success: 1.2
exec auc 10/10 10/10 10/10 10/10 10/10 10/10 10/10 9/10 3/3 3/3 4/4 3/4 3/3 2/4 4/4
sim score avg. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 0.8 1.0 0.5 1.0
iter-to-succ avg. 1.0 1.0 1.1 1.0 1.7 1.6 1.0 1.0 1.0 1.3 1.5 1.0 1.3 1.0 1.0

TABLE III TABLE V


AVERAGE C OLUMN S IMILARITY S CORE WITH O NE - SHOT L EARNING C OMPARISON TO AUTO -P IPELINE ON T HEIR C OMMERCIAL B ENCHMARK
# KEYWORDS AVG .: 8, LENGTH AVG .: 566
Cases Failed with Prompt-3 Prompt-1 Prompt-2 Prompt-3
+Demo +Demo +Demo Prompt-1 Prompt-2 Prompt-3 Auto-Pipeline [4]
Case 78 (in Group 8) 1.00 1.00 1.00 exec auc 13/16 16/16 16/16 11/16 [4]
Case 92 (in Group 12) 0.00 0.27 1.00 sim score avg. 0.83 1.00 1.00 -
Case 100 (in Group 14) 0.38 0.50 1.00 iter-to-succ avg. 1.00 1.00 1.00 -
Case 101 (in Group 14) 0.50 0.50 1.00

100.00%
TABLE IV
P ROMPT C OMPARISON FOR C OVID -19 B ENCHMARK 80.00%
# KEYWORDS AVG .: 5, LENGTH AVG .: 277 60.00%
40.00%
Prompt-1 Prompt-2 Prompt-3 20.00%
exec auc 4/4 4/4 4/4 0.00%
sim score avg. 1.0 1.0 1.0 Prompt-1 attribute attribute attribute attribute group-by join
iter-to-succ avg. 1.0 1.0 1.0 Prompt-2 merging name pivoting flattening and
Prompt-3 change aggregate

Fig. 7. The overall execution accuracy of cases in each schema


• Our SQLMorpher methodology is promising in generalizing change category. (We considered all 125 cases in three benchmarks;
to other data transformation cases and outperforming state- each case may involve multiple types of schema changes.)
of-the-art automatic data transformation tools that do not
rely on LLMs. In particular, our methodology defines clean benchmarks into one or more schema change types. We then
interfaces for integrating domain-specific knowledge into the count the execution accuracy for each type of schema change,
data transformation process through the prompt generation as illustrated in Fig. 7. We observe that while Prompt-3 with
process. This is a missing feature in state-of-the-art data schema change hints can handle all schema change types well,
transformation tools. The evaluation results on the commercial Prompt-1 and Prompt-2 without schema change hints achieved
benchmark used by Auto-Pipeline showed that our approach, relatively better accuracy (40% to 100%) for the attribute
even without using any domain-specific knowledge, could name change, attribute flattening, and join than other types
achieve significantly better execution accuracy than Auto- of changes, such as attribute merging, attribute pivoting, and
Pipeline (81% vs. 69%). One observation is that while LLM group-by/aggregation. This further verified the importance of
generates SQL code, Auto-Pipeline attempts to learn a pipeline incorporating high-level schema change hints such as “use
of data transformation operators. The latter has a more limited aggregation” and “use pivoting”.
search space, which may affect the execution accuracy. • Zero-shot learning is effective in resolving most data
• Compared to other domain-specific knowledge, a high-level transformation problems investigated in this work. Few-shot
schema change hint, such as column mapping relationships or learning can resolve the difficult cases that fail with zero-shot.
instructions as simple as “use aggregation”, is critical to the • The iterative optimization framework that simply enhances
success of our proposed methodology. the prompt using ChatGPT reported errors or SQL execution
• We further classify each of 125 cases from all three errors for each iteration can benefit 9.5% of cases when using
Prompt-3 and 5% of cases when using Prompt-1 and 2. [11] O. Popescu, I. Manotas, N. P. A. Vo, H. Yeo, E. Khorashani, and
V. Sheinin, “Addressing limitations of encoder-decoder based approach
• The examples in our proposed building energy data transfor- to text-to-sql,” in Proceedings of the 29th International Conference on
mation benchmark are significantly more complicated than ex- Computational Linguistics, pp. 1593–1603, 2022.
isting benchmarks in terms of the number of distinct keywords [12] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large
and the length of the transformation query. They are used in language models trained on code,” arXiv preprint arXiv:2107.03374,
the real world but are missing in existing data transformation 2021.
benchmarks [4], [5]. [13] G. Fan, J. Wang, Y. Li, D. Zhang, and R. J. Miller, “Semantics-aware
dataset discovery from data lakes with contextualized column-based
representation learning,” Proceedings of the VLDB Endowment, vol. 16,
V. C ONCLUSION AND F UTURE W ORKS no. 7, pp. 1726–1739, 2023.
[14] Y. Dong, K. Takeoka, C. Xiao, and M. Oyamada, “Efficient joinable
In this work, we pioneered the experimental and feasibility table discovery in data lakes: A high-dimensional similarity-based
study about applying LLMs to data transformation problems. approach,” in 2021 IEEE 37th International Conference on Data En-
We proposed a novel SQLMorpher approach using LLM to gineering (ICDE), pp. 456–467, IEEE, 2021.
[15] L. Wang, S. Zhang, J. Shi, L. Jiao, O. Hassanzadeh, J. Zou, and
generate SQL modification queries for data transformation. C. Wangz, “Schema management for document stores,” Proceedings of
SQLMorpher is designed to incorporate domain knowledge the VLDB Endowment, vol. 8, no. 9, pp. 922–933, 2015.
flexibly and optimize prompts iteratively. We provided a [16] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li,
Q. Yao, S. Roman, et al., “Spider: A large-scale human-labeled dataset
unique benchmark for building energy data transformation, for complex and cross-domain semantic parsing and text-to-sql task,”
including 105 real-world cases collected from 21 energy arXiv preprint arXiv:1809.08887, 2018.
companies in the United States. The results are promising, [17] T. Shi, C. Zhao, J. Boyd-Graber, H. Daumé III, and L. Lee, “On
the potential of lexico-logical alignments for semantic parsing to sql
achieving up to 96% accuracy on the benchmark. In addition, queries,” arXiv preprint arXiv:2010.11246, 2020.
we have found our system can generalize to scenarios be- [18] X. Yu, T. Chen, Z. Yu, H. Li, Y. Yang, X. Jiang, and A. Jiang, “Dataset
yond building energy data. The commercial benchmark results and enhanced model for eligibility criteria-to-sql semantic parsing,” in
12th International Conference on Language Resources and Evaluation
demonstrate that our approach is able to outperform exist- (LREC), 2020.
ing automatic data transformation techniques significantly. In [19] C.-H. Lee, O. Polozov, and M. Richardson, “Kaggledbqa: Realistic
summary, SQLMorpher is promising to enable the automatic evaluation of text-to-sql parsers,” arXiv preprint arXiv:2106.11455,
2021.
integration of diverse data sources for building energy man- [20] Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stone-
agement and may benefit other domains. In the future, we braker, “Dataxformer: A robust transformation discovery system,” in
will design quality control for SQLMorpher to further reduce 2016 IEEE 32nd International Conference on Data Engineering (ICDE),
pp. 1134–1145, IEEE, 2016.
human validation involvement in the production environment. [21] S. Gulwani, “Automating string processing in spreadsheets using input-
output examples,” ACM Sigplan Notices, vol. 46, no. 1, pp. 317–330,
R EFERENCES 2011.
[22] Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri,
[1] “U.s. energy consumption by source and sector, 2022.” https://www.eia. “Transform-data-by-example (tde) an extensible search engine for data
gov/totalenergy/data/monthly/pdf/flow/total energy 2022.pdf. transformations,” Proceedings of the VLDB Endowment, vol. 11, no. 10,
[2] G. Pinto, Z. Wang, A. Roy, T. Hong, and A. Capozzoli, “Transfer learn- pp. 1165–1177, 2018.
ing for smart buildings: A critical review of algorithms, applications, [23] J. Heer, J. M. Hellerstein, and S. Kandel, “Predictive interaction for data
and future perspectives,” Advances in Applied Energy, vol. 5, p. 100084, transformation.,” in CIDR, Citeseer, 2015.
2022. [24] Z. Jin, M. Cafarella, H. Jagadish, S. Kandel, M. Minar, and J. M.
[3] Z. Jin, Y. He, and S. Chauduri, “Auto-transform: learning-to-transform Hellerstein, “Clx: Towards verifiable pbe data transformation,” arXiv
by patterns,” Proceedings of the VLDB Endowment, vol. 13, no. 12, preprint arXiv:1803.00701, 2018.
pp. 2368–2381, 2020. [25] R. Singh, “Blinkfill: Semi-supervised programming by example for
[4] J. Yang, Y. He, and S. Chaudhuri, “Auto-pipeline: synthesizing com- syntactic string transformations,” Proceedings of the VLDB Endowment,
plex data pipelines by-target using reinforcement learning and search,” vol. 9, no. 10, pp. 816–827, 2016.
Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2563–2575, [26] Z. Jin, M. R. Anderson, M. Cafarella, and H. Jagadish, “Foofah:
2021. Transforming data by example,” in Proceedings of the 2017 ACM
[5] P. Li, Y. He, C. Yan, Y. Wang, and S. Chauduri, “Auto-tables: Synthe- International Conference on Management of Data, pp. 683–698, 2017.
sizing multi-step transformations to relationalize tables without using [27] E. Zhu, Y. He, and S. Chaudhuri, “Auto-join: Joining tables by lever-
examples,” arXiv preprint arXiv:2307.14565, 2023. aging transformations,” Proceedings of the VLDB Endowment, vol. 10,
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training no. 10, pp. 1034–1045, 2017.
of deep bidirectional transformers for language understanding,” arXiv [28] C. Zuo, S. Assadi, and D. Deng, “Spine: Scaling up programming-by-
preprint arXiv:1810.04805, 2018. negative-example for string filtering and transformation,” in Proceedings
[7] Z. Wang, L. Zhou, A. Das, V. Dave, Z. Jin, and J. Zou, “Survive the of the 2022 International Conference on Management of Data, pp. 521–
schema changes: integration of unmanaged data using deep learning,” 530, 2022.
arXiv preprint arXiv:2010.07586, 2020. [29] L. T. Becker and E. M. Gould, “Microsoft power bi: Extending excel to
[8] I. Trummer, “Codexdb: Synthesizing code for query processing from manipulate, analyze, and visualize diverse data,” Serials Review, vol. 45,
natural language instructions using gpt-3 codex,” Proceedings of the no. 3, pp. 184–188, 2019.
VLDB Endowment, vol. 15, no. 11, pp. 2921–2928, 2022. [30] Trifacta, “Trifacta wrangler,” 2020.
[9] Z. Gu, J. Fan, N. Tang, S. Zhang, Y. Zhang, Z. Chen, L. Cao, G. Li, [31] “Covid-19 data repository by the center for systems science and
S. Madden, and X. Du, “Interleaving pre-trained language models and engineering (csse) at johns hopkins university.,” https://github.com/
large language models for zero-shot nl2sql generation,” arXiv preprint CSSEGISandData/COVID-19.
arXiv:2306.08891, 2023. [32] “Codex models and azure openai service.” https://learn.microsoft.com/
[10] Z. Gu, J. Fan, N. Tang, L. Cao, B. Jia, S. Madden, and X. Du, “Few- en-us/azure/ai-services/openai/how-to/work-with-code.
shot text-to-sql translation using structure and content prompt learning,”
Proceedings of the ACM on Management of Data, vol. 1, no. 2, pp. 1–28,
2023.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy