Heatwave-En A4
Heatwave-En A4
Abstract
This document describes how to use HeatWave. It covers how to load data, run queries, optimize analytics
workloads, and use HeatWave machine learning capabilities.
For information about creating and managing a HeatWave Cluster on Oracle Cloud Infrastructure (OCI), see
HeatWave on OCI Service Guide.
For information about creating and managing a HeatWave Cluster on Amazon Web Services (AWS), see
HeatWave on AWS Service Guide.
For information about creating and managing a HeatWave Cluster on Oracle Database Service for Azure (ODSA),
see HeatWave for Azure Service Guide.
For information about the latest HeatWave features and updates, refer to the HeatWave Release Notes.
For help with using MySQL, please visit the MySQL Forums, where you can discuss your issues with other
MySQL users.
iii
HeatWave User Guide
iv
HeatWave User Guide
v
HeatWave User Guide
vi
Preface and Legal Notices
This is the user manual for HeatWave.
Legal Notices
Copyright © 1997, 2024, Oracle and/or its affiliates.
License Restrictions
This software and related documentation are provided under a license agreement containing
restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly
permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate,
broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any
form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless
required by law for interoperability, is prohibited.
Warranty Disclaimer
The information contained herein is subject to change without notice and is not warranted to be error-
free. If you find any errors, please report them to us in writing.
If this is software, software documentation, data (as defined in the Federal Acquisition Regulation), or
related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the
U.S. Government, then the following notice is applicable:
U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated
software, any programs embedded, installed, or activated on delivered hardware, and modifications
of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed
by U.S. Government end users are "commercial computer software," "commercial computer software
documentation," or "limited rights data" pursuant to the applicable Federal Acquisition Regulation and
agency-specific supplemental regulations. As such, the use, reproduction, duplication, release, display,
disclosure, modification, preparation of derivative works, and/or adaptation of i) Oracle programs
(including any operating system, integrated software, any programs embedded, installed, or activated
on delivered hardware, and modifications of such programs), ii) Oracle computer documentation and/
or iii) other Oracle data, is subject to the rights and limitations specified in the license contained in
the applicable contract. The terms governing the U.S. Government's use of Oracle cloud services
are defined by the applicable contract for such services. No other rights are granted to the U.S.
Government.
This software or hardware is developed for general use in a variety of information management
applications. It is not developed or intended for use in any inherently dangerous applications, including
applications that may create a risk of personal injury. If you use this software or hardware in dangerous
applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and
other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any
damages caused by use of this software or hardware in dangerous applications.
Trademark Notice
Oracle, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other
names may be trademarks of their respective owners.
Intel and Intel Inside are trademarks or registered trademarks of Intel Corporation. All SPARC
trademarks are used under license and are trademarks or registered trademarks of SPARC
International, Inc. AMD, Epyc, and the AMD logo are trademarks or registered trademarks of Advanced
Micro Devices. UNIX is a registered trademark of The Open Group.
vii
Documentation Accessibility
This software or hardware and documentation may provide access to or information about content,
products, and services from third parties. Oracle Corporation and its affiliates are not responsible
for and expressly disclaim all warranties of any kind with respect to third-party content, products,
and services unless otherwise set forth in an applicable agreement between you and Oracle. Oracle
Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to
your access to or use of third-party content, products, or services, except as set forth in an applicable
agreement between you and Oracle.
This documentation is NOT distributed under a GPL license. Use of this documentation is subject to the
following terms:
You may create a printed copy of this documentation solely for your own personal use. Conversion
to other formats is allowed as long as the actual content is not altered or edited in any way. You shall
not publish or distribute this documentation in any form or on any media, except if you distribute the
documentation in a manner similar to how Oracle disseminates it (that is, electronically for download
on a Web site with the software) or on a CD-ROM or similar medium, provided however that the
documentation is disseminated together with the software on the same medium. Any other use, such
as any dissemination of printed copies or use of this documentation, in whole or in part, in another
publication, requires the prior written consent from an authorized representative of Oracle. Oracle and/
or its affiliates reserve any and all rights to this documentation not expressly granted above.
Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle Accessibility Program
website at
http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
viii
Chapter 1 Overview
Table of Contents
1.1 HeatWave Architectural Features ............................................................................................. 1
1.2 HeatWave MySQL .................................................................................................................. 3
1.3 HeatWave AutoML .................................................................................................................. 3
1.4 HeatWave Lakehouse ............................................................................................................. 3
1.5 HeatWave Autopilot ................................................................................................................ 4
1.6 MySQL Functionality for HeatWave ......................................................................................... 6
HeatWave is a massively parallel, high performance, in-memory query accelerator that accelerates
MySQL performance by orders of magnitude for analytics workloads, mixed workloads, and machine
learning. HeatWave can be accessed through Oracle Cloud Infrastructure (OCI), Amazon Web
Services (AWS), and Oracle Database Service for Azure (ODSA).
HeatWave consists of a MySQL DB System and HeatWave nodes. Analytics queries that meet certain
prerequisites are automatically offloaded from the MySQL DB System to the HeatWave Cluster
for accelerated processing. With a HeatWave Cluster, you can run online transaction processing
(OLTP), online analytical processing (OLAP), and mixed workloads from the same MySQL database
without requiring extract, transfer, and load (ETL), and without modifying your applications. For more
information about the analytical capabilities of HeatWave, see Chapter 2, HeatWave MySQL.
The MySQL DB System includes a HeatWave plugin that is responsible for cluster management, query
scheduling, and returning query results to the MySQL DB System. The HeatWave nodes store data in
memory and process analytics and machine learning queries. Each HeatWave node hosts an instance
of the HeatWave query processing engine (RAPID).
Enabling a HeatWave Cluster also provides access to HeatWave AutoML, which is a fully managed,
highly scalable, cost-efficient, machine learning solution for data stored in MySQL. HeatWave AutoML
provides a simple SQL interface for training and using predictive machine learning models, which can
be used by novice and experienced ML practitioners alike. Machine learning expertise, specialized
tools, and algorithms are not required. With HeatWave AutoML, you can train a model with a single call
to an SQL routine. Similarly, you can generate predictions with a single CALL or SELECT statement
which can be easily integrated with your applications.
1
In-Memory Hybrid-Columnar Format
2
Scale-Out Data Management
Changes to analytics data on the MySQL DB System are automatically propagated to HeatWave nodes
in real time, which means that queries always have access to the latest data. Change propagation is
performed automatically by a light-weight algorithm.
Users and applications interact with HeatWave through the MySQL DB System using standard tools
and standard-based ODBC/JDBC connectors. HeatWave supports the same ANSI SQL standard and
ACID properties as MySQL and the most commonly used data types. This support enables existing
applications to use HeatWave without modification, allowing for quick and easy integration.
The number of HeatWave nodes required depends on data size and the amount of compression
that is achieved when loading data into the HeatWave Cluster. A HeatWave Cluster in Oracle Cloud
Infrastructure (OCI) or Oracle Database Service for Azure (ODSA) supports up to 64 nodes. On
Amazon Web Services (AWS), a HeatWave Cluster supports up to 128 nodes.
On Oracle Cloud Infrastructure (OCI), data that is loaded into HeatWave is automatically persisted to
OCI Object Storage, which allows data to be reloaded quickly when the HeatWave Cluster resumes
after a pause or when the HeatWave Cluster recovers from a cluster or node failure.
3
HeatWave Autopilot
• Supports structured and semi-structured relational data in the following file formats:
• Avro.
• CSV.
• JSON.
• Parquet.
• With this feature, users can now analyse data in both InnoDB and an object store using familiar SQL
syntax in the same query.
See: Chapter 4, HeatWave Lakehouse. To use Lakehouse with HeatWave AutoML, see: Section 3.12,
“HeatWave AutoML and Lakehouse”.
System Setup
• Auto Provisioning
Estimates the number of HeatWave nodes required by sampling the data, which means that manual
cluster size estimations are not necessary.
• For HeatWave on OCI, see Generating a Node Count Estimate in the HeatWave on OCI Service
Guide.
• For HeatWave on AWS, see Estimating Cluster Size with HeatWave Autopilot in the HeatWave on
AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure Service
Guide.
For HeatWave on AWS, the Auto Shape Prediction feature in HeatWave Autopilot uses MySQL
statistics for the workload to assess the suitability of the current shape. Auto Shape Prediction
provides prompts to upsize the shape and improve system performance, or to downsize the shape if
the system is under-utilized. See: Autopilot Shape Advisor in the HeatWave on AWS Service Guide.
Data Load
• Auto Parallel Load
Optimizes load time and memory usage by predicting the optimal degree of parallelism for each table
loaded into HeatWave. See: Section 2.2.3, “Loading Data Using Auto Parallel Load”.
• Auto Encoding
4
Query Execution
Determines the optimal encoding for string column data, which minimizes the required cluster size
and improves query performance. See: Section 2.8.2, “Auto Encoding”.
Recommends how tables should be partitioned in memory to achieve the best query performance,
and estimates the expected performance improvement. See: Section 2.8.3, “Auto Data Placement”.
HeatWave and HeatWave Lakehouse can compress data stored in memory using different
compression algorithms. To minimize memory usage while providing the best query performance,
auto compression dynamically determines the compression algorithm to use for each column based
on its data characteristics. Auto compression employs an adaptive sampling technique during
the data loading process, and automatically selects the optimal compression algorithm without
user intervention. Algorithm selection is based on the compression ratio and the compression and
decompression rates, which balance the memory needed to store the data in HeatWave with query
execution time. See: Section 2.2.6, “Data Compression”.
Query Execution
• Auto Query Plan Improvement
Uses statistics from previously executed queries to improve future query execution plans. See:
Section 2.3.4, “Auto Query Plan Improvement”.
Adaptive query optimization automatically improves query performance and memory consumption,
and mitigates skew-related performance issues as well as out of memory errors. It uses various
statistics to adjust data structures and system resources after query execution has started. It
independently optimizes query execution for each HeatWave node based on actual data distribution
at runtime. This helps improve the performance of ad hoc queries by up to 25%. The HeatWave
optimizer generates a physical query plan based on statistics collected by Autopilot. During query
execution, each HeatWave node executes the same query plan. With adaptive query execution,
each individual HeatWave node adjusts the local query plan based on statistics such as cardinality
and distinct value counts of intermediate relations collected locally in real-time. This allows each
HeatWave node to tailor the data structures that it needs, resulting in better query execution time,
lower memory usage, and improved data skew-related performance.
Estimates query execution time, allowing you to determine how a query might perform without
having to run the query. Runtime estimates are provided by the Advisor Query Insights feature. See:
Section 2.8.4, “Query Insights”.
For HeatWave on OCI, Auto Change Propagation intelligently determines the optimal time when
changes to data on the MySQL DB System should be propagated to the HeatWave Storage Layer.
• Auto Scheduling
Prioritizes queries in an intelligent way to reduce overall query execution wait times. See:
Section 2.3.3, “Auto Scheduling”.
Queues incoming transactions to give sustained throughput during high transaction concurrency.
Where multiple clients are running queries concurrently, Auto Thread Pooling applies workload-
5
Failure Handling
aware admission control to eliminate resource contention caused by too many waiting
transactions. Auto Thread Pooling automatically manages the settings for the thread pool
control variables thread_pool_size, thread_pool_max_transactions_limit, and
thread_pool_query_threads_per_group. For details of how the thread pool works, see
Thread Pool Operation.
Failure Handling
• Auto Error Recovery
For HeatWave on OCI, Auto Error Recovery recovers a failed node or provisions a new one and
reloads data from the HeatWave storage layer when a HeatWave node becomes unresponsive due
to a software or hardware failure. See: HeatWave Cluster Failure and Recovery in the HeatWave on
OCI Service Guide.
For HeatWave on AWS, Auto Error Recovery recovers a failed node and reloads data from the
MySQL DB System when a HeatWave node becomes unresponsive due to a software failure.
• See Section 2.13, “SELECT Statement” for the QUALIFY and TABLESAMPLE clauses.
6
Chapter 2 HeatWave MySQL
Table of Contents
2.1 Before You Begin ................................................................................................................... 9
2.2 Loading Data to HeatWave MySQL ......................................................................................... 9
2.2.1 Prerequisites .............................................................................................................. 10
2.2.2 Loading Data Manually ............................................................................................... 11
2.2.3 Loading Data Using Auto Parallel Load ....................................................................... 12
2.2.4 Monitoring Load Progress ........................................................................................... 20
2.2.5 Checking Load Status ................................................................................................ 20
2.2.6 Data Compression ...................................................................................................... 21
2.2.7 Change Propagation ................................................................................................... 21
2.2.8 Reload Tables ............................................................................................................ 22
2.3 Running Queries ................................................................................................................... 23
2.3.1 Query Prerequisites .................................................................................................... 23
2.3.2 Running Queries ........................................................................................................ 24
2.3.3 Auto Scheduling ......................................................................................................... 25
2.3.4 Auto Query Plan Improvement .................................................................................... 25
2.3.5 Debugging Queries .................................................................................................... 25
2.3.6 Query Runtimes and Estimates ................................................................................... 27
2.3.7 CREATE TABLE ... SELECT Statements ..................................................................... 28
2.3.8 INSERT ... SELECT Statements ................................................................................. 28
2.3.9 Using Views ............................................................................................................... 28
2.4 Modifying Tables ................................................................................................................... 29
2.5 Unloading Data from HeatWave MySQL ................................................................................ 30
2.5.1 Unloading Tables ....................................................................................................... 30
2.5.2 Unloading Partitions ................................................................................................... 31
2.5.3 Unloading Data Using Auto Unload ............................................................................. 31
2.5.4 Unload All Tables ....................................................................................................... 35
2.6 Table Load and Query Example ............................................................................................ 35
2.7 Workload Optimization .......................................................................................................... 37
2.7.1 Encoding String Columns ........................................................................................... 38
2.7.2 Defining Data Placement Keys .................................................................................... 39
2.8 Workload Optimization using Advisor ..................................................................................... 40
2.8.1 Advisor Syntax ........................................................................................................... 41
2.8.2 Auto Encoding ........................................................................................................... 43
2.8.3 Auto Data Placement ................................................................................................. 46
2.8.4 Query Insights ............................................................................................................ 49
2.8.5 Unload Advisor ........................................................................................................... 52
2.8.6 Advisor Command-line Help ....................................................................................... 53
2.8.7 Advisor Report Table .................................................................................................. 53
2.9 Best Practices ....................................................................................................................... 54
2.9.1 Preparing Data ........................................................................................................... 54
2.9.2 Provisioning ............................................................................................................... 55
2.9.3 Importing Data into the MySQL DB System ................................................................. 55
2.9.4 Inbound Replication .................................................................................................... 56
2.9.5 Loading Data ............................................................................................................. 56
2.9.6 Auto Encoding and Auto Data Placement .................................................................... 57
2.9.7 Running Queries ........................................................................................................ 57
2.9.8 Monitoring .................................................................................................................. 62
2.9.9 Reloading Data .......................................................................................................... 62
2.10 Supported Data Types ........................................................................................................ 63
2.11 Supported SQL Modes ........................................................................................................ 64
2.12 Supported Functions and Operators ..................................................................................... 64
2.12.1 Aggregate Functions ................................................................................................. 64
7
2.12.2 Arithmetic Operators ................................................................................................. 67
2.12.3 Cast Functions and Operators .................................................................................. 67
2.12.4 Comparison Functions and Operators ....................................................................... 67
2.12.5 Control Flow Functions and Operators ....................................................................... 69
2.12.6 Data Masking and De-Identification Functions ............................................................ 69
2.12.7 Encryption and Compression Functions ..................................................................... 69
2.12.8 JSON Functions ....................................................................................................... 70
2.12.9 Logical Operators ..................................................................................................... 71
2.12.10 Mathematical Functions .......................................................................................... 71
2.12.11 String Functions and Operators ............................................................................... 72
2.12.12 Temporal Functions ................................................................................................ 74
2.12.13 Window Functions .................................................................................................. 77
2.13 SELECT Statement ............................................................................................................. 77
2.14 String Column Encoding Reference ..................................................................................... 79
2.14.1 Variable-length Encoding .......................................................................................... 80
2.14.2 Dictionary Encoding .................................................................................................. 81
2.14.3 Column Limits .......................................................................................................... 81
2.15 Troubleshooting ................................................................................................................... 82
2.16 Metadata Queries ................................................................................................................ 85
2.16.1 Secondary Engine Definitions ................................................................................... 85
2.16.2 Excluded Columns .................................................................................................... 85
2.16.3 String Column Encoding ........................................................................................... 86
2.16.4 Data Placement ........................................................................................................ 86
2.17 Bulk Ingest Data to MySQL Server ...................................................................................... 87
2.18 HeatWave MySQL Limitations ............................................................................................. 89
2.18.1 Change Propagation Limitations ................................................................................ 89
2.18.2 Data Type Limitations ............................................................................................... 89
2.18.3 Functions and Operator Limitations ........................................................................... 90
2.18.4 Index Hint and Optimizer Hint Limitations .................................................................. 92
2.18.5 Join Limitations ........................................................................................................ 92
2.18.6 Partition Selection Limitations ................................................................................... 93
2.18.7 Variable Limitations .................................................................................................. 93
2.18.8 Bulk Ingest Data to MySQL Server Limitations ........................................................... 93
2.18.9 Other Limitations ...................................................................................................... 94
When a HeatWave Cluster is enabled, queries that meet certain prerequisites are automatically
offloaded from the MySQL DB System to the HeatWave Cluster for accelerated processing.
Queries are issued from a MySQL client or application that interacts with the HeatWave Cluster by
connecting to the MySQL DB System. Results are returned to the MySQL DB System and to the
MySQL client or application that issued the query.
Manually loading data into HeatWave involves preparing tables on the MySQL DB System and
executing load statements. See Section 2.2.2, “Loading Data Manually”. The Auto Parallel Load utility
facilitates the process of loading data by automating required steps and optimizing the number of
parallel load threads. See Section 2.2.3, “Loading Data Using Auto Parallel Load”.
For HeatWave on AWS, load data into HeatWave using the HeatWave Console. See Manage Data in
HeatWave with Workspaces in the HeatWave on AWS Service Guide.
For HeatWave for Azure, see Importing Data to HeatWave in the HeatWave for Azure Service Guide.
When HeatWave loads a table, the data is sharded and distributed among HeatWave nodes. After a
table is loaded, DML operations on the tables are automatically propagated to the HeatWave nodes.
No user action is required to synchronize data. For more information, see Section 2.2.7, “Change
Propagation”.
On Oracle Cloud Infrastructure (OCI), data loaded into HeatWave, including propagated changes, are
automatically persisted by the HeatWave Storage Layer to OCI Object Storage for a fast recovery in
8
Before You Begin
case of a HeatWave node or cluster failure. For HeatWave on AWS, data is recovered from the MySQL
DB System.
After running a number of queries, you can use the HeatWave Advisor to optimize your workload.
Advisor analyzes your data and query history to provide string column encoding and data placement
recommendations. See Section 2.8, “Workload Optimization using Advisor”.
• An operational MySQL DB System, and able to connect to it using a MySQL client. If not, refer to the
following procedures:
• For HeatWave on OCI, see Creating a DB System, and Connecting to a DB System in the
HeatWave on OCI Service Guide in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Creating a DB System, and Connecting from a Client in the
HeatWave on AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave and Connecting to HeatWave in the
HeatWave for Azure Service Guide.
• The MySQL DB System has an operational HeatWave Cluster. If not, refer to the following
procedures:
• For HeatWave on OCI, see Adding a HeatWave Cluster in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Creating a HeatWave Cluster in the HeatWave on AWS Service
Guide.
• For HeatWave for Azure, see Provisioning HeatWave in the HeatWave for Azure Service Guide.
• Loading data manually. This method loads one table at a time and involves executing multiple
statements for each table. See Section 2.2.2, “Loading Data Manually”.
• Loading data using Auto Parallel Load. This HeatWave Autopilot enabled method loads one or more
schemas at a time and facilitates loading by automating manual steps and optimizing the number of
parallel load threads for a faster load. See Section 2.2.3, “Loading Data Using Auto Parallel Load”.
• For users of HeatWave on AWS, load data using the HeatWave Console. This GUI-based and
HeatWave Autopilot enabled method loads selected schemas and tables using an optimized number
of parallel load threads for a faster load. See Manage Data in HeatWave with Workspaces in the
HeatWave on AWS Service Guide.
HeatWave loads data with batched, multi-threaded reads from InnoDB. HeatWave then converts the
data into columnar format and sends it over the network to distribute it among HeatWave nodes in
horizontal slices. HeatWave partitions data by the table primary key, unless the table definition includes
data placement keys. See Section 2.7.2, “Defining Data Placement Keys”.
Concurrent DML operations and queries on the MySQL node are supported while a data load operation
is in progress; however, concurrent operations on the MySQL node can affect load performance and
vice versa.
After tables are loaded, changes to table data on the MySQL DB System node are automatically
propagated to HeatWave. For more information, see Section 2.2.7, “Change Propagation”.
9
Prerequisites
For each table that is loaded in HeatWave, 4MB of memory (the default heap segment size) is
allocated from the root heap. This memory requirement should be considered when loading a large
number of tables. For example, with a root heap of approximately 400GB available to HeatWave,
loading 100K tables would consume all available root heap memory (100K x 4GB = 400GB). As of
MySQL 8.0.30, the default heap segment size is reduced from 4MB to a default of 64KB per table,
reducing the amount of memory that must be allocated from the root heap for each loaded table.
Note
Before MySQL 8.0.31, DDL operations are not permitted on tables that are
loaded in HeatWave. In those releases, to alter the definition of a table, you
must unload the table and remove the SECONDARY_ENGINE attribute before
performing the DDL operation. See Section 2.4, “Modifying Tables”.
2.2.1 Prerequisites
Before loading data, ensure that you have met the following prerequisites:
• The data you want to load must be available on the MySQL DB System. For information about
importing data into a MySQL DB System, refer to the following instructions:
• For HeatWave on OCI, see Importing and Exporting Databases in the HeatWave on OCI Service
Guide.
• For HeatWave on AWS, see Importing Data in the HeatWave on AWS Service Guide.
• For HeatWave for Azure, see Importing Data to HeatWave in the HeatWave for Azure Service
Guide.
Tip
• The tables you intend to load must be InnoDB tables. You can manually convert tables to InnoDB
using the following ALTER TABLE statement:
mysql> ALTER TABLE tbl_name ENGINE=InnoDB;
• The tables you intend to load must be defined with a primary key. You can add a primary key using
the following syntax:
mysql> ALTER TABLE tbl_name ADD PRIMARY KEY (column);
Adding a primary key is a table-rebuilding operation. For more information, see Primary Key
Operations.
Primary key columns defined with column prefixes are not supported.
Load time is affected if the primary key contains more than one column, or if the primary key column
is not an INTEGER column. The impact on MySQL performance during load, change propagation,
and query processing depends on factors such as data properties, available resources (compute,
memory, and network), and the rate of transaction processing on the MySQL DB System.
10
Loading Data Manually
• Identify all of the tables that your queries access to ensure that you load all of them into HeatWave.
If a query accesses a table that is not loaded into HeatWave, it will not be offloaded to HeatWave for
processing.
• The number of columns per table cannot exceed 1017. Before MySQL 8.0.29, the limit was 900.
As of MySQL 8.2.0, HeatWave Guided Load uses HeatWave Autopilot to exclude schemas, tables, and
columns that cannot be loaded, and define RAPID as the secondary engine. To load data manually,
follow these steps:
1. Optionally, applying string column encoding and data placement workload optimizations. For more
information, see Section 2.7, “Workload Optimization”.
2. Loading tables or partitions using ALTER TABLE ... SECONDARY_LOAD statements. See
Section 2.2.2.3, “Loading Tables” and Section 2.2.2.4, “Loading Partitions”.
1. Excluding columns with unsupported data types. See Section 2.2.2.1, “Excluding Table Columns”.
2. Defining RAPID as the secondary engine for tables you want to load. See Section 2.2.2.2, “Defining
the Secondary Engine”.
3. Optionally, applying string column encoding and data placement workload optimizations. For more
information, see Section 2.7, “Workload Optimization”.
4. Loading tables using ALTER TABLE ... SECONDARY_LOAD statements. See Section 2.2.2.3,
“Loading Tables”.
Optionally, exclude columns that are not relevant to the intended queries. Excluding irrelevant columns
is not required but doing so reduces load time and the amount of memory required to store table data.
To exclude a column, specify the NOT SECONDARY column attribute in an ALTER TABLE or CREATE
TABLE statement, as shown below. The NOT SECONDARY column attribute prevents a column from
being loaded into HeatWave when executing a table load operation.
mysql> ALTER TABLE tbl_name MODIFY description BLOB NOT SECONDARY;
mysql> CREATE TABLE orders (id INT, description BLOB NOT SECONDARY);
Note
If a query accesses a column defined with the NOT SECONDARY attribute, the
query is executed on the MySQL DB System by default.
To include a column that was previously excluded, refer to the procedure described in Section 2.4,
“Modifying Tables”.
11
Loading Data Using Auto Parallel Load
The SECONDARY_LOAD PARTITION clause is valid if the table has been loaded to HeatWave or not.
For example:
mysql> ALTER TABLE t1 SECONDARY_LOAD;
mysql> ALTER TABLE t1 ADD PARTITION (PARTITION p4 VALUES LESS THAN (2002));
mysql> ALTER TABLE t1 SECONDARY_LOAD PARTITION (p4);
• Defining RAPID as the secondary engine for tables that are to be loaded.
Auto Parallel Load, which can be run from any MySQL client or connector, is implemented as a stored
procedure named heatwave_load, which resides in the MySQL sys schema. Running Auto Parallel
Load involves issuing a CALL statement for the stored procedure, which takes schemas and options
as arguments; for example, this statement loads the tpch schema:
mysql> CALL sys.heatwave_load(JSON_ARRAY("tpch"),NULL);
12
Loading Data Using Auto Parallel Load
• To run Auto Parallel Load in normal mode, the HeatWave Cluster must be active.
• An input_list JSON array replaces the db_list JSON array. This adds an include list to exactly
specify the tables and columns to load for a set of queries. It is no longer necessary to include a
complete schema, and exclude unnecessary tables and columns.
input_list: {
JSON_ARRAY(input [,input] ...)
}
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["mode",{"normal"|"dryrun"|"validation"}]
["output",{"normal"|"compact"|"silent"|"help"}]
["sql_mode","sql_mode"]
["policy",{"disable_unsupported_columns"|"not_disable_unsupported_columns"}]
["set_load_parallelism",{true|false}]
["auto_enc",JSON_OBJECT("mode",{"off"|"check"})]
}
}
input: {
'db_name' | db_object
}
db_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"db_name": "db_name",
["tables": JSON_ARRAY(table [, table] ...)]
["exclude_tables": JSON_ARRAY(table [, table] ...)]
}
}
table: {
'table_name' | table_object
}
table_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"table_name": "table_name",
['engine_attribute': engine_attribute_object],
['columns': JSON_ARRAY('column_name' [, 'column_name', ...]}],
['exclude_columns': JSON_ARRAY('column_name' [, 'column_name', ...]}]
}
}
engine_attribute_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"sampling": true|false,
13
Loading Data Using Auto Parallel Load
"dialect": {dialect_section},
"file": JSON_ARRAY(file_section [, file_section]...),
}
}
My SQL 8.0.33-u3 adds support for HeatWave Lakehouse with external_tables, see: Chapter 4,
HeatWave Lakehouse:
mysql> CALL sys.heatwave_load (db_list,[options]);
db_list: {
JSON_ARRAY(["schema_name","schema_name"] ...)
}
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["mode",{"normal"|"dryrun"}]
["output",{"normal"|"compact"|"silent"|"help"}]
["sql_mode","sql_mode"]
["policy",{"disable_unsupported_columns"|"not_disable_unsupported_columns"}]
["exclude_list",JSON_ARRAY(schema_name_1, schema_name_2.table_name_1, schema_name_3.table_name_2.
["set_load_parallelism",{true|false}]
["auto_enc",JSON_OBJECT("mode",{"off"|"check"})]
["external_tables",JSON_ARRAY(db_object [, db_object]... )]
}
}
db_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"db_name": "name",
"tables": JSON_ARRAY(table [, table] ...)
}
}
table: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"table_name": "name",
"sampling": true|false,
"dialect": {dialect_section},
"file": JSON_ARRAY(file_section [, file_section]...),
}
}
db_list: {
JSON_ARRAY(["schema_name","schema_name"] ...)
}
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["mode",{"normal"|"dryrun"}]
["output",{"normal"|"compact"|"silent"|"help"}]
["sql_mode","sql_mode"]
["policy",{"disable_unsupported_columns"|"not_disable_unsupported_columns"}]
["exclude_list",JSON_ARRAY(schema_name_1, schema_name_2.table_name_1, schema_name_3.table_name_2.
["set_load_parallelism",{true|false}]
["auto_enc",JSON_OBJECT("mode",{"off"|"check"})]
}
}
As of MySQL 8.4.0 use input_list to define what to load. input_list is a JSON array and
requires one or more valid input which can be either a valid schema name or a db_object. An
empty array is permitted to view the Auto Parallel Load command-line help, see Section 2.2.3.5, “Auto
Parallel Load Command-Line Help”. This is backwards compatible with db_list.
14
Loading Data Using Auto Parallel Load
Before MySQL 8.4.0, db_list specifies the schemas to load. The list is a JSON array and requires
one or more valid schema names. An empty array is permitted to view the Auto Parallel Load
command-line help.
Use key-value pairs in JSON format to specify parameters. HeatWave uses the default setting if there is
no option setting. Use NULL to specify no arguments.
For syntax examples, see Section 2.2.3.6, “Auto Parallel Load Examples”.
• mode: Defines the Auto Parallel Load operational mode. Permitted values are:
• dryrun: Generates a load script only. Auto Parallel Load executes in dryrun mode automatically
if the HeatWave Cluster is not active.
• validation: Only use with Lakehouse. validation performs the same checks as dryrun and
also validates external files before loading. It follows all the options and the load configuration,
for example column information, sql_mode, is_strict_mode and allow_missing_files,
but does not load any tables. It uses schema inference and might modify the schema, see:
Section 4.2.4.1, “Lakehouse Auto Parallel Load Schema Inference”. validation is faster than a
full load, particularly for large tables. The memory requirement is similar to running a full load.
Note
• output: Defines how Auto Parallel Load produces output. Permitted values are:
• normal: The default. Produces summarized output and sends it to stdout and to the
heatwave_autopilot_report table. See Section 6.1, “HeatWave Autopilot Report Table”.
Before MySQL 8.0.32, it sends it to the heatwave_load_report table. See Section 2.2.3.7,
“The Auto Parallel Load Report Table”.
• help: Displays Auto Parallel Load command-line help. See Section 2.2.3.5, “Auto Parallel Load
Command-Line Help”.
• sql_mode: Defines the SQL mode used while loading tables. Auto Parallel Load does not support
the MySQL global or session sql_mode variable. To run Auto Parallel Load with a non-oci-default
SQL mode configuration, specify the configuration using the Auto Parallel Load sql_mode option as
a string value. If no SQL modes are specified, the default OCI SQL mode configuration is used.
• policy: Defines the policy for handling of tables containing columns with unsupported data types.
Permitted values are:
15
Loading Data Using Auto Parallel Load
Auto Parallel Load does not generate statements to disable columns that are explicitly defined as
NOT SECONDARY.
• not_disable_unsupported_columns: Exclude the table from the load script if the table
contains a column with an unsupported data type.
A column with an unsupported data type that is explicitly defined as a NOT SECONDARY column
does not cause the table to be excluded. For information about defining columns as NOT
SECONDARY, see Section 2.2.2.1, “Excluding Table Columns”.
• exclude_list: Defines a list of schemas, tables, and columns to exclude from the load script.
Names must be fully qualified without backticks.
Do not use as of MySQL 8.4.0. Use db_object with tables, exclude_tables, columns or
exclude_columns instead. exclude_list will be deprecated in a future release.
Auto Parallel Load automatically excludes database objects that cannot be offloaded, according to
the default policy setting. These objects need not be specified explicitly in the exclude list. System
schemas, non-InnoDB tables, tables that are already loaded in HeatWave, and columns explicitly
defined as NOT SECONDARY are automatically excluded.
• auto_enc: Checks if there is enough memory for string column encoding. Settings include:
• check: The default. Checks if there is enough memory on the MySQL node for dictionary-
encoded columns and if there is enough root heap memory for variable-length column encoding
overhead. Dictionary-encoded columns require memory on the MySQL node for dictionaries.
For each loaded table, 4MB of memory (the default heap segment size) must be allocated from
the root heap for variable-length column encoding overhead. As of MySQL 8.0.30, the default
heap segment size is reduced from 4MB to a default of 64KB per table. If there is not enough
memory, Auto Parallel Load executes in dryrun mode and prints a warning about insufficient
memory. The auto_enc option runs check mode if it is not specified explicitly and set to off.
For more information, see Section 2.2.3.4, “Memory Estimation for String Column Encoding”.
• external_tables: non-InnoDB tables which do not store any data, but refer to data stored
externally. For the external_tables syntax, see: Section 4.2.4.2, “Lakehouse Auto Parallel Load
with the external_tables Option”.
Do not use as of MySQL 8.4.0. Use db_object with tables or exclude_tables instead.
external_tables will be deprecated in a future release.
• Use one or other of the following, but not both. The use of both parameters will throw an error.
• exclude_tables: As of MySQL 8.4.0, an optional JSON array of table to exclude from the
load.
16
Loading Data Using Auto Parallel Load
• sampling: Only use with Lakehouse. If set to true, the default setting, Lakehouse Auto
Parallel Load samples the data to infer the schema and collect statistics.
If set to false, Lakehouse Auto Parallel Load performs a full scan to infer the schema and
collect statistics. Depending on the size of the data, this can take a long time.
Auto Parallel Load uses the inferred schema to generate CREATE TABLE statements. The
statistics are used to estimate storage requirements and load times. See: Section 4.2.4.1,
“Lakehouse Auto Parallel Load Schema Inference”.
• For dialect and file, see: Section 4.2.2, “Lakehouse External Table Syntax”.
• Use one or other of the following, but not both. The use of both parameters will throw an error.
In dryrun mode, Auto Parallel Load sends the load script to the heatwave_autopilot_report
table only. See Section 6.1, “HeatWave Autopilot Report Table”. Before MySQL 8.0.32, it sends it to
the heatwave_load_report table. It does not load data into HeatWave.
If Auto Parallel Load fails with an error, inspect the errors with a query to the
heatwave_autopilot_report table. Before MySQL 8.0.32, query the heatwave_load_report
table.
mysql> SELECT log FROM sys.heatwave_autopilot_report
WHERE type="error";
When Auto Parallel Load finishes running, query the heatwave_autopilot_report table to check
for warnings. Before MySQL 8.0.32, query the heatwave_load_report table.
mysql> SELECT log FROM sys.heatwave_autopilot_report
WHERE type="warn";
Issue the following query to inspect the load script that was generated. Before MySQL 8.0.32, query
the heatwave_load_report table.
mysql> SELECT log->>"$.sql" AS "Load Script"
FROM sys.heatwave_autopilot_report
WHERE type = "sql" ORDER BY id;
Once you are satisfied with the Auto Parallel Load CALL statement and the generated load script,
reissue the CALL statement in normal mode to load the data into HeatWave. For example:
mysql> CALL sys.heatwave_load(JSON_ARRAY("tpch"), JSON_OBJECT("mode","normal"));
Note
17
Loading Data Using Auto Parallel Load
The time required to load data depends on the data size. Auto Parallel Load provides an estimate of
the time required to complete the load operation.
Tables are loaded in sequence, ordered by schema and table name. Load-time errors are reported as
they are encountered. If an error is encountered while loading a table, the operation is not terminated.
Auto Parallel Load continues running, moving on to the next table.
When Auto Parallel Load finishes running, it checks if tables are loaded and shows a summary with the
number of tables that were loaded and the number of tables that failed to load.
The following example uses the auto_enc option in check mode, if you want to ensure that there is
sufficient memory for string column encoding before attempting a load operation. Insufficient memory
can cause a load failure.
mysql> CALL sys.heatwave_load(JSON_ARRAY("tpch"),
JSON_OBJECT("mode","dryrun","auto_enc",JSON_OBJECT("mode","check")));
Note
Look for capacity estimation data in the Auto Parallel Load output. The results indicate whether there is
sufficient memory to load all tables.
The command-line help provides usage documentation for the Auto Parallel Load utility.
18
Loading Data Using Auto Parallel Load
• Load tables that begin with an “hw” prefix from a schema named schema_customer_1.
mysql> SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name)))
FROM information_schema.tables
WHERE table_schema = 'schema_customer_1'
AND table_name NOT LIKE 'hw%');
mysql> CALL sys.heatwave_load(JSON_ARRAY('schema_customer_1'), @exc_list);
• Load all schemas with tables that start with an “hw” prefix.
mysql> SET @db_list = (SELECT json_arrayagg(schema_name) FROM information_schema.schemata);
mysql> SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name)))
FROM information_schema.tables
WHERE table_schema NOT IN
('mysql','information_schema', 'performance_schema','sys')
AND table_name NOT LIKE 'hw%');
mysql> CALL sys.heatwave_load(@db_list, @exc_list);
You can check db_list and exc_list using SELECT JSON_PRETTY(@db_list); and SELECT
JSON_PRETTY(@exc_list);
-- CUSTOM OUTPUT
SELECT log as 'Unsupported objects' FROM sys.heatwave_autopilot_report WHERE type="warn"
AND stage="VERIFICATION" and log like "%Unsupported%";
SELECT Count(*) AS "Total Load commands Generated"
FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;
END //
DELIMITER ;
CALL auto_load_wrapper();
When MySQL runs Auto Parallel Load, it sends output including execution logs and a generated load
script to the heatwave_load_report table in the sys schema.
The heatwave_load_report table is a temporary table. It contains data from the last execution
of Auto Parallel Load. Data is only available for the current session and is lost when the session
terminates or when the server is shut down.
Query the heatwave_load_report table after MySQL runs Auto Parallel Load, as in the following
examples:
19
Monitoring Load Progress
• View the generated load script to see commands that would be executed by Auto Parallel Load in
normal mode:
mysql> SELECT log->>"$.sql" AS "Load Script"
FROM sys.heatwave_load_report
WHERE type = "sql" ORDER BY id;
• Concatenate Auto Parallel Load generated DDL statements into a single string to copy and
paste for execution. The group_concat_max_len variable sets the result length in bytes
for the GROUP_CONCAT() function to accommodate a potentially long string. (The default
group_concat_max_len setting is 1024 bytes.)
mysql> SET SESSION group_concat_max_len = 1000000;
mysql> SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ')
FROM sys.heatwave_load_report
WHERE type = "sql" ORDER BY id;
Note
20
Data Compression
| tpch.supplier | AVAIL_RPDGSTABSTATE |
| tpch.partsupp | AVAIL_RPDGSTABSTATE |
| tpch.orders | AVAIL_RPDGSTABSTATE |
| tpch.lineitem | AVAIL_RPDGSTABSTATE |
| tpch.customer | AVAIL_RPDGSTABSTATE |
| tpch.nation | AVAIL_RPDGSTABSTATE |
| tpch.region | AVAIL_RPDGSTABSTATE |
| tpch.part | AVAIL_RPDGSTABSTATE |
+------------------------------+---------------------+
The AVAIL_RPDGSTABSTATE status indicates that the table is loaded. For information about load
statuses, see Section 6.4.8, “The rpd_tables Table”.
While data compression results in a smaller HeatWave Cluster, decompression operations that occur
as data is accessed affect performance to a small degree. Specifically, decompression operations
have a minor effect on query runtimes, on the rate at which queries are offloaded to HeatWave during
change propagation, and on recovery time from Object Storage.
If data storage size is not a concern, disable data compression by setting the rapid_compression
session variable to OFF before loading data:
mysql> SET SESSION rapid_compression=OFF;
As of MySQL 8.3.0, the default option is AUTO which automatically chooses the best compression
algorithm for each column.
DML operations, INSERT, UPDATE, and DELETE, on the MySQL DB System do not wait for changes
to be propagated to the HeatWave Cluster; that is, DML operations on the MySQL DB System are not
delayed by HeatWave change propagation.
Data changes on the MySQL DB System node are propagated to HeatWave in batch transactions.
Change propagation is initiated as follows:
• Every 200ms.
• When data updated by DML operations on the MySQL DB System are read by a subsequent
HeatWave query.
A change propagation failure can cause tables in HeatWave to become stale, and queries that access
stale tables are not offloaded to HeatWave for processing.
Tables that have become stale due to change propagation failures resulting from out-of-code errors are
automatically reloaded. A check for stale tables is performed periodically when the HeatWave Cluster
is idle.
If change propagation failure has occurred for some other reason causing a table to become stale,
you must unload and reload the table manually to restart change propagation for the table. See
Section 2.5.1, “Unloading Tables”, and Section 2.2, “Loading Data to HeatWave MySQL”.
21
Reload Tables
To check if change propagation is enabled for individual tables, query the POOL_TYPE data in
HeatWave Performance Schema tables. RAPID_LOAD_POOL_TRANSACTIONAL indicates that
change propagation is enabled for the table. RAPID_LOAD_POOL_SNAPSHOT indicates that change
propagation is disabled.
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["only_user_loaded_tables",{true|false}]
["output",{"normal"|"silent"}]
}
}
Use key-value pairs in JSON format to specify options. HeatWave uses the default setting if there is
no defined option. Use NULL to specify no arguments.
• normal: The default. Produces summarized output and sends it to stdout and to the
heatwave_autopilot_report table. See Section 6.1, “HeatWave Autopilot Report Table”.
• silent: Sends output to the heatwave_autopilot_report table only. See Section 6.1,
“HeatWave Autopilot Report Table”. The silent output type is useful if human-readable output is
not required; when the output is consumed by a script, for example.
Syntax Examples
• Reload all tables with default options:
22
Running Queries
• Reload all user and system tables with the silent option:
mysql> CALL sys.heatwave_reload (JSON_OBJECT("only_user_loaded_tables",false,"output","silent"));
As of MySQL 8.4.0, HeatWave supports InnoDB partitions. Query processing in HeatWave can access
partitions with standard syntax.
• For HeatWave on OCI see Connecting to a DB System in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Connecting from a Client in the HeatWave on AWS Service Guide.
• The query must be a SELECT statement. INSERT ... SELECT and CREATE TABLE ... SELECT
statements are supported, but only the SELECT portion of the statement is offloaded to HeatWave.
See Section 2.3.7, “CREATE TABLE ... SELECT Statements”, and Section 2.3.8, “INSERT ...
SELECT Statements”.
• All tables accessed by the query must be defined with RAPID as the secondary engine. See
Section 2.2.2.2, “Defining the Secondary Engine”.
• All tables accessed by the query must be loaded in HeatWave. See Section 2.2, “Loading Data to
HeatWave MySQL”.
• autocommit must be enabled. If autocommit is disabled, queries are not offloaded and execution
is performed on the MySQL DB System. To check the autocommit setting:
mysql> SHOW VARIABLES LIKE 'autocommit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | ON |
+---------------+-------+
• Queries must only use supported functions and operators. See Section 2.12, “Supported Functions
and Operators”.
23
Running Queries
• Queries must avoid known limitations. See Section 2.18, “HeatWave MySQL Limitations”.
If any prerequisite is not satisfied, the query is not offloaded and falls back to the MySQL DB System
for processing by default.
If Using secondary engine RAPID does not appear in the Extra column, the query will
not be offloaded to HeatWave. To determine why a query will not offload, refer to Section 2.15,
“Troubleshooting”, or try debugging the query using the procedure described in Section 2.3.5,
“Debugging Queries”.
After using EXPLAIN to verify that the query can be offloaded, run the query and note the execution
time.
mysql> SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT
FROM orders
WHERE O_ORDERDATE >= DATE '1994-03-01'
GROUP BY O_ORDERPRIORITY
ORDER BY O_ORDERPRIORITY;
+-----------------+-------------+
| O_ORDERPRIORITY | ORDER_COUNT |
+-----------------+-------------+
| 1-URGENT | 2017573 |
| 2-HIGH | 2015859 |
| 3-MEDIUM | 2013174 |
| 4-NOT SPECIFIED | 2014476 |
| 5-LOW | 2013674 |
+-----------------+-------------+
5 rows in set (0.04 sec)
To compare HeatWave query execution time with MySQL DB System execution time, disable the
use_secondary_engine variable and run the query again to see how long it takes to run on the
MySQL DB System.
mysql> SET SESSION use_secondary_engine=OFF;
24
Auto Scheduling
| O_ORDERPRIORITY | ORDER_COUNT |
+-----------------+-------------+
| 1-URGENT | 2017573 |
| 2-HIGH | 2015859 |
| 3-MEDIUM | 2013174 |
| 4-NOT SPECIFIED | 2014476 |
| 5-LOW | 2013674 |
+-----------------+-------------+
5 rows in set (8.91 sec)
Note
Concurrently issued queries are prioritized for execution. For information about
query prioritization, see Section 2.3.3, “Auto Scheduling”.
When HeatWave is idle, an arriving query is scheduled immediately for execution. It is not queued. A
query is queued only if a preceding query is running on HeatWave.
A light-weight cost estimate is performed for each query at query compilation time.
Queries cancelled via Ctrl-C are removed from the scheduling queue.
For a query that you can run to view the HeatWave query history including query start time, end time,
and wait time in the scheduling queue, see Section 6.2, “HeatWave MySQL Monitoring”.
Each entry in the cache corresponds to a query execution plan node. A query execution plan may have
nodes for table scans, JOIN, GROUP BY and other operations.
The statistics cache is an LRU structure. When cache capacity is reached, the least recently used
entries are evicted from the cache as new entries are added. The number of entries permitted in the
statistics cache is 65536, which is enough to store statistics for 4000 to 5000 unique queries of medium
complexity. The maximum number of statistics cache entries is defined by the MySQL-managed
rapid_stats_cache_max_entries setting.
25
Debugging Queries
2. Issue the problematic query using EXPLAIN. If the query is supported by HeatWave, the Extra
column in the EXPLAIN output shows the following text: “Using secondary engine RAPID”;
otherwise, that text does not appear. The following query example uses the TIMEDIFF() function,
which is currently not supported by HeatWave:
mysql> EXPLAIN SELECT TIMEDIFF(O_ORDERDATE,'2000:01:01 00:00:00.000001')
FROM orders;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ORDERS
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1488248
filtered: 100
Extra: NULL
1 row in set, 1 warning (0.0011 sec)
3. Query the INFORMATION_SCHEMA.OPTIMIZER_TRACE table for a failure reason. There are two
trace markers for queries that fail to offload:
• Rapid_Offload_Fails
• secondary_engine_not_used
Note
If the optimizer trace does not return all of the trace information, increase
the optimizer trace buffer size. For more information, see Section 2.9.7,
“Running Queries”.
For the TIMEDIFF() query example used above, querying the Rapid_Offload_Fails marker
returns the reason for the failure:
mysql> SELECT QUERY, TRACE->'$**.Rapid_Offload_Fails'
FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
+---------------------------------------------------+--------------------------------------+
| QUERY | TRACE->'$**.Rapid_Offload_Fails' |
+---------------------------------------------------+--------------------------------------+
| EXPLAIN SELECT |[{"Reason": "Function timediff is not |
| TIMEDIFF(O_ORDERDATE,'2000:01:01 00:00:00.000001')| yet supported"}] |
| FROM ORDERS | |
+---------------------------------------------------+--------------------------------------+
The reason reported for a query offload failure depends on the issue or limitation encountered. For
common issues, such as unsupported clauses or functions, a specific reason is reported. For undefined
issues or unsupported query transformations performed by the optimizer, the following generic reason
is reported:
mysql> [{"Reason": "Currently unsupported RAPID query compilation scenario"}]
For a query that does not meet the query cost threshold for HeatWave, the following reason is
reported:
26
Query Runtimes and Estimates
mysql> [{"Reason": "The estimated query cost does not exceed secondary_engine_cost_threshold."}]
The query cost threshold prevents queries of little cost from being offloaded to HeatWave. For
information about the query cost threshold, see Section 2.15, “Troubleshooting”.
For a query that attempts to access a column defined as NOT SECONDARY, the following reason is
reported:
Columns defined as NOT SECONDARY are excluded when a table is loaded into HeatWave. See
Section 2.2.2.1, “Excluding Table Columns”.
Runtime data is available for queries in the HeatWave query history, which is a non-persistent store of
information about the last 1000 executed queries.
• To view runtime data for queries executed by the current session only:
For additional information about using Query Insights, see Section 2.8.4, “Query Insights”.
• To view runtime data for all queries in the HeatWave query history:
• To view runtime data for a particular HeatWave query, filtered by query ID:
27
CREATE TABLE ... SELECT Statements
WHERE query_id = 1;
• EXPLAIN output includes the query ID. You can also query the
performance_schema.rpd_query_stats table for query IDs:
mysql> SELECT query_id, LEFT(query_text,160)
FROM performance_schema.rpd_query_stats;
The SELECT table must be loaded in HeatWave. For example, the following statement selects data
from the orders table on HeatWave and inserts the result set into the orders2 table created on the
MySQL DB System:
mysql> CREATE TABLE orders2 SELECT * FROM orders;
The SELECT portion of the CREATE TABLE ... SELECT statement is subject to the same HeatWave
requirements and limitations as regular SELECT queries.
The SELECT table must be loaded in HeatWave, and the INSERT table must be present on the MySQL
DB System. For example, the following statement selects data from the orders table on tHeatWave
and inserts the result set into the orders2 table on the MySQL DB System:
mysql> INSERT INTO orders2 SELECT * FROM orders;
Usage notes:
• The SELECT portion of the INSERT ... SELECT statement is subject to the same HeatWave
requirements and limitations as regular SELECT queries.
• Functions, operators, and attributes deprecated by MySQL Server are not supported in the SELECT
query.
• See Section 2.3.1, “Query Prerequisites” and Section 2.18.9, “Other Limitations”.
28
Modifying Tables
In the following example, a view is created on the orders table, described in Section 2.6, “Table Load
and Query Example”. The example assumes the orders table is loaded in HeatWave.
mysql> CREATE VIEW v1 AS SELECT O_ORDERPRIORITY, O_ORDERDATE
FROM orders;
To determine if a query executed on a view can be offloaded to HeatWave for execution, use
EXPLAIN. If offload is supported, the Extra column of EXPLAIN output shows “Using secondary
engine RAPID”, as in the following example:
mysql> EXPLAIN SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT
FROM v1
WHERE O_ORDERDATE >= DATE '1994-03-01'
GROUP BY O_ORDERPRIORITY
ORDER BY O_ORDERPRIORITY;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ORDERS
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1488248
filtered: 33.32999801635742
Extra: Using where; Using temporary; Using filesort; Using secondary engine RAPID
Before MySQL 8.0.31, follow this procedure to modify a table that is loaded in HeatWave:
3. Modify the table. The following examples demonstrate adding a previously excluded column,
modifying or removing a column encoding, and modifying or removing a data placement key.
Examples are based on the orders table describes in Section 2.6, “Table Load and Query
Example”.
Columns are excluded by specifying the NOT SECONDARY column attribute in a CREATE TABLE
or ALTER TABLE statement; for example:
mysql> ALTER TABLE orders MODIFY `O_COMMENT` varchar(79) NOT NULL NOT SECONDARY;
To include a previously excluded column the next time the table is loaded, modify the column
definition to remove the NOT SECONDARY column attribute; for example:
mysql> ALTER TABLE orders MODIFY `O_COMMENT` varchar(79) NOT NULL;
29
Unloading Data from HeatWave MySQL
To modify the column encoding, alter the column comment; for example:
mysql> ALTER TABLE orders MODIFY `O_COMMENT` VARCHAR(79) COLLATE utf8mb4_bin NOT NULL
COMMENT 'RAPID_COLUMN=ENCODING=VARLEN';
The following example removes the column comment entirely, but if there are other column
comments that you want to keep, you need only remove the encoding keyword string.
mysql> ALTER TABLE orders MODIFY `O_COMMENT` VARCHAR(79)
COLLATE utf8mb4_bin NOT NULL;
To modify a data placement key, modify the data placement keyword string:
mysql> ALTER TABLE orders MODIFY `O_ORDERDATE` DATE NOT NULL
COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=2';
To remove a data placement key, modify the column comment to remove the
RAPID_COLUMN=DATA_PLACEMENT_KEY=N keyword string. The following example removes
the column comment entirely, but if there are other column comments that you want to keep, you
need only remove the data placement keyword string.
mysql> ALTER TABLE orders MODIFY `O_ORDERDATE` DATE NOT NULL;
4. After making the desired changes to the table, set the SECONDARY_ENGINE attribute back to
RAPID; for example:
mysql> ALTER TABLE orders SECONDARY_ENGINE = RAPID;
To unload a table from HeatWave, specify the SECONDARY_UNLOAD clause in an ALTER TABLE
statement:
mysql> ALTER TABLE tbl_name SECONDARY_UNLOAD;
Data is removed from HeatWave only. The table contents on the MySQL DB System are not affected.
30
Unloading Partitions
• Removes the secondary engine flag for tables that are to be unloaded.
Auto Unload, which can be run from any MySQL client or connector, is implemented as a stored
procedure named heatwave_unload, which resides in the MySQL sys schema. Running Auto
Unload involves issuing a CALL statement for the stored procedure, which takes schemas and
options as arguments; for example, this statement unloads the tpch schema:
mysql> CALL sys.heatwave_unload(JSON_ARRAY("tpch"),NULL);
input_list: {
JSON_ARRAY(input [,input] ...)
}
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["mode",{"normal"|"dryrun"}]
["output",{"normal"|"silent"|"help"}]
}
}
input: {
'db_name' | db_object
}
db_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"db_name": "db_name",
["tables": JSON_ARRAY(table [, table] ...)]
["exclude_tables": JSON_ARRAY(table [, table] ...)]
}
}
table: {
'table_name'
}
31
Unloading Data Using Auto Unload
db_list: {
JSON_ARRAY(["schema_name","schema_name"] ...)
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["mode",{"normal"|"dryrun"}]
["output",{"normal"|"silent"|"help"}]
["exclude_list",JSON_ARRAY(schema_name_1, schema_name_2.table_name_1, ...)]
}
As of MySQL 8.4.0 use input_list to define what to unload. input_list is a JSON array and
requires one or more valid input which can be either a valid schema name or a db_object. An
empty array is permitted to view the Auto Unload command-line help, see Section 2.5.3.3, “Auto
Unload Command-Line Help”. This is backwards compatible with db_list.
Before MySQL 8.4.0, db_list specifies the schemas to unload. The list is a JSON array and requires
one or more valid schema names. An empty array is permitted to view the Auto Unload command-line
help.
Use key-value pairs in JSON format to specify parameters. HeatWave uses the default setting if there is
no option setting. Use NULL to specify no arguments.
• mode: Defines the Auto Unload operational mode. Permitted values are:
• dryrun: Generates an unload script only. Auto Unload executes in dryrun mode automatically if
the HeatWave Cluster is not active.
• output: Defines how Auto Unload produces output. Permitted values are:
• normal: The default. Produces summarized output and sends it to stdout and to the
heatwave_autopilot_report table. See Section 6.1, “HeatWave Autopilot Report Table”.
• silent: Sends output to the heatwave_autopilot_report table only. See Section 6.1,
“HeatWave Autopilot Report Table”. The silent output type is useful if human-readable output
is not required; when the output is consumed by a script, for example. For an example of a stored
procedure with an Auto Unload call that uses the silent output type, see Section 2.5.3.4, “Auto
Unload Examples”.
• help: Displays Auto Unload command-line help. See Section 2.5.3.3, “Auto Unload Command-
Line Help”.
• exclude_list: Defines a list of schemas and tables to exclude from the unload script. Names must
be fully qualified without backticks.
Do not use as of MySQL 8.4.0. Use db_object with tables or exclude_tables instead.
exclude_list will be deprecated in a future release.
Auto Unload automatically excludes tables that are loading, unloading or in recovery. This is when
the load_status is one of the following: NOLOAD_RPDGSTABSTATE, LOADING_RPDGSTABSTATE,
UNLOADING_RPDGSTABSTATE or INRECOVERY_RPDGSTABSTATE.
32
Unloading Data Using Auto Unload
• Use one or other of the following, but not both. The use of both parameters will throw an error.
• exclude_tables: As of MySQL 8.4.0, an optional JSON array of table to exclude from the
load.
Run Auto Unload in dryrun mode first to check for errors and warnings and to inspect the generated
unload script. To unload a single schema in dryrun mode:
In dryrun mode, Auto Unload sends the unload script to the heatwave_autopilot_report table
only. See Section 6.1, “HeatWave Autopilot Report Table”.
If Auto Unload fails with an error, inspect the errors with a query to the
heatwave_autopilot_report table.
When Auto Unload finishes running, query the heatwave_autopilot_report table to check for
warnings.
Issue the following query to inspect the unload script that was generated.
Once you are satisfied with the Auto Unload CALL statement and the generated unload script, reissue
the CALL statement in normal mode to unload the data into HeatWave. For example:
Note
Tables are unloaded in sequence, ordered by schema and table name. Unload-time errors are reported
as they are encountered. If an error is encountered while unloading a table, the operation is not
terminated. Auto Unload continues running, moving on to the next table.
When Auto Unload finishes running, it checks if tables are unloaded and shows a summary with the
number of tables that were unloaded and the number of tables that failed to unload.
33
Unloading Data Using Auto Unload
The command-line help provides usage documentation for the Auto Unload utility.
• Unload tables that begin with an “hw” prefix from a schema named schema_customer_1.
mysql> SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name)))
FROM performance_schema.rpd_table_id
WHERE schema_name = 'schema_customer_1'
AND table_name NOT LIKE 'hw%');
mysql> CALL sys.heatwave_unload(JSON_ARRAY('schema_customer_1'), @exc_list);
• Unload all schemas with tables that start with an “hw” prefix.
mysql> SET @db_list = (SELECT JSON_ARRAYAGG(unique_schemas)
FROM (SELECT DISTINCT(schema_name) as unique_schemas
FROM performance_schema.rpd_table_id)
AS loaded_schemas);
mysql> SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name)))
FROM performance_schema.rpd_table_id
WHERE table_name NOT LIKE 'hw%');
mysql> CALL sys.heatwave_unload(@db_list, @exc_list);
-- CUSTOM OUTPUT
SELECT log as 'Warnings' FROM sys.heatwave_autopilot_report WHERE type="warn";
SELECT Count(*) AS "Total Unload commands Generated"
FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id;
END //
DELIMITER ;
34
Unload All Tables
CALL auto_unload_wrapper();
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
["only_user_loaded_tables",{true|false}]
["output",{"normal"|"silent"}]
}
}
Use key-value pairs in JSON format to specify options. HeatWave uses the default setting if there is
no defined option. Use NULL to specify no arguments.
• normal: The default. Produces summarized output and sends it to stdout and to the
heatwave_autopilot_report table. See Section 6.1, “HeatWave Autopilot Report Table”.
• silent: Sends output to the heatwave_autopilot_report table only. See Section 6.1,
“HeatWave Autopilot Report Table”. The silent output type is useful if human-readable output is
not required; when the output is consumed by a script, for example.
Syntax Examples
• Unload all tables with default options:
mysql> CALL sys.heatwave_unload_all (NULL);
• Unload all user and system tables with the silent option:
mysql> CALL sys.heatwave_unload_all (JSON_OBJECT("only_user_loaded_tables",false,"output","silent"));
It is assumed that HeatWave is enabled and the MySQL DB System has a schema named tpch with
a table named orders. The example shows how to exclude a table column, encode string columns,
define RAPID as the secondary engine, and load the table. The example also shows how to use
EXPLAIN to verify that the query can be offloaded, and how to force query execution on the MySQL DB
System to compare MySQL DB System and HeatWave query execution times.
35
Table Load and Query Example
2. Exclude columns that you do not want to load, such as columns with unsupported data types:
mysql> ALTER TABLE orders MODIFY `O_COMMENT` varchar(79) NOT NULL NOT SECONDARY;
3. Encode individual string columns as necessary. For example, apply dictionary encoding to string
columns with a low number of distinct values. Variable-length encoding is the default if no encoding
is specified.
mysql> ALTER TABLE orders MODIFY `O_ORDERSTATUS` char(1) NOT NULL
COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
7. Use EXPLAIN to determine if a query on the orders table can be offloaded. Using secondary
engine RAPID in the Extra column indicates that the query can be offloaded.
mysql> EXPLAIN SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT
FROM orders
WHERE O_ORDERDATE >= DATE '1994-03-01'
GROUP BY O_ORDERPRIORITY
ORDER BY O_ORDERPRIORITY;
36
Workload Optimization
9. To compare HeatWave query execution time with MySQL DB System execution time, disable
use_secondary_engine and run the query again to see how long it takes to run on the MySQL
DB System
mysql> SET SESSION use_secondary_engine=OFF;
You can determine where to apply the encoding and data placement optimizations yourself or run
the Advisor utility for recommendations. The Advisor Auto Encoding feature provides string column
encoding recommendations. The Advisor Auto Data Placement feature recommends data placement
keys. See Section 2.8, “Workload Optimization using Advisor”.
37
Encoding String Columns
Advisor also includes a Query Insights feature that provides query runtimes and runtime estimates,
which can be used to optimize queries, troubleshoot, and perform workload cost estimations. See
Section 2.8.4, “Query Insights”.
When tables are loaded into HeatWave, variable-length encoding is applied to CHAR,
VARCHAR, and TEXT type columns by default. To use dictionary encoding, you must define the
RAPID_COLUMN=ENCODING=SORTED keyword string in a column comment before loading the table.
The keyword string must be uppercase; otherwise, it is ignored.
You can define the keyword string in a CREATE TABLE or ALTER TABLE statement, as shown:
mysql> CREATE TABLE orders (name VARCHAR(100)
COMMENT 'RAPID_COLUMN=ENCODING=SORTED');
Note
Tip
For string column encoding recommendations, use the Advisor utility after
loading tables into HeatWave and running queries. For more information, see
Section 2.8, “Workload Optimization using Advisor”.
To modify or remove a string column encoding, refer to the procedure described in Section 2.4,
“Modifying Tables”.
• Variable-length encoding (VARLEN) is best suited to columns with a high number of distinct values,
such as “comment” columns.
• Dictionary encoding (SORTED) is best suited to columns with a low number of distinct values, such as
“country” columns.
Variable-length encoding requires space for column values on the HeatWave nodes. Dictionary
encoding requires space on the MySQL DB System node for dictionaries.
38
Defining Data Placement Keys
For additional information about string column encoding, see Section 2.14, “String Column Encoding
Reference”.
Generally, use data placement keys only if partitioning by the primary key does not provide adequate
performance. Also, reserve data placement keys for the most time-consuming queries. In such cases,
define data placement keys on the most frequently used JOIN keys and the keys of the longest running
queries.
Tip
For data placement key recommendations, use the Advisor utility after
loading tables into HeatWave and running queries. For more information, see
Section 2.8, “Workload Optimization using Advisor”.
Defining a data placement key requires adding a column comment with the data placement keyword
string:
$> RAPID_COLUMN=DATA_PLACEMENT_KEY=N
where N is an index value that defines the priority order of data placement keys.
• An index value cannot be repeated in the same table. For example, you cannot assign an index
value of 2 to more than one column in the same table.
• Gaps in index values are not permitted. For example, if you define a data placement key column with
an index value of 3, there must also be two other data placement key columns with index values of 1
and 2, respectively.
39
Workload Optimization using Advisor
You can define the data placement keyword string in a CREATE TABLE or ALTER TABLE statement:
mysql> CREATE TABLE orders (date DATE
COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=1');
The following example shows multiple columns defined as data placement keys. Although a primary
key is defined, data is partitioned by the data placement keys, which are prioritized over the primary
key.
mysql> CREATE TABLE orders (
id INT PRIMARY KEY,
date DATE COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=1',
price FLOAT COMMENT 'RAPID_COLUMN=DATA_PLACEMENT_KEY=2');
When defining multiple columns as data placement keys, prioritize the keys according to query
cost. For example, assign DATA_PLACEMENT_KEY=1 to the key of the costliest query, and
DATA_PLACEMENT_KEY=2 to the key of the next costliest query, and so on.
Note
Usage notes:
• JOIN and GROUP BY query optimizations are only applied if at least one of the JOIN or GROUP BY
relations has a key that matches the defined data placement key.
• If a JOIN operation can be executed with or without the JOIN and GROUP BY query optimization, a
compilation-time cost model determines how the query is executed. The cost model uses estimated
statistics.
• A data placement key cannot be defined on a dictionary-encoded string column but are permitted on
variable-length encoded columns. HeatWave applies variable-length encoding to string columns by
default. See Section 2.7.1, “Encoding String Columns”.
• A data placement key can only be defined on a column with a supported data type. See
Section 2.10, “Supported Data Types”.
• A data placement key column cannot be defined as a NOT SECONDARY column. See Section 2.2.2.1,
“Excluding Table Columns”.
Recommends string column encodings that minimize the required cluster size and improve query
performance. See Section 2.8.2, “Auto Encoding”.
40
Advisor Syntax
Recommends data placement keys for optimizing JOIN and GROUP BY query performance. See
Section 2.8.3, “Auto Data Placement”.
Provides runtime information for successfully executed queries and runtime estimates for EXPLAIN
queries, queries cancelled with Ctrl+C, and queries that fail due to out of memory errors. Runtime
data is useful for query optimization, troubleshooting, and estimating the cost of running a particular
query or workload. See Section 2.8.4, “Query Insights”.
• Unload Advisor
Recommends tables to unload, that will reduce HeatWave memory usage. The recommendations
are based upon when the tables were last queried.
To run Advisor, the HeatWave Cluster must be active, and the user must have the following MySQL
privileges:
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
['output',{'normal'|'silent'|'help'}]
['target_schema',JSON_ARRAY({'schema_name'[,'schema_name']}]
['exclude_query',JSON_ARRAY('query_id'[,'query_id'] ...)]
['query_session_id',JSON_ARRAY('query_session_id'[,'query_session_id'] ...)]
['query_insights',{true|false}]
['auto_enc',JSON_OBJECT(auto_enc_option)]
['auto_dp',JSON_OBJECT(auto_dp_option)]
['auto_unload',JSON_OBJECT(auto_unload_option)]
}
auto_enc_option: {
['mode',{'off'|'recommend'}]
['fixed_enc',JSON_OBJECT('schema.tbl.col',{'varlen'|'dictionary'}
[,'schema.tbl.col',{'varlen'|'dictionary'}] ...]
}
auto_dp_option: {
['benefit_threshold',N]
['max_combinations',N]
}
auto_unload_option: {
['mode',{'off'|'recommend'}]
['exclude_list',JSON_ARRAY(schema_name_1, schema_name_2.table_name_1, ...)]
['last_queried_hours',N]
['memory_gain_ascending',{true|false}]
41
Advisor Syntax
['limit_tables',N]
}
Advisor options are specified as key-value pairs in JSON format. Options include:
• normal: The default. Produces summarized output and sends it to stdout and to the
heatwave_autopilot_report table. See Section 6.1, “HeatWave Autopilot Report Table”.
Before MySQL 8.0.32, it sends it to the heatwave_advisor_report table. See Section 2.8.7,
“Advisor Report Table”.
• help: Displays Advisor command-line help. See Section 2.8.6, “Advisor Command-line Help”.
• target_schema: Defines one or more schemas for Advisor to analyze. The list is specified in JSON
array format. If a target schema is not specified, all schemas in the HeatWave Cluster are analyzed.
When a target schema is specified, Advisor generates recommendations for tables belonging to the
target schema. For the most accurate recommendations, specify one schema at a time. Only run
Advisor on multiple schemas if your queries access tables in multiple schemas.
• exclude_query: Defines the IDs of queries to exclude when Advisor analyzes query statistics.
To identify query IDs, query the performance_schema.rpd_query_stats table. For a query
example, see Section 2.8.3.2, “Auto Data Placement Examples”.
• query_session_id: Defines session IDs for filtering queries by session ID. To identify session
IDs, query the performance_schema.rpd_query_stats table. For a query example, see
Section 2.8.4.3, “Query Insights Examples”.
• query_insights: Provides runtime information for successfully executed queries and runtime
estimates for EXPLAIN queries, queries cancelled using Ctrl+C, and queries that fail due to an out-
of-memory error. See: Section 2.8.4, “Query Insights”. The default setting is false.
auto_enc: Defines settings for Auto Encoding, see Section 2.8.2, “Auto Encoding”. Options include:
• fixed_enc: Defines an encoding type for specified columns. Use this option if you know the
encoding you want for a specific column and you are not interested in an encoding recommendation
for that column. Only applicable in recommend mode. Columns with a fixed encoding type are
excluded from encoding recommendations. The fixed_enc key is a fully qualified column name
without backticks in the following format: schema_name.tbl_name.col_name. The value is
the encoding type; either varlen or dictionary. Multiple key-value pairs can be specified in a
comma-separated list.
auto_dp: Defines settings for Data Placement, which recommends data placement keys. See:
Section 2.8.3, “Auto Data Placement”. Options include:
42
Auto Encoding
• max_combinations: The maximum number of data placement key combinations Advisor considers
before making recommendations. The default is 10000. The supported range is 1 to 100000.
Specifying fewer combinations generates recommendations more quickly but recommendations may
not be optimal.
auto_unload: Defines settings for Unload Advisor, which recommends tables to unload. See:
Section 2.8.5, “Unload Advisor”. Options include:
• exclude_list: Defines a list of schemas and tables to exclude from Unload Advisor. Names must
be fully qualified without backticks.
• last_queried_hours: Recommend unloading tables that were not queried in the past
last_queried_hours hours. Minimum: 1, maximum 744, default: 24.
• limit_tables: A limit to the number of unload table suggestions, based on the order imposed by
memory_gain_ascending. The default is 10.
To enable Auto Encoding, specify the auto_enc option in recommend mode. See Section 2.8.1,
“Advisor Syntax”.
Note
To run Advisor for both encoding and data placement recommendations, run
Auto Encoding first, apply the recommended encodings, rerun the queries,
and then run Auto Data Placement. This sequence allows data placement
performance benefits to be calculated with string column encodings in place,
which provides for greater accuracy from Advisor internal models.
For Advisor to provide string column encoding recommendations, tables must be loaded in HeatWave
and a query history must be available. Run the queries that you intend to use or run a representative
set of queries. Failing to do so can affect query offload after Auto Encoding recommendations are
implemented due to query constraints associated with dictionary encoding. For dictionary encoding
limitations, see Section 2.14.2, “Dictionary Encoding”.
In the following example, Auto Encoding is run in recommend mode, which analyzes column data,
checks the amount of memory on the MySQL node, and provides encoding recommendations intended
to reduce the amount of space required on HeatWave nodes and optimize query performance. There is
no target schema specified, so Auto Encoding runs on all schemas loaded in HeatWave
43
Auto Encoding
The fixed_enc option can be used in recommend mode to specify an encoding for specific columns.
These columns are excluded from consideration when Auto Encoding generates recommendations.
Manually encoded columns are also excluded from consideration. (For manual encoding instructions,
see Section 2.7.1, “Encoding String Columns”.)
mysql> CALL sys.heatwave_advisor(JSON_OBJECT('auto_enc',JSON_OBJECT('mode','recommend','fixed_enc',
JSON_OBJECT('tpch.CUSTOMER.C_ADDRESS','varlen'))));
Advisor output provides information about each stage of Advisor execution, including recommended
column encodings and estimated HeatWave Cluster memory savings.
mysql> CALL sys.heatwave_advisor(JSON_OBJECT('target_schema',JSON_ARRAY('tpch_1024'),
'auto_enc',JSON_OBJECT('mode','recommend')));
+-------------------------------+
| INITIALIZING HEATWAVE ADVISOR |
+-------------------------------+
| Version: 1.12 |
| |
| Output Mode: normal |
| Excluded Queries: 0 |
| Target Schemas: 1 |
| |
+-------------------------------+
6 rows in set (0.01 sec)
+---------------------------------------------------------+
| ANALYZING LOADED DATA |
+---------------------------------------------------------+
| Total 8 tables loaded in HeatWave for 1 schemas |
| Tables excluded by user: 0 (within target schemas) |
| |
| SCHEMA TABLES COLUMNS |
| NAME LOADED LOADED |
| ------ ------ ------ |
| `tpch_1024` 8 61 |
| |
+---------------------------------------------------------+
8 rows in set (0.15 sec)
+-------------------------------------------------------------------------------------------+
| ENCODING SUGGESTIONS |
+-------------------------------------------------------------------------------------------+
| Total Auto Encoding suggestions produced for 21 columns |
| Queries executed: 200 |
| Total query execution time: 28.82 min |
| Most recent query executed on: Tuesday 8th June 2021 14:42:13 |
| Oldest query executed on: Tuesday 8th June 2021 14:11:45 |
| |
| CURRENT SUGGESTED |
| COLUMN COLUMN COLUMN |
| NAME ENCODING ENCODING |
| ------ -------- --------- |
| `tpch_1024`.`CUSTOMER`.`C_ADDRESS` VARLEN DICTIONARY |
| `tpch_1024`.`CUSTOMER`.`C_COMMENT` VARLEN DICTIONARY |
| `tpch_1024`.`CUSTOMER`.`C_MKTSEGMENT` VARLEN DICTIONARY |
| `tpch_1024`.`CUSTOMER`.`C_NAME` VARLEN DICTIONARY |
| `tpch_1024`.`LINEITEM`.`L_COMMENT` VARLEN DICTIONARY |
| `tpch_1024`.`LINEITEM`.`L_SHIPINSTRUCT` VARLEN DICTIONARY |
| `tpch_1024`.`LINEITEM`.`L_SHIPMODE` VARLEN DICTIONARY |
| `tpch_1024`.`NATION`.`N_COMMENT` VARLEN DICTIONARY |
| `tpch_1024`.`NATION`.`N_NAME` VARLEN DICTIONARY |
| `tpch_1024`.`ORDERS`.`O_CLERK` VARLEN DICTIONARY |
| `tpch_1024`.`ORDERS`.`O_ORDERPRIORITY` VARLEN DICTIONARY |
| `tpch_1024`.`PART`.`P_BRAND` VARLEN DICTIONARY |
| `tpch_1024`.`PART`.`P_COMMENT` VARLEN DICTIONARY |
| `tpch_1024`.`PART`.`P_CONTAINER` VARLEN DICTIONARY |
| `tpch_1024`.`PART`.`P_MFGR` VARLEN DICTIONARY |
| `tpch_1024`.`PARTSUPP`.`PS_COMMENT` VARLEN DICTIONARY |
| `tpch_1024`.`REGION`.`R_COMMENT` VARLEN DICTIONARY |
44
Auto Encoding
+-------------------------------------------------------------------------------------------+
| SCRIPT GENERATION |
+-------------------------------------------------------------------------------------------+
| Script generated for applying suggestions for 8 loaded tables |
| |
| Applying changes will take approximately 1.64 h |
| |
| Retrieve script containing 61 generated DDL commands using the query below: |
| SELECT log->>"$.sql" AS "SQL Script" FROM sys.heatwave_advisor_report WHERE type = "sql"|
| ORDER BY id; |
| |
| Caution: Executing the generated script will alter the column comment and secondary engine|
| flags in the schema |
| |
+-------------------------------------------------------------------------------------------+
9 rows in set (18.20 sec)
To inspect the load script, which includes the DDL statements required to implement the recommended
encodings, query the heatwave_autopilot_report table. Before MySQL 8.0.32, query the
heatwave_advisor_report table.
mysql> SELECT log->>"$.sql" AS "SQL Script"
FROM sys.heatwave_autopilot_report
WHERE type = "sql"
ORDER BY id;
To concatenate generated DDL statements into a single string that can be copied and pasted for
execution, issue the statements that follow. The group_concat_max_len variable sets the result
length in bytes for the GROUP_CONCAT() function to accommodate a potentially long string. (The
default group_concat_max_len setting is 1024 bytes.)
mysql> SET SESSION group_concat_max_len = 1000000;
mysql> SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ')
FROM sys.heatwave_advisor_report
WHERE type = "sql"
ORDER BY id;
Usage Notes:
• Auto Encoding analyzes string columns (CHAR, VARCHAR, and TEXT type columns) of tables that
are loaded in HeatWave. Automatically or manually excluded columns, columns greater than
65532 bytes, and columns with manually defined encodings are excluded from consideration. Auto
Encoding also analyzes HeatWave query history to identify query constraints that preclude the
use of dictionary encoding. Dictionary-encoded columns are not supported in JOIN operations,
with string functions and operators, or in LIKE predicates. For dictionary encoding limitations, see
Section 2.14.2, “Dictionary Encoding”.
• The time required to generate encoding recommendations depends on the number of queries to be
analyzed, the number of operators, and the complexity of each query.
• Encoding recommendations for the same table may differ after changes to data or data statistics.
For example, changes to table cardinality or the number of distinct values in a column can affect
recommendations.
• Auto Encoding does not generate recommendations for a given table if existing encodings do not
require modification.
45
Auto Data Placement
• Auto Encoding only recommends dictionary encoding if it is expected to reduce the amount of
memory required on HeatWave nodes.
• If there is not enough MySQL node memory for the dictionaries of all columns that would benefit
from dictionary encoding, the columns estimated to save the most memory are recommended for
dictionary encoding.
• Auto Encoding uses the current state of tables loaded in HeatWave when generating
recommendations. Concurrent change propagation activity is not considered.
• Encoding recommendations are based on estimates and are therefore not guaranteed to reduce the
memory required on HeatWave nodes or improve query performance.
• Running Auto Encoding with the fixed_enc option to force variable-length encoding for the
tpch.CUSTOMER.C_ADDRESS column. Columns specified by the fixed_enc option are excluded
from consideration by the Auto Encoding feature.
Note
To run Advisor for both encoding and data placement recommendations, run
Auto Encoding first, apply the recommended encodings, rerun the queries,
and then run Auto Data Placement. This sequence allows data placement
performance benefits to be calculated with string column encodings in place,
which provides for greater accuracy from Advisor internal models.
• There must be a query history with at least 5 queries. A query is counted if it includes a JOIN on
tables loaded in the HeatWave Cluster or GROUP BY keys. A query executed on a table that is no
longer loaded or that was reloaded since the query was run is not counted.
46
Auto Data Placement
For the most accurate data placement recommendations, run Advisor on one schema at a time. In the
following example, Advisor is run on the tpch_1024 schema using the target_schema option. No
other options are specified, which means that the default option settings are used.
mysql> CALL sys.heatwave_advisor(JSON_OBJECT('target_schema',JSON_ARRAY('tpch_1024')));
Advisor output provides information about each stage of Advisor execution. The data placement
suggestion output shows suggested data placement keys and the estimated performance benefit of
applying the keys.
The script generation output provides a query for retrieving the generated DDL statements for
implementing the suggested data placement keys. Data placement keys cannot be added to a table or
modified without reloading the table. Therefore, Advisor generates DDL statements for unloading the
table, adding the keys, and reloading the table.
mysql> CALL sys.heatwave_advisor(JSON_OBJECT('target_schema',JSON_ARRAY('tpch_1024')));
+-------------------------------+
| INITIALIZING HEATWAVE ADVISOR |
+-------------------------------+
| Version: 1.12 |
| |
| Output Mode: normal |
| Excluded Queries: 0 |
| Target Schemas: 1 |
| |
+-------------------------------+
6 rows in set (0.01 sec)
+---------------------------------------------------------+
| ANALYZING LOADED DATA |
+---------------------------------------------------------+
| Total 8 tables loaded in HeatWave for 1 schemas |
| Tables excluded by user: 0 (within target schemas) |
| |
| SCHEMA TABLES COLUMNS |
| NAME LOADED LOADED |
| ------ ------ ------ |
| `tpch_1024` 8 61 |
| |
+---------------------------------------------------------+
8 rows in set (0.02 sec)
+----------------------------------------------------------------------+
| AUTO DATA PLACEMENT |
+----------------------------------------------------------------------+
| Auto Data Placement Configuration: |
| Minimum benefit threshold: 1% |
| Producing Data Placement suggestions for current setup: |
| Tables Loaded: 8 |
| Queries used: 189 |
| Total query execution time: 22.75 min |
| Most recent query executed on: Tuesday 8th June 2021 16:29:02 |
| Oldest query executed on: Tuesday 8th June 2021 16:05:43 |
| HeatWave cluster size: 5 nodes |
| |
| All possible Data Placement combinations based on query history: 120 |
| Explored Data Placement combinations after pruning: 90 |
| |
+----------------------------------------------------------------------+
16 rows in set (12.38 sec)
+---------------------------------------------------------------------------------------+
| DATA PLACEMENT SUGGESTIONS |
+---------------------------------------------------------------------------------------+
| Total Data Placement suggestions produced for 2 tables |
| |
| TABLE DATA PLACEMENT DATA PLACEMENT |
| NAME CURRENT KEY SUGGESTED KEY |
| ------ -------------- -------------- |
| `tpch_1024`.`LINEITEM` L_ORDERKEY, L_LINENUMBER L_ORDERKEY |
47
Auto Data Placement
+-------------------------------------------------------------------------------------------+
| SCRIPT GENERATION |
+-------------------------------------------------------------------------------------------+
| Script generated for applying suggestions for 2 loaded tables |
| |
| Applying changes will take approximately 1.18 h |
| |
| Retrieve script containing 12 generated DDL commands using the query below: |
| SELECT log->>"$.sql" AS "SQL Script" FROM sys.heatwave_advisor_report WHERE type = "sql"|
| ORDER BY id; |
| |
| Caution: Executing the generated script will alter the column comment and secondary engine|
| flags in the schema |
| |
+-------------------------------------------------------------------------------------------+
9 rows in set (16.43 sec)
Usage Notes:
• If a table already has data placement keys or columns are customized before running Advisor,
Advisor may generate DDL statements for removing previously defined data placement keys.
• Advisor provides recommendations only if data placement keys are estimated to improve query
performance. If not, an information message is returned and no recommendations are provided.
• Running Advisor with only the target_schema option runs the Data Placement Advisor on the
specified schemas with the default option settings.
48
Query Insights
• Running the Advisor with the data placement max_combinations and benefit_threshold
parameters. For information about these options, see Section 2.8.1, “Advisor Syntax”.
• The following example shows how to view the HeatWave query history by querying the
performance_schema.rpd_query_stats table, and how to exclude specific queries from Data
Placement Advisor analysis using the exclude_query option:
• This example demonstrates how to invoke the Data Placement Advisor with options specified in a
variable:
• This example demonstrates how to invoke Advisor in silent output mode, which is useful if the output
is consumed by a script, for example. Auto Data Placement is run by default if no option such as
auto_enc or query_insights is specified.
Runtime data can be used for query optimization, troubleshooting, or to estimate the cost of running a
particular query or workload on HeatWave.
For Query Insights to provide runtime data, a query history must be available. Query Insights provides
runtime data for up to 1000 queries, which is the HeatWave query history limit. To view the current
HeatWave query history, query the performance_schema.rpd_query_stats table:
The following example shows how to retrieve runtime data for the entire query history using Query
Insights. In this example, there are three queries in the query history: a successfully executed query, a
49
Query Insights
query that failed due to an out of memory error, and a query that was cancelled using Ctrl+C. For an
explanation of Query Insights data, see Section 2.8.4.2, “Query Insights Data”.
+---------------------------------------------------------+
| ANALYZING LOADED DATA |
+---------------------------------------------------------+
| Total 8 tables loaded in HeatWave for 1 schemas |
| Tables excluded by user: 0 (within target schemas) |
| |
| SCHEMA TABLES COLUMNS |
| NAME LOADED LOADED |
| ------ ------ ------ |
| `tpch128` 8 61 |
| |
+---------------------------------------------------------+
8 rows in set (0.02 sec)
+-------------------------------------------------------------------------------------+
| QUERY INSIGHTS |
+-------------------------------------------------------------------------------------+
| Queries executed on Heatwave: 4 |
| Session IDs (as filter): None |
| |
| QUERY-ID SESSION-ID QUERY-STRING EXEC-RUNTIME COMMENT |
| -------- ---------- ------------ ------------ ------- |
| 1 32 SELECT COUNT(*) |
| FROM tpch128.LINEITEM 0.628 |
| 2 32 SELECT COUNT(*) |
| FROM tpch128.ORDERS 0.114 (est.) Explain. |
| 3 32 SELECT COUNT(*) |
| FROM tpch128.ORDERS, |
| tpch128.LINEITEM 5.207 (est.) Out of memory |
| error during |
| query execution |
| in RAPID. |
| 4 32 SELECT COUNT(*) |
| FROM tpch128.SUPPLIER, |
| tpch128.LINEITEM 3.478 (est.) Operation was |
| interrupted by |
| the user. |
| TOTAL ESTIMATED: 3 EXEC-RUNTIME: 8.798 sec |
| TOTAL EXECUTED: 1 EXEC-RUNTIME: 0.628 sec |
| |
| |
| Retrieve detailed query statistics using the query below: |
| SELECT log FROM sys.heatwave_advisor_report WHERE stage = "QUERY_INSIGHTS" AND |
| type = "info"; |
| |
+-------------------------------------------------------------------------------------+
50
Query Insights
| |
| {"comment": "Explain.", "query_id": 2, "query_text": "SELECT COUNT(*) |
| FROM tpch128.ORDERS", "session_id": 32, "runtime_executed_ms": null, |
| "runtime_estimated_ms": 113.592768} |
| |
| {"comment": "Out of memory error during query execution in RAPID.", "query_id": 3, |
| "query_text": "SELECT COUNT(*) FROM tpch128.ORDERS, tpch128.LINEITEM", |
| "session_id": 32, "runtime_executed_ms": null, "runtime_estimated_ms": 5206.80822} |
| |
| {"comment": "Operation was interrupted by the user.", "query_id": 4, |
| "query_text": "SELECT COUNT(*) FROM tpch128.SUPPLIER, tpch128.LINEITEM", |
| "session_id": 32, "runtime_executed_ms": null, "runtime_estimated_ms": 3477.720953} |
+--------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
• QUERY-ID
• SESSION-ID
• QUERY-STRING
The query string. EXPLAIN, if specified, is not displayed in the query string.
• EXEC-RUNTIME
The query execution runtime in seconds. Runtime estimates are differentiated from actual runtimes
by the appearance of the following text adjacent to the runtime: (est.). Actual runtimes are shown
for successfully executed queries. Runtime estimates are shown for EXPLAIN queries, queries
cancelled by Ctrl+C, and queries that fail with an out-of-memory error.
• COMMENT
• Operation was interrupted by the user: The query was successfully offloaded to
HeatWave but was interrupted by a Ctrl+C key combination.
• Out of memory error during query execution in RAPID: The query was successfully
offloaded to HeatWave but failed due to an out-of-memory error.
The total number of queries with runtime estimates and total execution runtime (estimated).
The total number of successfully executed queries and total execution runtime (actual).
The query retrieves detailed statistics from the heatwave_autopilot_report table. Before
MySQL 8.0.32, it retrieves from the heatwave_advisor_report table. For an example of the
detailed statistics, see Section 2.8.4.1, “Running Query Insights”.
Query Insights data is available in machine readable format for use in scripts. Query Insights data is
also available in JSON format or SQL table format with queries to the heatwave_autopilot_report
51
Unload Advisor
table. See Section 6.1, “HeatWave Autopilot Report Table”. Before MySQL 8.0.32, query the
heatwave_advisor_report table. See Section 2.8.7, “Advisor Report Table”.
• This example demonstrates how to invoke the Query Insights Advisor in silent output mode, which is
useful if the output is consumed by a script, for example.
mysql> CALL sys.heatwave_advisor(JSON_OBJECT('query_insights',true,'output','silent'));
To enable Unload Advisor, specify the auto_unload option in recommend mode. See Section 2.8.1,
“Advisor Syntax”.
Use the exclude_list option to define a list of schemas and tables to exclude from Unload Advisor.
Use the last_queried_hours option to only recommend unloading tables that were not queried
during this past number of hours. The default is 24 hours.
Set memory_gain_ascending to true to rank the unload table suggestions in ascending order
based on the table size. The default is false.
Use the limit_tables option to limit the number of unload table suggestions, based on the order
imposed by memory_gain_ascending. The default is 10.
52
Advisor Command-line Help
| SCHEMA TABLE
| NAME NAME REASON
| ------ ----------- --------------------------------------------- -------
| `tpch128` `LINEITEM` LAST QUERIED ON '2022-09-12 15:30:40.538585'
| `tpch128` `CUSTOMER` LAST QUERIED ON '2022-09-12 15:30:40.538585'
|
|
| Storage consumed by base relations after unload: 100 GiB
+----------------------------------------------------------------------------------------------------
When MySQL runs Advisor, it sends detailed output to the heatwave_advisor_report table in the
sys schema.
The heatwave_advisor_report table is a temporary table. It contains data from the last execution
of Advisor. Data is only available for the current session and is lost when the session terminates or
when the server is shut down.
• Concatenate Advisor generated DDL statements into a single string to copy and paste for execution.
The group_concat_max_len variable sets the result length in bytes for the GROUP_CONCAT()
function to accommodate a potentially long string. The default group_concat_max_len setting is
1024 bytes.
mysql> SET SESSION group_concat_max_len = 1000000;
mysql> SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ')
FROM sys.heatwave_advisor_report
WHERE type = "sql"
ORDER BY id;
53
Best Practices
FROM sys.heatwave_advisor_report
WHERE stage = "QUERY_INSIGHTS" AND type = "info";
• Instead of preparing and loading tables into HeatWave manually, consider using the Auto Parallel
Load utility. See Section 2.2.3, “Loading Data Using Auto Parallel Load”.
• To minimize the number of HeatWave nodes required for your data, exclude table columns that
are not accessed by your queries. For information about excluding columns, see Section 2.2.2.1,
“Excluding Table Columns”.
• To save space in memory, set CHAR, VARCHAR, and TEXT type column lengths to the minimum
length required for the longest string value.
• Where appropriate, apply dictionary encoding to CHAR, VARCHAR, and TEXT type columns.
Dictionary encoding reduces memory consumption on the HeatWave Cluster nodes. Use the
following criteria when selecting string columns for dictionary encoding:
2. Your queries do not perform operations such as LIKE, SUBSTR, CONCAT, etc., on the column.
Variable-length encoding supports string functions and operators and LIKE predicates; dictionary
encoding does not.
3. The column has a limited number of distinct values. Dictionary encoding is best suited to columns
with a limited number of distinct values, such as “country” columns.
4. The column is expected to have few new values added during change propagation. Avoid
dictionary encoding for columns with a high number of inserts and updates. Adding a significant
number of a new, unique values to a dictionary encoded column can cause a change propagation
failure.
The following columns from the TPC Benchmark™ H (TPC-H) provide examples of string columns
that are suitable and unsuitable for dictionary encoding:
• ORDERS.O_ORDERPRIORITY
This column is used only in range queries. The values associated with column are limited.
During updates, it is unlikely for a significant number of new, unique values to be added. These
characteristics make the column suitable for dictionary encoding.
• LINEITEM.L_COMMENT
This column is not used in joins or other complex expressions, but as a comment field, values are
expected to be unique, making the column unsuitable for dictionary encoding.
54
Provisioning
When in doubt about choosing an encoding type, use variable-length encoding, which is applied by
default when tables are loaded into HeatWave, or use the HeatWave Encoding Advisor to obtain
encoding recommendations. See Section 2.8.2, “Auto Encoding”.
• Data is partitioned by the table primary key when no data placement keys are defined. Only consider
defining data placement keys if partitioning data by the primary key does not provide suitable
performance.
Reserve the use of data placement keys for the most time-consuming queries. In such cases, define
data placement keys on:
Consider using Auto Data Placement for data placement recommendations. See Section 2.8.3, “Auto
Data Placement”.
2.9.2 Provisioning
To determine the appropriate HeatWave Cluster size for a workload, you can estimate the required
cluster size. Cluster size estimates are generated by the HeatWave Auto Provisioning feature, which
uses machine learning models to predict the number of required nodes based on node shape and data
sampling. For instructions:
• For HeatWave on OCI, see Generating a Node Count Estimate in the HeatWave on OCI Service
Guide.
• For HeatWave on AWS, see Estimating Cluster Size with HeatWave Autopilot in the HeatWave on
AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure Service
Guide.
• When adding a HeatWave Cluster to a MySQL DB System, to determine the number of nodes
required for the data you intend to load.
• Periodically, to ensure that you have an appropriate number of HeatWave nodes for your data.
Over time, data size may increase or decrease, so it is important to monitor the size of your data by
performing cluster size estimates.
• When encountering out-of-memory errors while running queries. In this case, the HeatWave Cluster
may not have sufficient memory capacity.
• When the transaction rate (the rate of updates and inserts) is high.
• For HeatWave on OCI, see Importing and Exporting Databases in the HeatWave on OCI Service
Guide.
55
Inbound Replication
• For HeatWave on AWS, see Importing Data in the HeatWave on AWS Service Guide.
• For HeatWave for Azure, see Importing Data to HeatWave in the HeatWave for Azure Service Guide.
Before MySQL 8.0.31, DDL operations are not permitted on tables defined with a secondary engine.
In those releases, before replicating a DDL operation from an on-premise instance to a table on the
MySQL DB System that is defined with a secondary engine, you must set the SECONDARY_ENGINE
option to NULL; for example:
mysql> ALTER TABLE orders SECONDARY_ENGINE = NULL;
Setting the SECONDARY_ENGINE option to NULL removes the SECONDARY_ENGINE option from the
table definition and unloads the table from HeatWave. To reload the table into HeatWave after the
DDL operation is replicated, specify the SECONDARY_LOAD option in an ALTER TABLE statement.
mysql> ALTER TABLE orders SECONDARY_LOAD;
The loading of data into HeatWave can be classified into three types: Initial Bulk Load, Incremental
Bulk Load, and Change Propagation.
• Initial Bulk Load: Performed when loading data into HeatWave for the first time, or when reloading
data. The best time to perform an initial bulk load is during off-peak hours, as bulk load operations
can affect OLTP performance on the MySQL DB System.
• Incremental Bulk Load: Performed when there is a substantial amount of data to load into tables that
are already loaded in HeatWave. An incremental bulk load involves these steps:
3. Performing a SECONDARY_LOAD operation to reload the table into HeatWave. See Section 2.2,
“Loading Data to HeatWave MySQL”.
Depending on the amount of data, an incremental bulk load may be a faster method of loading new
data than waiting for change propagation to occur. It also provides greater control over when new
data is loaded. As with initial build loads, the best time to perform an incremental bulk load is during
off-peak hours, as bulk load operations can affect OLTP performance on the MySQL DB System.
• Change Propagation: After tables are loaded into HeatWave, data changes are automatically
propagated from InnoDB tables on the MySQL DB System to their counterpart tables in HeatWave.
See Section 2.2.7, “Change Propagation”.
56
Auto Encoding and Auto Data Placement
For medium to large tables, increase the number of read threads to 32 by setting the
innodb_parallel_read_threads variable on the MySQL DB System.
mysql> SET SESSION innodb_parallel_read_threads = 32;
If the MySQL DB System is not busy, you can increase the value to 64.
Tip
The Auto Parallel Load utility automatically optimizes the number of parallel
read threads for each table. See Section 2.2.3, “Loading Data Using Auto
Parallel Load”. For users of HeatWave on AWS, the number of parallel read
threads is also optimized when loading data from the HeatWave Console.
See Manage Data in HeatWave with Workspaces in the HeatWave on AWS
Service Guide.
If you have many small and medium tables (less than 20GB in size), load tables from multiple
sessions:
Session 1:
mysql> ALTER TABLE supplier SECONDARY_LOAD;
Session 2:
mysql> ALTER TABLE parts SECONDARY_LOAD;
Session 3:
mysql> ALTER TABLE region SECONDARY_LOAD;
Session 4:
mysql> ALTER TABLE partsupp SECONDARY_LOAD;
Data load operations share resources with other OLTP DML and DDL operations on the MySQL DB
System. To improve load performance, avoid or reduce conflicting DDL and DML operations. For
example, avoid running DDL and large DML operations on the LINEITEM table while executing an
ALTER TABLE LINEITEM SECONDARY_LOAD operation.
In all cases, re-run your queries before running Advisor. See Section 2.8, “Workload Optimization using
Advisor”.
• If a query fails to offload and you cannot identify the reason, enable tracing and query the
INFORMATION_SCHEMA.OPTIMIZER_TRACE table to debug the query. See Section 2.3.5,
“Debugging Queries”.
If the optimizer trace does not return all of the trace information, increase the optimizer
trace buffer size. The MISSING_BYTES_BEYOND_MAX_MEM_SIZE column of the
INFORMATION_SCHEMA.OPTIMIZER_TRACE table shows how many bytes are missing from a
trace. If the column shows a non-zero value, increase the optimizer_trace_max_mem_size
setting accordingly. For example:
57
Running Queries
SET optimizer_trace_max_mem_size=1000000;
This query can be rewritten as follows to unnest the subquery so that it can be offloaded.
mysql> EXPLAIN SELECT COUNT(*)
FROM orders o, (SELECT o_custkey, AVG(o_totalprice) a_totalprice
FROM orders
GROUP BY o_custkey)a
WHERE o.o_custkey=a.o_custkey AND o.o_totalprice>a.a_totalprice;
• By default, SELECT queries are offloaded to HeatWave for execution and fall back to the MySQL DB
System if that is not possible. To force a query to execute on HeatWave or fail if that is not possible,
set the use_secondary_engine variable to FORCED. In this mode, a SELECT statement returns
an error if it cannot be offloaded. The use_secondary_engine variable can be set as shown:
1. Avoid or rewrite queries that produce a Cartesian product. In the following query, a JOIN
predicated is not defined between the supplier and nation tables, which causes the query to
select all rows from both tables:
mysql> SELECT s_nationkey, s_suppkey, l_comment FROM lineitem, supplier, nation
WHERE s_suppkey = l_suppkey LIMIT 10;
58
Running Queries
ERROR 3015 (HY000): Out of memory in storage engine 'Failure detected in RAPID; query
execution cannot proceed'.
To avoid the Cartesian product, add a relevant predicate between the supplier and nation
tables to filter out rows:
2. Avoid or rewrite queries that produce a Cartesian product introduced by the MySQL optimizer.
Due to lack of quality statistics or non-optimal cost decisions, MySQL optimizer may introduce
one or more Cartesian products in a query even if a query has predicates defined among all
participating tables. For example:
The EXPLAIN plan output shows that there is no common predicate between the first two table
entries (NATION and SUPPLIER).
59
Running Queries
id: 1
select_type: SIMPLE
table: orders
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 14862970
filtered: 10.00
Extra: Using where; Using join buffer (hash join); Using secondary engine RAPID
*************************** 5. row ***************************
id: 1
select_type: SIMPLE
table: lineitem
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 56834662
filtered: 10.00
Extra: Using where; Using join buffer (hash join); Using secondary engine RAPID
To force a join order so that there are predicates associated with each pair of tables, add a
STRAIGHT_JOIN hint. For example:
60
Running Queries
Extra: Using where; Using join buffer (hash join); Using secondary engine RAPID
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: orders
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 14862970
filtered: 10.00
Extra: Using where; Using join buffer (hash join); Using secondary engine RAPID
*************************** 5. row ***************************
id: 1
select_type: SIMPLE
table: lineitem
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 56834662
filtered: 10.00
Extra: Using where; Using join buffer (hash join); Using secondary engine RAPID
3. Avoid or rewrite queries that produce a very large result set. This is a common cause of out of
memory errors during query processing. Use aggregation functions, a GROUP BY clause, or a
LIMIT clause to reduce the result set size.
4. Avoid or rewrite queries that produce a very large intermediate result set. In certain cases, large
result sets can be avoided by adding a STRAIGHT_JOIN hint, which enforces a join order in a
decreasing of selectiveness.
5. Check the size of your data by performing a cluster size estimate. If your data has grown
substantially, the HeatWave Cluster may require additional nodes.
• For HeatWave on OCI, see Generating a Node Count Estimate in the HeatWave on OCI
Service Guide.
• For HeatWave on AWS, see Estimating Cluster Size with HeatWave Autopilot in the HeatWave
on AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure
Service Guide.
6. HeatWave optimizes for network usage by default. Try running the query with the
MIN_MEM_CONSUMPTION strategy by setting by setting rapid_execution_strategy to
MIN_MEM_CONSUMPTION. The rapid_execution_strategy variable can be set as shown:
• Unloading tables that are not used. These tables consume memory on HeatWave nodes
unnecessarily. See Section 2.5.1, “Unloading Tables”.
• Excluding table columns that are not accessed by your queries. These columns consume
memory on HeatWave nodes unnecessarily. This strategy requires reloading data. See
Section 2.2.2.1, “Excluding Table Columns”.
61
Monitoring
7. After running queries, consider using HeatWave Advisor for encoding and data placement
recommendations. See Section 2.8, “Workload Optimization using Advisor”.
2.9.8 Monitoring
The following monitoring practices are recommended:
• For HeatWave on OCI, monitor operating system memory usage by setting an alarm to notify you
when memory usage on HeatWave nodes remains above 450GB for an extended period of time.
If memory usage exceeds this threshold, either reduce the size of your data or add nodes to the
HeatWave Cluster. For information about using metrics, alarms, and notifications, refer to Metrics in
the HeatWave on OCI Service Guide.
• For HeatWave on AWS, you can monitor memory usage on the Performance tab of the HeatWave
Console. See Performance Monitoring in the HeatWave on AWS Service Guide.
• For HeatWave for Azure, select Metrics on the details page for the HeatWave Cluster to access
Microsoft Azure Application Insights. See About Oracle Database Service for Azure.
• Monitor change propagation status. If change propagation is interrupted and tables are not
automatically reloaded for some reason, table data becomes stale. Queries that access tables
with stale data are not offloaded to HeatWave for processing. For instructions, see Section 2.2.7,
“Change Propagation”.
• After resizing the cluster by adding or removing nodes. Reloading data distributes the data among all
nodes of the resized cluster.
• After a maintenance window. Maintenance involves a DB System restart, which requires that you
reload data into HeatWave. On OCI, consider setting up a HeatWave Service event notification or
Service Connector Hub notification to let you know when an update has occurred. For information
about MySQL DB System maintenance:
• For HeatWave on OCI, see Maintenance in the HeatWave on OCI Service Guide.
For HeatWave on AWS, see Maintenance in the HeatWave on AWS Service Guide.
For HeatWave for Azure, see the Oracle Database Service for Azure documentation.
• For information about HeatWave Service events, see Managing a DB System in the HeatWave on
OCI Service Guide.
• For information about Service Connector Hub, see Service Connector Hub.
• For table load instructions, see Section 2.2, “Loading Data to HeatWave MySQL”.
Tip
Instead of loading data into HeatWave manually, consider using the Auto
Parallel Load utility, which prepares and loads data for you using an
optimized number of parallel load threads. See Section 2.2.3, “Loading
Data Using Auto Parallel Load”.
• When the HeatWave Cluster is restarted due to a DB System restart. Data in the HeatWave Cluster
is lost in this case, requiring reload.
62
Supported Data Types
• BIGINT
• BOOL
• DECIMAL
• DOUBLE
• FLOAT
• INT
• INTEGER
• MEDIUMINT
• SMALLINT
• TINYINT
• DATE
• DATETIME
• TIME
• TIMESTAMP
• YEAR
Temporal types are supported only with strict SQL mode. See Strict SQL Mode.
For limitations related to TIMESTAMP data type support, see Section 2.18, “HeatWave MySQL
Limitations”.
• CHAR
• VARCHAR
• TEXT
• TINYTEXT
• MEDIUMTEXT
• LONGTEXT
• ENUM
• JSON
63
Supported SQL Modes
• COALESCE() and IN(), see: Section 2.12.4, “Comparison Functions and Operators”.
• ANSI_QUOTES
• HIGH_NOT_PRECEDENCE
• IGNORE_SPACE
• NO_BACKSLASH_ESCAPES
• REAL_AS_FLOAT
• TIME_TRUNCATE_FRACTIONAL
Temporal types are supported only with strict SQL mode. See Strict SQL Mode.
64
Aggregate Functions
• HLL(expr,[expr...], precision)
Returns an approximate count of the number of rows with different non-NULL expr values. 4 ≤
precision ≤ 15. The default value is 10.
As of MySQL 8.4.0, HyperLogLog, HLL(), is available in the HeatWave primary and secondary
engines. As of MySQL 8.3.0, HyperLogLog, HLL(), is only available in the HeatWave secondary
engine. It is similar to COUNT(DISTINCT), but with user defined precision.
HLL() is faster than COUNT(DISTINCT). The greater the precision, the greater the accuracy,
and the greater the amount of memory required.
precision
For each GROUP, HLL() requires 2 bytes of memory. The default value of 10 requires 1KiB of
memory per GROUP.
An example with precision set to 8 that uses 256 bytes of memory per GROUP:
mysql> SELECT HLL(results, 8) FROM student;
65
Aggregate Functions
The ROLLUP modifier generates aggregated results that follow the hierarchy for the selected columns.
The CUBE modifier generates aggregated results for all possible combinations of the selected columns.
For a single column the results are the same.
66
Arithmetic Operators
• CAST() from and to all the HeatWave supported numeric, temporal, string and text data types. See
Section 2.10, “Supported Data Types”.
• CAST() of ENUM columns to CHAR, DECIMAL, FLOAT, and to SIGNED and UNSIGNED numeric
values. CAST() operates on the ENUM index rather than the ENUM values.
67
Comparison Functions and Operators
expr IN (value,...)
comparisons where the
expression is a single value and
compared values are constants
of the same data type and
encoding are optimized for
performance. For example, the
following IN() comparison is
optimized:
68
Control Flow Functions and Operators
69
JSON Functions
70
Logical Operators
Name Description
AND, && Logical AND
NOT, ! Negates value
||, OR Logical OR
XOR Logical XOR
Name Description
ABS() Return the absolute value.
ACOS() Return the arc cosine.
ASIN() Return the arc sine.
ATAN() Return the arc tangent.
ATAN2(), ATAN() Return the arc tangent of the two arguments
CEIL() Return the smallest integer value not less than the
argument. The function is not applied to BIGINT
values. The input value is returned. CEIL() is a
synonym for CEILING().
CEILING() Return the smallest integer value not less than the
argument. The function is not applied to BIGINT
values. The input value is returned. CEILING() is
a synonym for CEIL().
COS() Return the cosine.
COT() Return the cotangent.
CRC32() Compute a cyclic redundancy check value.
DEGREES() Convert radians to degrees.
EXP() Raise to the power of.
FLOOR() Return the largest integer value not greater than
the argument. The function is not applied to
BIGINT values. The input value is returned.
LN() Return the natural logarithm of the argument.
LOG() Return the natural logarithm of the first argument.
LOG10() Return the base-10 logarithm of the argument.
LOG2() Return the base-2 logarithm of the argument.
MOD() Return the remainder.
PI() Return the value of pi
POW() Return the argument raised to the specified power
71
String Functions and Operators
Name Description
POWER() Return the argument raised to the specified
power.
RADIANS() Return argument converted to radians.
RAND() Return a random floating-point value.
ROUND() Round the argument.
SIGN() Return the sign of the argument.
SIN() Return the sine of the argument.
SQRT() Return the square root of the argument.
TAN() Return the tangent of the argument.
TRUNCATE() Truncate to specified number of decimal places.
Name Description
ASCII() Return numeric value of left-most character
BIN() Return a string containing binary representation of
a number. Supported as of MySQL 8.2.0.
BIT_LENGTH() Return length of argument in bits. Supported as of
MySQL 8.2.0.
CHAR_LENGTH() Return number of characters in argument
CONCAT() Return concatenated string
CONCAT_WS() Return concatenated with separator
ELT() Return string at index number. Supported as of
MySQL 8.2.0.
EXPORT_SET() Return a string such that for every bit set in the
value bits, it returns an on string (and for every
unset bit, it returns an off string). Supported as of
MySQL 8.2.0.
FIELD() Index (position) of first argument in subsequent
arguments. Supported as of MySQL 8.2.0.
FIND_IN_SET() Index (position) of first argument within second
argument
FORMAT() Return a number formatted to specified number of
decimal places. Does not support variable-length-
encoded columns.
FROM_BASE64() Decode base64 encoded string and return result
GREATEST() Return the largest argument. MySQL 8.0.30
includes support for temporal types with the
exception of the YEAR data type. MySQL 8.3.0
includes support for the YEAR data type.
HEX() Hexadecimal representation of decimal or string
value
72
String Functions and Operators
Name Description
INSERT() Return the index of the first occurrence of
substring
INSTR() Return the index of the first occurrence of
substring
LEAST() Return the smallest argument. MySQL 8.0.30
includes support for temporal types with the
exception of the YEAR data type. MySQL 8.3.0
includes support for the YEAR data type.
LEFT() Return the leftmost number of characters as
specified
LENGTH() Return the length of a string in bytes
LIKE Simple pattern matching
LOCATE() Return the position of the first occurrence of
substring
LOWER() Return the argument in lowercase
LPAD() Return the string argument, left-padded with the
specified string
LTRIM() Remove leading spaces
MAKE_SET() Return a set of comma-separated strings that
have the corresponding bit in bits. Supported as of
MySQL 8.2.0.
MID() Return a substring starting from the specified
position
NOT LIKE Negation of simple pattern matching
OCTET_LENGTH() Synonym for LENGTH()
ORD() Return character code for leftmost character of the
argument
POSITION() Synonym for LOCATE()
QUOTE() Escape the argument for use in an SQL statement
REGEXP Whether string matches regular expression
REGEXP_INSTR() Starting index of substring matching regular
expression
REGEXP_LIKE() Whether string matches regular expression
REGEXP_REPLACE() Replace substrings matching regular expression.
Supports up to three arguments.
REGEXP_SUBSTR() Return substring matching regular expression.
Supports up to three arguments.
REPEAT() Repeat a string the specified number of times
REPLACE() Replace occurrences of a specified string
REVERSE() Reverse the characters in a string
RIGHT() Return the specified rightmost number of
characters
RLIKE Whether string matches regular expression
RPAD() Append string the specified number of times
RTRIM() Remove trailing spaces
73
Temporal Functions
Name Description
SOUNDEX() Return a soundex string
SPACE() Return a string of the specified number of spaces
STRCMP() Compare two strings
SUBSTR() Return the substring as specified
SUBSTRING() Return the substring as specified
SUBSTRING_INDEX() Return a substring from a string before the
specified number of occurrences of the delimiter
TO_BASE64() Return the argument converted to a base-64 string
TRIM() Remove leading and trailing spaces
UNHEX() Return a string containing hex representation of a
number
UPPER() Convert to uppercase
WEIGHT_STRING() Return the weight string for a string
As of MySQL 8.4.0, HeatWave supports named time zones such as MET or Europe/Amsterdam for
CONVERT_TZ(). For a workaround before MySQL 8.4.0, see Section 2.18.3, “Functions and Operator
Limitations”.
As of MySQL 8.3.0, all functions support variable-length encoded string columns. See Section 2.7.1,
“Encoding String Columns”.
Before MySQL 8.3.0, the VARLEN Support column identifies functions that support variable-length
encoded string columns.
74
Temporal Functions
75
Temporal Functions
76
Window Functions
• WINDOW and OVER clauses in conjunction with PARTITION BY, ORDER BY, and WINDOW frame
specifications.
• AVG()
• COUNT()
• MIN()
• MAX()
• SUM()
As of MySQL 8.2.0-u1, the SELECT statement includes the QUALIFY clause. This is between the
WINDOW clause and the ORDER BY clause:
SELECT
[ALL | DISTINCT | DISTINCTROW ]
[HIGH_PRIORITY]
[STRAIGHT_JOIN]
[SQL_SMALL_RESULT] [SQL_BIG_RESULT] [SQL_BUFFER_RESULT]
[SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]
select_expr [, select_expr] ...
[into_option]
[FROM table_references
77
SELECT Statement
[PARTITION partition_list]]
[WHERE where_condition]
[GROUP BY {col_name | expr | position}, ... [WITH ROLLUP]]
[HAVING where_condition]
[WINDOW window_name AS (window_spec)
[, window_name AS (window_spec)] ...]
[QUALIFY qualify_condition]
[ORDER BY {col_name | expr | position}
[ASC | DESC], ... [WITH ROLLUP]]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
[into_option]
[FOR {UPDATE | SHARE}
[OF tbl_name [, tbl_name] ...]
[NOWAIT | SKIP LOCKED]
| LOCK IN SHARE MODE]
[into_option]
into_option: {
INTO OUTFILE 'file_name'
[CHARACTER SET charset_name]
export_options
| INTO DUMPFILE 'file_name'
| INTO var_name [, var_name] ...
}
In addition to constraints similar to the HAVING clause, the QUALIFY clause can also include
predicates related to a window function.
Similar to the HAVING clause, the QUALIFY clause can refer to aliases mentioned in the SELECT list.
The QUALIFY clause requires the inclusion of at least one window function in the query. The window
function can be part of any one of the following:
The same query, but now with a QUALIFY clause that only returns results where country_profit is
greater than 1,500:
mysql> SELECT
year, country, product, profit,
78
String Column Encoding Reference
As of MySQL 8.3.0-u2, the SELECT statement includes the TABLESAMPLE clause, which only applies
to base tables:
SELECT
select_expr [, select_expr] ...
[into_option]
FROM table_references
[PARTITION partition_list]
TABLESAMPLE { SYSTEM | BERNOULLI } ( sample_percentage )
into_option: {
INTO OUTFILE 'file_name'
[CHARACTER SET charset_name]
export_options
| INTO DUMPFILE 'file_name'
| INTO var_name [, var_name] ...
}
The TABLESAMPLE clause retrieves a sample of the data in a table with the SYSTEM or BERNOULLI
sampling method.
SYSTEM sampling samples chunks, and each chunk has a random number. Bernoulli sampling
samples rows, and each row has a random number. If the random number is less than or equal to the
sample_percentage, the chunk or row is chosen for subsequent processing.
A TABLESAMPLE example that counts the number of rows from a join of a 10% sample of the
LINEITEM table with a 10% sample of the ORDERS table. It applies TABLESAMPLE with the SYSTEM
sampling method to both base tables.
mysql> SELECT COUNT(*)
FROM LINEITEM TABLESAMPLE SYSTEM (10), ORDERS TABLESAMPLE SYSTEM (10)
WHERE L_ORDERKEY=O_ORDERKEY;
String column encoding is automatically applied when tables are loaded into HeatWave. Variable-
length encoding is the default.
To use dictionary encoding, you must define the encoding type explicitly for individual string columns.
See Section 2.7.1, “Encoding String Columns”.
79
Variable-length Encoding
• It is the default encoding type. No action is required to use variable-length encoding. It is applied
to string columns by default when tables are loaded with the exception of string columns defined
explicitly as dictionary-encoded columns.
• It minimizes the amount of data stored for string columns by efficiently storing variable length column
values.
• It is more efficient than dictionary encoding with respect to storage and processing of string columns
with a high number of distinct values relative to the cardinality of the table.
• It permits more operations involving string columns to be offloaded than dictionary encoding.
• It supports all character sets and collation types supported by the MySQL DB System. User defined
character sets are not supported.
Before MySQL 8.3.0, both columns must be VARLEN encoded and column-to-column filters must use
columns that are encoded with the same character set and collation.
Before MySQL 8.3.0, the character set and collation of the constant variable must match the
character set and collation of the constant.
• JOIN
• LIMIT
• ORDER BY
• There is no memory requirement on the MySQL DB System node, apart from a small memory
footprint for metadata.
80
Dictionary Encoding
• Table load and change propagation operations perform more slowly on VARLEN encoded TEXT type
columns than on VARLEN encoded VARCHAR columns.
• There are two main differences with respect to HeatWave result processing for variable-length
encoding compared to dictionary encoding:
• A dictionary decode operation is not required, which means that fewer CPU cycles are required.
• Because VARLEN encoded columns use a larger number of bytes than dictionary-encoded
columns, the network cost for sending results from HeatWave to the MySQL DB System is greater.
• Best suited to string columns with a low number of distinct values relative to the cardinality of the
table. Dictionary encoding reduces the space required for column values on the HeatWave nodes but
requires space on the MySQL DB System node for dictionaries.
• Supports only a subset of the operations supported by variable-length encoding such as LIKE
with prefix expressions, and comparison with the exact same column. Dictionary-encoded columns
cannot be compared in any way with other columns or constants, or with other dictionary-encoded
columns.
• Does not support operations that use string operators. Queries that use string operators on
dictionary-encoded string columns are not offloaded.
• The dictionaries required to decode dictionary-encoded string columns must fit in MySQL DB System
node memory. Dictionary size depends on the size of the column and the number of distinct values.
Load operations for tables with dictionary-encoded string columns that have a high number of distinct
values can fail if there is not enough available memory on the MySQL DB System node.
• The column limit for base relations, tables as loaded into HeatWave, is 1017. Before MySQL 8.0.29,
the limit was 900.
• The column limit for intermediate relations (intermediate tables used by HeatWave when processing
queries) is 1800.
• The actual column limit when running queries depends on factors such as MySQL limits, protocol
limits, the total number of columns, column types, and column widths. For example, for any
HeatWave physical operator, the maximum number of 65532-byte VARLEN encoded columns is
31 if the query only uses VARLEN encoded columns. On the other hand, HeatWave can produce
81
Troubleshooting
a maximum of 1800 VARLEN encoded columns that are less than 1024 bytes in size if the query
includes only VARLEN encoded columns. If the query includes only non-string columns such as
DOUBLE, INTEGER, DECIMAL, and so on, 1800 columns are supported.
2.15 Troubleshooting
• Problem: Queries are not offloaded.
• Solution A: Your query contains an unsupported predicate, function, operator, or has encountered
some other limitation. See Section 2.3.1, “Query Prerequisites”.
• Solution B: Query execution time is less than the query cost threshold.
HeatWave is designed for fast execution of large analytic queries. Smaller, simpler queries, such
as those that use indexes for quick lookups, often execute faster on the MySQL DB System.
To avoid offloading inexpensive queries to HeatWave, the optimizer uses a query cost estimate
threshold value. Only queries that exceed the threshold value on the MySQL DB System are
considered for offload.
The query cost threshold unit value is the same unit value used by the MySQL optimizer for query
cost estimates. The threshold is 100000.00000. The ratio between a query cost estimate value and
the actual time required to execute a query depends on the type of query, the type of hardware,
and MySQL DB System configuration.
3. Query the Last_query_cost status variable. If the value is less than 100000.00000, the
query cannot be offloaded.
• Solution C: The table you are querying is not loaded. You can check the load status of a table in
HeatWave by querying LOAD_STATUS data from HeatWave Performance Schema tables. For
example:
For information about load statuses, see Section 6.4.8, “The rpd_tables Table”.
82
Troubleshooting
• Solution D: The HeatWave Cluster has failed. To determine the status of the HeatWave Cluster,
run the following statement:
See Chapter 5, System and Status Variables for rapid_plugin_bootstrapped status values.
If the HeatWave Cluster has failed, restart it in the HeatWave Console and reload the data if
necessary. The HeatWave recovery mechanism should reload the data automatically.
• Problem: You cannot alter the table definition to exclude a column, define a string column encoding,
or define data placement keys.
Solution: Before MySQL 8.0.31, DDL operations are not permitted on tables defined with a
secondary engine. In those releases, column attributes must be defined before or at the same that
you define a secondary engine for a table. If you need to perform a DDL operation on a table that is
defined with a secondary engine in those releases, remove the SECONDARY_ENGINE option first:
Solution: HeatWave optimizes for network usage rather than memory. If you encounter out of
memory errors when running a query, try running the query with the MIN_MEM_CONSUMPTION
strategy by setting rapid_execution_strategy before executing the query:
Also consider checking the size of your data by performing a cluster size estimate. If your data has
grown substantially, you may require additional HeatWave nodes.
• For HeatWave on OCI, see Generating a Node Count Estimate in the HeatWave on OCI Service
Guide.
• For HeatWave on AWS, see Estimating Cluster Size with HeatWave Autopilot in the HeatWave on
AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure Service
Guide.
Avoid or rewrite queries that produce a Cartesian product. For more information, see Section 2.9.7,
“Running Queries”.
• Problem: A table load operation fails with “ERROR HY000: Error while running parallel scan.”
Solution: A TEXT type column value larger than 65532 bytes is rejected during SECONDARY_LOAD
operations. Reduce the size of the TEXT type column value to less than 65532 bytes or exclude the
column before loading the table. See Section 2.2.2.1, “Excluding Table Columns”.
83
Troubleshooting
• Problem: Change propagation fails with the following error: “Blob/text value of n bytes was
encountered during change propagation but RAPID supports text values only up to 65532 bytes.”
Solution: TEXT type values larger than 65532 bytes are rejected during change propagation. Reduce
the size of TEXT type values to less than 65532 bytes. Should you encounter this error, check the
change propagation status for the affected table. If change propagation is disabled, reload the table.
See Section 2.2.7, “Change Propagation”.
Solution: When Auto Parallel Load encounters an issue that produces a warning, it automatically
switches to dryrun mode to prevent further problems. In this case, the load statements generated
by the Auto Parallel Load utility can still be obtained using the SQL statement provided in the utility
output, but avoid those load statements or use them with caution, as they may be problematic.
• If a warning message indicates that the HeatWave Cluster or service is not active or online, this
means that the load cannot start because a HeatWave Cluster is not attached to the MySQL
DB System or is not active. In this case, provision and enable a HeatWave Cluster and run Auto
Parallel Load again.
• If a warning message indicates that MySQL host memory is insufficient to load all of the tables,
the estimated dictionary size for dictionary-encoded columns may be too large for MySQL host
memory. Try changing column encodings to VARLEN to free space in MySQL host memory.
• If a warning message indicates that HeatWave Cluster memory is insufficient to load all of the
tables, the estimated table size is too large for HeatWave Cluster memory. Try excluding certain
schemas or tables from the load operation or increase the size of the cluster.
• If a warning message indicates that a concurrent table load is in progress, this means that another
client session is currently loading tables into HeatWave. While the concurrent load operation is
in progress, the accuracy of Auto Parallel Load estimates cannot be guaranteed. Wait until the
concurrent load operation finishes before running Auto Parallel Load.
• Problem: During retrieval of the generated Auto Parallel Load or Advisor DDL statements,
an error message indicates that the heatwave_autopilot_report table or the
heatwave_advisor_report table or the heatwave_load_report table does not exist. For
example:
Solution: This error can occur when querying a report table from a different session. Query the
report table using the same session that issued the Auto Parallel Load or Advisor CALL statement.
This error also occurs if the session used to call Auto Parallel Load or Advisor has timed out or was
terminated. In this case, run Auto Load or Advisor again before querying the report table.
84
Metadata Queries
Note
For queries that monitor HeatWave node status, memory usage, data loading,
change propagation, and queries, see Section 6.2, “HeatWave MySQL
Monitoring”.
You can also view create options for an individual table using SHOW CREATE TABLE.
Note
85
String Column Encoding
You can also view columns defined as NOT SECONDARY for an individual table using SHOW CREATE
TABLE.
You can also view explicitly defined column encodings for an individual table using SHOW CREATE
TABLE.
You can also view data placement keys for an individual table using SHOW CREATE TABLE.
• To identify columns defined as data placement keys in tables that are loaded in HeatWave, query the
DATA_PLACEMENT_INDEX column of the performance_schema.rpd_columns table for columns
with a DATA_PLACEMENT_INDEX value greater than 0, which indicates that the column is defined as
a data placement key. For example:
mysql> SELECT TABLE_NAME, COLUMN_NAME, DATA_PLACEMENT_INDEX
FROM performance_schema.rpd_columns r1
JOIN performance_schema.rpd_column_id r2 ON r1.COLUMN_ID = r2.ID
WHERE r1.TABLE_ID = (SELECT ID FROM performance_schema.rpd_table_id
WHERE TABLE_NAME = 'orders') AND r2.TABLE_NAME = 'orders'
86
Bulk Ingest Data to MySQL Server
For information about data placement key index values, see Section 2.7.2, “Defining Data Placement
Keys”.
• To determine if data placement partitions were used by a JOIN or GROUP BY query, you can query
the QEP_TEXT column of the performance_schema.rpd_query_stats table to view prepart
data. (prepart is short for “pre-partitioning”.) The prepart data for a GROUP BY operation contains
a single value; for example: "prepart":#, where # represents the number of HeatWave nodes.
A value greater than 1 indicates that data placement partitions were used. The prepart data for
a JOIN operation has two values that indicate the number of HeatWave nodes; one for each JOIN
branch; for example: "prepart":[#,#]. A value greater than 1 for a JOIN branch indicates that
the JOIN branch used data placement partitions. (A value of "prepart":[1,1] indicates that data
placement partitions were not used by either JOIN branch.) prepart data is only generated if a
GROUP BY or JOIN operation is executed. To query QEP_TEXT prepart data for the last executed
query:
mysql> SELECT CONCAT( '"prepart":[', (JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][0]')),
"," ,(JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][1]')) , ']' )
FROM performance_schema.rpd_query_stats
WHERE query_id = (select max(query_id)
FROM performance_schema.rpd_query_stats);
+-----------------------------------------------------------------------------+
| concat( '"prepart":[', (JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][0]')), |
|"," ,(JSON_EXTRACT(QEP_TEXT->>"$**.prepart", '$[0][1]')) , ']' ) |
+-----------------------------------------------------------------------------+
| "prepart":[2,2] |
+-----------------------------------------------------------------------------+
87
Bulk Ingest Data to MySQL Server
[MEMORY = M]
[ALGORITHM = BULK]
This requires a pre-authenticated request, PAR. See: Section 4.3.1, “Pre-Authenticated Requests”. It
also requires the user privilege LOAD_FROM_URL.
For COUNT 5 and file_prefix set to data.csv., the five files would be: data.csv.1,
data.csv.2, data.csv.3, data.csv.4, and data.csv.5.
• IN PRIMARY KEY ORDER: Use when the data is already sorted. The values should be in ascending
order within the file.
For a file series, the primary keys in each file must be disjoint and in ascending order from one file to
the next.
• PARALLEL: The number of concurrent threads to use. A typical value might be 16, 32 or 48. The
default value is 16.
• MEMORY: The amount of memory to use. A typical value might be 512M or 4G. The default value is
1G.
• ALGORITHM: Set to BULK for bulk load. The file format is CSV.
• VARCHAR and VARBINARY. The record must fit in the page. There is no large data support.
88
Syntax Examples
For the data types that HeatWave supports, see: Section 2.10, “Supported Data Types”
Syntax Examples
• An example that loads unsorted data from AWS S3 with 48 concurrent threads and 4G of memory.
mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;
• An example that loads eight files of sorted data from AWS S3. The file_prefix ends with a
period. The files are lineitem.tbl.1, lineitem.tbl.2, ... lineitem.tbl.8.
mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;
• BINARY
• VARBINARY
• Decimal values with a precision greater than 18 in expression operators, with the exception of the
following:
89
Functions and Operator Limitations
• CAST()
• COALESCE()
• CASE
• IF()
• NULLIF()
• ABS()
• CEILING()
• FLOOR()
• ROUND()
• TRUNCATE()
• GREATEST()
• LEAST()
• ENUM type columns as part of a UNION, EXCEPT, EXCEPT ALL, INTERSECT, or INTERSECT ALL
SELECT list or as a JOIN key, except when used inside a supported expression.
• ENUM type columns as part of a non-top level UNION ALL SELECT list or as a JOIN key, except
when used inside a supported expression.
• Comparison with string or numeric constants, and other numeric, non-temporal expressions
(numeric columns, constants, and functions with a numeric result).
• Comparison operators (<, <=, <=>, =, >=, >, and BETWEEN) with numeric arguments.
• The IN() function in combination with numeric arguments (constants, functions, or columns) and
string constants.
• COUNT(), SUM(), and AVG() aggregation functions on ENUM columns. The functions operate on
the numeric index value, not the associated string value.
• CAST(enum_col) AS {[N]CHAR} is supported only in the SELECT list and when it is not nested
in another expression.
• Temporal types are supported only with strict SQL mode. See Strict SQL Mode.
90
Functions and Operator Limitations
• CONVERT_TZ() with named time zones before MySQL 8.4.0. Only datetime values are supported.
Rewrite queries that use named time zones with equivalent datetime values. For example:
mysql> SELECT CONVERT_TZ(O_ORDERDATE, 'UTC','EST') FROM tpch.orders;
Rewrite as:
mysql> SELECT CONVERT_TZ(O_ORDERDATE, '+00:00','-05:00') FROM tpch.orders;
For information about time zone offsets, see MySQL Server Time Zone Support.
• JSON_ARRAY_APPEND()
• JSON_ARRAY_INSERT()
• JSON_INSERT()
• JSON_MERGE()
• JSON_MERGE_PATCH()
• JSON_MERGE_PRESERVE()
• JSON_REMOVE()
• JSON_REPLACE()
• JSON_SCHEMA_VALID()
• JSON_SCHEMA_VALIDATION_REPORT()
• JSON_SET()
• JSON_TABLE()
• Loadable Functions.
• GROUP_CONCAT() with:
• WITH ROLLUP
• A CASE control flow operator or IF() function that contains columns not within an aggregation
function and not part of the GROUP BY key.
• String functions and operators on columns that are not VARLEN encoded. See Section 2.7.1,
“Encoding String Columns”.
• The AVG() aggregate function with enumeration and temporal data types.
91
Index Hint and Optimizer Hint Limitations
• STD()
• STDDEV()
• STDDEV_POP()
• STDDEV_SAMP()
• SUM()
• VARIANCE()
• VAR_POP()
• VAR_SAMP()
With the exception of SUM(), the same aggregate functions within a semi-join predicate due to the
nondeterministic nature of floating-point results and potential mismatches. For example, the following
use is not supported:
mysql> SELECT FROM A WHERE a1 IN (SELECT VAR_POP(b1) FROM B);
The same aggregate functions with numeric data types other than those supported by HeatWave.
See Section 2.10, “Supported Data Types”.
To use the mode argument, the mode value must be defined explicitly.
MySQL attempts to enforce the FIRSTMATCH strategy and ignores all other semijoin strategies
specified explicitly as subquery optimizer hints. However, MySQL may still select the DUPSWEEDOUT
semijoin strategy during JOIN order optimization, even if an equivalent plan could be offered using
the FIRSTMATCH strategy. A plan that uses the DUPSWEEDOUT semijoin strategy would produce
incorrect results if executed on HeatWave.
For general information about subquery optimizer hints, see Subquery Optimizer Hints.
• EXISTS semijoins and antijoins are supported in the following variants only:
92
Partition Selection Limitations
• A query with a supported semijoin or antijoin condition may be rejected for offload due to how
MySQL optimizes and transforms the query.
• Semijoin and antijoin queries use the best plan found after evaluating the first 10000 possible plans,
or after investigating 10000 possible plans since the last valid plan.
The plan evaluation count is reset to zero after each derived table, after an outer query, and after
each subquery. The plan evaluation limit is required because the DUPSWEEDOUT join strategy, which
is not supported by HeatWave, may be used as a fallback strategy by MySQL during join order
optimization (for related information, see FIRSTMATCH). The plan evaluation limit prevents too
much time being spent evaluating plans in cases where MySQL generates numerous plans that use
the DUPSWEEDOUT semijoin strategy.
• Outer join queries without an equality condition defined for the two tables.
• Some outer join queries with IN ... EXISTS sub-queries (semi-joins) in the ON clause.
As of MySQL 8.4.0, HeatWave supports InnoDB partitions with the following limitations:
• HeatWave cannot load partitions from a table that contains dictionary-encoded columns.
• HeatWave maintains partition information in memory. This information is not available during a
restart, and the automatic reload on restart has to reload the entire table.
• HeatWave can unload up to 1,000 partitions in a single statement. Use additional statements to
unload more than 1,000 partitions.
• time_zone and timestamp variable settings are not passed to HeatWave when queries are
offloaded.
93
Other Limitations
• HeatWave on AWS does support LOAD DATA with ALGORITHM=BULK, but does not support the
INFILE and URL clauses.
• It locks the target table exclusively and does not allow other operations on the table.
• It does not support automatic rounding or truncation of the input data. It will fail if the input data
requires rounding or truncation in order to be loaded.
• It is atomic but not transactional. It commits any transaction that is already running. On failure the
LOAD DATA statement is completely rolled back.
• It cannot execute when the target table is explicitly locked by a LOCK TABLES statement.
• The target table for LOAD DATA with ALGORITHM=BULK has the following limitations:
• It must be empty. The state of the table should be as though it has been freshly created. If the
table has instantly added/dropped column, call TRUNCATE before calling LOAD DATA with
ALGORITHM=BULK.
• It must have the default row format, ROW_FORMAT=DYNAMIC. Use ALTER TABLE to make any
changes to the table after LOAD DATA with ALGORITHM=BULK.
• It must contain a primary key, but the primary key must not have a prefix index.
• It must not use a secondary engine. Set the secondary engine after the after LOAD DATA with
ALGORITHM=BULK. See: Section 2.2.2.2, “Defining the Secondary Engine”.
For a list of supported SQL modes, see Section 2.11, “Supported SQL Modes”.
94
Other Limitations
• UNION ALL queries with an ORDER BY or LIMIT clause, between dictionary-encoded columns, or
between ENUM columns.
EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL, and UNION queries with or without an
ORDER BY or LIMIT clause, between dictionary-encoded columns, or between ENUM columns.
EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL, UNION and UNION ALL subqueries with
or without an ORDER BY or LIMIT clause, between dictionary-encoded columns, between ENUM
columns, or specified in an IN or EXISTS clause.
• Comparison predicates, GROUP BY, JOIN, and so on, if the key column is DOUBLE PRECISION.
• Queries with an impossible WHERE condition (queries known to have an empty result set).
• Materialized views.
Only nonmaterialized views are supported. See Section 2.3.9, “Using Views”.
If all elements of the query are supported, the entire query is offloaded; otherwise, the query is
executed on the MySQL DB System by default.
HeatWave supports CREATE TABLE ... SELECT and INSERT ... SELECT statements where
only the SELECT portion of the operation is offloaded to HeatWave. See Section 2.3, “Running
Queries”.
• Row widths in intermediate and final query results that exceed 4MB in size.
A query that exceeds this row width limit is not offloaded to HeatWave for processing.
The query uses a filter for table tt1 in the table scan of table t1 (x < 7) followed by a consecutive
filter on table tt1 (tt1.x > 5) in the WHERE clause.
• Operations involving ALTER TABLE such as loading, unloading, or recovering data when MySQL
Server is running in SUPER_READ_ONLY mode.
95
Other Limitations
MySQL Server is placed in SUPER_READ_ONLY mode when MySQL Server disk space drops
below a set amount for a specific duration. For information about thresholds that control this
behavior and how to disable SUPER_READ_ONLY mode, see Resolving SUPER_READ_ONLY and
OFFLINE_MODE Issue in the HeatWave on OCI Service Guide.
96
Chapter 3 HeatWave AutoML
Table of Contents
3.1 HeatWave AutoML Features .................................................................................................. 98
3.1.1 HeatWave AutoML Supervised Learning ..................................................................... 98
3.1.2 HeatWave AutoML Ease of Use ................................................................................. 98
3.1.3 HeatWave AutoML Workflow ...................................................................................... 99
3.1.4 Oracle AutoML ........................................................................................................... 99
3.2 Before You Begin ................................................................................................................ 100
3.3 Getting Started .................................................................................................................... 101
3.4 Preparing Data .................................................................................................................... 101
3.4.1 Labeled Data ........................................................................................................... 101
3.4.2 Unlabeled Data ........................................................................................................ 102
3.4.3 General Data Requirements ...................................................................................... 102
3.4.4 Example Data .......................................................................................................... 103
3.4.5 Example Text Data ................................................................................................... 104
3.5 Training a Model ................................................................................................................. 105
3.5.1 Advanced ML_TRAIN Options ................................................................................... 106
3.6 Training Explainers .............................................................................................................. 106
3.7 Predictions .......................................................................................................................... 108
3.7.1 Row Predictions ....................................................................................................... 108
3.7.2 Table Predictions ...................................................................................................... 109
3.8 Explanations ....................................................................................................................... 109
3.8.1 Row Explanations ..................................................................................................... 110
3.8.2 Table Explanations ................................................................................................... 111
3.9 Forecasting ......................................................................................................................... 112
3.9.1 Training a Forecasting Model .................................................................................... 112
3.9.2 Using a Forecasting Model ....................................................................................... 112
3.10 Anomaly Detection ............................................................................................................ 115
3.10.1 Training an Anomaly Detection Model ..................................................................... 116
3.10.2 Using an Anomaly Detection Model ......................................................................... 116
3.11 Recommendations ............................................................................................................. 118
3.11.1 Training a Recommendation Model ......................................................................... 119
3.11.2 Using a Recommendation Model ............................................................................. 121
3.12 HeatWave AutoML and Lakehouse .................................................................................... 131
3.13 Managing Models .............................................................................................................. 135
3.13.1 The Model Catalog ................................................................................................. 135
3.13.2 ONNX Model Import ............................................................................................... 137
3.13.3 Loading Models ...................................................................................................... 143
3.13.4 Unloading Models ................................................................................................... 143
3.13.5 Viewing Models ...................................................................................................... 144
3.13.6 Scoring Models ...................................................................................................... 144
3.13.7 Model Explanations ................................................................................................ 145
3.13.8 Model Handles ....................................................................................................... 145
3.13.9 Deleting Models ...................................................................................................... 146
3.13.10 Sharing Models .................................................................................................... 147
3.14 Progress tracking .............................................................................................................. 147
3.15 HeatWave AutoML Routines .............................................................................................. 151
3.15.1 ML_TRAIN ............................................................................................................. 151
3.15.2 ML_EXPLAIN ......................................................................................................... 158
3.15.3 ML_MODEL_IMPORT ............................................................................................. 160
3.15.4 ML_PREDICT_ROW ............................................................................................... 161
3.15.5 ML_PREDICT_TABLE ............................................................................................ 164
3.15.6 ML_EXPLAIN_ROW ............................................................................................... 167
3.15.7 ML_EXPLAIN_TABLE ............................................................................................. 169
97
HeatWave AutoML Features
Once a model is created, it can be used on unseen data, where the label is unknown, to make
predictions. In a business setting, predictive models have a variety of possible applications such as
predicting customer churn, approving or rejecting credit applications, predicting customer wait times,
and so on.
HeatWave AutoML supports both classification and regression models. A classification model predicts
discrete values, such as whether an email is spam or not, whether a loan application should be
approved or rejected, or what product a customer might be interested in purchasing. A regression
model predicts continuous values, such as customer wait times, expected sales, or home prices, for
example. The model type is selected during training, with classification being the default type.
The ML_TRAIN routine leverages Oracle AutoML technology to automate training of machine learning
models. For information about Oracle AutoML, see Section 3.1.4, “Oracle AutoML”.
You can use a model created by ML_TRAIN with other HeatWave AutoML routines to generate
predictions and explanations; for example, this call to the ML_PREDICT_TABLE routine generates
predictions for a table of input data:
CALL sys.ML_PREDICT_TABLE('heatwaveml_bench.census_test', @census_model,
'heatwaveml_bench.census_predictions');
All HeatWave AutoML operations are initiated by running CALL or SELECT statements, which can be
easily integrated into your applications. HeatWave AutoML routines reside in the MySQL sys schema
and can be run from any MySQL client or application that is connected to a MySQL DB System with a
HeatWave Cluster. HeatWave AutoML routines include:
98
HeatWave AutoML Workflow
In addition, with HeatWave AutoML, there is no need to move or reformat your data. Data and machine
learning models never leave the HeatWave Service, which saves you time and effort while keeping
your data and models secure.
1. When the ML_TRAIN routine is called, HeatWave AutoML calls the MySQL DB System where the
training data resides. The training data is sent from the MySQL DB System and distributed across
the HeatWave Cluster, which performs machine learning computation in parallel. See Section 3.5,
“Training a Model”.
2. HeatWave AutoML analyzes the training data, trains an optimized machine learning model, and
stores the model in a model catalog on the MySQL DB System. See Section 3.13.1, “The Model
Catalog”.
3. HeatWave AutoML ML_PREDICT_* and ML_EXPLAIN_* routines use the trained model to
generate predictions and explanations on test or unseen data. See Section 3.7, “Predictions”, and
Section 3.8, “Explanations”.
4. Predictions and explanations are returned to the MySQL DB System and to the user or application
that issued the query.
Optionally, the ML_SCORE routine can be used to compute the quality of a model to ensure that
predictions and explanations are reliable. See Section 3.13.6, “Scoring Models”.
Note
3. Selecting only predictive features to speed up the pipeline and reduce over-fitting.
99
Before You Begin
4. Ensuring the model performs well on unseen data (also called generalization).
Oracle AutoML automates this workflow, providing you with an optimal model given a time budget. The
Oracle AutoML pipeline used by the HeatWave AutoML ML_TRAIN routine has these stages:
• Data preprocessing
• Algorithm selection
• Hyperparameter optimization
Oracle AutoML also produces high quality models very efficiently, which is achieved through a scalable
design and intelligent choices that reduce trials at each stage in the pipeline.
• Scalable design: The Oracle AutoML pipeline is able to exploit both HeatWave internode and
intranode parallelism, which improves scalability and reduces runtime.
• Intelligent choices reduce trials in each stage: Algorithms and parameters are chosen based on
dataset characteristics, which ensures that the model is accurate and efficiently selected. This is
achieved using meta-learning throughout the pipeline.
For additional information about Oracle AutoML, refer to Yakovlev, Anatoly, et al. "Oracle AutoML: A
Fast and Predictive AutoML Pipeline." Proceedings of the VLDB Endowment 13.12 (2020): 3166-3180.
• You have an operational MySQL DB System and are able to connect to it using a MySQL client. If
not, complete the steps described in Getting Started in the HeatWave on OCI Service Guide.
• Your MySQL DB System has an operational HeatWave Cluster. If not, complete the steps described
in Adding a HeatWave Cluster in the HeatWave on OCI Service Guide.
• The MySQL account that you will use to train a model does not have a period character (".") in
its name; for example, a user named 'joesmith'@'%' is permitted to train a model, but a user
named 'joe.smith'@'%' is not. For more information about this requirement, see Section 3.18,
“HeatWave AutoML Limitations”.
• The MySQL account that will use HeatWave AutoML has been granted the following privileges:
• SELECT and ALTER privileges on the schema that contains the machine learning datasets; for
example:
mysql> GRANT SELECT, ALTER ON schema_name.* TO 'user_name'@'%';
• SELECT and EXECUTE on the MySQL sys schema where HeatWave AutoML routines reside; for
example:
mysql> GRANT SELECT, EXECUTE ON sys.* TO 'user_name'@'%';
100
Getting Started
Proceed through the following steps to prepare data, train a model, make predictions, and generate
explanations:
1. Prepare and load training and test data. See Section 3.4, “Preparing Data”.
3. Make predictions with test data using a trained model. See Section 3.7, “Predictions”.
4. Run explanations on test data using a trained model to understand how predictions are made. See
Section 3.8, “Explanations”.
5. Score your machine learning model to assess its reliability. See Section 3.13.6, “Scoring Models”.
6. View a model explanation to understand how the model makes predictions. See Section 3.13.7,
“Model Explanations”.
Alternatively, you can jump ahead to the Iris Data Set Machine Learning Quickstart, which provides a
quick run-through of HeatWave AutoML capabilities using a simple, well-known machine learning data
set. See Section 7.3, “Iris Data Set Machine Learning Quickstart”.
Feature columns contain the input variables used to train the machine learning model. The target
column contains ground truth values or, in other words, the correct answers. A labeled dataset with
ground truth values is required to train a machine learning model. In the context of this guide, the
labeled dataset used to train a machine learning model is referred as the training dataset.
A labeled dataset with ground truth values is also used to score a model (compute its accuracy and
reliability). This dataset should have the same columns as the training dataset but with a different set of
101
Unlabeled Data
data. In the context of this guide, the labeled dataset used to score a model is referred as the validation
dataset.
Unlabeled data is required to generate predictions and explanations. It must have exactly the same
feature columns as the training dataset but no target column. In the context of this guide, the unlabeled
data used for predictions and explanations is referred to as the test dataset. Test data starts as labeled
data but the label is removed for the purpose of trialing the machine learning model.
The “unseen data” that you will eventually use with your model to make predictions is also unlabeled
data. Like the test dataset, unseen data must have exactly the same feature columns as the training
dataset but no target column.
For examples of training, validation, and test dataset tables and how they are structured, see
Section 3.4.4, “Example Data”, and Section 7.3, “Iris Data Set Machine Learning Quickstart”.
• Each dataset must reside in a single table on the MySQL DB System. HeatWave AutoML routines
such as ML_TRAIN, ML_PREDICT_TABLE, and ML_EXPLAIN_TABLE operate on a single table.
For information about loading data into a MySQL DB System, see Importing and Exporting
Databases in the HeatWave on OCI Service Guide.
• Tables used with HeatWave AutoML must not exceed 10 GB, 100 million rows, or 1017 columns.
Before MySQL 8.0.29, the column limit was 900.
• Table columns must use supported data types. For supported data types and recommendations for
how to handle unsupported types, see Section 3.16, “Supported Data Types”.
• NaN (Not a Number) values are not recognized by MySQL and should be replaced by NULL.
• The target column in a training dataset for a classification model must have at least two distinct
values, and each distinct value should appear in at least five rows. For a regression model, only a
numeric target column is permitted.
Note
The ML_TRAIN routine ignores columns missing more than 20% of its values
and columns with the same value in each row. Missing values in numerical
102
Example Data
columns are replaced with the average value of the column, standardized to
a mean of 0 and with a standard deviation of 1. Missing values in categorical
columns are replaced with the most frequent value, and either one-hot or ordinal
encoding is used to convert categorical values to numeric values. The input
data as it exists in the MySQL database is not modified by ML_TRAIN.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,
CA: University of California, School of Information.
Note
To replicate the examples in this guide, perform the following steps to create the required schema and
tables. Python 3 and MySQL Shell are required.
1. Create the following schema and tables on the MySQL DB System by executing the following
statements:
mysql> CREATE SCHEMA heatwaveml_bench;
2. Navigate to the HeatWave AutoML Code for Performance Benchmarks GitHub repository at https://
github.com/oracle-samples/heatwave-ml.
b. Download or clone the repository, which includes the census source data and preprpocessing
script.
103
Example Text Data
Note
4. Start MySQL Shell with the --mysql option to open a ClassicSession, which is required when
using the Parallel Table Import Utility.
$> mysqlsh --mysql Username@IPAddressOfMySQLDBSystemEndpoint
5. Load the data from the .csv files into the MySQL DB System using the following commands:
MySQL>JS> util.importTable("census_train.csv",{table: "census_train",
dialect: "csv-unix", skipRows:1})
104
Training a Model
ML_TRAIN supports training of classification, regression, and forecasting models. Use a classification
model to predict discrete values. Use a regression model to predict continuous values. Use a
forecasting model to create timeseries forecasts for temporal data.
The time required to train a model can take a few minutes to a few hours depending on the number
of rows and columns in the dataset, specified ML_TRAIN parameters, and the size of the HeatWave
Cluster. HeatWave AutoML supports tables up to 10 GB in size with a maximum of 100 million rows
and or 1017 columns. Before MySQL 8.0.29, the column limit was 900.
ML_TRAIN stores machine learning models in the MODEL_CATALOG table. See Section 3.13.1, “The
Model Catalog”.
The training dataset used with ML_TRAIN must reside in a table on the MySQL DB System. For an
example training dataset, see Section 3.4.4, “Example Data”.
Where:
• heatwaveml_bench.census_train is the fully qualified name of the table that contains the
training dataset (schema_name.table_name).
• revenue is the name of the target column, which contains ground truth values.
Specify NULL instead of JSON options to use the default classification task type.
When using the regression task type, only a numeric target column is permitted.
For the anomaly_detection task type, see Section 3.10, “Anomaly Detection”
• @census_model is the name of the user-defined session variable that stores the model handle for
the duration of the connection. User variables are written as @var_name. Some of the examples in
this guide use @census_model as the variable name. Any valid name for a user-defined variable is
permitted, for example @my_model).
After ML_TRAIN trains a model, the model is stored in the model catalog. To retrieve the generated
model handle, query the specified session variable; for example:
mysql> SELECT @census_model;
+--------------------------------------------------+
| @census_model |
+--------------------------------------------------+
| heatwaveml_bench.census_train_user1_1636729526 |
+--------------------------------------------------+
105
Advanced ML_TRAIN Options
Tip
While using the same connection used to execute ML_TRAIN, specify the
session variable, for example @census_model, in place of the model handle in
other HeatWave AutoML routines, but the session variable data is lost when the
current session is terminated. If you need to look up a model handle, you can do
so by querying the model catalog table. See Section 3.13.8, “Model Handles”.
The quality and reliability of a trained model can be assessed using the ML_SCORE routine. For more
information, see Section 3.13.6, “Scoring Models”. As of MySQL 8.0.30, ML_TRAIN displays the
following message if a trained model has a low score: Model Has a low training score,
expect low quality model explanations.
• The model_list option permits specifying the type of model to be trained. If more than one
type of model specified, the best model type is selected from the list. For a list of supported
model types, see Section 3.15.11, “Model Types”. This option cannot be used together with the
exclude_model_list option.
• The exclude_model_list option specifies types of models that should not be trained. Specified
model types are excluded from consideration. For a list of model types you can specify, see
Section 3.15.11, “Model Types”. This option cannot be used together with the model_list option.
• The optimization_metric option specifies a scoring metric to optimize for. See: Section 3.15.13,
“Optimization and Scoring Metrics”.
• The exclude_column_list option specifies feature columns to exclude from consideration when
training a model.
The following example excludes the 'age' column from consideration when training a model for the
census dataset.
mysql> CALL sys.ml_train('heatwaveml_bench.census_train', 'revenue',
JSON_OBJECT('task','classification', 'exclude_column_list', JSON_ARRAY('age')),
@census_model);
106
Training Explainers
Explanations help you understand which features have the most influence on a prediction. Feature
importance is presented as a value ranging from -1 to 1. A positive value indicates that a feature
contributed toward the prediction. A negative value indicates that the feature contributed toward a
different prediction; for example, if a feature in a loan approval model with two possible predictions
('approve' and 'reject') has a negative value for an 'approve' prediction, that feature would have a
positive value for a 'reject' prediction. A value of 0 or near 0 indicates that the feature value has no
impact on the prediction to which it applies.
Prediction explainers are used when you run the ML_EXPLAIN_ROW and ML_EXPLAIN_TABLE
routines to generate explanations for specific predictions. You must train a prediction explainer for
the model before you can use those routines. The ML_EXPLAIN routine can train these prediction
explainers:
• The SHAP prediction explainer, specified as shap, uses feature importance values to explain the
prediction for a single row or table.
Model explainers are used when you run the ML_EXPLAIN routine to explain what the model learned
from the training dataset. The model explainer provides a list of feature importances to show what
features the model considered important based on the entire training dataset. The ML_EXPLAIN
routine can train these model explainers:
• The SHAP model explainer, specified as shap, produces global feature importance values based on
Shapley values.
• The Fast SHAP model explainer, specified as fast_shap, is a subsampling version of the SHAP
model explainer which usually has a faster runtime.
The model explanation is stored in the model catalog along with the machine learning model (see
Section 3.13.1, “The Model Catalog”). If you run ML_EXPLAIN again for the same model handle and
model explainer, the field is overwritten with the new result.
Before you run ML_EXPLAIN, you must load the model, for example:
mysql> CALL sys.ML_MODEL_LOAD('ml_data.iris_train_user1_1636729526', NULL);
The following example runs ML_EXPLAIN to train the SHAP model explainer and the Permutation
Importance prediction explainer for the model:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class', 'ml_data.iris_train_user1_1636729526',
JSON_OBJECT('model_explainer', 'shap', 'prediction_explainer', 'permutation_importance'));
Where:
• ml_data.iris_train is the fully qualified name of the table that contains the training dataset
(schema_name.table_name).
• class is the name of the target column, which contains ground truth values.
• JSON is a list of key-value pairs naming the model explainer and prediction explainer that are
to be trained for the model. In this case, model_explainer specifies shap for the SHAP
107
Predictions
This example runs ML_EXPLAIN to train the Partial Dependence model explainer (which requires extra
options) and the SHAP prediction explainer for the model:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class', @iris_model,
JSON_OBJECT('columns_to_explain', JSON_ARRAY('sepal width'),
'target_value', 'Iris-setosa', 'model_explainer',
'partial_dependence', 'prediction_explainer', 'shap'));
Where:
• columns_to_explain identifies the sepal width column for the explainer to explain how
changing the value in this column affects the model. You can identify more than one column in the
JSON array.
• target_value is a valid value that the target column containing ground truth values (in this case,
class) can take.
For the full ML_EXPLAIN option descriptions, see Section 3.15.2, “ML_EXPLAIN”.
3.7 Predictions
Predictions are generated by running ML_PREDICT_ROW or ML_PREDICT_TABLE on unlabeled
data; that is, it must have the same feature columns as the data used to train the model but no target
column.
Before running ML_PREDICT_ROW, ensure that the model you want to use is loaded; for example:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
For more information about loading models, see Section 3.13.3, “Loading Models”.
The following example runs ML_PREDICT_ROW on a single row of unlabeled data, which is assigned to
a @row_input session variable:
mysql> SET @row_input = JSON_OBJECT(
"age", 25,
"workclass", "Private",
"fnlwgt", 226802,
"education", "11th",
"education-num", 7,
"marital-status", "Never-married",
"occupation", "Machine-op-inspct",
"relationship", "Own-child",
"race", "Black",
"sex", "Male",
"capital-gain", 0,
"capital-loss", 0,
"hours-per-week", 40,
"native-country", "United-States");
where:
108
Table Predictions
• @row_input is a session variable containing a row of unlabeled data. The data is specified in JSON
key-value format. The column names must match the feature column names in the training dataset.
ML_PREDICT_ROW returns a JSON object containing a Prediction key with the predicted value and
the features values used to make the prediction.
You can also run ML_PREDICT_ROW on multiple rows of data selected from a table. For an example,
refer to the syntax examples in Section 3.15.4, “ML_PREDICT_ROW”.
Before running ML_PREDICT_TABLE, ensure that the model you want to use is loaded; for example:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
For more information about loading models, see Section 3.13.3, “Loading Models”.
The following example creates a table with 10 rows of unlabeled test data and generates predictions for
that table:
mysql> CREATE TABLE heatwaveml_bench.census_test_subset AS SELECT *
FROM heatwaveml_bench.census_test
LIMIT 10;
where:
ML_PREDICT_TABLE populates the output table with predictions and the features used to make each
prediction.
3.8 Explanations
Explanations are generated by running ML_EXPLAIN_ROW or ML_EXPLAIN_TABLE on unlabeled
data; that is, it must have the same feature columns as the data used to train the model but no target
column.
Explanations help you understand which features have the most influence on a prediction. Feature
importance is presented as a value ranging from -1 to 1. A positive value indicates that a feature
109
Row Explanations
contributed toward the prediction. A negative value indicates that the feature contributed toward a
different prediction; for example, if a feature in a loan approval model with two possible predictions
('approve' and 'reject') has a negative value for an 'approve' prediction, that feature would have a
positive value for a 'reject' prediction. A value of 0 or near 0 indicates that the feature value has no
impact on the prediction to which it applies.
As of MySQL 8.0.31, after the ML_TRAIN routine, use the ML_EXPLAIN routine to train prediction
explainers and model explainers for HeatWave AutoML. You must train prediction explainers in order
to use ML_EXPLAIN_ROW and ML_EXPLAIN_TABLE. In earlier releases, the ML_TRAIN routine trains
the default Permutation Importance model and prediction explainers. See Section 3.6, “Training
Explainers”.
Before running ML_EXPLAIN_ROW, ensure that the model you want to use is loaded; for example:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
For more information about loading models, see Section 3.13.3, “Loading Models”.
The following example generates explanations for a single row of unlabeled data, which is assigned to
a @row_input session variable:
mysql> SET @row_input = JSON_OBJECT(
"age", 25,
"workclass", "Private",
"fnlwgt", 226802,
"education", "11th",
"education-num", 7,
"marital-status", "Never-married",
"occupation", "Machine-op-inspct",
"relationship", "Own-child",
"race", "Black",
"sex", "Male",
"capital-gain", 0,
"capital-loss", 0,
"hours-per-week", 40,
"native-country", "United-States");
where:
• @row_input is a session variable containing a row of unlabeled data. The data is specified in JSON
key-value format. The column names must match the feature column names in the training dataset.
• prediction_explainer provides the name of the prediction explainer that you have trained for
this model, either the Permutation Importance prediction explainer or the SHAP prediction explainer.
You train this using the ML_EXPLAIN routine (see Section 3.6, “Training Explainers”).
ML_EXPLAIN_ROW output includes a prediction, the features used to make the prediction,
and a weighted numerical value that indicates feature importance, in the following format:
"feature_attribution": value. As of MySQL 8.0.30, output includes a Notes field that
110
Table Explanations
identifies features with the greatest impact on predictions and reports a warning if the model is low
quality.
You can also run ML_EXPLAIN_ROW on multiple rows of data selected from a table. For an example,
refer to the syntax examples in Section 3.15.6, “ML_EXPLAIN_ROW”.
The following example creates a table with 10 rows of data selected from the census_test dataset
and generates explanations for that table.
Before running ML_EXPLAIN_TABLE, ensure that the model you want to use is loaded; for example:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
For more information about loading models, see Section 3.13.3, “Loading Models”.
The following example creates a table with 10 rows of unlabeled test data and generates explanations
for that table:
mysql> CREATE TABLE heatwaveml_bench.census_test_subset AS SELECT *
FROM heatwaveml_bench.census_test
LIMIT 10;
where:
• prediction_explainer provides the name of the prediction explainer that you have trained for
this model, either the Permutation Importance prediction explainer or the SHAP prediction explainer.
You train this using the ML_EXPLAIN routine (see Section 3.6, “Training Explainers”).
The ML_EXPLAIN_TABLE output table includes the features used to make the explanations, the
explanations, and feature_attribution columns that provide a weighted numerical value that
indicates feature importance.
As of MySQL 8.0.30, ML_EXPLAIN_TABLE output also includes a Notes field that identifies features
with the greatest impact on predictions. ML_EXPLAIN_TABLE reports a warning if the model is low
quality.
111
Forecasting
3.9 Forecasting
To generate a forecast, run ML_TRAIN and specify a forecasting task as a JSON object literal.
ML_PREDICT_TABLE can then predict the values for the selected column. Use ML_SCORE to score the
model quality.
Create a timeseries forecast for a single column with a numeric data type. In forecasting terms, this is a
univariate endogenous variable.
MySQL 8.0.32 adds support for multivariate endogenous forecasting models, and exogenous
forecasting models.
As of MySQL 8.1.0 ML_TRAIN does not require target_column_name for forecasting, and it can be
set to NULL.
• datetime_index: 'column' The column name for a datetime column that acts as an index for
the forecast variable. The column can be one of the supported datetime column types, DATETIME,
TIMESTAMP, DATE, TIME, and YEAR, or an auto-incrementing index.
• datetime_index: 'column' The column name for a datetime column that acts as an index for
the forecast variable. The column can be one of the supported datetime column types, DATETIME,
TIMESTAMP, DATE, TIME, and YEAR, or an auto-incrementing index.
See Section 3.5, “Training a Model”, and for full details of all the options, see ML_TRAIN.
Syntax Examples
• An ML_TRAIN example that specifies the forecasting task type and the additional required
parameters datetime_index and endogenous_variables:
mysql> CALL sys.ML_TRAIN('ml_data.opsd_germany_daily_train', 'consumption',
JSON_OBJECT('task', 'forecasting', 'datetime_index', 'ddate',
'endogenous_variables', JSON_ARRAY('consumption')),
@forecast_model);
112
Using a Forecasting Model
For instructions to use the ML_PREDICT_TABLE and ML_SCORE routines, see Section 3.7,
“Predictions”, and Section 3.13.6, “Scoring Models”. For the complete list of option descriptions, see
ML_PREDICT_TABLE and ML_SCORE.
As of MySQL 8.1.0 ML_SCORE does not require target_column_name for forecasting, and it can be
set to NULL.
Syntax Examples
• A forecasting example with univariate endogenous_variables. Before MySQL 8.0.32,
the ML_PREDICT_TABLE routine does not include options, and the results do not include the
ml_results column:
mysql> CALL sys.ML_TRAIN('mlcorpus.opsd_germany_daily_train', 'consumption',
JSON_OBJECT('task', 'forecasting',
'datetime_index', 'ddate',
'endogenous_variables', JSON_ARRAY('consumption')),
@forecast_model);
Query OK, 0 rows affected (11.51 sec)
113
Using a Forecasting Model
114
Anomaly Detection
For anomaly detection, HeatWave AutoML uses Generalised kth Nearest Neighbours, GkNN, which
is a model developed at Oracle. It is a single ensemble algorithm that outperforms state-of-the-art
models on public benchmarks. It can identify common anomaly types, such as local, global, and
clustered anomalies, and can achieve an AUC score that is similar to, or better than, when identifying
the following:
• Clustered anomalies.
Optimal k hyperparameter values would be extremely difficult to set without labels and knowledge of
the use-case.
Other algorithms would require training and comparing scores from at least three algorithms to address
global and local anomalies, ignoring clustered anomalies: LOF for local, KNN for global, and another
generic method to establish a 2/3 voting mechanism.
115
Training an Anomaly Detection Model
Anomaly detection introduces an optional contamination factor which represents an estimate of the
percentage of outliers in the training table.
Contamination factor := estimated number of rows with anomalies / total number of rows in the training
table
Run the ML_TRAIN routine to create an anomaly detection model, and use the following JSON
options:
• model_list: not supported because GkNN is currently the only supported algorithm.
• exclude_model_list: not supported because GkNN is currently the only supported algorithm.
• optimization_metric: not supported because the GkNN algorithm does not require labeled data.
See Section 3.5, “Training a Model”, and for full details of all the options, see ML_TRAIN.
Syntax Examples
• An ML_TRAIN example that specifies the anomaly_detection task type:
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train',
NULL, JSON_OBJECT('task', 'anomaly_detection',
'exclude_column_list', JSON_ARRAY('target')),
@anomaly);
Query OK, 0 rows affected (46.59 sec)
116
Using an Anomaly Detection Model
The default contamination value is 0.01. The default threshold value based on the default
contamination value is the 0.99-th percentile of all the anomaly scores.
An alternative to threshold is topk. The results include the top K rows with the highest anomaly
scores. The ML_PREDICT_TABLE and ML_SCORE routines include the topk option, which is an integer
between 1 and the table length.
To detect anomalies, run the ML_PREDICT_ROW or ML_PREDICT_TABLE routines on data with the
same columns as the training model.
For ML_SCORE the target_column_name column must only contain the anomaly scores as an
integer: 1: an anomaly or 0 normal.
ML_SCORE now includes an options parameter in JSON format. The options are threshold and topk.
For instructions to use the ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE routines, see
Section 3.7, “Predictions”, and Section 3.13.6, “Scoring Models”. For the complete list of option
descriptions, see ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE.
Syntax Examples
• An anomaly detection example that uses the roc_auc metric for ML_SCORE.
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train',
NULL, JSON_OBJECT('task', 'anomaly_detection',
'exclude_column_list', JSON_ARRAY('target')),
@anomaly);
Query OK, 0 rows affected (46.59 sec)
117
Recommendations
| @score |
+--------------------+
| 0.7465642094612122 |
+--------------------+
1 row in set (0.00 sec)
mysql> CALL sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL);
+--------------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL)
+--------------------------------------------------------------------------------------------------------
| {"V1": 438.0, "V2": 959.0, "V3": 0.556034, "ml_results": "{'predictions': {'is_anomaly': 0}, 'probabili
+--------------------------------------------------------------------------------------------------------
1 row in set (5.35 sec)
• A ML_PREDICT_TABLE example that uses the threshold option set to 1%. All rows shown have
probabilities of being an anomaly above 1%, and are predicted to be anomalies.
• An ML_SCORE example that uses the accuracy metric with a threshold set to 90%.
• An ML_SCORE example that uses the precision_at_k metric with a topk value of 10.
3.11 Recommendations
MySQL 8.0.33 introduces recommendation models which can recommend the following:
118
Training a Recommendation Model
Recommendation models include matrix factorization models, and use Surprise algorithms, see:
Surprise.
MySQL 8.1.0 adds further options for recommendation models which can now also recommend the
following:
MySQL 8.2.0 introduces recommendation models for implicit feedback. When a user interacts with an
item, the implication is that they prefer it to an item that they do not interact with. Implicit feedback uses
BPR: Bayesian Personalized Ranking from Implicit Feedback which is a matrix factorization model that
ranks user-item pairs.
Recommendation models that use explicit feedback learn and recommend ratings for users and items.
Recommendation models that use implicit feedback learn and recommend rankings for users and
items.
Ratings from explicit feedback are specific values, and the higher the value, the better the rating.
Rankings are a comparative measure, and the lower the value, the better the ranking. Because A is
better than B, the ranking for A has a lower value than the ranking for B. HeatWave AutoML derives
rankings based on ratings from implicit feedback, for all ratings that are at or above the feedback
threshold.
Recommendation models can now repeat existing interactions from the training table.
Run the ML_TRAIN routine to create a recommendation model, and use the following JSON options:
• feedback: The type of feedback for a recommendation model, explicit, the default, or
implicit.
• feedback_threshold: The feedback threshold for a recommendation model that uses implicit
feedback.
If the users or items column contains NULL values, the corresponding rows will be dropped and will
not be considered during training.
HeatWave AutoML does not support recommendation tasks with a text column.
See Section 3.5, “Training a Model”, and for full details of all the options, see ML_TRAIN.
119
Training a Recommendation Model
• An ML_TRAIN example that specifies three models for the model_list option.
mysql> SET @allowed_models = JSON_ARRAY('SVD', 'SVDpp', 'NMF');
120
Using a Recommendation Model
'items', 'item_id',
'model_list', CAST(@allowed_models AS JSON)),
@model);
Query OK, 0 rows affected (14.88 sec)
• An ML_TRAIN example that specifies five models for the exclude_model_list option.
mysql> SET @exclude_models= JSON_ARRAY('NormalPredictor', 'Baseline', 'SlopeOne', 'CoClustering', 'SV
• For known users and known items, the prediction is the model output.
121
Using a Recommendation Model
• For a known user with a new item, the prediction is the global average rating or ranking. The
routines can add a user bias if the model includes it.
• For a new user with a known item, the prediction is the global average rating or ranking. The
routines can add an item bias if the model includes it.
• For a new user with a new item, the prediction is the global average rating or ranking.
• For known users and known items, the prediction is the model output.
• For a new item, and an explicit feedback model, the prediction is the global top K users who have
provided the average highest ratings.
For a new item, and an implicit feedback model, the prediction is the global top K users with the
highest number of interactions.
• For an item that has been tried by all known users, the prediction is an empty list because it
is not possible to recommend any other users. Set remove_seen to false to repeat existing
interactions from the training table.
• For known users and known items, the prediction is the model output.
• For a new user, and an explicit feedback model, the prediction is the global top K items that
received the average highest ratings.
For a new user, and an implicit feedback model, the prediction is the global top K items with the
highest number of interactions.
• For a user who has tried all known items, the prediction is an empty list because it is not possible
to recommend any other items. Set remove_seen to false to repeat existing interactions from
the training table.
• For a new item, there is no information to provide a prediction. This will produce an error.
• The predictions are expressed in cosine similarity, and range from 0, very dissimilar, to 1, very
similar.
• For a new user, there is no information to provide a prediction. This will produce an error.
• The predictions are expressed in cosine similarity, and range from 0, very dissimilar, to 1, very
similar.
For recommendations, run the ML_PREDICT_ROW or ML_PREDICT_TABLE routines on data with the
same columns as the training model.
122
Using a Recommendation Model
• ratings: Use this option to predict ratings. This is the default value.
• remove_seen: If true, the model will not repeat existing interactions from the training table.
A table with the same name as the output table for ML_PREDICT_TABLE must not already exist.
NULL values for any row in the users or items columns will cause an error.
• threshold: The optional threshold that defines positive feedback, and a relevant sample. Only use
with ranking metrics.
• topk: The optional top K rows to recommend. Only use with ranking metrics.
• remove_seen: If true, the model will not repeat existing interactions from the training table.
ML_SCORE can use any of the recommendation metrics to score a recommendation model. Always
use the metric parameter to specify a ratings metric to use with a recommendation model that uses
explicit feedback, or a ranking metric to use with a recommendation model that uses implicit feedback.
See: Section 3.15.13, “Optimization and Scoring Metrics”.
For instructions to use the ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE routines, see
Section 3.7, “Predictions”, and Section 3.13.6, “Scoring Models”. For the complete list of option
descriptions, see ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE.
• A ML_PREDICT_TABLE example that predicts the ratings for particular users and items. This is the
default option for recommend, with options set to NULL.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.retailrocket-transactionto_to_predict',
@model, 'mlcorpus.table_predictions', NULL);
Query OK, 0 rows affected (0.7589 sec)
123
Using a Recommendation Model
• A ML_PREDICT_ROW example that recommends the top 3 users that will like a particular item.
mysql> SELECT sys.ML_PREDICT_ROW('{"item_id": "64154"}', @model,
JSON_OBJECT("recommend", "users", "topk", 3));
+--------------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"item_id": "64154"}', @model, JSON_OBJECT("recommend", "users", "topk", 3))
+--------------------------------------------------------------------------------------------------------
| {"item_id": "64154", "ml_results": "{"predictions": {"user_id": ["171718", "1167457", "1352334"], "rati
+--------------------------------------------------------------------------------------------------------
1 row in set (0.34 sec)
• A ML_PREDICT_TABLE example that recommends the top 3 users that will like particular items.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.retailrocket-transactionto_to_predict',
@model, 'mlcorpus.table_predictions_items',
JSON_OBJECT("recommend", "users", "topk", 3));
Query OK, 0 rows affected (1.85 sec)
• A more complete example for the top 3 users that will like particular items.
mysql> SELECT * FROM train_table;
+---------+------------+--------+
| user_id | item_id | rating |
+---------+------------+--------+
| user_1 | good_movie | 5 |
| user_1 | bad_movie | 1 |
| user_2 | bad_movie | 1 |
| user_3 | bad_movie | 0 |
| user_4 | bad_movie | 0 |
+---------+------------+--------+
5 rows in set (0.00 sec)
124
Using a Recommendation Model
• A ML_PREDICT_ROW example for the top 3 users that will like a particular item with the
items_to_users option.
mysql> SELECT sys.ML_PREDICT_ROW('{"item_id": "524"}', @model,
JSON_OBJECT("recommend", "items_to_users", "topk", 3));
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"item_id": "524"}', @model, JSON_OBJECT("recommend", "items_to_users", "topk"
+----------------------------------------------------------------------------------------------------
| {"item_id": "524", "ml_results": "{"predictions": {"user_id": ["7", "164", "894"], "rating": [4.05,
+----------------------------------------------------------------------------------------------------
1 row in set (0.2808 sec)
• A ML_PREDICT_TABLE example for the top 3 users that will like particular items with the
items_to_users option.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.ml-100k',
@model, 'mlcorpus.item_to_users_recommendation',
JSON_OBJECT("recommend", "items_to_users", "topk", 3));
Query OK, 0 rows affected (21.2070 sec)
• A ML_PREDICT_ROW example for the top 3 items that a particular user will like.
mysql> SELECT sys.ML_PREDICT_ROW('{"user_id": "836347"}', @model,
JSON_OBJECT("recommend", "items", "topk", 3));
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"user_id": "836347"}', @model, JSON_OBJECT("recommend", "items", "topk", 3))
+----------------------------------------------------------------------------------------------------
| {"user_id": "836347", "ml_results": "{"predictions": {"item_id": ["119736", "396042", "224549"], "r
+----------------------------------------------------------------------------------------------------
1 row in set (0.31 sec)
• A ML_PREDICT_TABLE example for the top 3 items that particular users will like.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.retailrocket-transactionto_to_predict',
125
Using a Recommendation Model
@model, 'mlcorpus.user_recommendations',
JSON_OBJECT("recommend", "items", "topk", 3));
Query OK, 0 rows affected (6.0322 sec)
• A more complete example for the top 3 items that particular users will like.
mysql> SELECT * FROM train_table;
+---------+------------+--------+
| user_id | item_id | rating |
+---------+------------+--------+
| user_1 | good_movie | 5 |
| user_1 | bad_movie | 1 |
| user_2 | bad_movie | 1 |
| user_3 | bad_movie | 0 |
| user_4 | bad_movie | 0 |
+---------+------------+--------+
5 rows in set (0.00 sec)
• A ML_PREDICT_ROW example for the top 3 items that a particular user will like with the
users_to_items option.
mysql> SELECT sys.ML_PREDICT_ROW('{"user_id": "846"}', @model,
JSON_OBJECT("recommend", "users_to_items", "topk", 3));
+--------------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"user_id": "846"}', @model, JSON_OBJECT("recommend", "users_to_items", "topk", 3)
+--------------------------------------------------------------------------------------------------------
| {"user_id": "846", "ml_results": "{"predictions": {"item_id": ["313", "483", "64"], "rating": [4.06, 4.
+--------------------------------------------------------------------------------------------------------
1 row in set (0.2811 sec)
126
Using a Recommendation Model
• A ML_PREDICT_TABLE example for the top 3 items that particular users will like with the
users_to_items option.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.ml-100k',
@model, 'mlcorpus.user_to_items_recommendation',
JSON_OBJECT("recommend", "users_to_items", "topk", 3));
Query OK, 0 rows affected (22.8504 sec)
127
Using a Recommendation Model
• A ML_SCORE example:
mysql> CALL sys.ML_SCORE('mlcorpus.ipinyou-click_test',
'rating', @model, 'neg_mean_squared_error', @score, NULL);
Query OK, 0 rows affected (1 min 18.29 sec)
• A ML_PREDICT_TABLE example that predicts the rankings for particular users and items.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_table', @model, 'mlcorpus.table_predictions', NULL);
• A ML_PREDICT_ROW example that recommends the top 3 users that will like a particular item.
mysql> SELECT sys.ML_PREDICT_ROW('{"item_id": "13763"}', @model, JSON_OBJECT("recommend", "users", "topk
+--------------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"item_id": "13763"}', @model, JSON_OBJECT("recommend", "users", "topk", 3))
+--------------------------------------------------------------------------------------------------------
| {"item_id": "13763", "ml_results": {"predictions": {"rating": [1.26, 1.25, 1.25], "user_id": ["3929", "
+--------------------------------------------------------------------------------------------------------
1 row in set (0.3098 sec)
• A ML_PREDICT_ROW example that recommends the top 3 users that will like a particular item and
includes existing interactions from the training table.
mysql> SELECT sys.ML_PREDICT_ROW('{"item_id": "13763"}', @model, JSON_OBJECT("recommend", "users", "topk
+--------------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"item_id": "13763"}', @model, JSON_OBJECT("recommend", "users", "topk", 3, "remov
128
Using a Recommendation Model
+----------------------------------------------------------------------------------------------------
| {"item_id": "13763", "ml_results": {"predictions": {"rating": [1.26, 1.26, 1.26], "user_id": ["4590
+----------------------------------------------------------------------------------------------------
1 row in set (0.3098 sec)
• A ML_PREDICT_TABLE example that recommends the top 3 users that will like particular items.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_items',
Query OK, 0 rows affected (4.1777 sec)
• A ML_PREDICT_ROW example that recommends the top 3 items that a particular user will like.
mysql> SELECT sys.ML_PREDICT_ROW('{"user_id": "1026"}', @model, JSON_OBJECT("recommend", "items", "t
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"user_id": "1026"}', @model, JSON_OBJECT("recommend", "items", "topk", 3))
+----------------------------------------------------------------------------------------------------
| {"user_id": "1026", "ml_results": {"predictions": {"rating": [3.43, 3.37, 3.18], "item_id": ["10",
+----------------------------------------------------------------------------------------------------
1 row in set (0.6586 sec)
• A ML_PREDICT_ROW example that recommends the top 3 items that a particular user will like and
includes existing interactions from the training table.
mysql> SELECT sys.ML_PREDICT_ROW('{"user_id": "1026"}', @model, JSON_OBJECT("recommend", "items", "t
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"user_id": "1026"}', @model, JSON_OBJECT("recommend", "items", "topk", 3, "re
+----------------------------------------------------------------------------------------------------
| {"user_id": "1026", "ml_results": {"predictions": {"rating": [3.43, 3.37, 3.18], "item_id": ["10",
+----------------------------------------------------------------------------------------------------
• A ML_PREDICT_TABLE example that recommends the top 3 items that particular users will like.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_users',
Query OK, 0 rows affected (5.0672 sec)
129
Using a Recommendation Model
+-----+---------+---------+--------+---------------------------------------------------------------------
| 1 | 1026 | 13763 | 1 | {"predictions": {"item_id": ["13751", "13711", "13668"], "similarity
| 2 | 992 | 16114 | 1 | {"predictions": {"item_id": ["14050", "16413", "16454"], "similarity
| 3 | 1863 | 4527 | 1 | {"predictions": {"item_id": ["6008", "1873", "1650"], "similarity":
+-----+---------+---------+--------+---------------------------------------------------------------------
3 rows in set (0.0004 sec)
• ML_SCORE examples for the four metrics suitable for a recommendation model with implicit feedback,
with threshold set to 3 and topk set to 50.
mysql> CALL sys.ML_MODEL_LOAD(@model, NULL);
130
HeatWave AutoML and Lakehouse
• ML_SCORE examples for the four metrics suitable for a recommendation model with implicit feedback,
with threshold set to 3, topk set to 50 and including existing interactions from the training table
with remove_seen set to false.
131
Syntax Examples
• If the Lakehouse table had not been loaded into HeatWave before a HeatWave AutoML command,
then the data will be unloaded after the command.
• If the Lakehouse table had been loaded into HeatWave before a HeatWave AutoML command, then
the data will remain in HeatWave after the command.
HeatWave AutoML commands operate on data loaded into HeatWave. If the original Lakehouse data in
Object Storage is deleted or modified this will not affect a HeatWave AutoML command, until the data
is unloaded from HeatWave.
Syntax Examples
The following examples use data from: Bank Marketing. The target column is y.
• A CREATE TABLE example with Lakehouse details that loads the training dataset.
mysql> CREATE TABLE bank_marketing_lakehouse_train(
age int,
job varchar(255),
marital varchar(255),
education varchar(255),
default1 varchar(255),
balance float,
housing varchar(255),
loan varchar(255),
contact varchar(255),
day int,
month varchar(255),
duration float,
campaign int,
pdays float,
previous float,
poutcome varchar(255),
y varchar(255)
)
ENGINE=LAKEHOUSE
SECONDARY_ENGINE=RAPID
ENGINE_ATTRIBUTE='{"dialect": {"format": "csv",
"skip_rows": 1,
"field_delimiter":",",
"record_delimiter":"\\n"}}',
"file": [{ "region": "region",
"namespace": "namespace",
"bucket": "bucket",
"prefix": "mlbench/bank_marketing_train.csv"}]'
• An ALTER TABLE example with Lakehouse details that loads the test dataset.
mysql> CREATE TABLE bank_marketing_lakehouse_train(
age int,
job varchar(255),
marital varchar(255),
education varchar(255),
default1 varchar(255),
balance float,
housing varchar(255),
loan varchar(255),
contact varchar(255),
day int,
month varchar(255),
duration float,
campaign int,
132
Syntax Examples
pdays float,
previous float,
poutcome varchar(255),
y varchar(255)
);
• ML_TRAIN, ML_MODEL_LOAD, and ML_SCORE examples that use the Lakehouse data.
• Examples for ML_PREDICT_ROW and ML_EXPLAIN_ROW that insert data directly, and avoid the FROM
clause.
133
Syntax Examples
"housing": "yes",
"loan": "no",
"contact": "unknown",
"day": 21,
"month": "may",
"duration": 1106,
"campaign": 1,
"pdays": -1,
"previous": 0,
"poutcome":
"unknown",
"y": "no"}',
@bank_model,
JSON_OBJECT('prediction_explainer', 'permutation_importance'));
• Examples for ML_PREDICT_ROW and ML_EXPLAIN_ROW that insert data directly with a JSON object,
and avoid the FROM clause.
• Examples for ML_PREDICT_ROW and ML_EXPLAIN_ROW that copies four rows to an InnoDB table,
and then uses a FROM clause.
134
Managing Models
When a user creates a model, the ML_TRAIN routine creates the model catalog schema and table if
they do not exist. ML_TRAIN inserts the model as a row in the MODEL_CATALOG table at the end of
training.
A model catalog is accessible only to the owning user unless the user grants privileges on the model
catalog to another user. This means that HeatWave AutoML routines can only use models that are
accessible to the user running the routines. For information about granting model catalog privileges,
see Section 3.13.10, “Sharing Models”.
A database administrator can manage a model catalog table as they would a regular MySQL table.
• model_id
• model_handle
A name for the model. The model handle must be unique in the model catalog. The model handle
is generated or set by the user when the ML_TRAIN routine is executed on a training dataset. The
generated model_handle format is schemaName_tableName_userName_No, as in the following
example: heatwaveml_bench.census_train_user1_1636729526.
Note
• model_object
• model_owner
The user who initiated the ML_TRAIN query to create the model.
• build_timestamp
A timestamp indicating when the model was created (in UNIX epoch time). A model is created when
the ML_TRAIN routine finishes executing.
135
The Model Catalog
• target_column_name
The name of the column in the training table that was specified as the target column.
• train_table_name
• model_object_size
• model_type
• task
MySQL 8.1.0 deprecates task, and replaces it with task in model_metadata. A future release will
remove it.
• column_names
• model_explanation
The model explanation generated during training. See Section 3.13.7, “Model Explanations”. This
column was added in MySQL 8.0.29.
• last_accessed
The last time the model was accessed. HeatWave AutoML routines update this value to the current
timestamp when accessing the model.
MySQL 8.1.0 deprecates last_accessed, because it is no longer used. A future release will
remove it.
136
ONNX Model Import
• model_metadata
Metadata for the model. If an error occurs during training or you cancel the training operation,
HeatWave AutoML records the error status in this column. This column was added in MySQL 8.0.31.
See Section 3.15.12, “Model Metadata”
• notes
Use this column to record notes about the trained model. It also records any error messages that
occur during model training.
MySQL 8.1.0 deprecates notes, and replaces it with notes in model_metadata. A future release
will remove it.
Models in ONNX format, .onnx, cannot be loaded directly into a MySQL table. They require string
serialization and conversion to Base64 encoding before you use the ML_MODEL_IMPORT routine.
• An ONNX model that has only one input and it is the entire MySQL table.
• An ONNX model that has more than one input and each input is one column in the MySQL table.
For example, HeatWave AutoML does not support an ONNX model that takes more than one input and
each input is associated with more than one column in the MySQL table.
The first dimension of the input to the ONNX model provided by the ONNX model get_inputs() API
should be the batch size. This should be None, a string, or an integer. None or string indicate a variable
batch size and an integer indicates a fixed batch size. Examples of input shapes:
[None, 2]
['batch_size', 2, 3]
[1, 14]
All other dimensions should be integers. For example, HeatWave AutoML does not support an input
shape similar to the following:
input shape = ['batch_size', 'sequence_length']
The output of an ONNX model is a list of results. The ONNX API documentation defines the results as
a numpy array, a list, a dictionary or a sparse tensor. HeatWave AutoML only supports a numpy array,
a list, and a dictionary.
array([[0.8896357 , 0.11036429],
[0.28360802, 0.716392 ],
[0.9404001 , 0.05959991],
[0.5655978 , 0.43440223]], dtype=float32)
array([[0.96875435],
[1.081366 ],
[0.5736201 ],
137
ONNX Model Import
[0.90711355]], dtype=float32)
[0, 2, 0, 0]
[[[0.8896357] , [0.110364]],
[[0.28360802], [0.716392]],
[[0.9404001] , [0.059599]],
[[0.5655978] , [0.434402]]]
[[0.968754],
[1.081366],
[0.573620],
[0.907113]]
[[[0.968754]],
[[1.081366]],
[[0.573620]],
[[0.907113]]]
• Dictionary examples:
{'Iris-setosa': 0.0, 'Iris-versicolor': 0.0, 'Iris-virginica': 0.999}
For classification and regression tasks, HeatWave AutoML only supports model explainers and scoring
for variable batch sizes.
For forecasting, anomaly detection and recommendation tasks, HeatWave AutoML does not support
model explainers and scoring. The prediction column must contain a JSON object literal of name value
keys. For example, for three outputs:
{output1: value1, output2: value2, output3: value3}
138
ONNX Model Import
onnx_inputs_info includes data_types_map. See Section 3.15.12, “Model Metadata” for the
default value.
Use the data_types_map to map the data type of each column to an ONNX model data type. For
example, to convert inputs of the type tensor(float) to float64:
data_types_map = {"tensor(float)": "float64"}
HeatWave AutoML first checks the user data_types_map, and then the default data_types_map to
check if the data type exists. HeatWave AutoML supports the following numpy data types:
Table 3.1 Supported numpy data types
str_ unicode_ int8 int16 int32 int64 int_ uint16
uint32 uint64 byte ubyte short ushort intc uintc
uint longlong ulonglong intp uintp float16 float32 float64
half single longfloat double longdoublebool_ datetime64complex_
complex64 complex128complex256csingle cdouble clongdouble
The use of any other numpy data type will cause an error.
Use predictions_name to determine which of the ONNX model outputs is associated with
predictions. Use prediction_probabilities_name to determine which of the ONNX model
outputs is associated with prediction probabilities. Use use a labels_map to map prediction
probabilities to predictions, known as labels.
For regression tasks, if the ONNX model generates only one output, then predictions_name is
optional. If the ONNX model generates more than one output, then predictions_name is required.
Do not provide prediction_probabilities_name as this will cause an error.
HeatWave AutoML adds a note for ONNX models that have inputs with four dimensions about the
reshaping of data to a suitable shape for an ONNX model. This would typically be for ONNX models
that are trained on image data. An example of this note added to the ml_results column:
139
ONNX Model Import
1. Convert the .onnx file containing the model to Base64 encoding and carry out string serialization.
Do this with the Python base64 module. The following example converts the file iris.onnx:
$> python -c "import onnx; import base64;
$> open('iris_base64.onnx', 'wb').write(
$> base64.b64encode(onnx.load('iris.onnx').SerializeToString()))"
2. Connect to the MySQL DB System for the HeatWave Cluster as a client, and create a temporary
table to upload the model. For example:
mysql> CREATE TEMPORARY TABLE onnx_temp (onnx_string LONGTEXT);
3. Use a LOAD DATA INFILE statement to load the preprocessed .onnx file into the temporary
table. For example:
mysql> LOAD DATA INFILE 'iris_base64.onnx'
INTO TABLE onnx_temp
CHARACTER SET binary
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r' (onnx_string);
4. Select the uploaded model from the temporary table into a session variable. For example:
mysql> SELECT onnx_string FROM onnx_temp INTO @onnx_encode;
5. Call the ML_MODEL_IMPORT routine to import the ONNX model into the model catalog. For
example:
mysql> CALL sys.ML_MODEL_IMPORT(@onnx_encode, NULL, 'iris_onnx');
In this example, the model handle is iris_onnx, and the optional model metadata is omitted
and set to NULL. For details of the supported metadata for imported ONNX models, see
ML_MODEL_IMPORT and Section 3.15.12, “Model Metadata”.
After import, all the HeatWave AutoML routines can be used with the ONNX model. It is added to the
model catalog and can be managed in the same ways as a model created by HeatWave AutoML.
140
ONNX Model Import
141
ONNX Model Import
142
Loading Models
A model can only be loaded by the MySQL user that created the model. For more information, see
Section 3.13.10, “Sharing Models”.
HeatWave can load multiple models but to avoid taking up too much space in memory, limit the number
of loaded models to three.
The following example loads a HeatWave AutoML model from the model catalog:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
where:
A fully qualified ML_MODEL_LOAD call that specifies the model handle and user name of the owner is as
follows:
mysql> CALL sys.ML_MODEL_LOAD('heatwaveml_bench.census_train_user1_1636729526', user1);
It is permitted to load multiple models but to avoid taking up too much space in memory, limit the
number of loaded models to three.
where:
143
Viewing Models
where:
Note
The example above retrieves data from only a few MODEL_CATALOG table
columns. For other columns you can query, see Section 3.13.1, “The Model
Catalog”.
The dataset used with ML_SCORE should have the same feature columns as the dataset used to train
the model but the data sample should be different from the data used to train the model; for example,
you might reserve 20 to 30 percent of a labeled dataset for scoring.
ML_SCORE returns a computed metric indicating the quality of the model. A value of None is reported
if a score for the specified or default metric cannot be computed. If an invalid metric is specified, the
following error message is reported: Invalid data for the metric. Score could not be
computed.
Models with a low score can be expected to perform poorly, producing predictions and explanations
that cannot be relied upon. A low score typically indicates that the provided feature columns are not
a good predictor of the target values. In this case, consider adding more rows or more informative
features to the training dataset.
You can also run ML_SCORE on the training dataset and a labeled test dataset and compare results to
ensure that the test dataset is representative of the training dataset. A high score on a training dataset
and low score on a test dataset indicates that the test data set is not representative of the training
dataset. In this case, consider adding rows to the training dataset that better represent the test dataset.
HeatWave AutoML supports a variety of scoring metrics to help you understand how your model
performs across a series of benchmarks. For ML_SCORE parameter descriptions and supported
metrics, see Section 3.15.8, “ML_SCORE”.
Before running ML_SCORE, ensure that the model you want to use is loaded; for example:
mysql> CALL sys.ML_MODEL_LOAD(@census_model, NULL);
For information about loading models, see Section 3.13.3, “Loading Models”.
The following example runs ML_SCORE to compute model quality using the balanced_accuracy
metric:
mysql> CALL sys.ML_SCORE('heatwaveml_bench.census_validate', 'revenue',
@census_model, 'balanced_accuracy', @score);
144
Model Explanations
where:
• revenue is the name of the target column containing ground truth values.
• balanced_accuracy is the scoring metric. For other supported scoring metrics, see
Section 3.15.8, “ML_SCORE”.
• @score is the user-defined session variable that stores the computed score. The ML_SCORE routine
populates the variable. User variables are written as @var_name. The examples in this guide use
@score as the variable name. Any valid name for a user-defined variable is permitted, for example
@my_score.
A model explanation helps you identify the features that are most important to the model overall.
Feature importance is presented as a numerical value ranging from 0 to 1. Higher values signify higher
feature importance, lower values signify lower feature importance, and a 0 value means that the feature
does not influence the model.
The following example retrieves the model explanation for the census model:
mysql> SELECT model_explanation FROM ML_SCHEMA_user1.MODEL_CATALOG
WHERE model_handle=@census_model;
where:
The model handle is stored temporarily in a user-defined session variable specified in the ML_TRAIN
call. In the following example, @census_model is defined as the model handle session variable:
mysql> CALL sys.ML_TRAIN('heatwaveml_bench.census_train', 'revenue', NULL, @census_model);
To use your own model handle instead of a generated one, set the value of the session variable before
calling the ML_TRAIN routine, like this:
145
Deleting Models
If you set a model handle that already appears in the model catalog, the ML_TRAIN routine returns an
error.
While the connection used to run ML_TRAIN remains active, that connection can retrieve the model
handle by querying the session variable; for example:
mysql> SELECT @census_model;
+--------------------------------------------------+
| @census_model |
+--------------------------------------------------+
| heatwaveml_bench.census_train_user1_1636729526 |
+--------------------------------------------------+
Note
While the session variable remains populated with the model handle, it can be specified in place of
the model handle when running other ML_* routines. However, once the connection is terminated, the
session variable data is lost. In this case, you can look up the model handle by querying the model
catalog table; for example:
mysql> SELECT model_handle, model_owner, train_table_name
FROM ML_SCHEMA_user1.MODEL_CATALOG;
+------------------------------------------------+-------------+-------------------------------+
| model_handle | model_owner | train_table_name |
+------------------------------------------------+-------------+-------------------------------+
| heatwaveml_bench.census_train_user1_1636729526 | user1 | heatwaveml_bench.census_train |
+------------------------------------------------+-------------+-------------------------------+
You can specify the model handle in ML_ROUTINE_* calls directly; for example:
mysql> SELECT sys.ML_PREDICT_ROW(@row_input, 'heatwaveml_bench.census_train_user1_1636729526');
Alternatively, you can reassign a model handle to a session variable; for example:
• To assign a model handle to a session variable named @my_model for the most recently trained
model:
mysql> SET @my_model = (SELECT model_handle FROM ML_SCHEMA_user1.MODEL_CATALOG
ORDER BY model_id DESC LIMIT 1);
The most recently trained model is the last model inserted into the MODEL_CATALOG table. It has the
most recently assigned model_id, which is a unique auto-incrementing numeric identifier.
To delete a model from the model catalog, issue a query similar to the following:
mysql> DELETE FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_id = 3;
where:
146
Sharing Models
Sharing a model requires granting model catalog privileges to another user. You can only share a
model with another MySQL user on the same MySQL DB System.
where:
Note
The user that is granted model catalog privileges must also have the privileges
required to use HeatWave AutoML and the CREATE privilege on the schema
where ML_PREDICT_TABLE or ML_EXPLAIN_TABLE results are written. See
Section 3.2, “Before You Begin”.
After a model catalog is shared with another user, that user can access models in the catalog when
running ML_* routines. For example, 'user2'@'%' in the example above might assign a model
handle from the user1 model catalog to a session variable, and call that session variable from a
ML_PREDICT_TABLE routine. The model owner is responsible for loading a model shared with other
users.
mysql> SET @my_model = (SELECT model_handle
FROM ML_SCHEMA_user1.MODEL_CATALOG
WHERE train_table_name LIKE '%census_train%');
MySQL 8.2.0 adds progress tracking for all HeatWave AutoML routines:
• ML_TRAIN.
• ML_EXPLAIN.
• ML_MODEL_IMPORT for models in ONNX format. Progress tracking does not support
ML_MODEL_IMPORT for models in HeatWave AutoML format.
• ML_PREDICT_ROW.
• ML_PREDICT_TABLE.
147
Syntax Examples
• ML_EXPLAIN_ROW.
• ML_EXPLAIN_TABLE.
• ML_SCORE.
• ML_MODEL_LOAD.
• ML_MODEL_UNLOAD.
For each of these routines, progress tracking tracks each individual operation, stages within routines,
and includes a completed percentage value.
Syntax Examples
• As of MySQL 8.2.0, run ML_TRAIN from the first MySQL Client window:
mysql> CALL sys.ML_TRAIN('mlcorpus_v5.`titanic_train`', 'survived', NULL, @model);
Query OK, 0 rows affected (1 min 28.7369 sec)
From the second MySQL Client window, run the following successive queries:
• As of MySQL 8.2.0, as an example for other operations, run ML_EXPLAIN from the first MySQL
Client window:
mysql> CALL sys.ML_MODEL_LOAD(@model, NULL);
Query OK, 0 rows affected (0.5951 sec)
148
Syntax Examples
From the second MySQL Client window, run the following successive queries:
1. At the start of the ML_EXPLAIN operation. The first five rows relate to the progress of ML_TRAIN.
mysql> SELECT * FROM performance_schema.rpd_query_stats;
+----------+--------------+---------------+-------------------+-----------------------------------
| QUERY_ID | STATEMENT_ID | CONNECTION_ID | QUERY_TEXT | QEXEC_TEXT
+----------+--------------+---------------+-------------------+-----------------------------------
| 6 | 4294967295 | 4294967295 | ML_LOAD_TABLE | {"status": "Completed", "completed
| 7 | 4294967295 | 4294967295 | ML_LOAD_TABLE | {"status": "Completed", "completed
| 8 | 4294967295 | 4294967295 | ML_EXPLAIN | {"options": {"model_explainer": "f
+----------+--------------+---------------+-------------------+-----------------------------------
8 rows in set (0.0005 sec)
• As of MySQL 8.2.0, run ML_PREDICT_ROW from the first MySQL Client window:
mysql> SELECT sys.ML_PREDICT_ROW(
JSON_OBJECT('pclass',`titanic_test`.`pclass`,
'name',`titanic_test`.`name`,
'sex',`titanic_test`.`sex`,
'age',`titanic_test`.`age`,
'sibsp',`titanic_test`.`sibsp`,
'parch',`titanic_test`.`parch`,
'ticket',`titanic_test`.`ticket`,
'fare',`titanic_test`.`fare`,
'cabin',`titanic_test`.`cabin`,
'embarked',`titanic_test`.`embarked`,
'boat',`titanic_test`.`boat`,
'body',`titanic_test`.`body`,
'home.dest',`titanic_test`.`home.dest` ),
@model, NULL) FROM `titanic_test` LIMIT 4;
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW(JSON_OBJECT('pclass',`titanic_test`.`pclass`,'name',`titanic_test`.`name`,'sex',
+----------------------------------------------------------------------------------------------------
| {"age": 20.0, "sex": "male", "boat": null, "body": 89.0, "fare": 9.2250003815, "name": "Olsvigen, M
| {"age": 4.0, "sex": "female", "boat": "2", "body": null, "fare": 22.0249996185, "name": "Kink-Heilm
| {"age": 42.0, "sex": "male", "boat": null, "body": 120.0, "fare": 7.6500000954, "name": "Humblen, M
| {"age": 45.0, "sex": "male", "boat": "7", "body": null, "fare": 29.7000007629, "name": "Chevre, Mr.
+----------------------------------------------------------------------------------------------------
4 rows in set (1.1977 sec)
From the second MySQL Client window, run the following query:
mysql> SELECT * FROM performance_schema.rpd_query_stats;
149
Syntax Examples
+----------+--------------+---------------+-------------------+------------------------------------------
| QUERY_ID | STATEMENT_ID | CONNECTION_ID | QUERY_TEXT | QEXEC_TEXT
+----------+--------------+---------------+-------------------+------------------------------------------
| 23 | 4294967295 | 4294967295 | ML_PREDICT_ROW | {"status": "Completed", "completedSteps":
| 24 | 4294967295 | 4294967295 | ML_PREDICT_ROW | {"status": "Completed", "completedSteps":
| 25 | 4294967295 | 4294967295 | ML_PREDICT_ROW | {"status": "Completed", "completedSteps":
| 26 | 4294967295 | 4294967295 | ML_PREDICT_ROW | {"status": "Completed", "completedSteps":
+----------+--------------+---------------+-------------------+------------------------------------------
26 rows in set (0.0005 sec)
From the second MySQL Client window, run the following query:
• As of MySQL 8.2.0, to extract the completed percentage for the ML_TRAIN operation:
150
HeatWave AutoML Routines
DESC limit 1;
Query OK, 1 row affected (0.0009 sec)
• As of MySQL 8.0.32, and before MySQL 8.2.0, run ML_TRAIN from the first MySQL Client window:
From the second MySQL Client window, run the following successive queries:
1. Before ML_TRAIN has started. It might take several seconds before ML_TRAIN starts with a large
dataset.
Examples in this section are based on the Iris Data Set. See Section 7.3, “Iris Data Set Machine
Learning Quickstart”.
3.15.1 ML_TRAIN
Run the ML_TRAIN routine on a labeled training dataset to produce a trained machine learning model.
151
ML_TRAIN
ML_TRAIN Syntax
MySQL 8.2.0 adds recommendation models that use implicit feedback to learn and recommend
rankings for users and items.
mysql> CALL sys.ML_TRAIN ('table_name', 'target_column_name', [options], model_handle);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'recommendation'}|NULL
|'datetime_index', 'column'
|'endogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'exogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'model_list', JSON_ARRAY('model'[,'model'] ...)
|'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
|'optimization_metric', 'metric'
|'include_column_list', JSON_ARRAY('column'[,'column'] ...)
|'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
|'contamination', 'contamination factor'
|'users', 'users_column'
|'items', 'items_column'
|'notes', 'notes_text'
|'feedback', {'explicit'|'implicit'}
|'feedback_threshold', 'threshold'
}
MySQL 8.1.0 adds notes to the JSON options, the ExtraTreesClassifier classification
model, and the ExtraTreesRegressor regression model. Forecasting does not require
target_column_name, and it can be set to NULL.
mysql> CALL sys.ML_TRAIN ('table_name', 'target_column_name', [options], model_handle);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'recommendation'}|NULL
|'datetime_index', 'column'
|'endogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'exogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'model_list', JSON_ARRAY('model'[,'model'] ...)
|'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
|'optimization_metric', 'metric'
|'include_column_list', JSON_ARRAY('column'[,'column'] ...)
|'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
|'contamination', 'contamination factor'
|'users', 'users_column'
|'items', 'items_column'
|'notes', 'notes_text'
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'recommendation'}|NULL
|'datetime_index', 'column'
|'endogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'exogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'model_list', JSON_ARRAY('model'[,'model'] ...)
|'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
|'optimization_metric', 'metric'
|'include_column_list', JSON_ARRAY('column'[,'column'] ...)
|'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
|'contamination', 'contamination factor'
|'users', 'users_column'
|'items', 'items_column'
}
152
ML_TRAIN
MySQL 8.0.32 adds support for multivariate endogenous forecasting models, and exogenous
forecasting models. MySQL 8.0.32 also adds support for anomaly detection models.
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'task', {'classification'|'regression'|'forecasting'|'anomaly_detection'}|NULL
|'datetime_index', 'column'
|'endogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'exogenous_variables', JSON_ARRAY('column'[,'column'] ...)
|'model_list', JSON_ARRAY('model'[,'model'] ...)
|'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
|'optimization_metric', 'metric'
|'include_column_list', JSON_ARRAY('column'[,'column'] ...)
|'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
|'contamination', 'contamination factor'
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'task', {'classification'|'regression'|'forecasting'}|NULL
|'datetime_index', 'column'
|'endogenous_variables', JSON_ARRAY('column')
|'model_list', JSON_ARRAY('model'[,'model'] ...)
|'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
|'optimization_metric', 'metric'
|'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
}
Note
The MySQL account that runs ML_TRAIN cannot have a period character (".")
in its name; for example, a user named 'joesmith'@'%' is permitted to train
a model, but a user named 'joe.smith'@'%' is not. For more information
about this limitation, see Section 3.18, “HeatWave AutoML Limitations”.
The ML_TRAIN routine also runs the ML_EXPLAIN routine with the default Permutation Importance
model for prediction explainers and model explainers. See Section 3.6, “Training Explainers”. To train
other prediction explainers and model explainers use the ML_EXPLAIN routine with the preferred
explainer after ML_TRAIN. MySQL 8.0.31 does not run the ML_EXPLAIN routine after ML_TRAIN.
ML_EXPLAIN does not support the anomaly_detection and recommendation tasks, and
ML_TRAIN does not run ML_EXPLAIN.
ML_TRAIN parameters:
• table_name: The name of the table that contains the labeled training dataset. The table name must
be valid and fully qualified; that is, it must include the schema name, schema_name.table_name.
The table cannot exceed 10 GB, 100 million rows, or 1017 columns. Before MySQL 8.0.29, the
column limit was 900.
• target_column_name: The name of the target column containing ground truth values.
Anomaly detection does not require labelled data, and target_column_name must be set to NULL.
As of MySQL 8.1.0 forecasting does not require target_column_name, and it can be set to NULL.
153
ML_TRAIN
• model_handle: The name of a user-defined session variable that stores the machine learning
model handle for the duration of the connection. User variables are written as @var_name. Some of
the examples in this guide use @census_model as the variable name. Any valid name for a user-
defined variable is permitted, for example @my_model.
As of MySQL 8.0.31, if the model_handle variable was set to a value before calling ML_TRAIN,
that model handle is used for the model. A model handle must be unique in the model catalog.
Otherwise, HeatWave AutoML generates a model handle. When ML_TRAIN finishes executing,
retrieve the generated model handle by querying the session variable. See Section 3.13.8, “Model
Handles”.
• options: Optional parameters specified as key-value pairs in JSON format. If an option is not
specified, the default setting is used. If no options are specified, you can specify NULL in place of the
JSON argument.
• classification: The default. Use this task type if the target is a discrete value.
• regression: Use this task type if the target column is a continuous numerical value.
• forecasting: Use this task type if the target column is a date-time column that requires a
timeseries forecast. The datetime_index and endogenous_variables parameters are
required with the forecasting task.
• datetime_index: For forecasting tasks, the column name for a datetime column that acts as an
index for the forecast variable. The column can be one of the supported datetime column types,
DATETIME, TIMESTAMP, DATE, TIME, and YEAR, or an auto-incrementing index.
The datetime_index for the predict table must not have missing dates after the last date in the
training table. For example, the predict table has to start with year 2024 if the training table with
YEAR data type datetime_index ends with year 2023. The predict table cannot start with year,
for example, 2025 or 2030, because that would miss out 1 and 6 years, respectively.
When options do not include exogenous_variables , the predict table can overlap the
datetime_index with the training table. This supports back testing.
The valid range of years for datetime_index dates must be between 1678 and 2261. It will
cause an error if any part of the training table or predict table has dates outside this range. The
last date in the training table plus the predict table length must still be inside the valid year range.
154
ML_TRAIN
For example, if the datetime_index in the training table has YEAR data type, and the last date is
year 2023, the predict table length must be less than 238 rows: 2261 minus 2023 equals 238 rows.
Univariate forecasting models support a single numeric column, specified as a JSON_ARRAY. This
column must also be specified as the target_column_name, because that field is required, but it
is not used in that location.
ML_TRAIN will consider all supported models during the algorithm selection stage if
options includes exogenous_variables, including models that do not support
exogenous_variables.
155
ML_TRAIN
If options also includes include_column_list, this will force ML_TRAIN to only consider
those models that support exogenous_variables.
• model_list: The type of model to be trained. If more than one model is specified, the best model
type is selected from the list. See Section 3.15.11, “Model Types”.
This option cannot be used together with the exclude_model_list option, and it is not
supported for anomaly_detection tasks.
• exclude_model_list: Model types that should not be trained. Specified model types are
excluded from consideration during model selection. See Section 3.15.11, “Model Types”.
This option cannot be specified together with the model_list option, and it is not supported for
anomaly_detection tasks.
• contamination: The optional contamination factor for use with the anomaly_detection task.
0 < contamination < 0.5. The default value is 0.1.
This must be a valid column name, and it must be different from the items column name.
This must be a valid column name, and it must be different from the users column name.
• feedback: The type of feedback for a recommendation model, explicit, the default, or
implicit.
156
ML_TRAIN
Syntax Examples
• An ML_TRAIN example that uses the classification task option implicitly (classification is
the default if not specified explicitly):
mysql> CALL sys.ML_TRAIN('ml_data.iris_train', 'class',
NULL, @iris_model);
• An ML_TRAIN example that specifies the classification task type explicitly, and sets a model
handle instead of letting HeatWave AutoML generate one:
mysql> SET @iris_model = 'iris_manual';
mysql> CALL sys.ML_TRAIN('ml_data.iris_train', 'class',
JSON_OBJECT('task', 'classification'),
@iris_model);
• An ML_TRAIN example that specifies the model_list option. This example trains either an
XGBClassifier or LGBMClassifier model.
mysql> CALL sys.ml_train('ml_data.iris_train', 'class',
JSON_OBJECT('task','classification',
'model_list', JSON_ARRAY('XGBClassifier', 'LGBMClassifier')),
@iris_model);
157
ML_EXPLAIN
+--------------------------------------------------------------------------------------------------------
1 row in set (0.00 sec)
See also:
3.15.2 ML_EXPLAIN
Running the ML_EXPLAIN routine on a model and dataset trains a prediction explainer and model
explainer, and adds a model explanation to the model catalog.
MySQL 8.0.33 introduces recommendation models. ML_EXPLAIN does not support recommendation
models, and a call with a recommendation model will produce an error.
MySQL 8.0.32 introduces anomaly detection. ML_EXPLAIN does not support anomaly detection, and a
call with an anomaly detection model will produce an error.
ML_EXPLAIN Syntax
mysql> CALL sys.ML_EXPLAIN ('table_name', 'target_column_name',
model_handle_variable, [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'model_explainer', {'permutation_importance'|'partial_dependence'|'shap'|'fast_shap'}| NULL
|'prediction_explainer', {'permutation_importance'|'shap'}
|'columns_to_explain', JSON_ARRAY('column'[,'column'] ...)
|'target_value', 'target_class'
}
Run the ML_EXPLAIN routine before ML_EXPLAIN_ROW and ML_EXPLAIN_TABLE routines. The
ML_TRAIN routine also runs the ML_EXPLAIN routine with the default Permutation Importance
model. MySQL 8.0.31 does not run the ML_EXPLAIN routine after the ML_TRAIN routine. It is only
necessary to use the ML_EXPLAIN routine with MySQL 8.0.31 or to train prediction explainers and
model explainers with a different model. See Section 3.6, “Training Explainers”.
ML_EXPLAIN parameters:
• table_name: The name of the table that contains the labeled training dataset. The table name must
be valid and fully qualified; that is, it must include the schema name (schema_name.table_name).
Use NULL for help. Use the dataset that the model is trained on - running ML_EXPLAIN on a dataset
that the model has not been trained on produces errors or unreliable explanations.
• target_column_name: The name of the target column in the training dataset containing ground
truth values.
• model_handle: A string containing the model handle for the model in the model catalog. Use NULL
for help. The model explanation is stored in this model metadata. The model must be loaded first, for
example:
mysql> CALL sys.ML_MODEL_LOAD('ml_data.iris_train_user1_1636729526', NULL);
If you run ML_EXPLAIN again with the same model handle and model explainer, the model
explanation field is overwritten with the new result.
• options: Optional parameters specified as key-value pairs in JSON format. If an option is not
specified, the default setting is used. If you specify NULL in place of the JSON argument, the default
158
ML_EXPLAIN
Permutation Importance model explainer is trained, and no prediction explainer is trained. The
available options are:
• shap: The SHAP model explainer, which produces global feature importance values based on
Shapley values.
• fast_shap: The Fast SHAP model explainer, which is a subsampling version of the SHAP
model explainer that usually has a faster runtime.
• partial_dependence: Explains how changing the values in one or more columns will change
the value predicted by the model. The following additional arguments are required for the
partial_dependence model explainer:
• columns_to_explain: a JSON array of one or more column names in the table specified
by table_name. The model explainer explains how changing the value in this column or
columns affects the model.
• target_value: a valid value that the target column containing ground truth values, as
specified by target_column_name, can take.
• shap: The SHAP prediction explainer, which produces global feature importance values based
on Shapley values.
Syntax Examples
• Load the model first:
mysql> CALL sys.ML_MODEL_LOAD('ml_data.iris_train_user1_1636729526', NULL);
• Running ML_EXPLAIN to train the SHAP prediction explainer and the Fast SHAP model explainer:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class',
'ml_data.iris_train_user1_1636729526',
JSON_OBJECT('model_explainer', 'fast_shap', 'prediction_explainer', 'shap'));
• Run ML_EXPLAIN and use NULL for the options trains the default Permutation Importance model
explainer and no prediction explainer:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class',
'ml_data.iris_train_user1_1636729526', NULL);
• Running ML_EXPLAIN to train the Partial Dependence model explainer (which requires extra
options) and the SHAP prediction explainer:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class', @iris_model,
JSON_OBJECT('columns_to_explain', JSON_ARRAY('sepal width'),
'target_value', 'Iris-setosa', 'model_explainer', 'partial_dependence',
'prediction_explainer', 'shap'));
• Viewing the model explanation, in this case produced by the Permutation Importance model
explainer:
mysql> SELECT model_explanation FROM MODEL_CATALOG WHERE model_handle = @iris_model;
+----------------------------------------------------------------------------------------------------
| model_explanation
+----------------------------------------------------------------------------------------------------
| {"permutation_importance": {"petal width": 0.5926, "sepal width": 0.0, "petal length": 0.0423, "sep
159
ML_MODEL_IMPORT
+--------------------------------------------------------------------------------------------------------
3.15.3 ML_MODEL_IMPORT
Run the ML_MODEL_IMPORT routine to import a pre-trained model into the model catalog. MySQL 8.1.0
supports the import of HeatWave AutoML format models. MySQL 8.0.31 supports the import of ONNX,
Open Neural Network Exchange, format models. After import, all the HeatWave AutoML routines can
be used with the ONNX model.
Models in ONNX format, .onnx. cannot be loaded directly into a MySQL table. They require string
serialization and conversion to Base64 binary encoding. Before running ML_MODEL_IMPORT, follow the
instructions in Section 3.13.2, “ONNX Model Import” to carry out the required pre-processing and then
load the model into a temporary table for import to HeatWave.
ML_MODEL_IMPORT Syntax
mysql> CALL sys.ML_MODEL_IMPORT (model_object, model_metadata, model_handle);
ML_MODEL_IMPORT parameters:
• model_object: The preprocessed ONNX model object, which must be string serialized and
BASE64 encoded. See Section 3.13.2, “ONNX Model Import”.
• model_metadata: An optional JSON object literal that contains key-value pairs with model
metadata. See Section 3.15.12, “Model Metadata”.
• model_handle: The model handle for the model. The model is stored in the model catalog under
this name and accessed using it. Specify a model handle that does not already exist in the model
catalog.
Syntax Examples
• An example that imports a HeatWave AutoML model with metadata:
mysql> SET @hwml_model = "hwml_model";
Query OK, 0 rows affected (0.00 sec)
160
ML_PREDICT_ROW
3.15.4 ML_PREDICT_ROW
ML_PREDICT_ROW generates predictions for one or more rows of unlabeled data specified in JSON
format. Invoke ML_PREDICT_ROW with a SELECT statement.
ML_PREDICT_ROW requires a loaded model to run. See Section 3.13.3, “Loading Models”.
ML_PREDICT_ROW Syntax
MySQL 8.2.0 adds recommendation models that use implicit feedback to learn and recommend
rankings for users and items.
mysql> CALL sys.ML_PREDICT_ROW(table_name, model_handle, output_table_name), [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'us
|'remove_seen', {'true'|'false'}
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'us
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'}|NULL
}
MySQL 8.0.32 added an options parameter in JSON format that supports the anomaly_detection
task. For all other tasks, set this parameter to NULL.
MySQL 8.0.32 allows a call to ML_PREDICT_ROW to include columns that were not present during
ML_TRAIN. A table can include extra columns, and still use the HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground truth, and predictions in the same table.
ML_PREDICT_ROW ignores any extra columns, and appends them to the results.
mysql> CALL sys.ML_PREDICT_ROW(table_name, model_handle, output_table_name), [options]);
161
ML_PREDICT_ROW
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
}
ML_PREDICT_ROW parameters:
To run ML_PREDICT_ROW on multiple rows of data, specify the columns as key-value pairs in JSON
format and select from a table:
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("output_col_name", schema.`input_col_name`,
"output_col_name", schema.`input_col_name`, ...),
model_handle, options)
FROM input_table_name LIMIT N;
• model_handle: Specifies the model handle or a session variable that contains the model handle.
For all other tasks, set this parameter to NULL. Before MySQL 8.0.32, ignore this parameter.
• threshold: The optional threshold for use with the anomaly_detection task to convert
anomaly scores to 1: an anomaly or 0: normal. 0 < threshold < 1. The default value is (1 -
contamination)-th percentile of all the anomaly scores.
• topk: Use with the recommendation task to specify the number of recommendations to provide.
A positive integer. The default is 3.
• recommend: Use with the recommendation task to specify what to recommend. Permitted
values are:
• ratings: Use this option to predict ratings. This is the default value.
The input table must contain at least two columns with the same names as the user column and
item column from the training model.
The input table must at least contain a column with the same name as the user column from the
training model.
162
ML_PREDICT_ROW
The input table must at least contain a column with the same name as the item column from the
training model.
The input table must at least contain a column with the same name as the item column from the
training model.
The input table must at least contain a column with the same name as the user column from the
training model.
• remove_seen: If the input table overlaps with the training table, and remove_seen is true, then
the model will not repeat existing interactions. The default is true. Set remove_seen to false to
repeat existing interactions from the training table.
Syntax Examples
• To run ML_PREDICT_ROW on a single row of data use a select statement. The results include the
ml_results field, which uses JSON format:
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9,
"petal length", 6.3, "petal width", 1.8), @iris_model, NULL);
+----------------------------------------------------------------------------------------------------
| sys.ML_PREDICT_ROW('{"sepal length": 7.3, "sepal width": 2.9, "petal length": 6.3, "petal width": 1
+----------------------------------------------------------------------------------------------------
| {"Prediction": "Iris-virginica", "ml_results": "{'predictions': {'class': 'Iris-virginica'}, 'proba
+----------------------------------------------------------------------------------------------------
1 row in set (1.12 sec)
Before MySQL 8.0.32, the ML_PREDICT_ROW routine does not include options, and the results do
not include the ml_results field:
mysql> SELECT sys.ML_PREDICT_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9,
"petal length", 6.3, "petal width", 1.8), @iris_model);
+---------------------------------------------------------------------------+
| sys.ML_PREDICT_ROW(@row_input, @iris_model) |
+---------------------------------------------------------------------------+
| {"Prediction": "Iris-virginica", "petal width": 1.8, "sepal width": 2.9, |
| "petal length": 6.3, "sepal length": 7.3} |
+---------------------------------------------------------------------------+
163
ML_PREDICT_TABLE
See also:
3.15.5 ML_PREDICT_TABLE
ML_PREDICT_TABLE generates predictions for an entire table of unlabeled data and saves the results
to an output table. HeatWave AutoML performs the predictions in parallel.
A loaded model is required to run ML_PREDICT_TABLE. See Section 3.13.3, “Loading Models”.
• If the input table has a primary key, the output table will have the same primary key.
• If the input table does not have a primary key, the output table will have a new primary key named
_id that auto increments.
The input table must not have a column with the name _id that is not a primary key.
As of MySQL 8.0.32, the returned table also includes the ml_results column which contains the
prediction results and the data. MySQL 8.1.0 includes support for text data types. The combination of
results and data must be less than 65,532 characters.
MySQL 8.2.0 adds recommendation models that use implicit feedback to learn and recommend
rankings for users and items. MySQL 8.2.0 also adds batch processing with the batch_size option.
ML_PREDICT_TABLE Syntax
MySQL 8.2.0 adds more options to support the recommendation task and batch processing.
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_
|'remove_seen', {'true'|'false'}
|'batch_size', 'N'
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_
}
164
ML_PREDICT_TABLE
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'recommend', {'ratings'|'items'|'users'}|NULL
}
MySQL 8.0.32 added an options parameter in JSON format that supports the anomaly_detection
task. For all other tasks, set this parameter to NULL.
MySQL 8.0.32 allows a call to ML_PREDICT_TABLE to include columns that were not present during
ML_TRAIN. A table can include extra columns, and still use the HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground truth, and predictions in the same table.
ML_PREDICT_TABLE ignores any extra columns, and appends them to the results.
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
}
ML_PREDICT_TABLE parameters:
• table_name: Specifies the fully qualified name of the input table (schema_name.table_name).
The input table should contain the same feature columns as the training dataset but no target
column.
• model_handle: Specifies the model handle or a session variable containing the model handle
• output_table_name: Specifies the table where predictions are stored. The table is created if it
does not exist. A fully qualified table name must be specified (schema_name.table_name). If the
table already exists, an error is returned.
For all other tasks, set this parameter to NULL. Before MySQL 8.0.32, ignore this parameter.
• threshold: The optional threshold for use with the anomaly_detection task to convert
anomaly scores to 1: an anomaly or 0: normal. 0 < threshold < 1. The default value is (1 -
contamination)-th percentile of all the anomaly scores.
• topk: The optional top K rows for use with the anomaly_detection and recommendation
tasks. A positive integer between 1 and the table length.
For the anomaly_detection task, the results include the top K rows with the highest anomaly
scores. If topk is not set, ML_PREDICT_TABLE uses threshold.
For an anomaly_detection task, do not set both threshold and topk. Use threshold or
topk, or set options to NULL.
For the recommendation task, the number of recommendations to provide. The default is 3.
165
ML_PREDICT_TABLE
A recommendation task with implicit feedback can use both threshold and topk.
• recommend: Use with the recommendation task to specify what to recommend. Permitted
values are:
• ratings: Use this option to predict ratings. This is the default value.
The input table must contain at least two columns with the same names as the user column and
item column from the training model.
The input table must at least contain a column with the same name as the user column from the
training model.
The input table must at least contain a column with the same name as the item column from the
training model.
The input table must at least contain a column with the same name as the item column from the
training model.
The input table must at least contain a column with the same name as the user column from the
training model.
• remove_seen: If the input table overlaps with the training table, and remove_seen is true, then
the model will not repeat existing interactions. The default is true. Set remove_seen to false to
repeat existing interactions from the training table.
• batch_size: The size of each batch. 1 ≤ batch_size ≤ 1,000. The default is 1,000, and this
provides the best results.
166
ML_EXPLAIN_ROW
Syntax Examples
• A typical usage example that specifies the fully qualified name of the table to generate predictions
for, the session variable containing the model handle, and the fully qualified output table name:
To view ML_PREDICT_TABLE results, query the output table. The table shows the predictions and
the feature column values used to make each prediction. The table includes the primary key, _id,
and the ml_results column, which uses JSON format:
Before MySQL 8.0.32, the ML_PREDICT_TABLE routine does not include options, and the results do
not include the ml_results column:
See also:
3.15.6 ML_EXPLAIN_ROW
The ML_EXPLAIN_ROW routine generates explanations for one or more rows of unlabeled data.
ML_EXPLAIN_ROW is invoked using a SELECT statement.
167
ML_EXPLAIN_ROW
A loaded model, trained with a prediction explainer, is required to run ML_EXPLAIN_ROW. See
Section 3.13.3, “Loading Models” and Section 3.15.2, “ML_EXPLAIN”.
MySQL 8.0.32 introduces anomaly detection. ML_EXPLAIN_ROW does not support anomaly detection,
and a call with an anomaly detection model will produce an error.
MySQL 8.0.32 allows a call to ML_EXPLAIN_ROW to include columns that were not present during
ML_TRAIN. A table can include extra columns, and still use the HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground truth, and explanations in the same table.
ML_EXPLAIN_ROW ignores any extra columns, and appends them to the results.
ML_EXPLAIN_ROW Syntax
mysql> SELECT sys.ML_EXPLAIN_ROW(input_data, model_handle, [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'prediction_explainer', {'permutation_importance'|'shap'}|NULL
}
ML_EXPLAIN_ROW parameters:
• input_data: Specifies the data to generate explanations for. Data must be specified in JSON key-
value format, where the key is a column name. The column names must match the feature column
names in the table used to train the model. A single row of data can be specified as follows:
mysql> SELECT sys.ML_EXPLAIN_ROW(JSON_OBJECT("column_name", value, "column_name", value, ...)',
model_handle, options);
You can run ML_EXPLAIN_ROW on multiple rows of data by specifying the columns in JSON key-
value format and selecting from an input table:
mysql> SELECT sys.ML_EXPLAIN_ROW(JSON_OBJECT("output_col_name", schema.`input_col_name`,
output_col_name", schema.`input_col_name`, ...),
model_handle, options)
FROM input_table_name
LIMIT N;
• model_handle: Specifies the model handle or a session variable containing the model handle.
• prediction_explainer: The name of the prediction explainer that you have trained for this
model using ML_EXPLAIN. Valid values are:
• shap: The SHAP prediction explainer, which produces global feature importance values based
on Shapley values.
Syntax Examples
• Run ML_EXPLAIN_ROW on a single row of data with the default Permutation Importance prediction
explainer. The results include the ml_results field, which uses JSON format:
mysql> SELECT sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9,
"petal length", 6.3, "petal width", 1.8), @iris_model,
JSON_OBJECT('prediction_explainer', 'permutation_importance'));
+--------------------------------------------------------------------------------------------------------
| sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9, "petal length", 6.3, "petal wid
| @iris_model, JSON_OBJECT('prediction_explainer', 'permutation_importance'))
+--------------------------------------------------------------------------------------------------------
| {"Notes": "petal width (1.8) had the largest impact towards predicting Iris-virginica",
168
ML_EXPLAIN_TABLE
Before MySQL 8.0.32, the results do not include the ml_results field:
+------------------------------------------------------------------------------+
| sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9, |
| "petal length", 6.3, "petal width", 1.8), @iris_model, |
| JSON_OBJECT('prediction_explainer', 'permutation_importance')) |
+------------------------------------------------------------------------------+
| {"Prediction": "Iris-virginica", "petal width": 1.8, "sepal width": 2.9, |
| "petal length": 6.3, "sepal length": 7.3, "petal width_attribution": 0.73, |
| "petal length_attribution": 0.57} |
+------------------------------------------------------------------------------+
3.15.7 ML_EXPLAIN_TABLE
ML_EXPLAIN_TABLE explains predictions for an entire table of unlabeled data and saves results to an
output table.
A loaded model, trained with a prediction explainer, is required to run ML_EXPLAIN_TABLE. See
Section 3.13.3, “Loading Models” and Section 3.15.2, “ML_EXPLAIN”.
• If the input table has a primary key, the output table will have the same primary key.
• If the input table does not have a primary key, the output table will have a new primary key named
_id that auto increments.
The input table must not have a column with the name _id that is not a primary key.
MySQL 8.0.32 introduces anomaly detection. ML_EXPLAIN_TABLE does not support anomaly
detection, and a call with an anomaly detection model will produce an error.
MySQL 8.0.32 allows a call to ML_EXPLAIN_TABLE to include columns that were not present during
ML_TRAIN. A table can include extra columns, and still use the HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground truth, and explanations in the same table.
ML_EXPLAIN_TABLE ignores any extra columns, and appends them to the results.
ML_EXPLAIN_TABLE Syntax
MySQL 8.2.0 adds an option to support batch processing.
169
ML_EXPLAIN_TABLE
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'prediction_explainer', {'permutation_importance'|'shap'}|NULL
|'batch_size', 'N'
}
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'prediction_explainer', {'permutation_importance'|'shap'}|NULL
}
ML_EXPLAIN_TABLE parameters:
• table_name: Specifies the fully qualified name of the input table (schema_name.table_name).
The input table should contain the same feature columns as the table used to train the model but no
target column.
• model_handle: Specifies the model handle or a session variable containing the model handle.
• output_table_name: Specifies the table where explanation data is stored. The table is created if
it does not exist. A fully qualified table name must be specified (schema_name.table_name). If the
table already exists, an error is returned.
• prediction_explainer: The name of the prediction explainer that you have trained for this
model using ML_EXPLAIN. Valid values are:
• shap: The SHAP prediction explainer, which produces global feature importance values based
on Shapley values.
• batch_size: The size of each batch. 1 ≤ batch_size ≤ 100. The default is 100, and this provides
the best results.
Syntax Examples
• The following example generates explanations for a table of data with the default Permutation
Importance prediction explainer. The ML_EXPLAIN_TABLE call specifies the fully qualified name of
the table to generate explanations for, the session variable containing the model handle, and the fully
qualified output table name.
mysql> CALL sys.ML_EXPLAIN_TABLE('ml_data.iris_test', @iris_model,
'ml_data.iris_explanations',
JSON_OBJECT('prediction_explainer', 'permutation_importance'));
To view ML_EXPLAIN_TABLE results, query the output table. The SELECT statement retrieves
explanation data from the output table. The table includes the primary key, _id, and the
ml_results column, which uses JSON format:
mysql> SELECT * FROM ml_data.iris_explanations LIMIT 5;
+-----+--------------+-------------+--------------+-------------+-----------------+-----------------+----
| _id | sepal length | sepal width | petal length | petal width | class | Prediction | Not
+-----+--------------+-------------+--------------+-------------+-----------------+-----------------+----
| 1 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica | Iris-virginica | pet
| 2 | 6.1 | 2.9 | 4.7 | 1.4 | Iris-versicolor | Iris-versicolor | pet
| 3 | 6.3 | 2.8 | 5.1 | 1.5 | Iris-virginica | Iris-versicolor | pet
170
ML_SCORE
Before MySQL 8.0.32, the output table does not include the ml_results column:
mysql> SELECT * FROM ml_data.iris_explanations LIMIT 3;
*************************** 1. row ***************************
sepal length: 7.3
sepal width: 2.9
petal length: 6.3
petal width: 1.8
Prediction: Iris-virginica
petal length_attribution: 0.57
petal width_attribution: 0.73
*************************** 2. row ***************************
sepal length: 6.1
sepal width: 2.9
petal length: 4.7
petal width: 1.4
Prediction: Iris-versicolor
petal length_attribution: 0.14
petal width_attribution: 0.6
*************************** 3. row ***************************
sepal length: 6.3
sepal width: 2.8
petal length: 5.1
petal width: 1.5
Prediction: Iris-virginica
petal length_attribution: -0.25
petal width_attribution: 0.31
3 rows in set (0.0006 sec)
3.15.8 ML_SCORE
ML_SCORE scores a model by generating predictions using the feature columns in a labeled dataset as
input and comparing the predictions to ground truth values in the target column of the labeled dataset.
The dataset used with ML_SCORE should have the same feature columns as the dataset used to train
the model but the data should be different; for example, you might reserve 20 to 30 percent of the
labeled training data for scoring.
ML_SCORE Syntax
MySQL 8.2.0 adds recommendation models that use implicit feedback to learn and recommend
rankings for users and items.
mysql> CALL sys.ML_SCORE(table_name, target_column_name, model_handle, metric, score, [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
|'remove_seen', {'true'|'false'}
}
As of MySQL 8.1.0, forecasting does not require target_column_name, and it can be set to NULL.
171
ML_SCORE
MySQL 8.0.32 added an options parameter in JSON format that supports the anomaly_detection
task. For all other tasks, set this parameter to NULL.
mysql> CALL sys.ML_SCORE(table_name, target_column_name, model_handle, metric, score, [options]);
options: {
JSON_OBJECT('key','value'[,'key','value'] ...)
'key','value':
|'threshold', 'N'
|'topk', 'N'
}
ML_SCORE parameters:
• table_name: Specifies the fully qualified name of the table used to compute model quality
(schema_name.table_name). The table must contain the same columns as the training dataset.
• target_column_name: Specifies the name of the target column containing ground truth values.
As of MySQL 8.1.0 forecasting does not require target_column_name, and it can be set to NULL.
• model_handle: Specifies the model handle or a session variable containing the model handle.
• metric: Specifies the name of the metric. See Section 3.15.13, “Optimization and Scoring Metrics”.
• score: Specifies the user-defined variable name for the computed score. The ML_SCORE routine
populates the variable. User variables are written as @var_name. The examples in this guide use
@score as the variable name. Any valid name for a user-defined variable is permitted, for example
@my_score.
• options: A set of options in JSON format. As of MySQL 8.0.32, this parameter only supports the
anomaly detection task. For all other tasks, set this parameter to NULL. Before MySQL 8.0.32, ignore
this parameter.
• threshold: The optional threshold for use with the anomaly_detection and
recommendation tasks.
Use with the anomaly_detection task to convert anomaly scores to 1: an anomaly or 0: normal.
0 < threshold < 1. The default value is (1 - contamination)-th percentile of all the anomaly
scores.
Use with the recommendation task and ranking metrics to define positive feedback, and a
relevant sample. All rankings at or above the threshold are implied to provide positive feedback.
All rankings below the threshold are implied to provide negative feedback. The default value is
1.
• topk: The optional top K rows for use with the anomaly_detection and recommendation
tasks. A positive integer between 1 and the table length.
For the anomaly_detection task the results include the top K rows with the highest anomaly
scores. It is an integer between 1 and the table length. If topk is not set, ML_SCORE uses
threshold.
For an anomaly_detection task, do not set both threshold and topk. Use threshold or
topk, or set options to NULL.
For the recommendation task and ranking metrics, the number of recommendations to provide.
The default is 3.
A recommendation task and ranking metrics can use both threshold and topk.
• remove_seen: If the input table overlaps with the training table, and remove_seen is true, then
the model will not repeat existing interactions. The default is true. Set remove_seen to false to
repeat existing interactions from the training table.
172
ML_MODEL_LOAD
Syntax Example
• The following example runs ML_SCORE on the ml_data.iris_train table to determine model
quality:
mysql> CALL sys.ML_SCORE('ml_data.iris_validate', 'class', @iris_model,
'balanced_accuracy', @score, NULL);
See also:
3.15.9 ML_MODEL_LOAD
The ML_MODEL_LOAD routine loads a model from the model catalog. A model remains loaded until the
model is unloaded using the ML_MODEL_UNLOAD routine or until HeatWave AutoML is restarted by a
HeatWave Cluster restart.
As of MySQL 8.1.0, a user can only load their own model, and the user parameter is ignored.
To share models with other users, see: Section 3.13.10, “Sharing Models”.
You can load multiple models but to avoid taking up too much space in memory, the number of loaded
models should be limited to three.
ML_MODEL_LOAD Syntax
mysql> CALL sys.ML_MODEL_LOAD(model_handle, user);
ML_MODEL_LOAD parameters:
• model_handle: Specifies the model handle or a session variable containing the model handle. For
how to look up a model handle, see Section 3.13.8, “Model Handles”.
• user: The MySQL user name of the model owner. Specify NULL if the model owner is the current
user.
Syntax Examples
• An ML_MODEL_LOAD call with NULL specified, indicating that the model belongs to the user
executing the ML_MODEL_LOAD call:
mysql> CALL sys.ML_MODEL_LOAD('ml_data.iris_train_user1_1636729526', NULL);
• An ML_MODEL_LOAD call that specifies a session variable containing the model handle:
mysql> CALL sys.ML_MODEL_LOAD(@iris_model, NULL);
• Before MySQL 8.1.0, an ML_MODEL_LOAD call that specifies the model owner:
mysql> CALL sys.ML_MODEL_LOAD('ml_data.iris_train_user1_1636729526', user1);
173
ML_MODEL_UNLOAD
3.15.10 ML_MODEL_UNLOAD
ML_MODEL_UNLOAD unloads a model from HeatWave AutoML.
ML_MODEL_UNLOAD Syntax
mysql> CALL sys.ML_MODEL_UNLOAD(model_handle);
ML_MODEL_UNLOAD parameters:
Syntax Examples
• An ML_MODEL_UNLOAD call that specifies the model handle:
mysql> CALL sys.ML_MODEL_UNLOAD('ml_data.iris_train_user1_1636729526');
• An ML_MODEL_UNLOAD call that specifies a session variable containing the model handle:
mysql> CALL sys.ML_MODEL_UNLOAD(@iris_model);
• Classification models:
• LogisticRegression.
• GaussianNB.
• DecisionTreeClassifier.
• RandomForestClassifier.
• XGBClassifier.
• LGBMClassifier.
• SVC.
• LinearSVC.
• Regression models:
• DecisionTreeRegressor.
• RandomForestRegressor.
• LinearRegression.
• LGBMRegressor.
• XGBRegressor.
• SVR.
• LinearSVR.
174
Model Types
• Forecasting models:
• NaiveForecaster.
• ExpSmoothForecaster.
• Recommendation models:
• Recommendation models that rate users or items to use with explicit feedback.
• Baseline.
• CoClustering.
• NormalPredictor.
• SlopeOne.
• SVD.
• SVDpp.
• NMF.
• Recommendation models that rank users or items to use with implicit feedback.
175
Model Metadata
MySQL 8.1.0 adds several fields that replace deprecated columns in the Section 3.13.1.1, “The Model
Catalog Table”. It also adds fields that support ONNX model import. See: Section 3.13.2, “ONNX Model
Import”.
• task: string
The task type specified in the ML_TRAIN query. The default is classification when used with
ML_MODEL_IMPORT. This was added in MySQL 8.1.0.
• build_timestamp: number
A timestamp indicating when the model was created, in UNIX epoch time. A model is created when
the ML_TRAIN routine finishes executing. This was added in MySQL 8.1.0.
• target_column_name: string
The name of the column in the training table that was specified as the target column. This was added
in MySQL 8.1.0.
• train_table_name: string
The name of the input table specified in the ML_TRAIN query. This was added in MySQL 8.1.0.
The feature columns used to train the model. This was added in MySQL 8.1.0.
The model explanation generated during training. See Section 3.13.7, “Model Explanations”. This
was added in MySQL 8.1.0.
• notes: string
The notes specified in the ML_TRAIN query. It also records any error messages that occur during
model training. This was added in MySQL 8.1.0.
• format: string
The model serialization format. HWMLv1.0 for a HeatWave AutoML model or ONNX for a ONNX
model. The default is ONNX when used with ML_MODEL_IMPORT.
The status of the model. The default is Ready when used with ML_MODEL_IMPORT.
• Creating
• Ready
• Error
176
Model Metadata
Either training was canceled or an error occurred during training. Any error message appears in
the notes column. As of MySQL 8.1.0, the error message also appears in model_metadata
notes.
• model_quality: string
The quality of the model object. Either low or high. This was added in MySQL 8.1.0.
• training_time: number
• algorithm_name: string
• training_score: number
• n_rows: number
• n_columns: number
• n_selected_columns: number
• n_selected_rows: number
• optimization_metric: string
• contamination: number
The contamination factor for a model. This was added in MySQL 8.1.0.
The options specified in the ML_TRAIN query. This was added in MySQL 8.1.0.
Internal task dependent parameters used during ML_TRAIN. This was added in MySQL 8.1.0.
Information about the format of the ONNX model inputs. This was added in MySQL 8.1.0, and only
applies to ONNX models. See Section 3.13.2, “ONNX Model Import”.
Do not provide onnx_inputs_info if the model is not ONNX format. This will cause an error.
177
Optimization and Scoring Metrics
This maps the data type of each column to an ONNX model data type. The default value is:
JSON_OBJECT("tensor(int64)": "int64", "tensor(float)": "float32", "tensor(string)": "str_")
Information about the format of the ONNX model outputs. This was added in MySQL 8.1.0, and only
applies to ONNX models. See Section 3.13.2, “ONNX Model Import”.
Do not provide onnx_outputs_info if the model is not ONNX format, or if task is NULL. This will
cause an error.
• predictions_name: string
This name determines which of the ONNX model outputs is associated with predictions.
• prediction_probabilities_name: string
This name determines which of the ONNX model outputs is associated with prediction
probabilities.
For more information about scoring metrics, see: scikit-learn.org. For more information about
forecasting metrics, see: sktime.org and statsmodels.org.
• Classification metrics
• Binary-only metrics
• f1
• precision
• recall
• roc_auc
• accuracy
• f1_macro
• f1_micro
• f1_weighted
• neg_log_loss
178
Optimization and Scoring Metrics
• precision_macro
• precision_micro
• precision_weighted
• recall_macro
• recall_micro
• recall_weighted
• Regression metrics
• neg_mean_absolute_error
• neg_mean_squared_error
• neg_mean_squared_log_error
• neg_median_absolute_error
• r2
• Forecasting metrics
• neg_max_absolute_error
• neg_mean_absolute_error
• neg_mean_abs_scaled_error
• neg_mean_squared_error
• neg_root_mean_squared_error
• neg_root_mean_squared_percent_error
• neg_sym_mean_abs_percent_error
179
Optimization and Scoring Metrics
• roc_auc
• threshold option.
• accuracy
• balanced_accuracy
• f1
• neg_log_loss
• precision
• recall
• topk option.
• precision_k is an Oracle implementation of a common metric for fraud detection and lead
scoring.
180
Optimization and Scoring Metrics
• Rating metrics to use with recommendation models that use explicit feedback.
• neg_mean_absolute_error
• neg_mean_squared_error
• neg_root_mean_squared_error
• r2
• Ranking metrics to use with recommendation models that use implicit feedback.
If a user and item combination in the input table is not unique the input table is grouped by user
and item columns, and the result is the average of the rankings.
If the input table overlaps with the training table, and remove_seen is true, which is the default
setting, then the model will not repeat a recommendation and it ignores the overlap items.
• precision_at_k is the number of relevant topk recommended items divided by the total
topk recommended items for a particular user:
For example, if 7 out of 10 items are relevant for a user, and topk is 10, then
precision_at_k is 70%.
The precision_at_k value for the input table is the average for all users. If remove_seen is
true, the default setting, then the average only includes users for whom the model can make a
recommendation. If a user has implicitly ranked every item in the training table the model cannot
181
Supported Data Types
recommend any more items for that user, and they are ignored from the average calculation if
remove_seen is true.
• recall_at_k is the number of relevant topk recommended items divided by the total relevant
items for a particular user:
For example, there is a total of 20 relevant items for a user. If topk is 10, and 7 of those items
are relevant, then recall_at_k is 7 / 20 = 35%.
The recall_at_k value for the input table is the average for all users.
• hit_ratio_at_k is the number of relevant topk recommended items divided by the total
relevant items for all users:
hit_ratio_at_k = (relevant topk recommended items, all users) / (total relevant items, all
users)
The average of hit_ratio_at_k for the input table is recall_at_k. If there is only one user,
hit_ratio_at_k is the same as recall_at_k.
• ndcg_at_k is normalized discounted cumulative gain, which is the discounted cumulative gain
of the relevant topk recommended items divided by the discounted cumulative gain of the
relevant topk items for a particular user.
The discounted gain of an item is the true rating divided by log2(r+1) where r is the ranking of
this item in the relevant topk items. If a user prefers a particular item, the rating is higher, and
the ranking is lower.
The ndcg_at_k value for the input table is the average for all users.
• FLOAT
• DOUBLE
• INT
• TINYINT
• SMALLINT
• MEDIUMINT
• BIGINT
• INT UNSIGNED
• TINYINT UNSIGNED
• SMALLINT UNSIGNED
• MEDIUMINT UNSIGNED
• BIGINT UNSIGNED
• VARCHAR
• CHAR
182
HeatWave AutoML Error Messages
• The ML_PREDICT_TABLE ml_results column contains the prediction results and the data. This
combination must be less than 65,532 characters.
• HeatWave AutoML does not support text columns with NULL values.
• HeatWave AutoML does not support recommendation tasks with a text column.
Before MySQL 8.0.30, remove temporal types or convert them to CHAR or VARCHAR columns, or split
them into separate day, month, and year columns and define them as numeric or string types.
DECIMAL data type columns are not supported. Remove them or convert them to FLOAT.
Check the task option in the ML_TRAIN call to ensure that it is specified correctly.
Message: Running as a classification task. % classes have less than % samples per class, and
cannot be trained on. Maybe it should be trained as a regression task instead of a classification task.
Or the task ran on the default setting - classification, due to an incorrect JSON task argument.
183
HeatWave AutoML Error Messages
If a classification model is intended, add more samples to the data to increase the minority class
count; that is, add more rows with the under-represented target column value. If a classification
model was not intended, run ML_TRAIN with the regression task option.
Message: One or more rows contain all NaN values. Imputation is not possible on such rows.
Example: ERROR HY000: ML001051: One or more rows contain all NaN values.
Imputation is not possible on such rows.
Message: All columns are dropped. They are constant, mostly unique, or have a lot of missing
values!
Example: ERROR HY000: ML001052: All columns are dropped. They are constant,
mostly unique, or have a lot of missing values!
ML_TRAIN ignores columns with certain characteristics such as columns missing more than 20% of
values and columns containing the same single value. See Section 3.4, “Preparing Data”.
Message: Unlabeled samples detected in the training data. (Values in target column can not be
NULL).
Example: ERROR HY000: ML003000: Number of offloaded datasets has reached the
limit!
Message: Columns of provided data need to match those used for training. Provided - ['%', '%', '%']
vs Trained - ['%', '%'].
The input data columns do not match the columns of training dataset used to train the model.
Compare the input data to the training data to identify the discrepancy.
184
HeatWave AutoML Error Messages
Message: The size of model generated is larger than the maximum allowed.
Example: ERROR HY000: ML003014: The size of model generated is larger than
the maximum allowed.
Message: The input column types do not match the column types of dataset which the model was
trained on. ['%', '%'] vs ['%', '%'].
Example: ERROR HY000: ML003015: The input column types do not match the
column types of dataset which the model was trained on. ['numerical',
'numerical', 'categorical', 'numerical'] vs ['numerical', 'numerical',
'numerical', 'numerical'].
Message: Invalid data for the metric (%). Score could not be computed.
Example: ERROR HY000: ML003019: Invalid data for the metric (roc_auc). Score
could not be computed.
The scoring metric is legal and supported, but the data provided is not suitable to calculate such a
score. For example: ROC_AUC for multi-class classification. Try a different scoring metric.
The scoring metric is legal and supported, but the task provided is not suitable to calculate such a
score; for example: Using the accuracy metric for a regression model.
185
HeatWave AutoML Error Messages
Example: ERROR HY000: ML003021: Cannot train a regression task with a non-
numeric target column.
ML_TRAIN was run with the regression task type on a training dataset with a non-numeric target
column. Regression models require a numeric target column.
Example: ERROR HY000: ML003022: At least 2 target classes are needed for
classification task.
ML_TRAIN was run with the classification task type on a training dataset where the target column did
not have at least two possible values.
Message: Unknown option given. Allowed options for training are: ['task', 'model_list',
'exclude_model_list', 'optimization_metric', 'exclude_column_list', 'datetime_index',
'endogenous_variables', 'exogenous_variables', 'positive_class', 'users', 'items', 'user_columns',
'item_columns'].
Message: Not enough available memory, unloading any RAPID tables will help to free up memory.
Example: ERROR HY000: ML003024: Not enough available memory, unloading any
RAPID tables will help to free up memory.
There is not enough memory on the HeatWave Cluster to perform the operation. Try unloading data
that was loaded for analytics to free up space.
The recommended node shape for HeatWave AutoML functionality is HeatWave.256GB. The
HeatWave.16GB node shape might not have enough memory to train the model with large data sets.
If this error message appears with the smaller node shape and HeatWave AutoML, use the larger
shape instead.
186
HeatWave AutoML Error Messages
Message: Not all user specified columns are present in the input table - missing columns are {%}.
Example: ERROR HY000: ML003039: Not all user specified columns are present
in the input table - missing columns are {C4}.
Message: All columns cannot be excluded. User provided exclude_column_list is ['%', '%'].
The syntax includes an exclude_column_list that attempts to exclude too many columns.
Message: One or more columns in include_column_list ([%]) does not exist. Existing columns are
(['%', '%']).
The syntax includes an include_column_list that expects a column that does not exist.
The syntax for a forecasting task includes an include_column_list that expects one or more
columns that are not defined by exogenous_variables.
187
HeatWave AutoML Error Messages
Message: Target column provided % is one of the independent variables used to train the model [%,
%, %].
Example: ERROR HY000: ML003052: Target column provided LSTAT is one of the
independent variables used to train the model [RM, RAD, LSTAT].
The syntax defines a target_column_name that is one of the independent variables used to train
the model.
Message: datetime_index must be specified by the user for forecasting task and must be a column in
the training table.
The syntax for a forecasting task must include datetime_index, and this must be a column in the
training table.
Message: endogenous_variables must be specified by the user for forecasting task and must be
column(s) in the training table.
The syntax for a forecasting task must include the endogenous_variables option, and these must
be a column or columns in the training table.
The syntax for a forecasting task includes exclude_column_list that contains columns that are
also in endogenous_variables or exogenous_variables.
188
HeatWave AutoML Error Messages
Message: endogenous and exogenous variables may not have any common columns for forecasting
task.
Example: ERROR HY000: ML003057: endogenous and exogenous variables may not
have any common columns for forecasting task.
Message: Can not train a forecasting task with non-numeric endogenous_variables column(s).
Example: ERROR HY000: ML003058: Can not train a forecasting task with non-
numeric endogenous_variables column(s).
The syntax for a forecasting task includes endogenous_variables and some of the columns are
not defined as numeric.
The syntax for a forecasting task includes multivariate endogenous_variables, but the provided
models only support univariate endogenous_variables.
Message:: endogenous_variables may not contain repeated column names ['%1', '%2', '%1'].
The syntax for a forecasting task includes endogenous_variables with a repeated column.
The syntax for a forecasting task includes exogenous_variables with a repeated column.
189
HeatWave AutoML Error Messages
The syntax for a forecasting task includes endogenous_variables with a NULL argument.
The syntax for a forecasting task includes user provided exogenous_variables with a NULL
argument.
The syntax for a forecasting task must include at least one model.
Message: Prediction table cannot have overlapping datetime_index with train table when
exogenous_variables are used. It can only forecast into future.
The syntax for a forecasting task includes exogenous_variables and the prediction table
contains values in the datetime_index column that overlap with values in the datetime_index
column in the training table.
Message: datetime_index for test table must not have missing dates after the last date in training
table. Please ensure test table starts on or before 2034-01-01 00:00:00. Currently, start date in the
test table is 2036-01-01 00:00:00.
Example: ERROR HY000: ML003066: datetime_index for test table must not have
missing dates after the last date in training table. Please ensure test
table starts on or before 2034-01-01 00:00:00. Currently, start date in
the test table is 2036-01-01 00:00:00.
The syntax for a forecasting task includes a prediction table that contains values in the
datetime_index column that leave a gap to the values in the datetime_index column in the
training table.
190
HeatWave AutoML Error Messages
Message: datetime_index for forecasting task must be between year 1678 and 2261.
The syntax for a forecasting task includes values in a datetime_index column that are outside the
date range from 1678 to 2261.
Message: Last date of datetime_index in the training table 2151-01-01 00:00:00 plus the length of
the table 135 must be between year 1678 and 2261.
The syntax for a forecasting task includes a prediction table that has too many rows, and the values
in the datetime_index column would be outside the date range from 1678 to 2261.
Message: For recommendation tasks both user and item column names should be provided.
Example: ERROR 3877 (HY000): ML003070: For recommendation tasks both user
and item column names should be provided.
Message: contamination must be numeric value greater than 0 and less than 0.5.
Message: item_columns can not contain repeated column names ['C4', 'C4'].
Example: ERROR 3877 (HY000): ML003071: item_columns can not contain repeated
column names ['C4', 'C4'].
Message: user_columns can not contain repeated column names ['C4', 'C4'].
Example: ERROR 3877 (HY000): ML003071: user_columns can not contain repeated
column names ['C4', 'C4'].
Example: ERROR HY000: ML003072: Can not use more than one threshold method.
Example: ERROR 3877 (HY000): ML003072: Target column C3 can not be specified
as a user or item column. 191
HeatWave AutoML Error Messages
Message: topk must be an integer value between 1 and length of the table, inclusively (1 <= topk <=
20).
Example: ERROR HY000: ML003073: topk must be an integer value between 1 and
length of the table, inclusively (1 <= topk <= 20).
Example: ERROR 3877 (HY000): ML003073: The users and items columns should be
different.
Message: threshold must be a numeric value between 0 and 1, inclusively (0 <= threshold <= 1).
Message: Unknown option given. This scoring metric only allows for these options: ['topk'].
Example: ERROR HY000: ML003075: Unknown option given. This scoring metric
only allows for these options: ['topk'].
Message: Unknown option given. Allowed options for recommendations are ['recommend', 'top'].
Example: ERROR 3877 (HY000): ML003075: Unknown option given. Allowed options
for recommendations are ['recommend', 'top'].
Message: The recommend option should be provided when a value for topk is assigned.
192
HeatWave AutoML Error Messages
Message: Unknown recommend value given. Allowed values for recommend are ['ratings', 'items',
'users'].
Message: anomaly_detection only allows 0 (normal) and 1 (anomaly) for labels in target column with
any metric used, and they have to be integer values.
Message: Should not provide a value for topk when the recommend option is set to ratings.
Example: ERROR 3877 (HY000): ML003078: Should not provide a value for topk
when the recommend option is set to ratings.
Message: Provided value for option topk is not a strictly positive integer.
Example: ERROR 3877 (HY000): ML003079: Provided value for option topk is not
a strictly positive integer.
Message: One or more rows contains NULL or empty values. Please provide inputs without NULL or
empty values for recommendation.
Example: ERROR 3877 (HY000): ML003080: One or more rows contains NULL or
empty values. Please provide inputs without NULL or empty values for
recommendation.
Message: Options should be NULL. Options are currently not supported for this task classification.
Example: ERROR 3877 (HY000): ML003081: options should be NULL. Options are
currently not supported for this task classification.
Message: All supported models are excluded, but at least one model should be included.
Example: ERROR 3877 (HY000): ML003082: All supported models are excluded,
but at least one model should be included.
193
HeatWave AutoML Error Messages
Message: Both user column name ['C3'] and item column name C0 must be provided as string.
Example: ERROR HY000: ML003083: Both user column name ['C3'] and item column
name C0 must be provided as string.
Message: Cannot recommend users to a user not present in the training table.
Example: ERROR: 3877 (HY000): ML003105: Cannot recommend users to a user not
present in the training table.
Message: Cannot recommend items to an item not present in the training table.
Example: ERROR 3877 (HY000): ML003106: Cannot recommend items to an item not
present in the training table.
Message: Users to users recommendation is not supported, please retrain your model.
Message: Items to items recommendation is not supported, please retrain your model.
Example: ERROR HY000: ML003111: Unknown option given. Allowed options are
['batch_size'].
Message: Unknown option given. Allowed options for anomaly detection are [X, Y, ...].
Example: ERROR HY000: ML003112: Unknown option given. Allowed options for
anomaly detection are [X, Y, ...].
194
HeatWave AutoML Error Messages
Example: ERROR HY000: ML003115: Empty input table after applying threshold.
Message: The feedback_threshold option can only be set for implicit feedback.
Message: The remove_seen option can only be used with the following recommendation ['items',
'users', 'users_to_items', 'items_to_users'].
Example: ERROR HY000: ML003117: The remove_seen option can only be used
with the following recommendation ['items', 'users', 'users_to_items',
'items_to_users'].
Message: The remove_seen option must be set to either True or False. Provided input.
Example: ERROR HY000: ML003118: The remove_seen option must be set to either
True or False. Provided input.
Message: The feedback option must either be set to explicit or implicit. Provided input.
Example: ERROR HY000: ML003119: The feedback option must either be set to
explicit or implicit. Provided input.
Message: The input table needs to contain strictly more than one unique item.
Example: ERROR HY000: ML003120: The input table needs to contain strictly
more than one unique item.
Message: The input table needs to contain at least one unknown or negative rating.
Example: ERROR HY000: ML003121: The input table needs to contain at least
one unknown or negative rating.
Example: ERROR HY000: ML003123: User and item columns should contain
strings.
195
HeatWave AutoML Error Messages
Message: Calculation for precision_at_k metric could not complete because there are no
recommended items.
Example: HY000: ML004003: This ONNX model only supports fixed batch size=%.
Message: ML_SCORE is not supported for an onnx model that does not support batch inference.
Example: HY000: ML004006: ML_SCORE is not supported for an onnx model that
does not support batch inference.
Message: ML_EXPLAIN is not supported for an onnx model that does not support batch inference.
Example: HY000: ML004007: ML_EXPLAIN is not supported for an onnx model that
does not support batch inference.
Message: onnx model input type=% is not supported! Providing the appropriate types map using
'data_types_map' in model_metadata may resolve the issue.
Message: Output being sparse tensor with batch size > 1 is not supported.
196
HeatWave AutoML Error Messages
Example: HY000: ML004010: Output being sparse tensor with batch size > 1 is
not supported.
Example: ERROR 3877 (HY000): ML004010: Received data exceeds maximum allowed
length 943718400.
Message: predictions_name should be provided when task=regression and onnx model generates
more than one output.
Example: ERROR HY000: ML004015: Expected JSON string type value for key
(schema_name).
Message: When task=classification, if the user does not provide prediction_probabilities_name for
the onnx model, ML_EXPLAIN method=% will not be supported.
197
HeatWave AutoML Error Messages
Example: ERROR 3877 (HY000): ML004017: Input value to plugin variable is too
long.
Example: ERROR HY000: ML004018: Parsing JSON arg: Invalid value. failed!
Message: There are issues in running inference session for the onnx model. This might have
happened due to inference on inputs with incorrect names, shapes or types.
Example: HY000: ML004018: There are issues in running inference session for
the onnx model. This might have happened due to inference on inputs with
incorrect names, shapes or types.
Example: ERROR HY000: ML004019: Expected JSON object type value for key
(JSON root).
Message: The computed predictions do not have the right format. This might have happened
because the provided predictions_name is not correct.
Example: HY000: ML004019: The computed predictions do not have the right
format. This might have happened because the provided predictions_name is
not correct.
If a user-initiated interruption, Ctrl-C, is detected during the first phase of HeatWave AutoML
model and table load where a MySQL parallel scan is used in the HeatWave plugin to read data
198
HeatWave AutoML Error Messages
as of MySQL database and send it to the HeatWave Cluster, error messaging is handled by the
MySQL parallel scan function and directed to ERROR 1317 (70100): Query execution was
interrupted.. The ERROR 1317 (70100) message is reported to the client instead of the
ML004020 error message.
Message: The computed prediction probabilities do not have the right format. This might have
happened because the provided prediction_probabilities_name is not correct.
Message: The onnx model and dataset do not match. The onnx model's input=% is not a column in
the dataset.
Example: HY000: ML004021: The onnx model and dataset do not match. The onnx
model's input=% is not a column in the dataset.
Example: ERROR HY000: ML004022: The user does not have access privileges to
ml.foo.
Message: Labels in y_true and y_pred should be of the same type. Got y_true=% and y_pred=YYY.
Make sure that the predictions provided by the classifier coincides with the true labels.
Example: HY000: ML004022: Labels in y_true and y_pred should be of the same
type. Got y_true=% and y_pred=YYY. Make sure that the predictions provided
by the classifier coincides with the true labels.
Message: Received results exceed `max_allowed_packet`. Please increase it or lower input options
value to reduce result size.
200
HeatWave AutoML Error Messages
Message: onnx_outputs_info must only be provided for classification and regression tasks.
201
HeatWave AutoML Error Messages
Message: The length of a key provided in onnx_inputs_info should not be greater than 32 characters.
202
HeatWave AutoML Error Messages
Message: The length of a key provided in onnx_outputs_info should not be greater than 32
characters.
Message: Input table is empty. Please provide a table with at least one row.
Example: ERROR 45000: ML006052: Input table is empty. Please provide a table
with at least one row.
Message: Insufficient access rights. Grant user with correct privileges (SELECT, DROP, CREATE,
INSERT, ALTER) on input schema.
Example: ERROR 45000: ML006053: Insufficient access rights. Grant user with
correct privileges (SELECT, DROP, CREATE, INSERT, ALTER) on input schema.
Message: Input table already contains a column named '_id'. Please provide an input table without
such column.
Example: ERROR 45000: ML006054: Input table already contains a column named
'_id'. Please provide an input table without such column.
203
HeatWave AutoML Limitations
• The ML_TRAIN routine does not support MySQL user names that contain a period; for example,
a user named 'joe.smith'@'%' cannot run the ML_TRAIN routine. The model catalog schema
created by the ML_TRAIN procedure incorporates the user name in the schema name (e.g.,
ML_SCHEMA_joesmith), and a period is not a permitted schema name character.
• The table used to train a model (the training dataset) cannot exceed 10 GB, 100 million rows, or
1017 columns. Before MySQL 8.0.29, the column limit was 900.
• To avoid taking up too much space in memory, the number of loaded models should be limited to
three.
• “Bring your own model” is not supported. Use of non-HeatWave AutoML models or manually
modified HeatWave AutoML models can cause undefined behavior.
• Models greater than 900 MB in size are not supported. If a model being trained by the ML_TRAIN
routine exceeds 900 MB, the ML_TRAIN query fails with an error.
• There is currently no way to monitor HeatWave AutoML query progress. ML_TRAIN is typically the
most time consuming routine. The time required to train a model depends on the number of rows and
columns in the dataset and the specified ML_TRAIN parameters and options.
• The ML_PREDICT_TABLE ml_results column contains the prediction results and the data. This
combination must be less than 65,532 characters.
• HeatWave AutoML does not support text columns with NULL values.
• HeatWave AutoML does not support recommendation tasks with a text column.
• Concurrent HeatWave analytics and HeatWave AutoML queries are not supported. A HeatWave
AutoML query must wait for HeatWave analytics queries to finish, and vice versa. HeatWave
analytics queries are given priority over HeatWave AutoML queries.
• HeatWave on AWS only supports HeatWave AutoML with the HeatWave.256GB node shape. To use
HeatWave machine learning functionality, select that shape when creating a HeatWave Cluster.
204
Chapter 4 HeatWave Lakehouse
Table of Contents
4.1 Overview ............................................................................................................................. 205
4.1.1 External Tables ........................................................................................................ 206
4.1.2 Lakehouse Engine .................................................................................................... 206
4.1.3 Data Storage ............................................................................................................ 206
4.2 Loading Data to HeatWave Lakehouse ................................................................................ 206
4.2.1 Prerequisites ............................................................................................................ 206
4.2.2 Lakehouse External Table Syntax ............................................................................. 206
4.2.3 Loading Data Manually ............................................................................................. 210
4.2.4 Loading Data Using Auto Parallel Load ..................................................................... 211
4.2.5 Loading Data from External Storage Using Auto Parallel Load .................................... 214
4.3 Access Object Storage ........................................................................................................ 219
4.3.1 Pre-Authenticated Requests ...................................................................................... 219
4.3.2 Resource Principals .................................................................................................. 221
4.4 External Table Recovery ..................................................................................................... 222
4.5 Data Types ......................................................................................................................... 222
4.5.1 Parquet Data Type Conversions ................................................................................ 222
4.6 HeatWave Lakehouse Limitations ........................................................................................ 225
4.6.1 Lakehouse Limitations for all File Formats ................................................................. 225
4.6.2 Lakehouse Limitations for the Avro Format Files ........................................................ 227
4.6.3 Lakehouse Limitations for the CSV File Format .......................................................... 227
4.6.4 Lakehouse Limitations for the JSON File Format ........................................................ 227
4.6.5 Lakehouse Limitations for the Parquet File Format ..................................................... 227
4.7 HeatWave Lakehouse Error Messages ................................................................................. 228
4.1 Overview
The Lakehouse feature of HeatWave enables query processing on data resident in Object Storage.
The source data is read from Object Storage, transformed to the memory optimized HeatWave format,
stored in the HeatWave persistence storage layer in Object Storage, and then loaded to HeatWave
cluster memory.
• Supports structured and semi-structured relational data in the following file formats:
• Avro.
• CSV.
• JSON.
• Parquet.
• With this feature, users can now analyse data in both InnoDB and an object store using familiar SQL
syntax in the same query.
To use Lakehouse with HeatWave AutoML, see: Section 3.12, “HeatWave AutoML and Lakehouse”.
205
External Tables
• Avro.
• CSV.
• JSON.
• Parquet.
The external table stores the location of the data, see Section 4.3, “Access Object Storage”.
The Lakehouse Engine enables you to create tables which point to external data sources.
For HeatWave Lakehouse, lakehouse is the primary engine, and rapid is the secondary engine.
The data is deleted if the external table is dropped or unloaded, or if the HeatWave Cluster is deleted.
4.2.1 Prerequisites
Lakehouse requires the following:
• A HeatWave enabled MySQL DB System with Lakehouse support enabled, and a minimum 512GB
shape. See: Adding a HeatWave Cluster in the HeatWave on OCI Service Guide.
For a replicated MySQL DB System, see Section 4.6, “HeatWave Lakehouse Limitations”.
206
Lakehouse External Table Syntax
• is_strict_mode as a dialect parameter now supports all file formats. It can override the global
sql_mode.
• allow_missing_files is a dialect parameter and a file parameter that can allow missing
files or not.
file_section: {
"bucket": "bucket_name",
"namespace": "namespace",
"region": "region",
("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
"is_strict_mode": true | false,
"allow_missing_files": true | false
}
or
file_section: {
"par": "PAR URL",
("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
"is_strict_mode": true | false,
"allow_missing_files": true | false
}
207
Lakehouse External Table Syntax
]
}';
file_section: {
"bucket": "bucket_name",
"namespace": "namespace",
"region": "region",
("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
"is_strict_mode": true | false
}
or
file_section: {
"par": "PAR URL",
("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
"is_strict_mode": true | false
}
• ENGINE_ATTRIBUTE: JSON object literal. Defines the location of files, the file format, and how the
file format is handled.
Use key-value pairs in JSON format to specify options. Lakehouse uses the default setting if there
is no defined option. Use NULL to specify no arguments.
Tables created with json format must only have a single column that conforms to the JSON
data type, see: The JSON Data Type.
• check_constraints: Whether to validate primary key and unique key constraints or not.
The default is true. Supported as of MySQL 8.4.0.
If set to true, then Lakehouse validates primary key and unique key constraints.
If set to false, then Lakehouse does not validate primary key and unique key constraints.
• is_strict_mode: Whether the loading takes place in strict mode, true, or non-strict mode,
false. This setting overrides the global sql_mode. The default is the value of sql_mode.
See Strict SQL Mode. The file common parameter is_strict_mode can override this
setting.
If set to true, then missing files, empty columns, formatting errors or parsing errors throw an
error, and loading stops.
If set to false, then missing files, empty columns, formatting errors or parsing errors display
a warning, and loading continues.
As of MySQL 8.4.0, the dialect parameter is_strict_mode applies to all file formats.
Before MySQL 8.4.0, it only applies to the CSV file format. For Avro and Parquet file formats,
use the file parameter is_strict_mode to define strict mode before MySQL 8.4.0.
208
Lakehouse External Table Syntax
• allow_missing_files: Whether to allow missing files or not. This overrides the dialect
parameter is_strict_mode for missing files. Supported as of MySQL 8.4.0.
If set to true, then any missing files do not throw an error, and loading does not stop, unless
all files are missing.
If set to false, then any missing files throw an error, and loading stops.
• With the pattern parameter: there are no files that match the pattern.
• With the prefix parameter: there are no files with that prefix.
The use of any of these parameters with avro or parquet will produce an error.
• record_delimiter: Defines one or more characters used to delimit records. The maximum
field delimiter length is 64 characters.
The default for json is "\n". The only alternative for json is "\r\n".
The use of any of these parameters with avro, json or parquet will produce an error.
• field_delimiter: Defines one or more characters used to enclose fields. The maximum
field delimiter length is 64 characters. The default is "|".
• quotation_marks: Defines one or more characters used to enclose fields. The default is
"\"".
• skip_rows: The number of rows to skip at the start of the file. The maximum value is 20. The
default is 0.
• time_format: The time format, see: String and Numeric Literals in Date and Time Context.
The default is "auto".
• trim_spaces: Whether to remove leading and trailing spaces, or not. The default is false.
• has_header: Whether the CSV file has a header row, or not. The default is false.
If has_header and skip_rows are both defined, Lakehouse first skips the number of rows,
and then uses the next row as the header row.
209
Loading Data Manually
Lakehouse supports a maximum of 256 file locations. To define more than 256, store the files
under the same bucket or use prefix or pattern.
• file parameters for resource principals, see: Section 4.3.2, “Resource Principals”.
Do not specify a region, namespace or bucket with par. That information is contained
in the PAR URL and will generate an error if defined in separate parameters. See:
Section 4.3.1.1, “Recommendations”.
• Use one or more of the following parameters, unless the target defines a specific file:
• pattern: A regular expression that defines a set of Object Storage files. The pattern
follows the modified ECMAScript regular expression grammar, see: Modified ECMAScript
regular expression grammar.
• is_strict_mode: Whether the loading takes place in strict mode, true, or non-strict mode,
false. This overrides the dialect parameter is_strict_mode.
If set to true, then any missing files do not throw an error, and loading does not stop.
If set to false, then any missing files throw an error, and loading stops.
1. Choose whether to use a pre-authenticated request or a resource principal, see: Section 4.2.3.1,
“Manually Loading Data from External Storage”.
2. Use the Lakehouse ENGINE_ATTRIBUTE with CREATE TABLE statements, see: Section 4.2.2,
“Lakehouse External Table Syntax”.
210
Loading Data Using Auto Parallel Load
As of MySQL 8.3.0-u2, Lakehouse extends Guided Load for external tables, see: Section 2.2.2,
“Loading Data Manually”. This employs Autopilot to perform a series of pre-load validation checks, and
includes the following:
• Infers the schema and performs similar schema adjustments to those performed by Autopilot during
Lakehouse Auto Parallel Load. If the inferred schema is not compatible with the defined schema,
then Guided Load aborts the load. See: Section 4.2.4.1, “Lakehouse Auto Parallel Load Schema
Inference”.
• Predicts the amount of memory required, and checks that this is available. If the required memory is
not available, Guided Load aborts the load.
To monitor any issues encountered during this pre-load validation process, run the SHOW WARNINGS
statement after the load command has finished.
PARs can be used for any Object Storage data stored in any tenancy in the
same region.
mysql> CREATE TABLE `CUSTOMER` (`C_CUSTKEY` int NOT NULL PRIMARY KEY, `C_NATIONKEY` int NOT NULL)
ENGINE=lakehouse
SECONDARY_ENGINE = RAPID
ENGINE_ATTRIBUTE='{"dialect": {"format": "csv”},
"file": [{"par": "https://objectstorage.../n/some_bucket/customer.tbl"}]}';
ALTER TABLE `CUSTOMER` SECONDARY_LOAD;
• Lakehouse Auto Parallel Load includes schema inference which analyzes the external data to infer
the table structure.
• As of My SQL 8.0.33-u3, Lakehouse Auto Parallel Load uses the external_tables option to
enable loading data from external sources. See: Section 4.2.4.2, “Lakehouse Auto Parallel Load with
the external_tables Option”. Do not use as of MySQL 8.4.0. external_tables will be deprecated
in a future release.
• As of MySQL 8.4.0, Lakehouse Auto Parallel Load uses the db_object with table or
exclude_tables instead. See: Section 2.2.3.2, “Auto Parallel Load Syntax”.
Lakehouse Auto Parallel Load facilitates the process of loading data into HeatWave by automating
many of the steps involved, including:
• All these steps: Section 2.2.3, “Loading Data Using Auto Parallel Load”.
211
Loading Data Using Auto Parallel Load
• Lakehouse Auto Parallel Load analyzes the data, infers the table structure, and creates the database
and all tables. This only requires the name of the database, the names of each table, the external file
parameters, and then Lakehouse Auto Parallel Load generates the CREATE DATABASE and CREATE
TABLE statements. For example, see: Section 4.2.5.1, “Load Configuration”.
Lakehouse Auto Parallel Load uses header information from the external files to define the column
names. If this is not available, Lakehouse Auto Parallel Load defines the column names sequentially:
col_1, col_2, col_3 ...
• As of MySQL 8.3.0, if the tables are already defined, Lakehouse Auto Parallel Load analyzes the
data, infers the table structure, and then modifies the structure to avoid errors during data load. For
example, if a table defines a column with TINYINT, but Lakehouse Auto Parallel Load infers that
the data requires SMALLINT MEDIUMINT, INT, or BIGINT, then Lakehouse Auto Parallel Load will
modify the structure accordingly. If the inferred data type is incompatible with the table definition,
Lakehouse Auto Parallel Load raises an error, and specifies the column as NOT SECONDARY.
As of My SQL 8.0.33-u3, HeatWave Lakehouse extends Auto Parallel Load with the
external_tables option. This is a JSON array that includes one or more db_object.
Do not use as of MySQL 8.4.0. Use db_object with table or exclude_tables instead.
external_tables will be deprecated in a future release.
db_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"db_name": "name",
"tables": JSON_ARRAY(table [, table] ...)
}
}
table: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"table_name": "name",
"sampling": true|false,
"dialect": {dialect_section},
"file": JSON_ARRAY(file_section [, file_section]...),
}
}
• db_object: the details of one or more tables. Each db_object contains the following:
• db_name: name of the database. If the database does not exist, Lakehouse Auto Parallel Load
creates it during the load process.
• sampling: if set to true, the default setting, Lakehouse Auto Parallel Load infers the schema
by sampling the data and collect statistics.
If set to false, Lakehouse Auto Parallel Load performs a full scan to infer the schema and
collect statistics. Depending on the size of the data, this can take a long time.
212
Loading Data Using Auto Parallel Load
Auto Parallel Load uses the inferred schema to generate CREATE TABLE statements. The
statistics are used to estimate storage requirements and load times.
• dialect: details about the file format. See the dialect parameter in Section 4.2.2,
“Lakehouse External Table Syntax”.
• file: the location of the data in Object Storage. This can use a pre-authenticated request or
a resource principal, and can be a path to a file, a file prefix, or a file pattern. See the file
parameter in Section 4.2.2, “Lakehouse External Table Syntax”, and see: Section 4.3, “Access
Object Storage”.
Syntax Examples
While it is possible to define the entire load command on a single line, for readability the configuration
is divided into option definitions using SET.
• Define the name of the database which will store the data.
mysql> SET @db_list = '["tpch"]';
This assumes that Lakehouse Auto Parallel Load will analyze the data, infer the table structure, and
create the database and all tables. See: Section 4.2.4.2, “Lakehouse Auto Parallel Load with the
external_tables Option”.
• Define the db_object parameters that will load data from three external sources with Avro, CSV
and Parquet format files:
mysql> SET @ext_tables = '[
{
"db_name": "tpch",
"tables": [{
"table_name": "supplier_pq",
"dialect": {
"format": "parquet"
},
"file": [{
"prefix": "src_data/parquet/tpch/supplier/",
"bucket": "myBucket",
"namespace": "myNamespace",
"region": "myRegion"
}]
},
{
"table_name": "nation_csv",
"dialect": {
"format": "csv",
"field_delimiter": "|",
"record_delimiter": "|\\n",
"has_header": true
},
"file": [{
"par": "https://objectstorage.../nation.csv"
}]
},
{
"table_name": "region_avro",
"dialect": {
"format": "avro"
},
"file": [{
"par": "https://objectstorage.../region.avro"
}]
}]
}
]';
213
Loading Data from External Storage Using Auto Parallel Load
Setting mode to dryrun generates the load script but does not create or load the external tables. For
example:
mysql> SET @options = JSON_OBJECT('mode', 'dryrun', 'external_tables', CAST(@ext_tables AS JSON));
To implement the changes as part of the load command, set mode to normal. This is the default,
and it is not necessary to add it to the command.
Exclude columns from the loading process with the exclude_list option. See Section 2.2.3.2, “Auto
Parallel Load Syntax”.
Lakehouse Auto Parallel Load infers the column names for Avro and Parquet files, and also for CSV
files if has_header is true. For these situations, use the column names with the exclude_list
option.
If the table already exists, but no data has been loaded, use the existing column names with the
exclude_list option.
For CSV files if has_header is false, use the generated schema names with the exclude_list
option. These are: col_1, col_2, col_3 ...
4.2.5 Loading Data from External Storage Using Auto Parallel Load
This section describes loading data with the Auto Parallel Load procedure and schema inference which
is part of the procedure. This process does not require an existing external table definition and creates
the external table based on schema inference.
The TPCH data set is used in this example, which loads data from external Avro, CSV and Parquet
format files.
• Define the input_list parameters that will load data from four external sources with Avro, CSV,
JSON and Parquet format files:
mysql> SET @input_list = '[
{"db_name": "tpch", "tables": [{
"table_name": "supplier_pq",
"engine_attribute": {
"dialect": {"format": "parquet"},
"file": [{
"prefix": "src_data/parquet/tpch/supplier/",
"bucket": "myBucket",
"namespace": "myNamespace",
"region": "myRegion"
}]
}
},
{
"table_name": "customer_csv",
"engine_attribute": {
"dialect": {
"format": "csv",
"field_delimiter": "|",
"record_delimiter": "|\\n",
"has_header": true
214
Loading Data from External Storage Using Auto Parallel Load
},
"file": [{"par": "https://objectstorage.../customer.csv"}]
}
},
{
"table_name": "region_avro",
"engine_attribute": {
"dialect": {"format": "avro"},
"file": [{"par": "https://objectstorage.../region.avro"}]
}
},
{
"table_name": "nation_json",
"engine_attribute": {
"dialect": {"format": "json"},
"file": [{"par": "https://objectstorage.../nation.json"}]
}
}
]}
]';
• Define the @options variable with SET. Setting mode to dryrun generates the load script but does
not create or load the external tables. For example:
mysql> SET @options = JSON_OBJECT('mode', 'dryrun');
To implement the changes as part of the load command, set mode to normal. This is the default,
and it is not necessary to add it to the command.
Set mode to validation to validate the data files against the created table for any potential data
errors. For example:
mysql> SET @options = JSON_OBJECT('mode', 'validation');
Note
validation requires the tables to be created first, and it does not load the
data to the tables. To load the tables the mode must be set to normal.
Exclude columns from the loading process with the exclude_columns option. See Section 2.2.3.2,
“Auto Parallel Load Syntax”.
Lakehouse Auto Parallel Load infers the column names for Avro, JSON and Parquet files, and
also for CSV files if has_header is true. For these situations, use the column names with the
exclude_columns option.
If the table already exists, but no data has been loaded, use the existing column names with the
exclude_columns option.
For CSV files if has_header is false, use the generated schema names with the
exclude_columns option. These are: col_1, col_2, col_3 ...
This example is run without a defined mode, and defaults to normal mode, generating the script and
running it. If the mode is set to dryrun, the script is generated and made available to examine in the
LOAD SCRIPT GENERATION section of the Auto Parallel Load process.
The procedure initializes, runs, and displays a report. The report is divided into the following sections:
• INITIALIZING HEATWAVE AUTO PARALLEL LOAD: Lists the load mode, policy, and output mode.
215
Loading Data from External Storage Using Auto Parallel Load
For example:
+------------------------------------------+
| INITIALIZING HEATWAVE AUTO PARALLEL LOAD |
+------------------------------------------+
| Version: 3.11 |
| |
| Load Mode: normal |
| Load Policy: disable_unsupported_columns |
| Output Mode: normal |
| |
+------------------------------------------+
6 rows in set (0.02 sec)
• LAKEHOUSE AUTO SCHEMA INFERENCE: Displays the details of the table, how many rows and
columns it contains, its file size, and the name of the schema.
For example:
+--------------------------------------------------------------------------------------------------------
| LAKEHOUSE AUTO SCHEMA INFERENCE
+--------------------------------------------------------------------------------------------------------
| Verifying external lakehouse tables: 4
|
| SCHEMA TABLE TABLE IS RAW NUM. OF ESTIMATED
| NAME NAME CREATED FILE SIZE COLUMNS ROW COUNT
| ------ ----- -------- --------- ------- ---------
| `tpch` `customer_csv` NO 232.71 GiB 8 1.5 B
| `tpch` `nation_json` NO 3.66 KiB 1 25
| `tpch` `region_avro` NO 476 bytes 3 9
| `tpch` `supplier_pq` NO 7.46 GiB 7 100 M
|
| New schemas to be created: 1
| External lakehouse tables to be created: 4
|
+--------------------------------------------------------------------------------------------------------
13 rows in set (21.06 sec)
• OFFLOAD ANALYSIS: Displays an analysis of the number and name of the tables and columns
which can be offloaded to HeatWave.
For example:
+------------------------------------------------------------------------+
| OFFLOAD ANALYSIS |
+------------------------------------------------------------------------+
| Verifying input schemas: 1 |
| User excluded items: 0 |
| |
| SCHEMA OFFLOADABLE OFFLOADABLE SUMMARY OF |
| NAME TABLES COLUMNS ISSUES |
| ------ ----------- ----------- ---------- |
| `tpch` 4 19 |
| |
| Total offloadable schemas: 1 |
| |
+------------------------------------------------------------------------+
10 rows in set (21.09 sec)
• CAPACITY ESTIMATION: Displays the HeatWave cluster and MySQL node memory requirement to
process the data and an estimation of the load time.
For example:
+--------------------------------------------------------------------------------------------------------
| CAPACITY ESTIMATION
+--------------------------------------------------------------------------------------------------------
| Default encoding for string columns: VARLEN (unless specified in the schema)
| Estimating memory footprint for 1 schema(s)
216
Loading Data from External Storage Using Auto Parallel Load
|
| TOTAL ESTIMATED ESTIMATED TOTAL DICTIONARY
| SCHEMA OFFLOADABLE HEATWAVE NODE MYSQL NODE STRING ENCODED
| NAME TABLES FOOTPRINT FOOTPRINT COLUMNS COLUMNS
| ------ ----------- --------- --------- ------- ----------
| `tpch` 4 193.39 GiB 1.44 MiB 12 0
|
| Sufficient MySQL host memory available to load all tables.
| Sufficient HeatWave cluster memory available to load all tables.
|
+----------------------------------------------------------------------------------------------------
12 rows in set (21.10 sec)
Note
If there is insufficient memory, update the nodes before proceeding with the
load.
• EXECUTING LOAD: Displays information about the generated script and approximate loading time.
For example:
+----------------------------------------------------------------------------------------------------
| EXECUTING LOAD SCRIPT
+----------------------------------------------------------------------------------------------------
| HeatWave Load script generated
| Retrieve load script containing 9 generated DDL command(s) using the query below:
| Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot
| SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER
|
| Adjusting load parallelism dynamically per internal/external table.
| Using current parallelism of 4 thread(s) as maximum for internal tables.
|
| Warning: Executing the generated script may alter column definitions and secondary engine flags in
|
| Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVIS
|
| Proceeding to load 4 table(s) into HeatWave.
|
| Applying changes will take approximately 22.08 min
|
+----------------------------------------------------------------------------------------------------
16 rows in set (22.08 sec)
• SCHEMA CREATION: Displays information about the schema creation process and duration.
For example:
+-----------------------------------+
| SCHEMA CREATION |
+-----------------------------------+
| Schema `tpch` creation succeeded! |
| Warnings/errors encountered: 0 |
| Elapsed time: 2.62 ms |
| |
+-----------------------------------+
4 rows in set (14.70 sec)
For example:
+----------------------------------------+
| TABLE LOAD |
+----------------------------------------+
| TABLE (1 of 4): `tpch`.`customer_csv` |
| Commands executed successfully: 2 of 2 |
| Warnings encountered: 0 |
| Table load succeeded! |
| Total columns loaded: 8 |
| Elapsed time: 19.33 min |
217
Loading Data from External Storage Using Auto Parallel Load
| |
+----------------------------------------+
7 rows in set (19 min 41.74 sec)
+----------------------------------------+
| TABLE LOAD |
+----------------------------------------+
| TABLE (2 of 4): `tpch`.`nation_json` |
| Commands executed successfully: 2 of 2 |
| Warnings encountered: 0 |
| Table load succeeded! |
| Total columns loaded: 1 |
| Elapsed time: 3.70 s |
| |
+----------------------------------------+
7 rows in set (19 min 45.44 sec)
+----------------------------------------+
| TABLE LOAD |
+----------------------------------------+
| TABLE (3 of 4): `tpch`.`region_avro` |
| Commands executed successfully: 2 of 2 |
| Warnings encountered: 0 |
| Table load succeeded! |
| Total columns loaded: 3 |
| Elapsed time: 3.79 s |
| |
+----------------------------------------+
7 rows in set (19 min 49.24 sec)
+----------------------------------------+
| TABLE LOAD |
+----------------------------------------+
| TABLE (4 of 4): `tpch`.`supplier_pq` |
| Commands executed successfully: 2 of 2 |
| Warnings encountered: 0 |
| Table load succeeded! |
| Total columns loaded: 7 |
| Elapsed time: 1.96 min |
| |
+----------------------------------------+
7 rows in set (21 min 46.98 sec)
For example:
+-------------------------------------------------------------------------------+
| LOAD SUMMARY |
+-------------------------------------------------------------------------------+
| |
| SCHEMA TABLES TABLES COLUMNS LOAD |
| NAME LOADED FAILED LOADED DURATION |
| ------ ------ ------ ------- -------- |
| `tpch` 4 0 19 21.41 min |
| |
+-------------------------------------------------------------------------------+
6 rows in set (21 min 46.98 sec)
+----------------------------------------------------------------------------------------------------------
| Load Script
218
Access Object Storage
+------------------------------------------------------------------------------------------------------
| CREATE DATABASE `tpch`;
| CREATE TABLE `tpch`.`customer_csv`(
| `C_CUSTKEY` int unsigned NOT NULL,
| `C_NAME` varchar(19) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `C_ADDRESS` varchar(40) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `C_NATIONKEY` tinyint unsigned NOT NULL,
| `C_PHONE` varchar(15) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `C_ACCTBAL` decimal(6,2) NOT NULL,
| `C_MKTSEGMENT` varchar(10) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `C_COMMENT` varchar(116) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN'
| ) ENGINE=lakehouse SECONDARY_ENGINE=RAPID
| ENGINE_ATTRIBUTE='{"file": [{"par": "https://objectstorage.../customer.csv"}],
| "dialect": {"format": "csv", "field_delimiter": "|", "record_delimiter": "|\\n"}
| ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ `tpch`.`customer_csv` SECONDARY_LOAD;
| CREATE TABLE `tpch`.`nation_json`(`col_1` json NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN')
| ENGINE=lakehouse SECONDARY_ENGINE=RAPID
| ENGINE_ATTRIBUTE='{"file": [{"par": "https://objectstorage.../nation.json"}],
| "dialect": {"format": "json"}}';
| ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ `tpch`.`nation_json` SECONDARY_LOAD;
| CREATE TABLE `tpch`.`region_avro`(
| `R_REGIONKEY` tinyint unsigned NOT NULL,
| `R_NAME` varchar(11) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `R_COMMENT` varchar(115) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=VARLEN'
| ) ENGINE=lakehouse SECONDARY_ENGINE=RAPID
| ENGINE_ATTRIBUTE='{"file": [{"par": "https://objectstorage.../region.avro"}],
| "dialect": {"format": "avro"}}';
| ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ `tpch`.`region_avro` SECONDARY_LOAD;
| CREATE TABLE `tpch`.`supplier_pq`(
| `S_SUPPKEY` int,
| `S_NAME` varchar(19) COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `S_ADDRESS` varchar(40) COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `S_NATIONKEY` int,
| `S_PHONE` varchar(15) COMMENT 'RAPID_COLUMN=ENCODING=VARLEN',
| `S_ACCTBAL` decimal(15,2),
| `S_COMMENT` varchar(100) COMMENT 'RAPID_COLUMN=ENCODING=VARLEN'
| ) ENGINE=lakehouse SECONDARY_ENGINE=RAPID
| ENGINE_ATTRIBUTE='{"file": [{"prefix": "src_data/parquet/tpch/supplier/",
| "bucket": "myBucket",
| "namespace": "myNamespace",
| "region": "myRegion"}],
| "dialect": {"format": "parquet"}}';
| ALTER TABLE /*+ AUTOPILOT_DISABLE_CHECK */ `tpch`.`supplier_pq` SECONDARY_LOAD;
+------------------------------------------------------------------------------------------------------
Note
The output above is displayed in a readable format. The actual CREATE TABLE
output is generated on a single line.
• A pre-authenticated request.
• A resource principal.
219
Pre-Authenticated Requests
For more information about pre-authenticated requests, see the Oracle Cloud Infrastructure
documentation at Using Pre-Authenticated Requests.
4.3.1.1 Recommendations
The following recommendations apply to pre-authenticated requests created for HeatWave Lakehouse:
• Set a short expiration date for the pre-authenticated request URL that matches the data loading plan.
• Use Enable Object Listing when creating the pre-authenticated request in the HeatWave
Console.
• When creating the pre-authenticated request from the command line, include the --access-type
AnyObjectRead parameter.
Note
Use a resource principal for access to more sensitive data in Object Storage as
it is more secure.
The default value of bucket-listing-action is Deny. This allows access to the defined object,
data-file-01.csv, only.
The following example creates a PAR named MyAllObjectsReadPAR,for a file named data-
file-01.csv, in a bucket named MyParBucket, and grants read-only access to all objects in the
bucket:
$>oci os preauth-request create --namespace MyNamespace --bucket-name MyParBucket \
--name MyAllObjectsReadPAR --access-type AnyObjectRead \
--time-expires="2022-11-21T23:00:00+00:00" --bucket-listing-action ListObjects
220
Resource Principals
"id": "alphanumericString",
"name": "MyAllObjectsReadPAR",
"object-name": null,
"time-created": "2022-12-08T18:51:34.491000+00:00",
"time-expires": "2022-12-09T23:07:04+00:00"
}
}
The defined value of bucket-listing-action is ListObjects. This allows the listing of all objects in
the named bucket.
All pre-authenticated requests created with the command line are available to view in the HeatWave
Console. To view the pre-authenticated requests, navigate to the HeatWave Console page for the
bucket and select Pre-Authenticated Requests on the Resources section. You can also list the PARs
from the command line, with the following command:
$>oci os preauth-request list --bucket-name=name
For HeatWave on OCI, see Resource Principals in the HeatWave on OCI Service Guide.
Note
Dynamic Group
Dynamic groups allow you to group MySQL DB Systems as principal actors, similar to user groups.
You can then create policies to permit MySQL DB Systems in these groups to make API calls against
services, such as Object Storage. Membership in the group is determined by a set of criteria called
matching rules.
The following example shows a matching rule including all MySQL DB Systems in the defined
compartment:
"ALL{resource.type='mysqldbsystem', resource.compartment.id = 'ocid1.compartment.oc1..alphanumericStrin
For more information, see Writing Matching Rules to Define Dynamic Groups.
Policy
Policies define what your groups can and cannot do. For HeatWave Lakehouse to access Object
Storage, you must define a policy which grants the dynamic group's resources access to buckets and
their contents in a specific compartment.
For example, the following policy grants the dynamic group Lakehouse-dynamicGroup read-only
access to the buckets and objects contained in those buckets in the compartment Lakehouse-Data:
221
External Table Recovery
222
Parquet Data Type Conversions
223
Parquet Data Type Conversions
224
HeatWave Lakehouse Limitations
• Do not create Lakehouse tables on the source DB in a replicated MySQL DB System if any of the
replicas are outside HeatWave on OCI or HeatWave on AWS. This will cause replication errors.
• Before MySQL 8.4.0-u2, a replication channel might fail if a HeatWave Cluster is added to a replica
of a MySQL DB System, and later manually stopped.
• It is not possible to dump external tables using the MySQL Shell export utilities, such as
dumpInstance(). External tables are not replicated to InnoDB storage and cannot be exported.
To export InnoDB data from a Lakehouse enabled database, exclude the external tables with an
excludeTables option.
• It is not possible to restore a backup from a Lakehouse enabled MySQL DB System to a standalone
MySQL DB System.
• Before MySQL 8.4.0, Lakehouse does not enforce any specified constraints. For example,
Lakehouse does not enforce primary key uniqueness constraints. MySQL 8.4.0 removes this
limitation.
• DML statements:
• INSERT
• UPDATE
• DELETE
• REPLACE
• AUTOEXTEND_SIZE
• AVG_ROW_LENGTH
• CHECKSUM
• COMPRESS
• CONNECTION
225
Lakehouse Limitations for all File Formats
• DATADIR
• DELAY_KEY_WRITE
• ENCRYPT
• INDEXDIR
• INSERT_METHOD
• KEY_BLOCK_SIZE
• MAX_ROWS
• MIN_ROWS
• PACK_KEYS
• PASSWORD
• ROW_FORMAT
• STATS_AUTO_RECALC
• STATS_PERSISTENT
• STATS_SAMPLE_PAGES
• UNION
• The default expression for a column definition for the CREATE TABLE statement.
• Creating triggers.
• Running ALTER TABLE statements that construct indexes, ADD or DROP columns, or add enforced
check constraints.
• Hidden columns.
• Index construction.
226
Lakehouse Limitations for the Avro Format Files
• The following limitations only apply up to MySQL 8.3.0-u2. MySQL 8.3.0-u2 removes these
limitations:
• High Availability. To enable Lakehouse, disable High Availability or create a standalone MySQL
DB System for use with Lakehouse.
• Read Replication.
• Outbound Replication.
Note
• The data in an Avro block must be no larger than 64MiB before applying any compression.
• Lakehouse only supports uncompressed Avro blocks, or Avro blocks compressed with Snappy or
Deflate algorithms. Lakehouse does not support other compression algorithms.
• Lakehouse does not support these data types in Avro files: Array, Map, and nested records.
Lakehouse marks columns with these data types as NOT SECONDARY, and does not load them.
• Lakehouse only supports an Avro union between one supported data type and NULL.
• The Avro DECIMAL data type is limited to the scale and precision supported by MySQL.
• Avro data with FIXED values or values in BYTES are only supported with the Avro DECIMAL data
type.
• Lakehouse does not support CSV files with more than 4MB per line.
• Tables created with json format must only have a single column that conforms to the JSON data
type, see: The JSON Data Type.
• Lakehouse can only load 64KB of data for each line, and ignores any line with more than 64KB of
data.
• Lakehouse does not support Parquet files with a row group size of more than 500MB.
227
HeatWave Lakehouse Error Messages
• Lakehouse does not support these data types in Parquet files: ENUM, UUID, Interval, BSON, List,
Map, Unknown. Lakehouse marks columns with these data types as NOT SECONDARY, and does not
load them.
• Before MySQL 8.3.0-u2 Lakehouse does not support the JSON data type in Parquet files. MySQL
8.3.0-u2 removes this limitation.
Message: Unable to load table from external source: Column %d of %s : Column contains null but it
is not nullable.
Message: Column %d of %s : Column is missing, but it is not nullable and no explicit default value is
provided.
Message: Unable to load table from external source: Column %d of %s : Column is empty.
Message: Unable to load table from external source: Column %d of %s : Invalid %s value.
Message: Unable to load table from external source: Column %d of %s : Unknown error while
parsing decimal.
Message: Unable to load table from external source: Column %d of %s : Out of memory loading
decimal.
Message: Unable to load table from external source: Column %d of %s : decimal precision exceeds
schema.
Message: Unable to load table from external source: Column %d of %s : %s exceeds min.
Message: Unable to load table from external source: Column %d of %s : %s exceeds max.
228
HeatWave Lakehouse Error Messages
Message: Unable to load table from external source: Column %d of %s : Real value is NaN.
Message: Unable to load table from external source: Column %d of %s : %s value is out of range.
Message: Unable to load table from external source: Column %d of %s : Error while applying %s
format.
Message: Unable to load table from external source: Column %d of %s : Cannot convert string %s.
Message: Unable to load table from external source: Resource principal error.
Message: Unable to load table from external source: AWS authentication error.
Message: Unable to load table from external source: %s:%lu: Parsing error found. %s.
Message: Unable to load table from external source: %s: Mismatch in the number of columns.
Expected %u, found %u.
Message: Unable to load table from external source: %s: More than %u column(s) found.
Message: Unable to load table from external source: Column %d in %s: Charset is not supported.
Message: Unable to load table from external source: Column %d in %s: Decimal conversion error.
Message: Unable to load table from external source: Column %d in %s: String is too long.
Message: Unable to load table from external source: Resource Principal endpoint is not configured.
Cannot access following buckets: %s.
229
HeatWave Lakehouse Error Messages
Message: Unable to load table from external source: No files found corresponding to the following
data locations: %s.
Message: Unable to load table from external source: File %s was not processed (duplicate).
Message: Unable to load table from external source: File %.*s: Avro schema depth exceeds max
allowed depth.
Message: Unable to load table from external source: File %s: Header does not match other files.
Message: Unable to load table from external source: File %s: Enum symbol can't convert to column
charset.
Message: Unable to load table from external source: File %s: Enum symbols are not the same, or
are not in the same order.
Message: Unable to load table from external source: Column %d in %s: Avro value of physical type
%.*s logical type %.*s cannot be converted to mysql type %s.
Message: Unable to load table from external source: File %.*s: Ends unexpectedly.
Message: Unable to load table from external source: File %.*s: File data might be corrupt.
Message: Unable to load table from external source: Column %d in %s: Invalid union encountered.
Unions are only supported to represent nullable columns. One of the two types of the union must be
null.
Message: Unable to load table from external source: File %.*s: Invalid avro block size.
Message: Unable to load table from external source: File %.*s: Invalid Avro block record count.
Message: Unable to load table from external source: File %.*s: Cannot locate %s specific metadata.
Either the file is corrupted or this is not an %s file.
230
HeatWave Lakehouse Error Messages
Message: Unable to load table from external source: File %.*s: Could not process Avro header
metadata. The metadata is corrupted or invalid.
Message: Unable to load table from external source: File %.*s: No schema in Avro header metadata.
Message: Unable to load table from external source: File %.*s: No codec in Avro header metadata.
Message: Unable to load table from external source: File %.*s: Invalid name in Avro schema.
Message: Unable to load table from external source: File %.*s: Error decoding Avro %.*s. %s
Message: Unable to load table from external source: Column %d in %s: String with non-utf8 file
encoding.
Message: Unable to load table from external source: Difference in schema among Parquet files.
Message: Unable to load table from external source: File %s: Size of row group %ld is larger than
maximum allowed size (%d).
Message: Unable to load table from external source: File %s: Cannot locate offset of row group %ld
in metadata.
Message: Unable to load table from external source: Column %d in parquet table cannot be
converted to mysql format.
Message: Unable to load table from external source: Cannot locate schema in %s.
Message: Error during Lakehouse Schema Inference: Found files with mismatched schemas.
Message: Warning during Lakehouse Schema Inference: Skipped %lu line(s) due to mismatched
num of cols.
231
HeatWave Lakehouse Error Messages
Message: Warning during Lakehouse Schema Inference: Skipped %lu file(s) due to mismatched
num of cols.
Message: Warning during Lakehouse Schema Inference: File %s is skipped because it has no valid
data. File is either empty or all rows are skipped.
Message: Error during Lakehouse Schema Inference: No valid data found for processing.
Message: Error during Lakehouse Schema Inference: No valid files found for processing.
Message: Warning during Lakehouse Schema Inference: Used default column names for column
index(s) %s as they are %s.
Message: Unable to load table from external source: File %s: Unable to read Parquet header.
Message: Error during Lakehouse Schema Inference: Got an exception while trying to process file
task.
Message: Unable to load table from external source: File %.*s: Could not parse avro header.
Probably corrupted avro file.
Message: Unable to load table from external source: Parquet reader cannot open %s. %s
Message: Unable to load table from external source: Unable to access the following data locations:
%s
232
Chapter 5 System and Status Variables
Table of Contents
5.1 System Variables ................................................................................................................ 233
5.2 Status Variables .................................................................................................................. 237
• bulk_loader.data_memory_size
Specifies the amount of memory to use for LOAD DATA with ALGORITHM=BULK, in bytes. See:
Section 2.17, “Bulk Ingest Data to MySQL Server”.
• rapid_compression
OFF
Whether to enable or disable data compression before loading data into HeatWave. Data
compression is enabled by default. The setting does not affect data that is already loaded. See
Section 2.2.6, “Data Compression”.
As of MySQL 8.3.0, the default option is AUTO which automatically chooses the best compression
algorithm for each column.
• rapid_bootstrap
233
System Variables
Scope Global
Dynamic Yes
SET_VAR Hint Applies No
Type Enumeration
Default Value OFF
Valid Values IDLE
SUSPEND
ON
The setting for this variable is managed by OCI and cannot be modified directly. Defines the
HeatWave Cluster bootstrap state. States include:
• OFF
• IDLE
• SUSPENDED
The HeatWave Cluster is suspended. The SUSPENDED state is a transition state between IDLE
and ON that facilitates planned restarts of the HeatWave Cluster.
• ON
• rapid_dmem_size
The setting for this variable is managed by OCI and cannot be modified directly. Specifies the
amount of DMEM available on each core of each node, in bytes.
• rapid_memory_heap_size
Type Integer
Default Value unlimited
Minimum Value 67108864
Maximum Value unlimited
The setting for this variable is managed by OCI and cannot be modified directly. Defines the amount
of memory available for the HeatWave plugin, in bytes. Ensures that HeatWave does not use more
memory than is allocated to it.
• rapid_execution_strategy
Command-Line Format --
rapid_execution_strategy[={MIN_RUNTIME|
MIN_MEM_CONSUMPTION}]
System Variable rapid_execution_strategy
Scope Session
Dynamic No
SET_VAR Hint Applies No
Type Enumeration
Default Value MIN_RUNTIME
Valid Values MIN_RUNTIME
MIN_MEM_CONSUMPTION
Specifies the query execution strategy to use. Minimum runtime (MIN_RUNTIME) or minimum
memory consumption (MIN_MEM_CONSUMPTION).
HeatWave optimizes for network usage rather than memory. If you encounter out of memory errors
when running a query, try running the query with the MIN_MEM_CONSUMPTION strategy by setting by
setting rapid_execution_strategy prior to executing the query:
• rapid_stats_cache_max_entries
235
System Variables
The setting for this variable is managed by OCI and cannot be modified directly. Specifies the
maximum number of entries in the statistics cache.
The number of entries permitted in the statistics cache by default is 65536, which is enough to store
statistics for 4000 to 5000 unique queries of medium complexity.
For more information, see Section 2.3.4, “Auto Query Plan Improvement”.
• show_create_table_skip_secondary_engine
Whether to exclude the SECONDARY ENGINE clause from SHOW CREATE TABLE output, and from
CREATE TABLE statements dumped by the mysqldump utility.
• use_secondary_engine
ON
FORCED
Whether to execute queries using the secondary engine. These values are permitted:
• OFF: Queries execute using the primary storage (InnoDB) on the MySQL DB System. Execution
using the secondary engine (RAPID) is disabled.
• ON: Queries execute using the secondary engine (RAPID) when conditions warrant, falling back
to the primary storage engine (InnoDB) otherwise. In the case of fallback to the primary engine,
whenever that occurs during statement processing, the attempt to use the secondary engine is
236 abandoned and execution is attempted using the primary engine.
Status Variables
• FORCED: Queries always execute using the secondary engine (RAPID) or fail if that is not possible.
Under this mode, a query returns an error if it cannot be executed using the secondary engine,
regardless of whether the tables that are accessed have a secondary engine defined.
• hw_data_scanned
Tracks the amount of data scanned by successfully executed HeatWave queries. Data is tracked in
megabytes and is a cumulative total of data scanned since the HeatWave Cluster was last started.
The counter is reset to 0 when the HeatWave Cluster is restarted (when the rapid_bootstrap
state changes from OFF or IDLE to ON.)
• rapid_change_propagation_status
A status of ON indicates that change propagation is enabled globally, permitting changes to InnoDB
tables on the MySQL DB System to be propagated to their counterpart tables in the HeatWave
Cluster.
• rapid_cluster_status
• rapid_core_count
The HeatWave node core count. The value remains at 0 until all HeatWave nodes are started.
• rapid_heap_usage
• rapid_load_progress
• rapid_ml_operation_count
A count of the number of HeatWave AutoML operations that a dbsystem executes. An operation
typically includes multiple queries, and it increases for both successful and failed queries. It resets
when a HeatWave node restarts.
237
Status Variables
• rapid_ml_status
• rapid_plugin_bootstrapped
• rapid_preload_stats_status
Reports the state of preload statistics collection. Column-level statistics are collected for tables
on the MySQL DB System during a cluster size estimate. You can perform a cluster estimate
when adding or editing a HeatWave Cluster. States include Not started, In progress, and
Statistics collected.
Preload statistics are cached in the rpd_preload_stats Performance Schema table. See
Section 6.4.5, “The rpd_preload_stats Table”.
• rapid_query_offload_count
• rapid_service_status
Reports the status of the cluster as it is brought back online after a node failure.
• Secondary_engine_execution_count
The number of queries executed by HeatWave. Execution occurs if query processing using
the secondary engine advances past the preparation and optimization stages. The variable is
incremented regardless of whether query execution is successful.
• rapid_n_external_tables
• rapid_lakehouse_total_loaded_bytes
• rapid_lakehouse_health
238
Chapter 6 HeatWave Performance and Monitoring
Table of Contents
6.1 HeatWave Autopilot Report Table ........................................................................................ 239
6.2 HeatWave MySQL Monitoring .............................................................................................. 241
6.2.1 HeatWave Node Status Monitoring ............................................................................ 241
6.2.2 HeatWave Memory Usage Monitoring ....................................................................... 241
6.2.3 Data Load Progress and Status Monitoring ................................................................ 241
6.2.4 Change Propagation Monitoring ................................................................................ 242
6.2.5 Query Execution Monitoring ...................................................................................... 243
6.2.6 Query History and Statistics Monitoring ..................................................................... 244
6.2.7 Scanned Data Monitoring ......................................................................................... 245
6.3 HeatWave AutoML Monitoring ............................................................................................. 246
6.4 HeatWave Performance Schema Tables .............................................................................. 247
6.4.1 The rpd_column_id Table ......................................................................................... 247
6.4.2 The rpd_columns Table ............................................................................................ 247
6.4.3 The rpd_exec_stats Table ......................................................................................... 248
6.4.4 The rpd_nodes Table ............................................................................................... 248
6.4.5 The rpd_preload_stats Table ..................................................................................... 250
6.4.6 The rpd_query_stats Table ....................................................................................... 251
6.4.7 The rpd_table_id Table ............................................................................................. 251
6.4.8 The rpd_tables Table ................................................................................................ 252
The MySQL Performance Schema collects statistics on the usage of HeatWave. Use SQL queries to
access this data and check the system status and performance.
As of MySQL 8.0.31, the Auto Shape Prediction feature in HeatWave Autopilot uses MySQL statistics
for the workload to assess the suitability of the current shape. Auto Shape Prediction provides prompts
to upsize the shape and improve system performance, or to downsize the shape if the system is under-
utilized. This feature is available for HeatWave on AWS only.
You can also monitor HeatWave from the HeatWave Console for your platform:
• For HeatWave on OCI, the HeatWave Console uses Metrics. See Metrics in the HeatWave on OCI
Service Guide.
• HeatWave on AWS users can monitor HeatWave on the Performance page in the HeatWave
Console.
• For HeatWave for Azure, select Metrics on the details page for the HeatWave Cluster to access
Microsoft Azure Application Insights.
239
Autopilot Report Table Query Examples
When MySQL runs Advisor or Auto Parallel Load run, it sends detailed output to the
heatwave_autopilot_report table in the sys schema. This includes execution logs and
generated load scripts.
The heatwave_autopilot_report table is a temporary table. It contains data from the last
execution of Advisor or Auto Parallel Load. Data is only available for the current session and is lost
when the session terminates or when the server shuts down.
• View the generated DDL statements for Advisor recommendations, or to see the commands that
would be executed by Auto Parallel Load in normal mode:
mysql> SELECT log->>"$.sql" AS "SQL Script"
FROM sys.heatwave_autopilot_report
WHERE type = "sql"
ORDER BY id;
• Concatenate Advisor or Auto Parallel Load generated DDL statements into a single string to
copy and paste for execution. The group_concat_max_len variable sets the result length in
bytes for the GROUP_CONCAT() function to accommodate a potentially long string. The default
group_concat_max_len setting is 1024 bytes.
mysql> SET SESSION group_concat_max_len = 1000000;
mysql> SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ')
FROM sys.heatwave_autopilot_report
WHERE type = "sql"
ORDER BY id;
• For Auto Parallel Load, view the number of load commands generated:
mysql> SELECT Count(*) AS "Total Load Commands Generated"
FROM sys.heatwave_autopilot_report
WHERE type = "sql" ORDER BY id;
• For Auto Parallel Load, view load script data for a particular table:
mysql> SELECT log->>"$.sql"
FROM sys.heatwave_autopilot_report
240
HeatWave MySQL Monitoring
241
Change Propagation Monitoring
For information about load statuses, see Section 6.4.8, “The rpd_tables Table”.
• To view the time that the last successful recovery took across all tables.
mysql> SELECT variable_value
FROM performance_schema.global_status
WHERE variable_name="rapid_recovery_time";
+----------------+
| variable_value |
+----------------+
| N/A |
+----------------+
• To determine if change propagation is enabled for a particular table, query the POOL_TYPE data from
the HeatWave Performance Schema tables. RAPID_LOAD_POOL_TRANSACTIONAL indicates that
change propagation is enabled for the table. RAPID_LOAD_POOL_SNAPSHOT indicates that change
propagation is disabled.
mysql> USE performance_schema;
mysql> SELECT NAME, POOL_TYPE FROM rpd_tables,rpd_table_id
WHERE rpd_tables.ID = rpd_table_id.ID AND SCHEMA_NAME LIKE 'tpch';
242
Query Execution Monitoring
+---------------+-------------------------------+
| NAME | POOL_TYPE |
+---------------+-------------------------------+
| tpch.orders | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.region | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.lineitem | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.supplier | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.partsupp | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.part | RAPID_LOAD_POOL_TRANSACTIONAL |
| tpch.customer | RAPID_LOAD_POOL_TRANSACTIONAL |
+---------------+-------------------------------+
• As of MySQL 8.0.29, the Performance Schema statement event tables (see Performance
Schema Statement Event Tables), and the performance_schema.threads and
performance_schema.processlist tables include an EXECUTION_ENGINE column that
indicates whether a query was processed on the PRIMARY or SECONDARY engine, where the primary
engine is InnoDB and the secondary engine is HeatWave. The sys.processlist and sys.x
$processlist views in the MySQL sys Schema also include an execution_engine column.
This query shows the schema, the first 50 characters of the query, and the execution engine that
processed the query:
mysql> SELECT CURRENT_SCHEMA, LEFT(DIGEST_TEXT, 50), EXECUTION_ENGINE
FROM performance_schema.events_statements_history
WHERE CURRENT_SCHEMA='tpch';
+----------------+----------------------------------------------------+------------------+
| CURRENT_SCHEMA | LEFT(DIGEST_TEXT, 50) | EXECUTION_ENGINE |
+----------------+----------------------------------------------------+------------------+
| tpch | SELECT COUNT(*) FROM tpch.LINEITEM | SECONDARY |
+----------------+----------------------------------------------------+------------------+
• As of MySQL 8.0.29, the Performance Schema statement summary tables (see Statement Summary
Tables) include a COUNT_SECONDARY column that indicates the number of times a query was
processed on the SECONDARY engine (HeatWave).
This query retrieves the total number of secondary engine execution events from the
events_statements_summary_by_digest table:
mysql> SELECT SUM(COUNT_SECONDARY)
FROM performance_schema.events_statements_summary_by_digest;
+----------------------+
| SUM(COUNT_SECONDARY) |
+----------------------+
| 25 |
+----------------------+
This query counts all engine execution events for a particular schema and shows how many
occurred on the primary engine (InnoDB) and how many occurred on the secondary engine
(HeatWave):
mysql> SELECT SUM(COUNT_STAR) AS TOTAL_EXECUTIONS, SUM(COUNT_STAR) - SUM(COUNT_SECONDARY)
AS PRIMARY_ENGINE, SUM(COUNT_SECONDARY) AS SECONDARY_ENGINE
FROM performance_schema.events_statements_summary_by_digest
243
Query History and Statistics Monitoring
• QUERY_ID
The ID assigned to the query by HeatWave. IDs are assigned in first in first out (FIFO) order.
• CONNECTION_ID
• QUERY_START
• QUERY_END
• QUEUE_WAIT
• EXEC_START
244
Scanned Data Monitoring
• To view the number of records in the rpd_query_stats table. The rpd_query_stats table
stores query compilation and execution statistics (the query history) for the last 1000 successfully
executed queries.
mysql> SELECT COUNT(*) FROM performance_schema.rpd_query_stats;
+----------+
| count(*) |
+----------+
| 1000 |
+----------+
• To view query IDs for the first and last successfully executed queries:
mysql> SELECT MIN(QUERY_ID), MAX(QUERY_ID) FROM performance_schema.rpd_query_stats;
+---------------+---------------+
| MIN(QUERY_ID) | MAX(QUERY_ID) |
+---------------+---------------+
| 2 | 1001 |
+---------------+---------------+
• To view the query count for a table and the last time the table was queried:
mysql> USE performance_schema;
mysql> SELECT rpd_table_id.TABLE_NAME, rpd_tables.NROWS, rpd_tables.QUERY_COUNT,
rpd_tables.LAST_QUERIED FROM rpd_table_id, rpd_tables
WHERE rpd_table_id.ID = rpd_tables.ID;
+------------+---------+-------------+----------------------------+
| TABLE_NAME | NROWS | QUERY_COUNT | LAST_QUERIED |
+------------+---------+-------------+----------------------------+
| orders | 1500000 | 1 | 2021-12-06 14:32:59.868141 |
+------------+---------+-------------+----------------------------+
To view a cumulative total of data scanned (in MBs) by all successfully executed HeatWave queries
from the time the HeatWave Cluster was last started, query the hw_data_scanned global status
variable. For example:
mysql> SHOW GLOBAL STATUS LIKE 'hw_data_scanned';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| hw_data_scanned | 66 |
+-----------------+-------+
The cumulative total does not include data scanned by failed queries, queries that were not offloaded
to HeatWave, or EXPLAIN queries.
The hw_data_scanned value is reset to 0 only when the HeatWave Cluster is restarted.
If a subset of HeatWave nodes go offline, HeatWave retains the cumulative total of scanned data
as long as the HeatWave Cluster remains in an active state. When the HeatWave Cluster becomes
fully operational and starts processing queries again, HeatWave resumes tracking the amount of data
scanned, adding to the cumulative total.
245
HeatWave AutoML Monitoring
FROM performance_schema.rpd_query_stats;
+----------+------------+--------------+---------------+
| query_id | session_id | data_scanned | error_message |
+----------+------------+--------------+---------------+
| 1 | 8 | 66 | "" |
+----------+------------+--------------+---------------+
The example above retrieves any error message associated with the query ID. If a query fails or was
interrupted, the number of bytes scanned by the failed or interrupted query and the associated error
message are returned, as shown in the following examples:
mysql> SELECT query_id,
JSON_EXTRACT(JSON_UNQUOTE(qkrn_text->'$**.sessionId'),'$[0]') AS session_id,
JSON_EXTRACT(JSON_UNQUOTE(qkrn_text->'$**.totalBaseDataScanned'), '$[0]') AS data_scanned,
JSON_EXTRACT(JSON_UNQUOTE(qexec_text->'$**.error'),'$[0]') AS error_message
FROM performance_schema.rpd_query_stats;
+----------+------------+--------------+------------------------------------------+
| query_id | session_id | data_scanned | error_message |
+----------+------------+--------------+------------------------------------------+
| 1 | 8 | 461 | "Operation was interrupted by the user." |
+----------+------------+--------------+------------------------------------------+
The rapid_ml_status variable provides the status of HeatWave AutoML. Possible values are ON
and OFF.
You can query the rapid_ml_status status variable directly or through the
performance_schema.global_status table; for example:
mysql> SHOW GLOBAL STATUS LIKE 'rapid_ml_status';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| rapid_ml_status | ON |
+-----------------+-------+
246
HeatWave Performance Schema Tables
The HeatWave plugin writes HeatWave AutoML status information to the ML_STATUS column of
performance_schema.rpd_nodes table after each ML query. Possible values include:
To following query retrieves ID, STATUS, and ML_STATUS for each HeatWave node from the
performance_schema.rpd_nodes table:
mysql> SELECT ID, STATUS, ML_STATUS FROM performance_schema.rpd_nodes;
+----+---------------+---------------+
| ID | STATUS | ML_STATUS |
+----+---------------+---------------+
| 1 | AVAIL_RNSTATE | AVAIL_MLSTATE |
| 0 | AVAIL_RNSTATE | AVAIL_MLSTATE |
+----+---------------+---------------+
If rapid_ml_status is OFF or ML_STATUS reports DOWN_MLSTATE for any HeatWave node, you
can restart the HeatWave Cluster in the HeatWave Console but be aware that restarting interrupts any
analytics queries that are running. See Managing a HeatWave Cluster in the HeatWave on OCI Service
Guide.
Information about HeatWave nodes is available only when rapid_bootstrap mode is ON.
Information about tables and columns is available only after tables are loaded in the HeatWave Cluster.
See Section 2.2, “Loading Data to HeatWave MySQL”.
• ID
• TABLE_ID
• COLUMN_NAME
247
The rpd_exec_stats Table
• TABLE_ID
• COLUMN_ID
• NDV
• ENCODING
• DATA_PLACEMENT_INDEX
The data placement key index ID associated with the column. Index value ranges from 1 to
16. For information about data placement key index values, see Section 2.7.2, “Defining Data
Placement Keys”. NULL indicates that the column is not defined as a data placement key.
For a DATA_PLACEMENT_INDEX query that identifies columns with data placement keys, see
Section 2.16, “Metadata Queries”.
• DICT_SIZE_BTYES
The rpd_exec_stats table stores query execution statistics produced by HeatWave nodes in JSON
format. One row of execution statistics is stored for each node that participates in the query. The table
stores a maximum of 1000 queries.
For HeatWave AutoML routines that include multiple sub-queries, such as ML_TRAIN, a new record is
used for each query.
• QUERY_ID
The query ID. The counter auto-increments for each RAPID or, as of MySQL 8.0.31, HeatWave
AutoML query.
• NODE_ID
• EXEC_TEXT
Query execution statistics. For HeatWave AutoML, this contains the HeatWave AutoML routine that
the user runs.
248
The rpd_nodes Table
• ID
• CORES
• MEMORY_USAGE
Node memory usage in bytes. The value is refreshed every four seconds. If a query starts and
finishes in the four seconds between refreshes, the memory used by the query is not accounted for
in the reported value.
• BASEREL_MEMORY_USAGE
• STATUS
• NOTAVAIL_RNSTATE
Not available.
• AVAIL_RNSTATE
Available.
• DOWN_RNSTATE
Down.
• SPARE_RNSTATE
Spare.
• DEAD_RNSTATE
• IP
• PORT
• CLUSTER_EVENT_NUM
The number of cluster events such as node down, node up, and so on.
• NUM_OBJSTORE_GETS
• NUM_OBJSTORE_PUTS
249
The rpd_preload_stats Table
The number of PUT requests from the node to the object store.
• NUM_OBJSTORE_DELETES
The number of DELETE requests from the node to the object store.
• ML_STATUS
The rpd_nodes table may not show the current status for a new node or newly configured node
immediately. The rpd_nodes table is updated after the node has successfully joined the cluster.
If additional nodes fail while node recovery is in progress, the newly failed nodes are not detected and
their status is not updated in the performance_schema.rpd_nodes table until after the current
recovery operation finishes and the nodes that failed previously have rejoined the cluster.
• TABLE_SCHEMA
• TABLE_NAME
• COLUMN_NAME
• AVG_BYTE_WIDTH_INC_NULL
The average byte width of the column. The average value includes NULL values.
• Statistics are deterministic, provided that the data does not change.
250
The rpd_query_stats Table
The rpd_query_stats table stores query compilation and execution statistics produced by the
HeatWave plugin in JSON format. One row of data is stored for each query. The table stores data for
the last 1000 executed queries. Data is stored for successfully processed queries and failed queries.
For HeatWave AutoML routines that include multiple sub-queries, such as ML_TRAIN, a new record is
used for each query.
• CONNECTION_ID
• QUERY_ID
The query ID. The counter auto-increments for each RAPID or, as of MySQL 8.0.31, HeatWave
AutoML query.
• QUERY_TEXT
The RAPID engine query or HeatWave AutoML query run by the user.
• QEXEC_TEXT
Query execution log. For HeatWave AutoML, this contains the arguments the user passed to the
HeatWave AutoML routine call.
• QKRN_TEXT
Logical query execution plan. This field is not used for HeatWave AutoML queries.
• QEP_TEXT
Physical query execution plan. This field is not used for HeatWave AutoML queries.
Includes prepart data, which can be queried to determine if a JOIN or GROUP BY query used data
placement partitions. See Section 2.16, “Metadata Queries”.
• STATEMENT_ID
• ID
• NAME
• SCHEMA_NAME
251
The rpd_tables Table
• TABLE_NAME
• ID
• SNAPSHOT_SCN
The system change number (SCN) of the table snapshot. The SCN is an internal number that
represents a point in time according to the system logical clock that the table snapshot was
transactionally consistent with the source table.
• PERSISTED_SCN
• POOL_TYPE
The load pool type of the table. Possible values are RAPID_LOAD_POOL_SNAPSHOT and
RAPID_LOAD_POOL_TRANSACTIONAL.
• DATA_PLACEMENT_TYPE
• NROWS
The number of rows that are loaded for the table. The value is set initially when the table is loaded,
and updated as changes are propagated.
• LOAD_STATUS
• NOLOAD_RPDGSTABSTATE
• LOADING_RPDGSTABSTATE
• AVAIL_RPDGSTABSTATE
• UNLOADING_RPDGSTABSTATE
252
The rpd_tables Table
• INRECOVERY_RPDGSTABSTATE
The table is being recovered. After completion of the recovery operation, the table is placed back
in the UNAVAIL_RPDGSTABSTATE state if there are pending recoveries.
• STALE_RPDGSTABSTATE
A failure during change propagation, and the table has become stale. See Section 2.2.7, “Change
Propagation”
• UNAVAIL_RPDGSTABSTATE
• LOAD_PROGRESS
• 70% - 80%: the transformation to native HeatWave format is complete and the aggregation phase
is in progress.
• SIZE_BYTES
• TRANSFORMATION_BYTES:
• NROWS:
• QUERY_COUNT
• LAST_QUERIED
• LOAD_START_TIMESTAMP
• LOAD_END_TIMESTAMP
• RECOVERY_SOURCE
Indicates the source of the last successful recovery for a table. The values are MySQL, that is
InnoDB, or ObjectStorage.
253
The rpd_tables Table
• RECOVERY_START_TIMESTAMP
• RECOVERY_END_TIMESTAMP
254
Chapter 7 HeatWave Quickstarts
Table of Contents
7.1 tpch Analytics Quickstart ..................................................................................................... 255
7.1.1 tpch Prerequisites ..................................................................................................... 255
7.1.2 Generating tpch Sample Data ................................................................................... 255
7.1.3 Creating the tpch Sample Database and Importing Data ............................................. 256
7.1.4 Loading tpch Data Into HeatWave MySQL ................................................................. 258
7.1.5 Running tpch Queries ............................................................................................... 260
7.1.6 Additional tpch Queries ............................................................................................. 261
7.1.7 Unloading tpch Tables .............................................................................................. 262
7.2 AirportDB Analytics Quickstart ............................................................................................. 263
7.2.1 AirportDB Prerequisites ............................................................................................. 263
7.2.2 Installing AirportDB ................................................................................................... 263
7.2.3 Loading AirportDB into HeatWave MySQL ................................................................. 264
7.2.4 Running AirportDB Queries ....................................................................................... 264
7.2.5 Additional AirportDB Queries ..................................................................................... 266
7.2.6 Unloading AirportDB Tables ...................................................................................... 267
7.3 Iris Data Set Machine Learning Quickstart ............................................................................ 267
• For HeatWave on OCI, see Creating a DB System in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Creating a DB System in the HeatWave on AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave in the HeatWave for Azure Service Guide.
• For HeatWave on OCI, see Adding a HeatWave Cluster in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Creating a HeatWave Cluster in the HeatWave on AWS Service
Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure Service
Guide.
• For HeatWave on OCI, see Connecting to a DB System in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Connecting with MySQL Shell in the HeatWave on AWS Service
Guide.
255
Creating the tpch Sample Database and Importing Data
The following instructions describe how to generate tpch sample data using the dbgen utility. The
instructions assume you are on a Linux system that has gcc and make libraries installed.
1. Download the TPC-H tools zip file from TPC Download Current.
3. Change to the dbgen directory and make a copy of the makefile template.
$> cd 2.18.0/dbgen
$> cp makefile.suite makefile
6. Issue the following dbgen command to generate a 1GB set of data files for the tpch database:
$> ./dbgen -s 1
The operation may take a few minutes. When finished, the following data files appear in the working
directory, one for each table in the tpch database:
$> ls -1 *.tbl
customer.tbl
lineitem.tbl
nation.tbl
orders.tbl
partsupp.tbl
part.tbl
region.tbl
supplier.tbl
Sample database creation and import operations are performed using MySQL Shell. The MySQL Shell
Parallel Table Import Utility provides fast data import for large data files. The utility analyzes an input
data file, divides it into chunks, and uploads the chunks to the target MySQL DB System using parallel
connections. The utility is capable of completing a large data import many times faster than a standard
single-threaded upload using a LOAD DATA statement. For additional information, see Parallel Table
Import Utility.
To create the tpch sample database on the MySQL DB System and import data:
The --mysql option opens a ClassicSession, which is required when using the MySQL Shell
Parallel Table Import Utility.
256
Creating the tpch Sample Database and Importing Data
MySQL>JS> \sql
257
Loading tpch Data Into HeatWave MySQL
5. Change back to JavaScript execution mode to use the Parallel Table Import Utility:
MySQL>SQL> \js
6. Execute the following operations to import the data into the tpch database on the MySQL DB
System.
Note
258
Loading tpch Data Into HeatWave MySQL
Note
For HeatWave on AWS, load data into HeatWave using the HeatWave Console.
See Manage Data in HeatWave with Workspaces in the HeatWave on AWS
Service Guide.
3. Execute the following operations to prepare the tpch sample database tables and load them
into the HeatWave Cluster. The operations performed include defining string column encodings,
defining the secondary engine, and executing SECONDARY_LOAD operations.
mysql> USE tpch;
mysql> ALTER TABLE nation modify `N_NAME` CHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE nation modify `N_COMMENT` VARCHAR(152) COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE nation SECONDARY_ENGINE=RAPID;
mysql> ALTER TABLE nation SECONDARY_LOAD;
mysql> ALTER TABLE region modify `R_NAME` CHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE region modify `R_COMMENT` VARCHAR(152) COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE region SECONDARY_ENGINE=RAPID;
mysql> ALTER TABLE region SECONDARY_LOAD;
mysql> ALTER TABLE part modify `P_MFGR` CHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE part modify `P_BRAND` CHAR(10) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE part modify `P_CONTAINER` CHAR(10) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED
mysql> ALTER TABLE part modify `P_COMMENT` VARCHAR(23) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTE
mysql> ALTER TABLE part SECONDARY_ENGINE=RAPID;
mysql> ALTER TABLE part SECONDARY_LOAD;
mysql> ALTER TABLE supplier modify `S_NAME` CHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED'
mysql> ALTER TABLE supplier modify `S_ADDRESS` VARCHAR(40) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=S
mysql> ALTER TABLE supplier modify `S_PHONE` CHAR(15) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED
mysql> ALTER TABLE customer modify `C_NAME` VARCHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORT
mysql> ALTER TABLE customer modify `C_ADDRESS` VARCHAR(40) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=S
mysql> ALTER TABLE customer modify `C_MKTSEGMENT` CHAR(10) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=S
mysql> ALTER TABLE customer modify `C_COMMENT` VARCHAR(117) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=
mysql> ALTER TABLE orders modify `O_ORDERSTATUS` CHAR(1) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SOR
mysql> ALTER TABLE orders modify `O_ORDERPRIORITY` CHAR(15) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=
mysql> ALTER TABLE orders modify `O_CLERK` CHAR(15) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SORTED';
mysql> ALTER TABLE orders SECONDARY_ENGINE=RAPID;
mysql> ALTER TABLE orders SECONDARY_LOAD;
mysql> ALTER TABLE lineitem modify `L_RETURNFLAG` CHAR(1) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SO
mysql> ALTER TABLE lineitem modify `L_LINESTATUS` CHAR(1) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SO
mysql> ALTER TABLE lineitem modify `L_SHIPINSTRUCT` CHAR(25) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING
mysql> ALTER TABLE lineitem modify `L_SHIPMODE` CHAR(10) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=SOR
mysql> ALTER TABLE lineitem modify `L_COMMENT` VARCHAR(44) NOT NULL COMMENT 'RAPID_COLUMN=ENCODING=S
259
Running tpch Queries
4. Verify that the tpch sample database tables are loaded in the HeatWave Cluster by querying
LOAD_STATUS data from the HeatWave Performance Schema tables. Loaded tables have an
AVAIL_RPDGSTABSTATE load status.
MySQL>SQL> USE performance_schema;
MySQL>SQL> SELECT NAME, LOAD_STATUS
FROM rpd_tables,rpd_table_id
WHERE rpd_tables.ID = rpd_table_id.ID;
+------------------------------+---------------------+
| NAME | LOAD_STATUS |
+------------------------------+---------------------+
| tpch.supplier | AVAIL_RPDGSTABSTATE |
| tpch.partsupp | AVAIL_RPDGSTABSTATE |
| tpch.orders | AVAIL_RPDGSTABSTATE |
| tpch.lineitem | AVAIL_RPDGSTABSTATE |
| tpch.customer | AVAIL_RPDGSTABSTATE |
| tpch.nation | AVAIL_RPDGSTABSTATE |
| tpch.region | AVAIL_RPDGSTABSTATE |
| tpch.part | AVAIL_RPDGSTABSTATE |
+------------------------------+---------------------+
Note
For HeatWave on AWS, run queries from the Query Editor in the HeatWave
Console. See Running Queries in the HeatWave on AWS Service Guide.
4. Before running a query, use EXPLAIN to verify that the query can be offloaded to the HeatWave
Cluster. For example:
MySQL>SQL> EXPLAIN SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= date '1994-01-01';
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: lineitem
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
260
Additional tpch Queries
ref: NULL
rows: 56834662
filtered: 33.33
Extra: Using where; Using secondary engine RAPID
If the query can be offloaded, the Extra column in the EXPLAIN output reports Using
secondary engine RAPID.
5. After verifying that the query can be offloaded, run the query and note the execution time.
6. To compare the HeatWave execution time with MySQL DB System execution time, disable the
use_secondary_engine variable to see how long it takes to run the same query on the MySQL
DB System. For example:
For other tpch sample database queries that you can run, see Section 7.1.6, “Additional tpch
Queries”. For more information about running queries, refer to Section 2.3, “Running Queries”.
As described in the TPC Benchmark™ H (TPC-H) specification: "The Pricing Summary Report
Query provides a summary pricing report for all lineitems shipped as of a given date. The date is
within 60 - 120 days of the greatest ship date contained in the database. The query lists totals for
extended price, discounted extended price, discounted extended price plus tax, average quantity,
average extended price, and average discount. These aggregates are grouped by RETURNFLAG
and LINESTATUS, and listed in ascending order of RETURNFLAG and LINESTATUS. A count of the
number of lineitems in each group is included."
mysql> SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
261
Unloading tpch Tables
As described in the TPC Benchmark™ H (TPC-H) specification: "The Shipping Priority Query
retrieves the shipping priority and potential revenue, defined as the sum of l_extendedprice *
(1-l_discount), of the orders having the largest revenue among those that had not been shipped
as of a given date. Orders are listed in decreasing order of revenue. If more than 10 unshipped
orders exist, only the 10 orders with the largest revenue are listed."
mysql> SELECT
l_orderkey,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1995-03-15'
AND l_shipdate > DATE '1995-03-15'
GROUP BY l_orderkey , o_orderdate , o_shippriority
ORDER BY revenue DESC , o_orderdate
LIMIT 10;
As described in the TPC Benchmark™ H (TPC-H) specification: "The Product Type Profit Measure
Query finds, for each nation and each year, the profit for all parts ordered in that year that contain
a specified substring in their names and that were filled by a supplier in that nation. The profit
is defined as the sum of [(l_extendedprice*(1-l_discount)) - (ps_supplycost *
l_quantity)] for all lineitems describing parts in the specified line. The query lists the nations in
ascending alphabetical order and, for each nation, the year and profit in descending order by year
(most recent first). "
mysql> SELECT
nation, o_year, SUM(amount) AS sum_profit
FROM
(SELECT
n_name AS nation,
YEAR(o_ORDERdate) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
FROM
part
STRAIGHT_JOIN partsupp
STRAIGHT_JOIN lineitem
STRAIGHT_JOIN supplier
STRAIGHT_JOIN orders
STRAIGHT_JOIN nation
WHERE
s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
AND p_partkey = l_partkey
AND o_ORDERkey = l_ORDERkey
AND s_nationkey = n_nationkey
AND p_name LIKE '%green%') AS profit
GROUP BY nation , o_year
ORDER BY nation , o_year DESC;
262
AirportDB Analytics Quickstart
Note
For HeatWave on AWS, unload data from HeatWave using the HeatWave
Console. See Manage Data in HeatWave with Workspaces in the HeatWave on
AWS Service Guide.
mysql> USE tpch;
For an online workshop that demonstrates HeatWave using the airportdb sample database, see
Turbocharge Business Insights with HeatWave Service and HeatWave.
• For HeatWave on OCI, see Creating a DB System, in the HeatWave on OCI Service Guide,.
• For HeatWave on AWS, see Creating a DB System in the HeatWave on AWS Service Guide.
• For HeatWave for Azure, see Provisioning HeatWave in the HeatWave for Azure Service Guide.
• For HeatWave on OCI, see Adding a HeatWave Cluster, in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Creating a HeatWave Cluster in the HeatWave on AWS Service
Guide.
• For HeatWave for Azure, see Provisioning HeatWave Nodes in the HeatWave for Azure Service
Guide.
• For HeatWave on OCI, see Connecting to a DB System in the HeatWave on OCI Service Guide.
• For HeatWave on AWS, see Connecting with MySQL Shell in the HeatWave on AWS Service
Guide.
1. Download the airportdb sample database and unpack it. The airportdb sample database is
provided for download as a compressed tar or Zip archive. The download is approximately 640
MBs in size.
263
Loading AirportDB into HeatWave MySQL
or
$> wget https://downloads.mysql.com/docs/airport-db.zip
$> unzip airport-db.zip
Unpacking the compressed tar or Zip archive results in a single directory named airport-db,
which contains the data files.
2. Start MySQL Shell and connect to the MySQL DB System Endpoint. For additional information
about connecting to a MySQL DB System, see Connecting to a DB System in the HeatWave on
OCI Service Guide.
$> mysqlsh Username@DBSystem_IP_Address_or_Host_Name
3. Load the airportdb database into the MySQL DB System using the MySQL Shell Dump Loading
Utility.
MySQL>JS> util.loadDump("airport-db", {threads: 16, deferTableIndexes: "all",
ignoreVersion: TRUE})
Note
After the data is imported into the MySQL DB System, you can load the tables into HeatWave. For
instructions, see Section 7.2.3, “Loading AirportDB into HeatWave MySQL”.
Note
For HeatWave on AWS, load data into HeatWave using the HeatWave Console.
See Manage Data in HeatWave with Workspaces in the HeatWave on AWS
Service Guide.
2. Change the MySQL Shell execution mode to SQL and run the following Auto Parallel Load
command to load the airportdb tables into HeatWave.
MySQL>JS> \sql
MySQL>SQL> CALL sys.heatwave_load(JSON_ARRAY('airportdb'), NULL);
For information about the Auto Parallel Load utility, see Section 2.2.3, “Loading Data Using Auto
Parallel Load”.
264
Running AirportDB Queries
Note
For HeatWave on AWS, run queries from the Query Editor in the HeatWave
Console. See Running Queries in the HeatWave on AWS Service Guide.
4. Before running a query, use EXPLAIN to verify that the query can be offloaded to the HeatWave
Cluster. For example:
MySQL>SQL> EXPLAIN SELECT booking.price, count(*)
FROM booking
WHERE booking.price > 500
GROUP BY booking.price
ORDER BY booking.price LIMIT
10;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: booking
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 54081693
filtered: 33.32999801635742
Extra: Using where; Using temporary; Using filesort; Using secondary engine RAPID
If the query can be offloaded, the Extra column in the EXPLAIN output reports Using
secondary engine RAPID.
5. After verifying that the query can be offloaded, run the query and note the execution time.
MySQL>SQL> SELECT booking.price, count(*)
FROM booking
WHERE booking.price > 500
GROUP BY booking.price
ORDER BY booking.price
LIMIT 10;
+--------+----------+
| price | count(*) |
+--------+----------+
| 500.01 | 860 |
| 500.02 | 1207 |
| 500.03 | 1135 |
| 500.04 | 1010 |
| 500.05 | 1016 |
| 500.06 | 1039 |
| 500.07 | 1002 |
| 500.08 | 1095 |
| 500.09 | 1117 |
| 500.10 | 1106 |
265
Additional AirportDB Queries
+--------+----------+
10 rows in set (0.0537 sec)
6. To compare the HeatWave execution time with MySQL DB System execution time, disable the
use_secondary_engine variable to see how long it takes to run the same query on the MySQL
DB System; for example:
MySQL>SQL> SET SESSION use_secondary_engine=OFF;
MySQL>SQL> SELECT booking.price, count(*)
FROM booking
WHERE booking.price > 500
GROUP BY booking.price
ORDER BY booking.price
LIMIT 10;
+--------+----------+
| price | count(*) |
+--------+----------+
| 500.01 | 860 |
| 500.02 | 1207 |
| 500.03 | 1135 |
| 500.04 | 1010 |
| 500.05 | 1016 |
| 500.06 | 1039 |
| 500.07 | 1002 |
| 500.08 | 1095 |
| 500.09 | 1117 |
| 500.10 | 1106 |
+--------+----------+
10 rows in set (9.3859 sec)
For other airportdb sample database queries that you can run, see Section 7.2.5, “Additional
AirportDB Queries”. For more information about running queries, see Section 2.3, “Running Queries”.
266
Unloading AirportDB Tables
LIMIT 10;
Note
For HeatWave on AWS, unload data from HeatWave using the HeatWave
Console. See Manage Data in HeatWave with Workspaces in the HeatWave on
AWS Service Guide.
mysql> USE airportdb;
For an online workshop based on this tutorial, see Get started with MySQL HeatWave AutoML.
The tutorial uses the publicly available Iris Data Set from the UCI Machine Learning Repository.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,
CA: University of California, School of Information.
The Iris Data Set has the following data, where the sepal and petal features are used to predict the
class label, which is the type of Iris plant:
267
Iris Data Set Machine Learning Quickstart
• Iris Setosa
• Iris Versicolour
• Iris Virginica
Data is stored in the MySQL database in the following schema and tables:
• ml_data schema: The schema containing training and test dataset tables.
• iris_train table: The training dataset (labeled). Includes feature columns (sepal length, sepal
width, petal length, petal width) and a populated class target column with ground truth values.
• iris_test table: The test dataset (unlabeled). Includes feature columns (sepal length, sepal width,
petal length, petal width) but no target column.
• iris_validate table: The validation dataset (labeled). Includes feature columns (sepal length,
sepal width, petal length, petal width) and a populated class target column with ground truth values.
This tutorial assumes that you have met the prerequisites outlined in Section 3.2, “Before You Begin”.
1. Create the example schema and tables on the MySQL DB System with the following statements:
mysql> CREATE SCHEMA ml_data;
268
Iris Data Set Machine Learning Quickstart
269
Iris Data Set Machine Learning Quickstart
2. Train the model with ML_TRAIN. Since this is a classification dataset, use the classification
task to create a classification model:
270
Iris Data Set Machine Learning Quickstart
When the training operation finishes, the model handle is assigned to the @iris_model session
variable, and the model is stored in the model catalog. View the entry in the model catalog with the
following query. Replace user1 with the MySQL account name:
mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1.MODEL_CATALOG;
+----------+---------------------------------------+--------------------+
| model_id | model_handle | train_table_name |
+----------+---------------------------------------+--------------------+
| 1 | ml_data.iris_train_user1_1648140791 | ml_data.iris_train |
+----------+---------------------------------------+--------------------+
MySQL 8.0.31 does not run the ML_EXPLAIN routine with the default Permutation Importance
model after ML_TRAIN. For MySQL 8.0.31, run ML_EXPLAIN and use NULL for the options:
mysql> CALL sys.ML_EXPLAIN('ml_data.iris_train', 'class',
'ml_data.iris_train_user1_1648140791', NULL);
A model must be loaded before you can use it. The model remains loaded until you unload it or the
HeatWave Cluster is restarted.
4. Make a prediction for a single row of data using the ML_PREDICT_ROW routine. In this example,
data is assigned to a @row_input session variable, and the variable is called by the routine. The
model handle is called using the @iris_model session variable:
mysql> SET @row_input = JSON_OBJECT(
"sepal length", 7.3,
"sepal width", 2.9,
"petal length", 6.3,
"petal width", 1.8);
Before MySQL 8.0.32, the ML_PREDICT_ROW routine does not include options, and the results do
not include the ml_results field:
mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @iris_model);
+---------------------------------------------------------------------------+
| sys.ML_PREDICT_ROW(@row_input, @iris_model) |
+---------------------------------------------------------------------------+
| {"Prediction": "Iris-virginica", "petal width": 1.8, "sepal width": 2.9, |
| "petal length": 6.3, "sepal length": 7.3} |
+---------------------------------------------------------------------------+
Based on the feature inputs that were provided, the model predicts that the Iris plant is of the class
Iris-virginica. The feature values used to make the prediction are also shown.
5. Now, generate an explanation for the same row of data using the ML_EXPLAIN_ROW routine with
the default Permutation Importance prediction explainer to understand how the prediction was
made:
mysql> SELECT sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9,
"petal length", 6.3, "petal width", 1.8), @iris_model, NULL);
+---------------------------------------------------------------------------------------------------
| sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9, "petal length", 6.3, "peta
+---------------------------------------------------------------------------------------------------
| {"Notes": "petal width (1.8) had the largest impact towards predicting Iris-virginica", "Predictio
271
Iris Data Set Machine Learning Quickstart
+-------------------------------------------------------------------------------------------------------
1 row in set (5.92 sec)
Before MySQL 8.0.32, the results do not include the ml_results field:
+------------------------------------------------------------------------------+
| sys.ML_EXPLAIN_ROW(JSON_OBJECT("sepal length", 7.3, "sepal width", 2.9, |
"petal length", 6.3, "petal width", 1.8), @iris_model) |
+------------------------------------------------------------------------------+
| {"Prediction": "Iris-virginica", "petal width": 1.8, "sepal width": 2.9, |
| "petal length": 6.3, "sepal length": 7.3, "petal width_attribution": 0.73, |
| "petal length_attribution": 0.57} |
+------------------------------------------------------------------------------+
The attribution values show which features contributed most to the prediction, with petal length and
petal width being the most important features. The other features have a 0 value indicating that they
did not contribute to the prediction.
6. Make predictions for a table of data using the ML_PREDICT_TABLE routine. The routine takes data
from the iris_test table as input and writes the predictions to an iris_predictions output
table.
mysql> CALL sys.ML_PREDICT_TABLE('ml_data.iris_test', @iris_model,
'ml_data.iris_predictions', NULL);
Before MySQL 8.0.32, the ML_PREDICT_TABLE routine does not include options, and the results
do not include the ml_results column:
mysql> CALL sys.ML_PREDICT_TABLE('ml_data.iris_test', @iris_model,
'ml_data.iris_predictions');
The table shows the predictions and the feature column values used to make each prediction.
7. Generate explanations for the same table of data using the ML_EXPLAIN_TABLE routine with the
default Permutation Importance prediction explainer.
272
Iris Data Set Machine Learning Quickstart
Explanations help you understand which features have the most influence on a prediction. Feature
importance is presented as an attribution value ranging from -1 to 1. A positive value indicates that
a feature contributed toward the prediction. A negative value indicates that the feature contributes
positively towards one of the other possible predictions.
Before MySQL 8.0.32, the output table does not include the ml_results column:
8. Score the model with ML_SCORE to assess the reliability of the model. This example uses the
balanced_accuracy metric, which is one of the many scoring metrics that HeatWave AutoML
supports.
273
Iris Data Set Machine Learning Quickstart
To avoid consuming too much space, it is good practice to unload a model when you are finished
using it.
274