Lab_ Updating Dynamic Data in Place
Lab_ Updating Dynamic Data in Place
Duration
This lab will require approximately 90 minutes to complete.
Scenario
Mary is a member of the data science team and works with a lot of streaming data that is
collected from IoT devices. Every time the devices are reset, the size and structure of the data
changes. A device that normally sends only a few fields on a regular basis might occasionally
send several fields. This is complex to handle given that the standard tools expect data to follow a
certain structure. Also, the changed data should affect only the requisite rows and not the entire
dataset.
Your challenge is to develop a proof of concept (POC) to accommodate the ever-changing
schema and only update the affected records.
You have decided to use an AWS Glue job with custom scripts to handle the dynamic schema and
the Apache Hudi Connector for in-place updates for streaming data. You will use Athena to run
SQL-like queries on the dynamic data and use Amazon S3 for a data lake. Finally, you will use
Amazon Kinesis Data Streams to ingest data that is randomly generated from the Amazon Kinesis
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 1/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
By the end of the lab, you will have created the architecture that is shown in the following diagram.
The table after the diagram provides a detailed explanation of the architecture.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 2/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Numbered
Detail
Step
1 You start an AWS Glue job. The KDG runs and sends data to a Kinesis data s
2 The AWS Glue job runs a Python script to iterate through the stream.
3 A Python script inserts or updates the data in an S3 bucket.
The AWS Glue Data Catalog provides metadata, such as tables and column
4
Athena.
5 Athena interacts with Amazon S3 using the metadata that the Data Catalog p
6 You run queries in Athena to view the data.
7 You change the schema and run queries to analyze the data.
8 Finally, you revert the schema changes and run queries again to analyze the d
2. To connect to the AWS Management Console, choose the AWS link in the upper-left corner.
A new browser tab opens and connects you to the console.
Tip: If a new browser tab does not open, a banner or icon is usually at the top of your
browser with the message that your browser is preventing the site from opening pop-up
windows. Choose the banner or icon, and then choose Allow pop-ups.
3. Retrieve values for resources that were created in the lab environment.
In the search box to the right of Services, search for and choose CloudFormation to
open the AWS CloudFormation console.
In the stacks list, choose the link for the stack name where the Description does not
contain ADE.
Choose the Outputs tab.
Outputs are listed for some of the resources in the stack as shown in the following image.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 3/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Resource Description
S3 buckets Used to store data for AWS Glue and Athena
Kinesis data stream Required to ingest data from the KDG tool
AWS Glue database and table Required to logically represent the data that is stored in Ama
AWS Glue IAM role Required to run the AWS Glue job
AWS Cloud9 environment Required to run commands
Kinesis Data generator Kinesis Data generator Cognito configuration
In this task, you copied output values from the CloudFormation stack to a text file for later use.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 5/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/CUR-TF-200-ACDENG-1-
91570/lab-06-hudi/s3/glue_job_script.py
wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/CUR-TF-200-ACDENG-1-
91570/lab-06-hudi/s3/glue_job.template
Tip: To confirm that both files are successfully downloaded, you can run the ls command
to list them.
7. Retrieve the URL for the CloudFormation template that you uploaded for the AWS Glue job.
In the search box to the right of Services, search for and choose S3 to open the Amazon
S3 console.
Choose the link for the bucket name that contains ade-hudi-bucket.
Choose the templates link.
Select glue_job.template, and then choose Copy URL to copy the URL for the template.
Save the URL to your text editor.
In this task, you configured the scripts that are necessary to create and run the AWS Glue job.
{
"name" : "{{random.arrayElement(["Sensor1","Sensor2","Sensor3", "Sensor4"])}}",
"date": "{{date.utc(YYYY-MM-DD)}}",
"year": "{{date.utc(YYYY)}}",
"month": "{{date.utc(MM)}}",
"day": "{{date.utc(DD)}}",
"column_to_update_integer": {{random.number(1000000000)}},
"column_to_update_string":"{{random.arrayElement(["45f","47f","44f", "48f"])}}"
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 8/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Analysis: Note the four partition columns in the schema. These four columns are used to
partition data in the S3 bucket where the data for this table is stored. If you would like, you
can examine these partitions in the S3 bucket.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 9/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Note: The KDG is simulating IoT devices. In this case, the tool is simulating temperature
sensors.
Run the query multiple times to see that the values are changing, as shown in the
following image.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 10/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
In this task, you used Athena to query the data and observed how the data changed.
{
"name" : "{{random.arrayElement(["Sensor1","Sensor2","Sensor3", "Sensor4"])}}",
"date": "{{date.utc(YYYY-MM-DD)}}",
"year": "{{date.utc(YYYY)}}",
"month": "{{date.utc(MM)}}",
"day": "{{date.utc(DD)}}",
"column_to_update_integer": {{random.number(1000000000)}},
"column_to_update_string": "{{random.arrayElement(["45f","47f","44f","48f"])}}",
"new_column": "{{random.number(1000000000)}}"
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 11/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Now, run the query again and observe the changes in the Sensor 3 values.
The results are similar to the following.
Analysis: When the schema was changed to add the new column, the AWS Glue job relied
on the schema evolution capabilities that are built in to Hudi. These capabilities enable the
update to the AWS Glue Data Catalog to add the new column. Hudi also added the extra
column in the output files (Parquet files that are written to Amazon S3). This enables the query
engine (Athena) to query the Hudi dataset with an extra column without any issues. For more
information, see Schema Evolution on the Apache Hudi website.
In this task, you modified the schema and observed how the AWS Glue job handled the change.
You were able to run Athena queries and perform data analysis without any issues after modifying
the schema.
Notice that the additional column, new_column, is still included in the table.
Run the following query multiple times.
Notice that new_column is still included in the query results; however, that column
doesn't contain any values.
Analysis: After you changed the schema again and removed new_column from the data,
the Python script in the AWS Glue job handled the record layout mismatches.
This method queries the AWS Glue Data Catalog for each to-be-ingested record and gets
the current Hudi table schema. It then merges the Hudi table schema with the schema of
the to-be-ingested record and enriches that schema with null values for new_column.
This enables Athena to query the Hudi dataset without any issues.
In this task, you reverted the schema and observed that records were updated in place.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 13/14
1/2/25, 17:20 Lab: Updating Dynamic Data in Place
Tip: You can submit your work multiple times. After you change your work, choose Submit
again. Your last submission is recorded for this lab.
18. To find detailed feedback about your work, choose Submission Report.
Lab complete
Congratulations! You have completed the lab.
19. At the top of this page, choose End Lab, and then choose Yes to confirm that you want to
end the lab.
A message panel indicates that the lab is ending.
© 2022, Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be
reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited.
https://awsacademy.instructure.com/courses/96839/modules/items/8946944 14/14