Retail Data Management Ps
Retail Data Management Ps
Overview
This project enhances ABC Retail's data management capabilities using AWS
Glue, an ETL service, to streamline data processing workflows and derive
actionable insights from sales and product data. It aims to integrate and analyze
data from multiple sources to understand customer behavior, product
performance, and market trends. This integration leads to data-driven decisions
that optimize inventory management, marketing strategies, and overall business
performance.
Instructions
• Review the learning materials in the ETL course
• Carefully read the situation, tasks, actions, and result sections to grasp the
assignment fully
• Complete and submit your assignment via the Learning Management
System (LMS)
• Follow the provided guidelines closely, ensuring your report includes all
required analyses and interpretations
Situation
You are a data analyst at ABC Retail, tasked with improving data processing and
analysis workflows. ABC Retail aims to leverage AWS Glue to streamline its data
processing workflows and derive actionable insights from its sales and product
data. Your role is crucial in unlocking insights from the company's vast stores of
data to drive business growth and enhance operational efficiency.
Task
Your task is to use AWS Glue to join the Products and Orders tables based on
ProductID, ensuring data integrity. Develop a script to cleanse the Sales column,
converting it to a numerical format. Create a transformation to calculate net
sales and remove duplicates for optimized analysis, then summarize the average
sales by category and ship mode.
Action
1. Login to the AWS Console:
• Open your web browser and navigate to the AWS Management
Console
• Log in with your AWS account credentials
• Navigate to S3
• Click on Create bucket and add the bucket name as etl-cep-01. Scroll
down the screen and click on the Create bucket button.
• Fill in the second classifier details as given below, then click on Create
o Classifier name as txnClass
o Classifier type and properties as CSV
o CSV Serde – optional as None
o Column delimiter as comma(,)
o Quote symbol as Double-quote(“)
o Column headings as Has headings and fill in the details as given
below:
o Order ID, Order Date, Ship Date, Aging, Ship Mode,
Product ID, Sales, Quantity, Discount, Profit, Shipping
Cost, Order Priority, Customer ID
7. Set up a Crawler
• Navigate to AWS Glue and click on Databases from the Data Catalog
and select abc-retail database
• Click on Add tables using a crawler
• Enter the name as retail-crawl and click on Next
• Click on Add a data source
• Click on Browse S3 and click on etl-cep-01 then select transaction-
files/ and click on Choose
• Click on Add an S3 data source
• Choose classifier as txnClass from the drop down of custom
classifiers – optional and click on Next
• Choose glue-role in the IAM role section and click on Next
• Choose Target database as abc-retail and enter the table name
prefix as txn and click on Next
• Click on Create crawler
• Click on Run crawler
Note: Repeat above steps for other Product dataset as well. While choosing
classifier choose cust_classifier.
8. Create ETL job
• Navigate to AWS Glue, click on ETL jobs, and click on Visual ETL
• In the Add nodes, double-click on AWS Glue Data Catalog
• Select Join from the add nodes and link Join to both the AWS Glue
Data Catalog
• Click on the Join box, and then select Drop Fields from the Add nodes
• Click on the Drop Fields box, and then select Regex Extractor from
the Add nodes
• Click on the Regex Extractor box, and then select Aggregate from the
Add nodes
• Click on the Aggregate box, and then select Amazon S3 from the
Targets in Add nodes
• Click on the first AWS Glue Data Catalog box, and select abc-retail in
the Database dropdown and select txntransaction_files under the
Table dropdown
• Click on the second AWS Glue Data Catalog box and select abc-retail
in the Database dropdown and select product_files under Table
dropdown
• Select the Join box and add both the AWS Glue Data Catalog in the
node parents. Select Inner join in the Join type and in the Join
conditions box select product id in both the AWS Glue Data Catalog
boxes
• Click on the Drop Fields box and in the DropFields section select
product id as it appears twice
• We need to extract sales values as it has $ symbol in it. Click on Regex
Extractor and fill the following fields:
o Column to extract from as sales
o Regular expression as \d+
o Extracted column as NetSales
• Now, let’s create our summary report using Aggregate block.
Fill the following fields:
o Fields to group by as product category and ship mode
o Field to aggregate as sales
o Aggregation function as avg
• Click on Amazon S3 block then click on Browse S3 and select etl-cep-
output-01
• On the top left corner click on Untitled job and give the etl-cep-job,
click on Save and the click on Run
• Check the progress in Runs
Result
Create a Word document with the detailed steps that you have performed with
the screenshots. Upload the solution document to the Learning Management
System (LMS).