0% found this document useful (0 votes)

10 views6 pages

Day 27

The document outlines a product recommendation system using basket analysis to identify products frequently bought together. It provides code examples in both PySpark and Spark SQL to demonstrate how to achieve this analysis through DataFrames and SQL queries. The author, Ganesh R, shares his contact information and links to his online portfolio and social media.

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Day 27

Uploaded by

Lapi Lapil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Scenario Based Interview

Pyspark vs
Spark SQL

Ganesh. R
#Problem Statement Product recommendation. Just the basic type (“customers who bought this
also bought…”). That, in its simplest form, is an outcome of basket analysis. In this solution, i
will learn how to find products which are most frequently bought together using simple SQL.
Based on the history ecommerce website can recommend products to new user.

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType,
StringType

# Initialize Spark session

spark = SparkSession.builder \
.appName("OrdersProducts") \
.getOrCreate()

# Define schema for orders

orders_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product_id", IntegerType(), True)
])

# Define schema for products

products_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Create data for orders

orders_data = [
(1, 1, 1),
(1, 1, 2),
(1, 1, 3),
(2, 2, 1),
(2, 2, 2),
(2, 2, 4),
(3, 1, 5)
]

# Create data for products

products_data = [
(1, 'A'),
(2, 'B'),
(3, 'C'),
(4, 'D'),
(5, 'E')
]

# Create DataFrame for orders

orders_df = spark.createDataFrame(orders_data, schema=orders_schema)
# Create DataFrame for products
products_df = spark.createDataFrame(products_data,
schema=products_schema)

# Show the result

orders_df.display()
products_df.display()

# Create temporary views for SQL queries

orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")

###Pyspark

from pyspark.sql.functions import col, concat,

monotonically_increasing_id, row_number, countDistinct
from pyspark.sql.window import Window

# Alias the orders DataFrame for joining

a = orders_df.alias("a")
b = orders_df.alias("b")

# Perform the join and necessary transformations

t1 = a.join(b, (a.order_id == b.order_id) & (a.product_id !=
b.product_id)) \
.join(products_df.alias("p1"), col("a.product_id") ==
col("p1.id"), "left") \
.join(products_df.alias("p2"), col("b.product_id") ==
col("p2.id"), "left") \
.select(
col("a.order_id").alias("order_id"),
col("a.customer_id").alias("customer_id"),
col("p1.name").alias("name1"),
col("p2.name").alias("name2"),
(col("p1.id") + col("p2.id")).alias("pair_sum"),
monotonically_increasing_id().alias("idf")
)

# Define window specification for row_number

window_spec = Window.partitionBy("order_id",
"pair_sum").orderBy("idf")

# Apply row_number function to filter duplicates

t2 = t1.withColumn("rnk", row_number().over(window_spec))

# Filter rows to keep only the first occurrence of each pair_sum

within each order
t3 = t2.filter(col("rnk") == 1) \
.withColumn("pair", concat(col("name1"), col("name2")))

# Perform final aggregation

result_df = t3.groupBy("pair") \
.agg(countDistinct("order_id").alias("frequency")) \
.orderBy(col("frequency").desc())

# Show the result

result_df.display()

###Spark SQL

%sql
with t1 as (
Select a.order_id,a.customer_id,p1.name as name1,p2.name as name2,
(p1.id+p2.id) as pair_sum,monotonically_increasing_id() as idf
from orders a
inner join orders b on a.order_id = b.order_id and
a.product_id<>b.product_id
left join products p1 on a.product_id = p1.id
left join products p2 on b.product_id = p2.id
)
, t2 as (
Select order_id,customer_id,name1,name2,pair_sum, row_number()
over(partition by order_id,pair_sum order by idf asc ) as rnk
from t1
), t3 as (
Select *,
concat(name1, ' ',name2) as pair
from t2 where rnk=1
)

Select
pair,count(distinct order_id) as frequency
from t3
group by pair
order by 2 desc
IF YOU FOUND
THIS POST
USEFUL, PLEASE
SAVE IT.

Ganesh. R
+91-9030485102. Hyderabad, Telangana. rganesh0203@gmail.com

https://medium.com/@rganesh0203 https://rganesh203.github.io/Portfolio/
https://github.com/rganesh203. https://www.linkedin.com/in/r-ganesh-a86418155/

https://www.instagram.com/rg_data_talks/ https://topmate.io/ganesh_r0203

Iso 14641-2018
No ratings yet
Iso 14641-2018
50 pages
Weekly Quiz 2 (AS) - PGPBABI.O.OCT19 Advanced Statistics - Great Learning PDF
No ratings yet
Weekly Quiz 2 (AS) - PGPBABI.O.OCT19 Advanced Statistics - Great Learning PDF
5 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Day 25
No ratings yet
Day 25
7 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Solutions SQL PseudoCode BIE Concepts
No ratings yet
Solutions SQL PseudoCode BIE Concepts
5 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
PySpark VS SQL Interview Questions
100% (1)
PySpark VS SQL Interview Questions
16 pages
Join Practice Quries
No ratings yet
Join Practice Quries
6 pages
SCD 1,2,3
No ratings yet
SCD 1,2,3
4 pages
Sparktuning
No ratings yet
Sparktuning
10 pages
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
No ratings yet
RDD - Mini - Project - 1 - 1707570179 2024-02-10 13 - 03 - 29
10 pages
Day 12
No ratings yet
Day 12
5 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Product
No ratings yet
Product
3 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
7 pages
Data Description
No ratings yet
Data Description
6 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Biydaalt
No ratings yet
Biydaalt
4 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Odd Queries
No ratings yet
Odd Queries
19 pages
Apriori Algorithm in SQL, PL/SQL and Spark SQL
No ratings yet
Apriori Algorithm in SQL, PL/SQL and Spark SQL
13 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL VSPy Spark
No ratings yet
SQL VSPy Spark
16 pages
Pandas Vs SQL Concepts Final
No ratings yet
Pandas Vs SQL Concepts Final
13 pages
Porter Case Study
No ratings yet
Porter Case Study
27 pages
Abi Interview Questions
No ratings yet
Abi Interview Questions
2 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
BI - Analytics - Question 4
No ratings yet
BI - Analytics - Question 4
4 pages
Course Notes
No ratings yet
Course Notes
10 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Journal
No ratings yet
Journal
47 pages
Bakery Management
No ratings yet
Bakery Management
4 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
Saprk
No ratings yet
Saprk
1 page
Section I - Setup: 2.1A - Scalar Subqueries
No ratings yet
Section I - Setup: 2.1A - Scalar Subqueries
32 pages
Code Logic
No ratings yet
Code Logic
6 pages
Final Output
No ratings yet
Final Output
8 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
SQL 1729830819
No ratings yet
SQL 1729830819
10 pages
2023kucp1088 Q1
No ratings yet
2023kucp1088 Q1
4 pages
Set 2
No ratings yet
Set 2
2 pages
First Pyspark
No ratings yet
First Pyspark
18 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Walmart Data Analyst Interview Experience
No ratings yet
Walmart Data Analyst Interview Experience
10 pages
Advanced SQL Assignment With Answers
No ratings yet
Advanced SQL Assignment With Answers
5 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
40 SQL Interview Questions & Solution For DBA
No ratings yet
40 SQL Interview Questions & Solution For DBA
30 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Data Engineering 101 SQL and PySpark 1727161935
No ratings yet
Data Engineering 101 SQL and PySpark 1727161935
58 pages
'عبد الرحمن ديرية نظم 2'
No ratings yet
'عبد الرحمن ديرية نظم 2'
2 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Day 62
No ratings yet
Day 62
9 pages
Day 28
No ratings yet
Day 28
5 pages
Redshift DG
No ratings yet
Redshift DG
733 pages
Day 24
No ratings yet
Day 24
8 pages
Day 76
No ratings yet
Day 76
10 pages
Day 57
No ratings yet
Day 57
11 pages
AWS Learning Material
No ratings yet
AWS Learning Material
13 pages
MENT CS CLASSPART-5-Final
No ratings yet
MENT CS CLASSPART-5-Final
10 pages
DB Topic 1
No ratings yet
DB Topic 1
30 pages
Unit5 DM&DW
No ratings yet
Unit5 DM&DW
17 pages
256 Hibernate Interview Questions Answers Guide
No ratings yet
256 Hibernate Interview Questions Answers Guide
12 pages
Genesys Interactive Insights: End-To-End Visibility Into Your Contact Center Performance
No ratings yet
Genesys Interactive Insights: End-To-End Visibility Into Your Contact Center Performance
2 pages
Mca-1 Rdbms Syllabus
No ratings yet
Mca-1 Rdbms Syllabus
3 pages
Samsher Singh Resume
No ratings yet
Samsher Singh Resume
1 page
Librarian JD
No ratings yet
Librarian JD
6 pages
Databases (Access) : Ict Igcse
No ratings yet
Databases (Access) : Ict Igcse
21 pages
0016 SAP ABAP With S4 HANA Syllabus UCPL Technologies
No ratings yet
0016 SAP ABAP With S4 HANA Syllabus UCPL Technologies
7 pages
Batch Apex Example in Salesforce
No ratings yet
Batch Apex Example in Salesforce
6 pages
Elasticsearch Optimization
No ratings yet
Elasticsearch Optimization
25 pages
CSL 605 - CC - TE Compb
No ratings yet
CSL 605 - CC - TE Compb
11 pages
CS Final Project - Debasish Khatei
No ratings yet
CS Final Project - Debasish Khatei
23 pages
LIS S511 Bow SP22 1
No ratings yet
LIS S511 Bow SP22 1
17 pages
Proposal For Smatphone Addiction Prediction
No ratings yet
Proposal For Smatphone Addiction Prediction
4 pages
Chapter 01 - JPA, JPA Mapping
No ratings yet
Chapter 01 - JPA, JPA Mapping
75 pages
Powerexchange Adapters For Informatica 10.2 Hotfix 1 Release Notes (10.2 Hotfix 1)
No ratings yet
Powerexchange Adapters For Informatica 10.2 Hotfix 1 Release Notes (10.2 Hotfix 1)
31 pages
Dsa Notes Topic Upto Tree
No ratings yet
Dsa Notes Topic Upto Tree
53 pages
Program Index PDF
No ratings yet
Program Index PDF
2 pages
Big Data Technologi
No ratings yet
Big Data Technologi
36 pages
Autoplant
No ratings yet
Autoplant
282 pages
Universe Designer & WEB Intelligence
No ratings yet
Universe Designer & WEB Intelligence
27 pages
CH04 CompSec4e
No ratings yet
CH04 CompSec4e
48 pages
DW Unit-1 (1) XXXXXXXX
No ratings yet
DW Unit-1 (1) XXXXXXXX
70 pages
Eb Data Warehouse Automation in Azure For Dummies en
No ratings yet
Eb Data Warehouse Automation in Azure For Dummies en
46 pages
Mobile Recharge Report
No ratings yet
Mobile Recharge Report
110 pages
Database Adnalesque Cano: I Sing of A Database and Its Records
No ratings yet
Database Adnalesque Cano: I Sing of A Database and Its Records
42 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Day 27

Uploaded by

Day 27

Uploaded by

Scenario Based Interview

from pyspark.sql import SparkSession

# Initialize Spark session

# Define schema for orders

# Define schema for products

# Create data for orders

# Create data for products

# Create DataFrame for orders

# Show the result

# Create temporary views for SQL queries

from pyspark.sql.functions import col, concat,

# Alias the orders DataFrame for joining

# Perform the join and necessary transformations

# Define window specification for row_number

# Apply row_number function to filter duplicates

# Filter rows to keep only the first occurrence of each pair_sum

# Perform final aggregation

# Show the result

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.