0% found this document useful (0 votes)

47 views23 pages

CSCI 720 - Project

CS_project

Uploaded by

Atif mirza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views23 pages

CSCI 720 - Project

CS_project

Uploaded by

Atif mirza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Big data analytics on

Amazon product reviews

Team: Classyﬁers
-Satyanarayan Iyengar
-Vaibhav Joshi
-Amritha Venkataramana
Agenda
1) Goal
2) Datasets
3) Data pre-processing
4) Data management in SQL
5) Data mining component
6) Exploratory Analysis and Visualization
7) Results
8) Tools used
Goal

1) To analyse reviews on books purchased on Amazon from datasets

obtained via two different sources using data mining and visualization
techniques.
2) Why product reviews?
a) They reveal customer sentiments
b) Help manufactures decide constraints that could make the
product a success.
3) To study and implement industry standard practices for data mining
Datasets and specifications
1) The datasets chosen for the project are
a) Stanford Amazon Reviews Dataset- a collection of customer reviews written in the
Amazon.com marketplace. (http://jmcauley.ucsd.edu/data/amazon/links.html)
Specifications:
Data format - JSON
Attributes: "reviewerID" - the id of the reviewer
"asin" - Amazon product ID
"reviewerName" - name of the reviewer
"helpful" - the number of times the review was thought to be helpful
"reviewText" - the content of the review
"overall" - the product rating (from 1 to 5)
"summary" - title of the review
"unixReviewTime" - the time of the review in UNIX format
"reviewTime" - the time of the review
b) AWS Amazon Customer Reviews Dataset- This dataset is divided into product reviews dataset and
product metadata dataset. The reviews dataset includes ratings, text and helpfulness. The product
metadata dataset includes product category, descriptions, price etc.
(https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
Specifications:
Data format - Tab Separated File (.tsv)
Attributes: "marketplace"- 2 letter country code of the marketplace where the review was written.
"Customer_id"- Random identifier that can be used to aggregate reviews written by a single
author.
"review_id" - The unique ID of the review.
"product_id" - The unique Product ID
"product_parent" - Random identifier that can be used to aggregate reviews for the same
product.
"product_title" - Title of the product.
"product_category" - category of the product
"star_rating" - The rating of the review (from 1-5)
Attributes continued..

"helpful_votes" - Total number of helpful votes of the review

"total_votes" - total votes the review received.
“vine" - Review was written as part of the Vine program.
"veriﬁed_purchase" - The review is on a veriﬁed purchase.
"review_headline" - The title of the review.
"review_body" - The review text.
"review_date" - The date of the review
Data pre-processing
1) Data processing constitutes about 80% of a data mining task. It serves as a basis for a strong
analysis.

Pre-processing tasks performed on the datasets:

1) Data Loading and Formatting
2) Data Conversion
3) Dropping unnecessary columns/attributes
4) Handling missing values
Data management

1) Data management is useful in storing and querying data as well as keeping the data
separate from the analysis. Typically done by database management systems (DBMS)
2) As part of the data management component, a base schema was designed using the
attributes from the combined dataset.
3) Data management done in MySQL using SQL Workbench and Python.
4) A representation of the data management component is shown in the ﬁgure that follows
Data Mining
1) Why data mining?
-> Data management is useful in web-applications and query-based environments. It can
execute complex queries however it cannot yield insights and it is difﬁcult to perform
visualizations. Thus, data mining is needed for predicting, modeling and visualizing data.
2) Customer reviews can be mined to generate trends as well analyse past history to improve future
recommendations.
3) Cross Industry Standard Process for DataMining (or CRISPDM) is the most popular technique for
Data mining tasks. It consists of the following steps:
a) Business Understanding
b) Data Understanding
c) Data Preparation
d) Data Modeling
e) Data Evaluation
f) Deployment
Exploratory analysis and Visualization
We have performed visualizations in Tableau to explore relationship between attributes as well as
determine timelines and trends in the attributes.

The visualizations follow in the next slides.

Descriptive Statistics for numeric attributes
WORD CLOUD FOR ALL REVIEWS
Total reviews for each year from 1997 - 2015.
Pairwise comparison of helpful votes and overall votes
Average helpful votes per rating
Modeling and Evaluation

Models Used:

Classiﬁcation :

Support Vector Machine

Clustering :

Agglomerative (Birch)
SVM Results

Accuracy : 68%
Birch results for a speciﬁc topic cluster
Birch results for a Generic topic cluster
Conclusion /Future Work

● Learnt industry standard data mining procedures

● Implemented SVM and Clustering. Got reasonable results
● Performed Visualization and explored relationship in attributes

Future Work

● More categories
● User identiﬁcation
Tools used

1) Language: Python (Pycharm)

2) Softwares: Tableau
3) Frameworks: scikit-learn, matplotlib
THANK YOU

Amazon Sales Data Analysis
No ratings yet
Amazon Sales Data Analysis
32 pages
Reviews Are A Treasure! How To Dig It?
No ratings yet
Reviews Are A Treasure! How To Dig It?
25 pages
Data Mining Seminar
No ratings yet
Data Mining Seminar
22 pages
Data Mining Project
No ratings yet
Data Mining Project
9 pages
Chapter 02 Overview (Python)
No ratings yet
Chapter 02 Overview (Python)
16 pages
Extract MKT Info
No ratings yet
Extract MKT Info
15 pages
Updated DM
No ratings yet
Updated DM
72 pages
Ita Final Report
No ratings yet
Ita Final Report
7 pages
Predictive Analytics of Product Quality in Industry 4.0
No ratings yet
Predictive Analytics of Product Quality in Industry 4.0
9 pages
Singh 2017
No ratings yet
Singh 2017
38 pages
Unit 1 - Big Data Technologies
No ratings yet
Unit 1 - Big Data Technologies
89 pages
Part 1
No ratings yet
Part 1
3 pages
Business Data Mining
No ratings yet
Business Data Mining
9 pages
11 Scopus
No ratings yet
11 Scopus
15 pages
For Office Use Only T1 T2 T3 T4 Team Control Number For Office Use Only F1 F2 F3 F4
No ratings yet
For Office Use Only T1 T2 T3 T4 Team Control Number For Office Use Only F1 F2 F3 F4
19 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Data Mining PDF
No ratings yet
Data Mining PDF
17 pages
Unit 1 DM
No ratings yet
Unit 1 DM
37 pages
Power BI
No ratings yet
Power BI
60 pages
DM Project 70110717
No ratings yet
DM Project 70110717
4 pages
Opinion Mining
No ratings yet
Opinion Mining
7 pages
Aspect Based Sentiment Analysis: 14 April 2021
No ratings yet
Aspect Based Sentiment Analysis: 14 April 2021
6 pages
Predicting The Ratings of Amazon Products Using Big Data
No ratings yet
Predicting The Ratings of Amazon Products Using Big Data
11 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Big Data
No ratings yet
Big Data
7 pages
Rating Prediction
No ratings yet
Rating Prediction
20 pages
Class Test-1: Manpreet Singh 2K19/DMBA/48 Ans 1)
No ratings yet
Class Test-1: Manpreet Singh 2K19/DMBA/48 Ans 1)
2 pages
Data Mining
No ratings yet
Data Mining
14 pages
Amazon Reviews Dataset Analysis
No ratings yet
Amazon Reviews Dataset Analysis
7 pages
60 Assignment
No ratings yet
60 Assignment
3 pages
Synopsis
No ratings yet
Synopsis
8 pages
BD 3
No ratings yet
BD 3
1 page
E-Commerce Product Rating Based On Customer Review Mining-IJAERDV05I0128071 PDF
No ratings yet
E-Commerce Product Rating Based On Customer Review Mining-IJAERDV05I0128071 PDF
5 pages
2020 MCM Problem C
No ratings yet
2020 MCM Problem C
3 pages
Unit No 4 Data Analysis & Report Writing
No ratings yet
Unit No 4 Data Analysis & Report Writing
20 pages
1.3 Tasks of Data Mining
No ratings yet
1.3 Tasks of Data Mining
10 pages
Data Mining
No ratings yet
Data Mining
6 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Best Customer Services Among The E-Commerce Websites - A Predictive Analysis
No ratings yet
Best Customer Services Among The E-Commerce Websites - A Predictive Analysis
8 pages
A Data Warehouse Is A Centralized Repository For Enterprise Data
No ratings yet
A Data Warehouse Is A Centralized Repository For Enterprise Data
5 pages
Path Breaking Case Studies in E-Commerce Using Data Mining: Rupesh Sanchati, P.C. Patidar, Gaurav Kulkarni
No ratings yet
Path Breaking Case Studies in E-Commerce Using Data Mining: Rupesh Sanchati, P.C. Patidar, Gaurav Kulkarni
6 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Polarity Categorization On Product Reviews
No ratings yet
Polarity Categorization On Product Reviews
4 pages
Ankit Survey Paper
No ratings yet
Ankit Survey Paper
6 pages
Data Mining Applications in PDF
No ratings yet
Data Mining Applications in PDF
23 pages
DWDM
No ratings yet
DWDM
11 pages
MIS Project
No ratings yet
MIS Project
7 pages
Project Synopsis: Department Title of The Project
No ratings yet
Project Synopsis: Department Title of The Project
4 pages
Value Stream Mapping en
No ratings yet
Value Stream Mapping en
52 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
51 pages
Chart Types
No ratings yet
Chart Types
20 pages
Cse - Ai - Batch No.20
No ratings yet
Cse - Ai - Batch No.20
46 pages
UC22NA 01EE10 AVEVA Debeer Ever More Efficient AVEVAs Engineering Roadmap
No ratings yet
UC22NA 01EE10 AVEVA Debeer Ever More Efficient AVEVAs Engineering Roadmap
38 pages
Power BI Tutorial
No ratings yet
Power BI Tutorial
15 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
27 pages
Qi Men Dun Jia
No ratings yet
Qi Men Dun Jia
8 pages
Data Visualization AI
No ratings yet
Data Visualization AI
13 pages
Python For Data Science
No ratings yet
Python For Data Science
5 pages
Data Visualization - R Programming Power Bi
No ratings yet
Data Visualization - R Programming Power Bi
51 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Germany PHD Thesis
100% (3)
Germany PHD Thesis
5 pages
Civil 3D Course Content - January
No ratings yet
Civil 3D Course Content - January
9 pages
J. Camm, J. Cochran, M. Fry, J. Ohlmann - Data Visualization - Exploring and Explaining With Data (2022) - Libgen - Li - Compressed (1) - Trang-2
No ratings yet
J. Camm, J. Cochran, M. Fry, J. Ohlmann - Data Visualization - Exploring and Explaining With Data (2022) - Libgen - Li - Compressed (1) - Trang-2
208 pages
Microsoft Excel Course - India
No ratings yet
Microsoft Excel Course - India
4 pages
IJRPR14602
No ratings yet
IJRPR14602
7 pages
Statistics and Probability - Statistical Models and Inference - 11th Grade by Slidesgo
No ratings yet
Statistics and Probability - Statistical Models and Inference - 11th Grade by Slidesgo
20 pages
IJRAR24A2020
No ratings yet
IJRAR24A2020
4 pages
Your Modern Business Guide To Data Analysis
No ratings yet
Your Modern Business Guide To Data Analysis
22 pages
WK 2 Introduction To Data Visualisation
No ratings yet
WK 2 Introduction To Data Visualisation
11 pages
Pitching
No ratings yet
Pitching
9 pages
Academic Resume - Umar Farouk 2
No ratings yet
Academic Resume - Umar Farouk 2
3 pages
Manas Nand Mohan
No ratings yet
Manas Nand Mohan
2 pages
Homework 1
No ratings yet
Homework 1
3 pages
Abdul Basit Khan: Business Intelligence Lead
No ratings yet
Abdul Basit Khan: Business Intelligence Lead
3 pages
2.3 Development of Design Ideas Into A Chosen Design - Do Now
No ratings yet
2.3 Development of Design Ideas Into A Chosen Design - Do Now
2 pages
Senior Executive Cover Letter, NielsenIQ
No ratings yet
Senior Executive Cover Letter, NielsenIQ
1 page
Sample MCQs M.Com IT, Semester IV
No ratings yet
Sample MCQs M.Com IT, Semester IV
2 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
From Everand
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
Anand Vemula
No ratings yet
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
Salesforce Platform App Builder Certification Handbook
From Everand
Salesforce Platform App Builder Certification Handbook
Siddhesh Kabe
4/5 (1)
Applied Architecture Patterns on the Microsoft Platform Second Edition
From Everand
Applied Architecture Patterns on the Microsoft Platform Second Edition
Andre Dovgal
No ratings yet
Oracle ADF Enterprise Application Development – Made Simple : Second Edition
From Everand
Oracle ADF Enterprise Application Development – Made Simple : Second Edition
Sten E. Vesterli
No ratings yet
Microsoft Dynamics CRM 2011 Customization & Configuration (MB2-866) Certification Guide
From Everand
Microsoft Dynamics CRM 2011 Customization & Configuration (MB2-866) Certification Guide
Neil Benson
No ratings yet
Microsoft Dynamics NAV Administration
From Everand
Microsoft Dynamics NAV Administration
Amit Sachdev
No ratings yet
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
From Everand
Learning Dynamics NAV Patterns: Create solutions that are easy to maintain, are quick to upgrade, and follow proven concepts and design
Marije Brummel
No ratings yet
Expert Cube Development with SSAS Multidimensional Models
From Everand
Expert Cube Development with SSAS Multidimensional Models
Marco Russo
No ratings yet
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Expert Cube Development with Microsoft SQL Server 2008 Analysis Services
From Everand
Expert Cube Development with Microsoft SQL Server 2008 Analysis Services
Alberto Ferrari
5/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CSCI 720 - Project

Uploaded by

CSCI 720 - Project

Uploaded by

Big data analytics on

Amazon product reviews

1) To analyse reviews on books purchased on Amazon from datasets

"helpful_votes" - Total number of helpful votes of the review

Pre-processing tasks performed on the datasets:

The visualizations follow in the next slides.

Support Vector Machine

● Learnt industry standard data mining procedures

1) Language: Python (Pycharm)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.