0% found this document useful (0 votes)
47 views23 pages

CSCI 720 - Project

CS_project

Uploaded by

Atif mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views23 pages

CSCI 720 - Project

CS_project

Uploaded by

Atif mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Big data analytics on

Amazon product reviews


Team: Classyfiers
-Satyanarayan Iyengar
-Vaibhav Joshi
-Amritha Venkataramana
Agenda
1) Goal
2) Datasets
3) Data pre-processing
4) Data management in SQL
5) Data mining component
6) Exploratory Analysis and Visualization
7) Results
8) Tools used
Goal

1) To analyse reviews on books purchased on Amazon from datasets


obtained via two different sources using data mining and visualization
techniques.
2) Why product reviews?
a) They reveal customer sentiments
b) Help manufactures decide constraints that could make the
product a success.
3) To study and implement industry standard practices for data mining
Datasets and specifications
1) The datasets chosen for the project are
a) Stanford Amazon Reviews Dataset- a collection of customer reviews written in the
Amazon.com marketplace. (http://jmcauley.ucsd.edu/data/amazon/links.html)
Specifications:
Data format - JSON
Attributes: "reviewerID" - the id of the reviewer
"asin" - Amazon product ID
"reviewerName" - name of the reviewer
"helpful" - the number of times the review was thought to be helpful
"reviewText" - the content of the review
"overall" - the product rating (from 1 to 5)
"summary" - title of the review
"unixReviewTime" - the time of the review in UNIX format
"reviewTime" - the time of the review
b) AWS Amazon Customer Reviews Dataset- This dataset is divided into product reviews dataset and
product metadata dataset. The reviews dataset includes ratings, text and helpfulness. The product
metadata dataset includes product category, descriptions, price etc.
(https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
Specifications:
Data format - Tab Separated File (.tsv)
Attributes: "marketplace"- 2 letter country code of the marketplace where the review was written.
"Customer_id"- Random identifier that can be used to aggregate reviews written by a single
author.
"review_id" - The unique ID of the review.
"product_id" - The unique Product ID
"product_parent" - Random identifier that can be used to aggregate reviews for the same
product.
"product_title" - Title of the product.
"product_category" - category of the product
"star_rating" - The rating of the review (from 1-5)
Attributes continued..

"helpful_votes" - Total number of helpful votes of the review


"total_votes" - total votes the review received.
“vine" - Review was written as part of the Vine program.
"verified_purchase" - The review is on a verified purchase.
"review_headline" - The title of the review.
"review_body" - The review text.
"review_date" - The date of the review
Data pre-processing
1) Data processing constitutes about 80% of a data mining task. It serves as a basis for a strong
analysis.

Pre-processing tasks performed on the datasets:


1) Data Loading and Formatting
2) Data Conversion
3) Dropping unnecessary columns/attributes
4) Handling missing values
Data management

1) Data management is useful in storing and querying data as well as keeping the data
separate from the analysis. Typically done by database management systems (DBMS)
2) As part of the data management component, a base schema was designed using the
attributes from the combined dataset.
3) Data management done in MySQL using SQL Workbench and Python.
4) A representation of the data management component is shown in the figure that follows
Data Mining
1) Why data mining?
-> Data management is useful in web-applications and query-based environments. It can
execute complex queries however it cannot yield insights and it is difficult to perform
visualizations. Thus, data mining is needed for predicting, modeling and visualizing data.
2) Customer reviews can be mined to generate trends as well analyse past history to improve future
recommendations.
3) Cross Industry Standard Process for DataMining (or CRISPDM) is the most popular technique for
Data mining tasks. It consists of the following steps:
a) Business Understanding
b) Data Understanding
c) Data Preparation
d) Data Modeling
e) Data Evaluation
f) Deployment
Exploratory analysis and Visualization
We have performed visualizations in Tableau to explore relationship between attributes as well as
determine timelines and trends in the attributes.

The visualizations follow in the next slides.


Descriptive Statistics for numeric attributes
WORD CLOUD FOR ALL REVIEWS
Total reviews for each year from 1997 - 2015.
Pairwise comparison of helpful votes and overall votes
Average helpful votes per rating
Modeling and Evaluation

Models Used:

Classification :

Support Vector Machine

Clustering :

Agglomerative (Birch)
SVM Results

Accuracy : 68%
Birch results for a specific topic cluster
Birch results for a Generic topic cluster
Conclusion /Future Work

● Learnt industry standard data mining procedures


● Implemented SVM and Clustering. Got reasonable results
● Performed Visualization and explored relationship in attributes

Future Work

● More categories
● User identification
Tools used

1) Language: Python (Pycharm)


2) Softwares: Tableau
3) Frameworks: scikit-learn, matplotlib
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy