0% found this document useful (0 votes)
101 views193 pages

IT6006-Data Analytics Department of CSE 2018-2019

This document provides an introduction to big data concepts. It discusses the three Vs of big data - volume, variety, and velocity. It describes sources of big data like human and machine generated data. It explains how big data is used to extract useful patterns and insights through case studies. Popular big data platforms like Hadoop and programming languages like Pig and Hive for MapReduce application development are covered. Finally, it outlines some risks of big data like being overwhelmed by data and privacy issues.

Uploaded by

slogeshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views193 pages

IT6006-Data Analytics Department of CSE 2018-2019

This document provides an introduction to big data concepts. It discusses the three Vs of big data - volume, variety, and velocity. It describes sources of big data like human and machine generated data. It explains how big data is used to extract useful patterns and insights through case studies. Popular big data platforms like Hadoop and programming languages like Pig and Hive for MapReduce application development are covered. Finally, it outlines some risks of big data like being overwhelmed by data and privacy issues.

Uploaded by

slogeshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 193

IT6006- Data Analytics Department of CSE 2018-2019

Introduction to ig Frequent Itemsets and


Data Analysis Mining Data Streams
Data Clustering

Stream Mining Frequent


Challenges of Regression Stream Computing
Concepts Itemsets
Conventional Modeling
Data Analysis of
time Series
Data Sampling Market Based
Multivariate Model
Web Data Model
Analysis
Support
Vector and Filtering
Analytic Kernel Architecture
Bayesian
Apriori
Processes and Method
Modeling Algorithm
Tools Counting Distinct
Element
Soft Computing
Handling
Statistical Concepts Large datasets
Estimating
Rule Induction Moments

Statistical Inference Limited Pass


Neural Algorithm
Counting Oneness
Networks
Sampling
Distribution
RTAP Counting
Fuzzy Applications Decaying Window Frequent
Logic Itemsets
Resampling
Case Studies
Stochastic Search Clustering
Prediction Error methods

St. Joseph’s College of Engineering Page 1 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 2 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 3 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 4 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 5 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 6 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 7 of 193


 f X ,Y  x, y  dxdy  1
x y
IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 8 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 9 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 10 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 11 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 12 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 13 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 14 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 15 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 16 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 17 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 18 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 19 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 20 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 21 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 22 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 23 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 24 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 25 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 26 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 27 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 28 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 29 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 30 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 31 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 32 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 33 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 34 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 35 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 36 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 37 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 38 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 39 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 40 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 41 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 42 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 43 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 44 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 45 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 46 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 47 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 48 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 49 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 50 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 51 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 52 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 53 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 54 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 55 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page


2
 y2
56 of 193
1 x
X ~ N   0,0  , I 2  , p ( x)  e 2
X ~ N  ,  2
IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 57 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 58 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 59 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 60 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 61 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 62 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 63 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 64 of 193


n get b from yTi ( wT xi1 b)2 1 Tn 0,
w y i byi ( w x i  b )  1
i ii(1wTi xyii xi b) i i y0i xi minimizeg (Lxp)(w,wb, xi) b  w i x
i x
T
  1
SV
 i  0
where x i is support vector
2 iSV i 1
IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

Ou t look

?
Tem p er a t u r e
St. Joseph’s College of Engineering P la y Page 65 of 193
H u m id it y
Win d y
IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering k 1


Page 66 of 193
1 m 1 m
w1  arg max  {( w T x i ) 2 } w k  arg max  {[ w (x i   w j w Tj x i )]2 }
T
w 1 m w 1 m
i 1 i 1 j 1
IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 67 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 68 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 69 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 70 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 71 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 72 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 73 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 74 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 75 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 76 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 77 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 78 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 79 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 80 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 81 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 82 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 83 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 84 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 85 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 86 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 87 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 88 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 89 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 90 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 91 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 92 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 93 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 94 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 95 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 96 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 97 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 98 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 99 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 100 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 101 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 102 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 103 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 104 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 105 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 106 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 107 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 108 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 109 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 110 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 111 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 112 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 113 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 114 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 115 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 116 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 117 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 118 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 119 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 120 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 121 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 122 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 123 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 124 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 125 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 126 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 127 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 128 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 129 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 130 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 131 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 132 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 133 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 134 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 135 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 136 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 137 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 138 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 139 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 140 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 141 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 142 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 143 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 144 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 145 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 146 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 147 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 148 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 149 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 150 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 151 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 152 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 153 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 154 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 155 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 156 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 157 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 158 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 159 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 160 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 161 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 162 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 163 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 164 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 165 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 166 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 167 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 168 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 169 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 170 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 171 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 172 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 173 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 174 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 175 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 176 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 177 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 178 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 179 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 180 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 181 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 182 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 183 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 184 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 185 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 186 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 187 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 188 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 189 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 190 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 191 of 193


IT6006- Data Analytics Department of CSE 2018-2019
UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems – Web data –
Evolution of Analytic scalability, analytic processes and tools, Analysis vs reporting –
Modern data analytic tools, Stastical concepts: Sampling distributions, resampling,
statistical inference, prediction error.
UNIT-I / PART-A
1. What are the Three Vs of big data? (Nov/Dec 2016)
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
2. What is the source of the Big Data?
Human Generated Data, ◦Email, ◦Social Networks, ◦Cloud Storage hosted info,
◦Electronics publications –Scientific, Engineering, social, admin, ◦Enterprise Web pages,
◦Product Documents, ◦Legacy documents presented in electronic form, ◦On-line Videos
(TV, YouTube,), ◦Photos, ◦Machine Generated Data.
3. What do we do with Big Data?
•Extract useful patterns from Big Data and use it for
•Short Case Study 1: Amazon’s personalized product recommendation
•Short Case Study 2:FB personalized product marketing
•Short Case Study 3: Google search –getting competitive edger over Wikipedia.
4. List some Big Data Platform.
Hadoop -Supports a large Hadoop distributed file system (HDFS)
•Supports a batch-oriented distributed computing architecture with Map Reduce
•HDFS and Map Reduce are the two core components of Hadoop
•Hadoop Apacheis open source. Suited for Batch processing,
•Not suitable for Ad-hoc queries (processing)
5. What are the other alternative application development languages?
The other alternative is to use one of the Map Reduce application development
languages Among these language, Pig, Hive, Jaql are popular one.
6. What is Hive?
Hive is one of the programming languages that support Map Reduce application
development. Hive is developed at Face book. Hive is a query language -Hive Query
Language (HQL). Hive queries are broken down into Map Reduce jobs and executed
across a Hadoop Cluster.
7. What are the risks of BIG DATA?
 An organization will be so overwhelmed with big data that it won’t make any
progress.
 That costs escalate too fast as too much big data is captured before an
organization knows what to do with it. As with anything, avoiding this is a
matter of making sure that progress moves at a pace that allows the organization
to keep up.
 The biggest risk with many sources of big data is privacy.
8. Draw the big data taxonomy.

St. Joseph’s College of Engineering Page 192 of 193


IT6006- Data Analytics Department of CSE 2018-2019

St. Joseph’s College of Engineering Page 193 of 193

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy