01 Unit-I Introduction To Big Data
01 Unit-I Introduction To Big Data
(Autonomous)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Big data refers to the large, diverse sets of information that grow at ever-increasing
rates. It encompasses the volume of information, the velocity or speed at which it is
created and collected, and the variety or scope of the data points being covered.
In short, the term Big Data applies to information that can’t be processed or analyzed
using traditional processes or tools.
The events of the last 20 years have fundamentally changed the way big data is raised.
We create more of it each day; it is not a waste product but a buried treasure waiting to
be discovered by curious, motivated researchers and practitioners who see these trends
and are reaching out to meet the current challenges.
So, the existing data analytics architecture is going through evolution and change
in order to keep pace with changing demands and technologies.
Velocity: Velocity refers to the rate at which data are generated and the speed at which
it should be analyzed and acted upon. Velocity in big data is a concept which processed
and analyzed with the speed of the data coming from various sources. The proliferation
of digital devices such as smart phones and sensors has led to an unprecedented rate of
data which is continually being generated at a pace that is impossible for traditional
systems to capture, store and analyze.
Variety: Variety shows different types of data and data sources. Variety in big data is a
measure of heterogeneity of data representation such as structured, semi structured and
unstructured. With the explosion of sensors, smart devices and social collaboration
Veracity: Veracity denotes data uncertainty. Veracity in big data is the level of reliability
associated with certain types of data. Some data is inherently uncertain, for example:
sentiment, truthfulness, weather conditions, economic factors. The need to acknowledge
and plan for this dimension of uncertainty of big data is still a major quality concern in
processing of data.
Big data is a process that is used when traditional data mining and handling techniques
cannot uncover the insights and meaning of the underlying data. This type of data
requires a different processing approach called big data analytics, which uses massive
parallelism on readily-available hardware.
Unstructured data: Unstructured data is information that either does not have a pre-
defined data model or is not organized in a pre-defined manner. Data of different forms
like text, image, video, document, etc.
Real-time media: Real-time streaming of live or stored media data. One of the main
sources of media data is services like e.g. YouTube, video conferencing, Flicker etc.
Natural language data: Human-generated data, particularly in the verbal form. The
sources of natural language data include speech capture devices, land phones, mobile
phones, and Internet of Things that generate large sizes of text-like communication
between devices.
Event data: Data generated from the matching between external events with time
series. For example, information related to vehicle crashes or accidents can be collected
and analyzed to help understand what the vehicles were doing before, during and after
the event.
Network data: Data concerns very large networks, such as social networks (e.g.
Facebook and Twitter), information networks (e.g. the World Wide Web), and biological
networks. Network data is represented as nodes connected via one or more types of
relationship.
Linked data: Data that is built upon standard Web technologies such as HTTP, RDF,
SPARQL and URIs to share information that can be semantically queried by computers.
Finally, each data type has different requirements for analysis and poses different
challenges.
Six Big Data use cases at organizations across industries that illustrate various
architectural approaches for modernizing their information management platforms are
discussed in the below table.
Table 1.3. Patterns for Big Data Development
Case Industry Motivation Scope
1 Banking Transformational Transform core business processes to
Modernization improve decision-making agility and
transform and modernize supporting
information architecture and
technology.
2 Retail Agility and Resiliency Develop a two-layer architecture that
includes a business process–neutral
Data can be either structured or unstructured both. Data Science helps us to transform a
business problem into a research project and then transform it into a practical solution
It helps the systems to learn from sample/training data and predicts results by teaching
itself with various algorithms. An ideal machine learning model does not require human
intervention too; however, still, such ML models are not in existence. The use of Machine
Learning can be seen in various sectors such as healthcare, infrastructure, science,
education, banking, finance, marketing, etc.
Big Data is used to store, analyze and organize the huge volume of structured as well as
unstructured datasets.
It deals with using more data as input and It deals with extraction as well as analysis
algorithms to predict future outcomes based on of data from a large number of datasets.
trends.
It uses tools such as Numpy, Pandas, It requires tools like Apache Hadoop
TensorFlow, Keras, etc., to analyze datasets. MongoDB.
Machine Learning can learn from training data Big Data analytics pulls raw data and
and act intelligently for making effective looks for patterns to help in stronger
predictions by teaching itself using Algorithms. decision-making for the firms.
The scope of machine learning is much vast such The scope of big data is not limited to
as improving quality of prediction, building strong collecting a huge amount of data only but
decision-making capability, cognitive analysis, also to optimizing data for analysis as
improving healthcare services, speech and text well.
recognition, etc.
It has a wide range of applications such as email It also has a wide range of applications for
and spam filtering, product recommendation, analysis data storage in a structured
infrastructure, marketing, transportation, format such as stock market analysis, etc.
medical, finance & banking, education, self-
driving cars, etc.
Machine Learning does not need human It requires human intervention because of
intervention for a complete process because it the huge amount of multidimensional
uses various algorithms to build intelligent data. Due to having multidimensional
models to predict the result. data, it becomes difficult to extract
Further, it contains limited dimensional data features from data.
hence making it easier for recognizing features.
Sss
Data science is the study of working with a huge Big data is the study of collecting and
volume of data and enables data for prediction, analyzing a huge volume of data sets to
prescriptive, and prescriptive analytical models. find a hidden pattern that helps in
stronger decision-making.
The main aim of data science is to build data-based The main goal of big data is to extract
products for firms. useful information from the huge volume
of data and use it for building products
for firms.
It requires strong knowledge of Python, R, SAS, It requires tools like Apache Hadoop
Scala, as well as hands-on knowledge of SQL MongoDB.
databases.
It is used for scientific or research purposes. It is used for businesses and customer
satisfaction.
It broadly focuses on the science of the data. It is more involved with the processes of
handling voluminous data.