0% found this document useful (0 votes)
43 views11 pages

01 Unit-I Introduction To Big Data

Big data refers to large, diverse sets of information that are created and collected at increasing rates. It encompasses volume, velocity, and variety of data. Big data is generated from many sources like social media, weather data, business transactions, sensors, and more. Technologies like the internet, databases, and GPS helped enable the rise of big data. Hadoop is an open-source software framework that allows distributed storage and processing of large datasets, and is better suited than traditional RDBMS for unstructured and semi-structured data. It provides horizontal scalability and works with commodity hardware.

Uploaded by

KumarAdabala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views11 pages

01 Unit-I Introduction To Big Data

Big data refers to large, diverse sets of information that are created and collected at increasing rates. It encompasses volume, velocity, and variety of data. Big data is generated from many sources like social media, weather data, business transactions, sensors, and more. Technologies like the internet, databases, and GPS helped enable the rise of big data. Hadoop is an open-source software framework that allows distributed storage and processing of large datasets, and is better suited than traditional RDBMS for unstructured and semi-structured data. It provides horizontal scalability and works with commodity hardware.

Uploaded by

KumarAdabala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SHRI VISHNU ENGINEERING COLLEGE FOR WOMENS :: BHIMAVARAM

(Autonomous)
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

BIG DATA TECHNOLOGIES


UNIT- I

Introduction to Big Data

1.1. What is Big Data?


Data is a set of qualitative or quantitative variables and it can be structured or
unstructured, machine readable or not, digital or analogue, personal or not. Ultimately it
is a specific set of individual data points, which can be used to generate insights, be
combined and abstracted to create information, knowledge and wisdom.

Big data refers to the large, diverse sets of information that grow at ever-increasing
rates. It encompasses the volume of information, the velocity or speed at which it is
created and collected, and the variety or scope of the data points being covered.

In short, the term Big Data applies to information that can’t be processed or analyzed
using traditional processes or tools.

1.2. Where the Big Data is produced?


Big Data can be generated by humans, machines or humans-machines combine. It can
be generated anywhere where any information is generated and stored in structured or
unstructured formats. It can be generated in industries, in military units, on internet, in
hospitals or anywhere else.
The following are some areas that produce the big data at exponential rate.
 Social Media: The statistic shows that 500+terabytes of new data get ingested
into the databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
 Education: A system provides all students with a knowledge tool that includes
the content of millions of books including textbooks with a search function.
 Weather: A national weather service processes millions of hourly observations
from thousands of remote weather stations to produce hourly forecasts of
weather for hundreds of cities.

SVECW/CSE Dept. Page 1


 Media: A streaming media service allows millions of customers to access millions
of hours of video and music content on demand.
 Robotics: A robot vacuum cleaner doesn’t just learn from it’s own experiences
but learns from the experience of all robot vacuum cleaners deployed.
 Internet: A search engine ingests millions of new web pages each day to assess
their quality, reputation, usefulness and context.
 Sensors and records from electronic devices: These kind of information is
produced on real-time, the number and periodicity of observations of the
observations will be variable. Quality of this kind of source depends mostly of the
capacity of the sensor to take accurate measurements in the way it is expected.
 Business transactions: Data produced as a result of business activities can be
recorded in structured or unstructured databases. Data maybe is being produced
too fast, so we would need to have different strategies to use and process the
data. Quality of information produced from business transactions is tightly related
to the capacity to get representative observations and to process them.
 Electronic Files: These refers to unstructured documents, statically or
dynamically produced which are stored or published as electronic files, like
Internet pages, videos, audios, PDF files, etc. They can have contents of special
interest but are difficult to extract, different techniques could be used, like text
mining, pattern recognition, and so on. Quality of our measurements will mostly
rely on the capacity to extract and correctly interpret all the representative
information from those documents;
 Broadcasting: Mainly referred to video and audio produced on real time, getting
statistical data from the contents of this kind of electronic data by now is too
complex and implies big computational and communications power, once solved
the problems of converting ―digital-analog‖ contents to ―digital-data‖ contents we
will have similar complications to process it like the ones that we can find on
social interactions.

1.3. Rise of Big Data


There are a number of items that show influential events that prepared and rise the big
data era and significant milestones during the era.
 1991 - The Internet, or World Wide Web as we know it, is born. The protocol
Hypertext Transfer Protocol (HTTP) becomes the standard means for sharing
information in this new medium.
 1995 - Sun releases the Java platform. Java, invented in 1991, has become the
second most popular language behind C. It dominates the Web applications space
and is the de facto standard for middle-tier applications. These applications are the

SVECW/CSE Dept. Page 2


source for recording and storing web traffic. And Global Positioning System (GPS)
becomes fully operational.
 1998 - Carlo Strozzi develops an open‐source relational database and calls it NoSQL.
Ten years later, a movement to develop NoSQL databases to work with large,
unstructured data sets gains momentum.
Google is founded by Larry Page and Sergey Brin, who worked for about a year on a
Stanford search engine project called BackRub.
 2003 - According to studies by IDC and EMC, the amount of data created in 2003
surpasses the amount of data created in all of human history before then.
LinkedIn, the popular social networking website for professionals, launches.
 2004 - Facebook, the social networking service are launched.
 2005 - The Apache Hadoop project is created by Doug Cutting and Mike Caferella.
 2008 - The number of devices connected to the Internet exceeds the world’s
population.
 2013 - The democratization of data begins. With smartphones, tablets, and Wi‐Fi,
everyone generates data at prodigious rates. More individuals access large volumes
of public data and put data to creative use.

The events of the last 20 years have fundamentally changed the way big data is raised.
We create more of it each day; it is not a waste product but a buried treasure waiting to
be discovered by curious, motivated researchers and practitioners who see these trends
and are reaching out to meet the current challenges.

1.4. Compare Hadoop vs traditional systems


 Comparison between RDBMS vs Hadoop:
Hadoop is an open-source software framework that allows distributed storage and
processing a huge amount of data i.e. Big Data. Hadoop software framework work is
very well structured semi-structured and unstructured data. Hadoop will be a good
choice in environments when there are needs for big data processing on which the
data being processed does not have dependable relationships.

RDBMS works efficiently when there is an entity-relationship flow that is defined


perfectly. An RDBMS works well with structured data only. The comparison between
RDBMS and Hadoop is presented in Table1.1. The comparison between Data
warehouse and Hadoop is presented in Table1.2.

SVECW/CSE Dept. Page 3


Table 1.1. RDBMS Vs Hadoop
Feature RDBMS Hadoop
Data Variety Mainly for Structured Data Used for Structured,
Semi-Structured &
Unstructured data

Data Storage Average size data(GBs) Use for large dataset(TBs


&PBs)
Querying SQL Language Hive Query Language
Schema Required on write (static Required on reading
schema) (dynamic schema)
Speed Reads are fast Both reads and writes are
fast
Cost License Free
Use case OLTP (Online transaction Analytics (Audio, video, logs
processing) etc), Data Discovery
Data Objects Works on Relational Tables Works on Key/Value Pair
Throughput Low High
Scalability Vertical Horizantal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High Low

 Comparison between Data Warehouse vs Hadoop:


Table 1.2. Data Warehouse vs Hadoop
Measures Data Warehouse Hadoop
Data In Data Warehouse we In Hadoop, we can process
analyze structured and any kind of data including
processed data structured / unstructured /
semi-structured
Processing Its processing is based on Its processing is based on
schema-on-write concepts schema-on-read concepts
Storage Suitable for data with small It works well with large
volume and it’s too much data sets having huge
expensive for large volume volume, velocity, & variety
Agility It is less agile and of fixed It is highly agile, configure
configuration and reconfigure as needed
Security Data Warehouse While Hadoop technologies
technologies have been are relatively new as
around for decades. Thus in compared to Data
term of security, we can Warehouse, so security is a
rely on Data Warehouse big concern here
Users Business professionals Hadoop is quite famous in
usually use data warehouse the field of data science and
data engineering

1.5. Limitations and Solutions of existing Data Analytics Architecture


Data Analytics architecture refers to the systems, protocols, and technology used
to collect, store, and analyze data. When building analytics architecture,
organizations need to consider both the hardware—how data will be physically
stored—as well as the software that will be used to manage and process it.

SVECW/CSE Dept. Page 4


Analytics architecture also focuses on multiple layers, starting with data warehouse
architecture, which defines how users in an organization can access and interact
with data. Storage is a key aspect of creating a reliable analytics process, as it will
establish how your data is organized, who can access it, and how quickly it can be
referenced.

It is used in several industries to allow organizations and companies to make better


decisions as well as verify and disprove existing theories or models. The focus of
Data Analytics lies in inference, which is the process of deriving conclusions that
are solely based on what the researcher already knows.

Limitations of existing Data Analytics Architecture:


 Inefficient in keeping pace with increasing data demands
 Inefficient in defining strict security principals
 No guidance for real time analytics
 Change management
 Incapability to offer data as Service
 No Guidance to promote Self Service Environments
 Data Redundancy
 Complex, lengthy and inflexible process

So, the existing data analytics architecture is going through evolution and change
in order to keep pace with changing demands and technologies.

Solutions of existing Data Analytics Architecture:


From the technology front, there is a clear shift towards Big Data technologies that
have already started gaining a wide popularity and adoption in the market.
 Big data solutions are ideal for analyzing not only raw structured data, but
semi structured and unstructured data from a wide variety of sources.
 Big data solutions are ideal when all, or most, of the data needs to be
analyzed versus a sample of the data. Or sampling of data isn’t nearly as
effective as a larger set of data from which to derive analysis.
 Big data solutions are ideal for iterative and exploratory analysis when
business measures on data are not predetermined.
 Big data is well suited for solving information challenges that don’t natively
fit within a traditional relational database approach for handling the problem
at hand.

SVECW/CSE Dept. Page 5


1.6. Attributes of Big Data
Big data refers to the large, diverse sets of information that grow at ever-increasing
rates. It encompasses the volume of information, the velocity or speed at which it is
created and collected, and the variety or scope of the data points being covered. Big
data often comes from multiple sources and arrives in multiple formats. The
characteristics of Big Data is presented in Figure1.1.

Figure1.1. Characteristics of Big Data


Volume: Volume indicates the size of the data, the volume of big data evolved into its
present stage as megabytes to gigabytes, gigabytes to terra bytes, terra bytes to
petabytes, petabytes to exabytes. The big data volumes are relative and vary by factors,
such as time and the type of data. This data is too big to be handled by the current state
of techniques and systems. In future, this will continue to expand exponentially at an
unprecedented rate.

Velocity: Velocity refers to the rate at which data are generated and the speed at which
it should be analyzed and acted upon. Velocity in big data is a concept which processed
and analyzed with the speed of the data coming from various sources. The proliferation
of digital devices such as smart phones and sensors has led to an unprecedented rate of
data which is continually being generated at a pace that is impossible for traditional
systems to capture, store and analyze.

Variety: Variety shows different types of data and data sources. Variety in big data is a
measure of heterogeneity of data representation such as structured, semi structured and
unstructured. With the explosion of sensors, smart devices and social collaboration

SVECW/CSE Dept. Page 6


technologies, data is being generated in countless forms, including: text, web data,
tweets, sensor data, audio, video, click streams, log files and more.

Veracity: Veracity denotes data uncertainty. Veracity in big data is the level of reliability
associated with certain types of data. Some data is inherently uncertain, for example:
sentiment, truthfulness, weather conditions, economic factors. The need to acknowledge
and plan for this dimension of uncertainty of big data is still a major quality concern in
processing of data.

Value: A new generation of big data technologies and architectures designed to


economically extract value from typical characteristics of data.

Big data is a process that is used when traditional data mining and handling techniques
cannot uncover the insights and meaning of the underlying data. This type of data
requires a different processing approach called big data analytics, which uses massive
parallelism on readily-available hardware.

1.7. Types of data


Data types involved in Big Data analytics are many:
Structured data: Structured data refers to any data that resides in a fixed field within a
record or file. This includes data contained in relational databases and spreadsheets.

Unstructured data: Unstructured data is information that either does not have a pre-
defined data model or is not organized in a pre-defined manner. Data of different forms
like text, image, video, document, etc.

Geographic data: Data related to roads, buildings, lakes, addresses, people,


workplaces, and transportation routes that are generated from geographic information
systems. These data link between place, time, and attributes.

Real-time media: Real-time streaming of live or stored media data. One of the main
sources of media data is services like e.g. YouTube, video conferencing, Flicker etc.

Natural language data: Human-generated data, particularly in the verbal form. The
sources of natural language data include speech capture devices, land phones, mobile
phones, and Internet of Things that generate large sizes of text-like communication
between devices.

SVECW/CSE Dept. Page 7


Time series: A sequence of data points, typically consisting of successive
measurements made over a time interval. The goal is to detect trends and anomalies,
identify context and external influences, and compare individual against the group or
compare individual at different times.

Event data: Data generated from the matching between external events with time
series. For example, information related to vehicle crashes or accidents can be collected
and analyzed to help understand what the vehicles were doing before, during and after
the event.

Network data: Data concerns very large networks, such as social networks (e.g.
Facebook and Twitter), information networks (e.g. the World Wide Web), and biological
networks. Network data is represented as nodes connected via one or more types of
relationship.

Linked data: Data that is built upon standard Web technologies such as HTTP, RDF,
SPARQL and URIs to share information that can be semantically queried by computers.

Finally, each data type has different requirements for analysis and poses different
challenges.

1.8. Use Cases of Big Data


Just as Big Data does not mean a single technology or a single use case, there is no
single path to start or expand an existing Big Data architecture. Some organizations may
want to supplement their relational data warehouse with a Hadoop-based data store,
while others may want to deploy streaming event analytics. Some may want to focus on
a narrowly scoped data exploration project, while others may want to operationalize
their findings based on data-driven analytical models. There is no one right path.

Six Big Data use cases at organizations across industries that illustrate various
architectural approaches for modernizing their information management platforms are
discussed in the below table.
Table 1.3. Patterns for Big Data Development
Case Industry Motivation Scope
1 Banking Transformational Transform core business processes to
Modernization improve decision-making agility and
transform and modernize supporting
information architecture and
technology.
2 Retail Agility and Resiliency Develop a two-layer architecture that
includes a business process–neutral

SVECW/CSE Dept. Page 8


canonical data model and a separate
layer that allows agile addition of any
type of business interpretation or
optimization
3 Investment Complementary Complement the existing relational
Banking Expansion data warehouse with a Hadoop-based
data store to address a near-real-time
financial consolidation and risk
assessment
4 Travel Targeted Enablement Improve a personalized sales process
by deploying a specific, targeted
solution based on real-time decision
management while ensuring minimal
impact on the rest of the information
architecture
5 Consumer Optimized Enable the ingestion, integration,
Packaged goods Exploration exploration, and discovery of
structured, semistructured, and
unstructured data coupled with
advanced analytic techniques to
better understand the buying patterns
and profiles of customers
6 Higher Education Vision Development Guarantee architectural readiness for
new requirements that would ensure
a much higher satisfaction level from
end users as they seek to leverage
new data and new analytics to
improve decision making

1.9. Other Technologies Vs Big Data

Data Science Vs. Machine Learning Vs. Big Data


Data Science, Machine Learning, and Big Data are all buzzwords in today's time. Data
science is a method for preparing, organizing, and manipulating data to perform data
analysis. After analyzing data, we need to extract the structured data, which is used in
various machine learning algorithms to train ML models later. Hence, these three
technologies are interrelated with each other, and together they provide unexpected
outcomes. Data is the most important key player in this IT world, and all these
technologies are based on data.

What is Data Science?


Data science is defined as the field of study of various scientific methods, algorithms,
tools, and processes that extract useful insights from a vast amount of data. It also
enables data scientists to discover hidden patterns from raw data. This concept allows us
to deal with Big Data that including extraction, organizing, preparation, and analyzing.

Data can be either structured or unstructured both. Data Science helps us to transform a
business problem into a research project and then transform it into a practical solution

SVECW/CSE Dept. Page 9


again. The term Data Science has emerged because of the evolution of mathematical
statistics, data analysis, and big data.

What is Machine Learning?


Machine Learning is defined as the subset of Artificial Intelligence that enables
machines/systems to learn from past experiences or trends and predict future events
accurately.

It helps the systems to learn from sample/training data and predicts results by teaching
itself with various algorithms. An ideal machine learning model does not require human
intervention too; however, still, such ML models are not in existence. The use of Machine
Learning can be seen in various sectors such as healthcare, infrastructure, science,
education, banking, finance, marketing, etc.

What is Big Data?


Big data is huge, large, or voluminous data, information, or the relevant statistics
acquired by large organizations that are difficult to process by traditional tools. Big data
can analyze structured, unstructured or semi-structured. Data is one of the key players
to run any business, and it is exponentially increasing with passes of time. Before a
decade, organizations were capable of dealing with gigabytes of data only and suffered
problems with data storage, but after emerging Big data, organizations are now capable
of handling petabytes and exabytes of data as well as able to store huge volumes of data
using cloud and big data frameworks such as Hadoop, etc.

Big Data is used to store, analyze and organize the huge volume of structured as well as
unstructured datasets.

Machine Learning Big data

It deals with using more data as input and It deals with extraction as well as analysis
algorithms to predict future outcomes based on of data from a large number of datasets.
trends.

It includes technologies such as supervised, Big data can be categorized as structured,


unsupervised, semi-supervised and unstructured, and semi-structured.
reinforcement learning, etc.

It uses tools such as Numpy, Pandas, It requires tools like Apache Hadoop
TensorFlow, Keras, etc., to analyze datasets. MongoDB.

Machine Learning can learn from training data Big Data analytics pulls raw data and
and act intelligently for making effective looks for patterns to help in stronger
predictions by teaching itself using Algorithms. decision-making for the firms.

SVECW/CSE Dept. Page 10


Machine Learning is helpful for providing virtual Big Data is helpful for handling different
assistance, Product Recommendations, Email purposes, including Stock Analysis, Market
Spam filtering, etc. Analysis, etc.

The scope of machine learning is much vast such The scope of big data is not limited to
as improving quality of prediction, building strong collecting a huge amount of data only but
decision-making capability, cognitive analysis, also to optimizing data for analysis as
improving healthcare services, speech and text well.
recognition, etc.

It has a wide range of applications such as email It also has a wide range of applications for
and spam filtering, product recommendation, analysis data storage in a structured
infrastructure, marketing, transportation, format such as stock market analysis, etc.
medical, finance & banking, education, self-
driving cars, etc.

Machine Learning does not need human It requires human intervention because of
intervention for a complete process because it the huge amount of multidimensional
uses various algorithms to build intelligent data. Due to having multidimensional
models to predict the result. data, it becomes difficult to extract
Further, it contains limited dimensional data features from data.
hence making it easier for recognizing features.
Sss

Data Science Big data

Data science is the study of working with a huge Big data is the study of collecting and
volume of data and enables data for prediction, analyzing a huge volume of data sets to
prescriptive, and prescriptive analytical models. find a hidden pattern that helps in
stronger decision-making.

It is a combination of various concepts of computer It is a technique to extract meaningful


science, statistics, and applied mathematics. insights from complex data sets.

The main aim of data science is to build data-based The main goal of big data is to extract
products for firms. useful information from the huge volume
of data and use it for building products
for firms.

It requires strong knowledge of Python, R, SAS, It requires tools like Apache Hadoop
Scala, as well as hands-on knowledge of SQL MongoDB.
databases.

It is used for scientific or research purposes. It is used for businesses and customer
satisfaction.

It broadly focuses on the science of the data. It is more involved with the processes of
handling voluminous data.

It includes various data operations such as It includes analysis of data stored in a


cleaning, collection, manipulation, etc. structured format such as stock market
analysis, etc.

SVECW/CSE Dept. Page 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy