0% found this document useful (0 votes)

8 views25 pages

Lecture 1

Big Data refers to vast and complex datasets that traditional data management tools cannot efficiently process or store. It encompasses structured, semi-structured, and unstructured data, and is utilized by organizations to enhance decision-making and improve products and services. The document also contrasts traditional data with Big Data, highlighting differences in volume, structure, and management challenges.

Uploaded by

kharstikim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views25 pages

Lecture 1

Uploaded by

kharstikim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

WHAT IS BIG DATA

Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but
with huge size.

• It can be structured, semi-structured, or unstructured. Big data is used by

organizations to improve their processes, make decisions, and create better
products and services.
• Big data includes structured data, like an inventory database or list of
financial transactions; unstructured data, such as social posts or videos; and
mixed data sets, like those used to train large language models for AI.
• We can have data without information but we cannot have information
without data. With such voluminous data comes the complexity of managing
it well with techniques that are not only effective and human-friendly but
also deliver the desired results in a timely manner.
• The challenges include capture, curation, storage, search, sharing, transfer,
analysis and visualization. (so such a huge volume of particular data poses
various challenges which includes how to capture such a big amount of data
how do you cure it how do you store such big amount of data how can you
search and how do you share this information and how to perform the
transfer of this huge volume of data also we will see another further
challenges like doing the analysis analytics and it's visualization which is
oftenly they are useful for many applications where this big data is going to
be used)
• the trend to this larger size of data sets is due to this additional information
which is derived from the analysis of a single set of large related data as
compared to the smaller sets with the same total size of the data this large
size will allow the correlations to be found to various opportunities to
exploit in terms of spot business trends, to determine the quality of research,
to prevent the diseases, to link the legal citations, to combat the crimes, and
determine the real time roadway traffic conditions.
Traditional Data V/S Big Data

Traditional Data Big Data

Traditional data is generated in Big data is generated outside the enterprise

enterprise level. level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.

Traditional database system deals Big data system deals with structured,
with structured data. semi-structured, database,
and unstructured data.

Traditional data is generated per hour But big data is generated more frequently
or per day or more. mainly per seconds.

Traditional data source is centralized Big data source is distributed and it is

and it is managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is High system configuration is required to

capable to process traditional data. process big data.

The size of the data is very small. The size is more than the traditional data
size.

Traditional data base tools are Special kind of data base tools are required
required to perform any data base to perform any database
operation. schema based operation.
Normal functions can manipulate data. Special kind of functions can manipulate
data.

Its data model is strict schema based Its data model is a flat schema based and it
and it is static. is dynamic.

Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Traditional data is in manageable Big data is in huge volume which

volume. becomes unmanageable.

It is easy to manage and manipulate It is difficult to manage and manipulate the

the data. data.

Its data sources includes ERP Its data sources includes social media,
transaction data, financial data, device data, sensor data, video,
organizational data, web transaction images, audio etc.
data etc.

1.Types of Digital Data

Digital data is the electronic representation of information in a format or language
that machines can read and understand. In more technical terms, digital data is a
binary format of information that's converted into a machine-readable digital
format. The power of digital data is that any analog inputs, from very simple text
documents to genome sequencing results, can be represented with the binary
system.
Digital data refers to information that is stored, processed, and transmitted
electronically. It exists in multiple formats and structures, which determine how it
can be stored, accessed, and analyzed. There are three main types of digital
data: Structured, Unstructured, and Semi-Structured data.
1. Structured Data refers to information that is organized in a predefined format,
typically stored in relational databases with clear rows and columns. It follows a
specific schema, making it easily searchable and processable using Structured
Query Language (SQL). Examples include customer records, sales transactions,
and sensor readings that adhere to a fixed format.

• Key Characteristics:
Organized & Standardized – Data is stored in fixed formats (rows and
columns).
Easily Searchable – Can be quickly queried using SQL.
Highly Scalable – Can be managed efficiently in large databases.
Fixed Schema – The structure of the data (like column names and data
types) is predefined.

• Examples
🔹 Student Records – A database storing student details with fields like Roll
No, Name, Course, and Marks.
🔹 Bank Transactions – A banking system storing transaction details
(Account No, Transaction Amount, Date).
🔹 E-commerce Data – Online retailers storing product details (Product ID,
Name, Price, Stock).
🔹 Employee Records – HR databases containing structured employee
information.

• Real-World Use Cases:

Banks store structured data in databases for secure transactions.
Hospitals maintain structured records for patient management.
Airlines store structured flight schedules and booking details.
• Storage Systems for Structured Data:
Relational Databases: MySQL, PostgreSQL, Oracle, SQL Server
Cloud Databases: Google BigQuery, Amazon RDS, Snowflake
• Advantages of structured data:
It is easy to work with structured data.
The advantages are:
Storage: Both defined and user- defined data types help with the storage of
structured data.
Scalability: Scalability is not generally an issue with increase in data
Security: Ensuring security is easy
Update and Delete: Updating, deleting etc is easy due to structured form.

• Limitations of structured data :

Structured data cannot handle unformatted or flexible data like social
media posts, emails, images, and videos. This leads us to Unstructured
Data.
2. Unstructured Data refers to information that lacks a fixed format or redefined
structure, making it more challenging to store and analyze. This type of data
includes a wide range of content such as text documents, social media posts,
images, videos, and audio recordings. Since unstructured data is not easily
processed using traditional databases, specialized tools and algorithms, such as
artificial intelligence and machine learning, are often required for analysis.
It is the one which cannot be stored in the form of rows and columns as in a
database and does not conform to any data model.
It is difficult to determine the meaning of the data.
It does not follow any rules and it can be of any type and thus its unpredictable
• Key Characteristics

Unorganized & Raw – Data exists in its natural form without any
predefined schema.
Difficult to Search & Analyze – Traditional SQL queries cannot retrieve
information easily.
Massive in Volume – Unstructured data forms over 80% of the world’s
digital data.
Requires AI & ML for Analysis – Deep learning models help in
recognizing patterns in images, videos, and text.

• Examples

🔹 Social Media Posts – Text, images, and videos posted on Facebook, Twitter,
Instagram.
🔹 Emails & Messages – Emails have structured headers (To, From) but an unstructured
body.
🔹 Videos & Images – Surveillance footage, YouTube videos, and digital images.
🔹 IoT Sensor Data – Unstructured logs generated by smart devices like temperature
sensors.

• Real-World Use Cases:

Netflix & YouTube: Analyze unstructured data (watch history) to recommend content.

Facebook & Twitter: Process billions of posts for sentiment analysis & trends.
Healthcare: AI scans X-ray images to detect diseases.

• Storage & Processing of Unstructured Data

Big Data Tools: Hadoop, Apache Spark

Cloud Storage: Amazon S3, Google Cloud Storage

AI & ML Processing: TensorFlow, OpenCV, Natural Language Processing (NLP)

• Managing unstructured data:

INDEXING : Data is indexed to enable faster search and retrieval. On the basis of some value
in data, index is defined as an identifier which represents a large record in the data set.

Indexing in unstructured data is difficult as text can be indexed based on a text string but in
case of non-text based files, e.g. audio/video, indexing depends on file names.

TAGS/METADATA : Using metadata

CLASSIFICATION/TAXONOMY : Taxonomy is classifying data on the basis of

relationship that exist between data. Data can be grouped and placed in hierarchies based
on the taxonomy prevalent in a firm.

But in absence of any structure/metadata, identifying relationships between data is difficult

as data is unstructured, naming standards are not consistent across the firm thus making it
difficult to classify data.

CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a
unique name to every object stored in it.AMh1 e object is retrieved based on its content and not
its location. It is used to store emails etc.

• Limitations of unstructured data :

Unstructured data requires special tools like Hadoop, NoSQL databases, and AI to process and
extract insights.
• Challenges faced while storing unstructured data :
3. Semi-Structured Data is a type of data that does not follow a rigid schema like
structured data but still contains tags or markers to define elements within it. It
is a hybrid of structured and unstructured data.
This type of data is often stored in formats such as JSON, XML, or NoSQL
databases, allowing for some level of organization while maintaining flexibility.
Examples include emails, web pages with embedded metadata, and data logs.

• Key Characteristics

✅ Partially Organized – Some structure is present, but not as rigid as relational databases.
✅ Uses Tags & Metadata – Data often has labels, attributes, or key-value pairs.
✅ Flexible Schema – Unlike structured data, the format can change dynamically.
✅ Common in Web & API Systems – Used in internet applications and data exchanges.

Examples

🔹 JSON & XML Files – Used for data exchange in web applications (APIs).
🔹 HTML & Web Pages – Contains structured elements (headings, paragraphs) but variable
content.
🔹 Log Files & System Records – Semi-structured data generated by software applications.
🔹 Emails (Partially Structured) – Headers are structured, but the email body is
unstructured.
Real-World Use Cases:

 E-commerce websites (Amazon) store product details in JSON format.

 Web servers store log files to track user activity.
 Smart devices (Fitbit, Smart Cars) generate semi-structured sensor data.

Sources of Semi-structured data

• Email
• XML
• TCP/IP Packets
• Zipped File
• Binary Executables
• Mark-Up Languages
• Integration of data from heterogeneous sources

Storage & Processing of Semi-Structured Data

NoSQL Databases: MongoDB, Apache Cassandra

Big Data Platforms: Hadoop, Google Cloud Bigtable

Web & API Processing: REST APIs, XML parsers, JSON databases

• Summary: Key Differences Between Digital Data Types

Feature Structured Data Unstructured Data Semi-Structured Data

Definition Highly organized, Raw, unorganized, Partially organized,

follows a fixed no predefined contains metadata
schema structure
Storage Relational Databases Big Data Systems NoSQL Databases
(SQL) (Hadoop, NoSQL) (MongoDB, Cassandra)
Examples Student records, bank Social media posts, JSON, XML, log files
transactions videos, emails
Searchability Easily searchable via Hard to search Searchable with metadata
SQL without AI/ML
Processing MySQL, PostgreSQL, AI, ML, NLP, NoSQL, APIs, Big Data
Tools Oracle Hadoop tools
Flexibility Rigid structure No structure Flexible format
But how did we get here? Let’s take a journey through the history of Big
Data, a story that spans centuries and is intertwined with human innovation
and technological evolution.

2. History of Big Data innovation

Long before computers, humans were collecting and analyzing data.
• The first census efforts, such as the US Census of 1790, required manual
data collection and analysis. This was the beginning of structured data
gathering.
• The invention of the tabulating machine by Herman Hollerith in 1890
revolutionized data processing. Hollerith’s machine was used to process the
US Census data, reducing the time required from years to months. This
marked the birth of automated data processing.
• The concept of handling large volumes of data began in the 1960s and
1970s with the development of relational databases. Companies and
institutions started using mainframes to store structured data in an organized
manner. However, computing power was limited, restricting the ability to
process vast amounts of information efficiently.
• During the 1980s and 1990s, advancements in database management
systems, particularly relational database management systems (RDBMS),
allowed businesses to handle growing data sets more efficiently. The
introduction of the internet further accelerated data generation, leading to an
increased need for better data storage and retrieval mechanisms.
• The early 2000s marked the emergence of Big Data as a formal concept. In
2001, analyst Doug Laney introduced the three Vs model (Volume,
Velocity, and Variety) to describe the growing challenges associated with
large-scale data processing. Around this time, companies like Google and
Yahoo developed distributed computing frameworks, such as the Google
File System (GFS) and MapReduce, enabling the processing of massive
data sets across multiple machines.
• By the 2010s, open-source frameworks like Apache Hadoop and Apache
Spark became popular, providing scalable and efficient ways to manage and
analyze large-scale data. Cloud computing platforms such as Amazon Web
Services (AWS), Microsoft Azure, and Google Cloud revolutionized Big
Data storage and processing, making it more accessible to businesses of all
sizes.
• In recent years, the rise of artificial intelligence (AI), machine learning
(ML), and real-time analytics has further advanced Big Data applications.
The use of streaming data platforms like Apache Kafka and the adoption of
deep learning models have enabled faster and more complex data analysis,
impacting industries such as healthcare, finance, and e-commerce.

Era Key Developments

1960s-1990s Traditional Databases, SQL, Data Warehousing

2000-2010 Google File System, Hadoop, AWS, Cloud Computing
2010-2020 NoSQL Databases, AI, GDPR, Real-Time Analytics
2020-Present Edge Computing, Blockchain, AI-Powered Big Data
• Today, Big Data continues to evolve, with technologies like edge
computing, blockchain, and quantum computing shaping the future of
large-scale data processing and analytics.
1. The Early Seeds of Data Collection (Pre-20th Century)

Long before computers, humans were collecting and analyzing data.

 18th Century: The first census efforts, such as the US Census of

1790, required manual data collection and analysis. This was the
beginning of structured data gathering.

 19th Century: The invention of the tabulating machine by Herman

Hollerith in 1890 revolutionized data processing. Hollerith’s machine
was used to process the US Census data, reducing the time required
from years to months. This marked the birth of automated data
processing.

Key Takeaway: Even in its infancy, data collection and analysis were critical
for governance and decision-making.

2. The Rise of Computing and Databases (20th Century)

The 20th century saw the advent of computers and the first steps toward
modern data management.

 1940s-1950s: The invention of the first computers, like the ENIAC,

laid the groundwork for digital data processing.

 1960s-1970s: The development of relational databases by Edgar F.

Codd (IBM) revolutionized how data was stored and retrieved. This era
also saw the rise of data warehouses, where organizations began
storing large amounts of structured data.

 1980s-1990s: The explosion of the internet and the proliferation of

personal computers led to an exponential increase in data generation.
Terms like data mining and business intelligence emerged as
organizations sought to extract insights from their growing datasets.

Key Takeaway: The 20th century transformed data from a static resource
into a dynamic, digital asset.

3. The Birth of Big Data (Early 21st Century)

The term “Big Data” entered the mainstream in the early 2000s, driven by
three key factors:

1. The Internet Boom: The rise of social media, e-commerce, and

search engines (e.g., Google) generated unprecedented amounts of
user data.

2. Mobile Revolution: Smartphones and IoT devices began producing

real-time data streams.

3. Cloud Computing: The advent of cloud platforms like AWS (2006)

made it easier to store and process massive datasets.

In 2001, Doug Laney (Gartner) formalized the 3 Vs of Big Data, providing a

framework to understand its challenges and opportunities.

Key Takeaway: The early 21st century marked the tipping point where data
became too big, too fast, and too diverse for traditional tools.

4. The Era of Big Data Technologies (2010s)

The 2010s saw the development of tools and frameworks designed to handle
Big Data:

 Hadoop: An open-source framework (2006) that allowed for

distributed processing of large datasets across clusters of computers.

 NoSQL Databases: Systems like MongoDB and Cassandra emerged

to handle unstructured data.

 Machine Learning and AI: Big Data became the fuel for training
advanced algorithms, enabling breakthroughs in fields like natural
language processing and computer vision.

 Real-Time Analytics: Tools like Apache Spark (2014) made it possible

to process data in real time, unlocking new possibilities for industries
like finance and healthcare.

Key Takeaway: The 2010s were about building the infrastructure and tools to
harness the power of Big Data.
5. Big Data Today and the Future (2020s and Beyond)

Today, Big Data is everywhere:

 Personalization: Companies like Netflix and Amazon use Big Data to

recommend products and content.

 Healthcare: Big Data powers predictive analytics, enabling early

disease detection and personalized medicine.

 Smart Cities: Urban centers use Big Data to optimize traffic, reduce
energy consumption, and improve public safety.

Looking ahead, the future of Big Data is shaped by:

 Edge Computing: Processing data closer to its source (e.g., IoT

devices).

 Quantum Computing: Potentially revolutionizing data processing

speeds.

 Ethics and Privacy: As data grows, so do concerns about security,

bias, and misuse.

Key Takeaway: Big Data is no longer just a tool—it’s a fundamental part of

how we live, work, and innovate.

3. Big data platforms are comprehensive software solutions designed to

manage, process, and analyze large volumes of structured, semi-structured, and
unstructured data.
 These platforms provide the tools and infrastructure to handle data that
traditional databases cannot efficiently manage due to their volume, velocity,
and variety.
 Big data platforms integrate various components, including data storage,
processing engines, analytics tools, and user interfaces, to facilitate the end-to-
end process of extensive data management.
 Due to the efficiency in data management, enterprises are increasingly adopting
big data platforms to gather tons of data and convert them into structured,
actionable business insights
How do big data platforms work?/Components of a Big Data Platform:
Big data platforms follow a structured process to ensure companies can harness
data to make informed decisions. This process involves the following steps:
a. Data collection
Data collection is the initial step in the operation of big data platforms. It
systematically gathers data from various sources such as databases, social media,
sensors, and other sources. The data is collected using various methods such as
web scraping, data feeds, APIs, and data integration tools. The collected data is
then stored in a centralized repository, often a data lake or a data warehouse, where
it can be easily accessed and processed for further analysis.

b. Data storage
Once the data is collected, it must be stored for efficient retrieval and processing.
Big data platforms typically utilize distributed storage systems that can handle
large volumes of data. These systems include Hadoop Distributed File System
(HDFS), Google Cloud Storage, or Amazon S3. This distributed storage
architecture ensures high availability, fault tolerance, and scalability.
Technologies Used:
• Hadoop Distributed File System (HDFS) – Stores large datasets across
multiple servers.
• Amazon S3 / Google Cloud Storage – Cloud-based data storage.
• NoSQL Databases (MongoDB, Apache Cassandra, HBase) – Stores
semi-structured and unstructured data.

c. Data processing
Once the data is stored, it must be processed to extract valuable insights. This layer
is responsible for processing large-scale data using parallel computing.
The process involves various operations such as cleaning, transforming, and
aggregating the data. The parallel processing capabilities of big data platforms,
such as Apache Hadoop and Apache Spark, enable rapid computations and
complex data transformations.
Technologies Used:
• Apache Hadoop (MapReduce) – Processes massive datasets across
distributed nodes.
• Apache Spark – Real-time, in-memory Big Data processing.
• Flink / Storm / Kafka – Handles real-time streaming data (IoT, Stock
Market).

d. Data analysis
Data analysis involves examining and interpreting large volumes of data to extract
meaningful insights and patterns. This layer provides data querying,
transformation, and analysis. The analysis process includes using machine
learning algorithms, data mining techniques, or visualization tools to better
understand the information. The analysis results can then be used to make data-
driven decisions, optimize processes, identify opportunities, or solve complex
problems.
Technologies Used:
• SQL-based tools (Hive, Presto, Google BigQuery) – Query large datasets
using SQL.
• AI & Machine Learning (TensorFlow, Scikit-Learn, H2O.ai) – Applies
ML on Big Data for predictions & automation.
• BI Tools (Tableau, Power BI, Looker) – Visualizes insights from Big
Data.

e. Data Security & Governance

This stage ensures accuracy, consistency, integrity, relevance, and data security.
Ensures data privacy, security, and compliance. The prominent techniques to
implement data quality and governance include data quality management, lineage
tracking, and cataloging. By implementing robust data quality assurance measures,
organizations can have confidence in the data they use for decision-making.
Technologies Used:
• Apache Ranger, Kerberos – Provides role-based access control.
• GDPR, HIPAA Compliance – Ensures legal data protection policies.
• Data Encryption (SSL, AES-256) – Protects sensitive Big Data.
f. Data management
Data management is a crucial aspect of big data platforms. It involves organizing,
storing, and retrieving large volumes of data. Platforms employ various techniques
such as data backup, recovery, and archiving to manage data effectively. These
techniques help implement fault tolerance and ensure optimized data retrieval for
all use cases.

Types of Big Data Platforms:

On-Premises Big Data Platforms: These platforms are installed and managed
within an organization’s own data centers, providing full control but requiring high
maintenance costs.
 Apache Hadoop
 Apache Kafka
 Cloudera
Cloud-Based Big Data Platforms: These platforms run on cloud infrastructure
and provide scalability, flexibility, and cost-effective storage solutions.
 AWS Big Data (Redshift, S3, Glue)
 Google BigQuery
 Microsoft Azure Synapse Analytics
Hybrid Big Data Platforms: These platforms combine both on-premises and
cloud-based solutions for greater flexibility.
 Cloudera Data Platform
 IBM Cloud Pak for Data

The best big data platforms

Several big data platforms offer comprehensive features and solutions for
businesses to manage and analyze complex datasets. The most prominent big data
platforms used by companies include the following:

1. Apache Hadoop
Overview
Apache Hadoop is an open-source framework that provides a reliable, scalable, and
distributed computing environment. It is designed to store and process large data
sets across clusters of computers using simple programming models.
Use Cases
Data warehousing: Storing and analyzing large volumes of structured and
unstructured data.
Log and event data processing: Analyzing massive log files and event data in
real-time or batch mode.
Fraud detection: Identifying fraudulent activities by analyzing vast amounts of
transaction data.
Pros and Cons
Pros: Cost-effective, highly scalable, fault-tolerant.
Cons: Complex setup and management, slower processing speed compared to
newer technologies like Spark.
2. Apache Spark
Overview
Apache Spark is a unified analytics engine for big data processing with built-in
modules for streaming, SQL, machine learning, and graph processing. It is
designed to be fast and general-purpose, capable of processing large datasets
quickly by leveraging in-memory processing.
Use Cases
• Real-time data analytics: Analyzing streaming data in real-time for
immediate insights.
• Machine learning: Training models on large datasets using MLlib, Spark’s
machine learning library.
• Interactive data exploration allows data scientists to explore and
manipulate large datasets interactively.
Pros and Cons
• Pros: Fast processing, versatile, extensive community support.
• Cons: High memory usage can be complex to deploy and manage for large-
scale applications.
3. Google BigQuery
• Overview
• Google BigQuery is a fully managed, serverless data warehouse allowing
super-fast SQL queries using Google’s infrastructure. It is designed to
analyze large datasets efficiently. Data strategy consulting can help
organizations integrate BigQuery into their existing data strategy, optimizing
its use for specific business intelligence needs and cost management.
• Use Cases
• Data warehousing: Storing and querying massive datasets without needing
to manage infrastructure.
• Business intelligence: Using SQL queries to generate reports and
dashboards for decision-making.
• Log analysis: Analyzing logs and other event data for operational insights.
• Pros and Cons
• Pros: Fully managed, fast, integrates well with Google’s ecosystem.
• Cons: High costs with large-scale queries, as well as data transfer fees, may
apply.
4. IBM Big Data Platform
• Overview
• IBM’s Big Data Platform offers tools for managing and analyzing big data,
including IBM Watson Studio, IBM Db2 Big Data, and IBM Cloud Pak for
Data. It is known for its comprehensive capabilities and focus on enterprise-
grade solutions.
• Use Cases
• Predictive analytics: Using IBM Watson to predict trends and outcomes
based on big data.
• Data management: Managing large volumes of structured and unstructured
data.
• AI and machine learning: Developing AI-driven applications using
Watson’s machine learning capabilities.
• Pros and Cons
• Pros: Strong enterprise focus, advanced analytics and AI capabilities, robust
security.
• Cons: High cost, may have a steep learning curve for some users.

Detailed Notes :
a. Apache Hadoop
Apache Hadoop is one of the industry's most widely used big data platforms. It is
an open-source framework that enables distributed processing for massive datasets
throughout clusters. Hadoop provides a scalable and cost-effective solution for
storing, processing, and analyzing massive amounts of structured and unstructured
data.
One of the key features of Hadoop is its distributed file system, known as Hadoop
Distributed File System (HDFS). HDFS enables data to be stored across multiple
machines, providing fault tolerance and high availability. This feature allows
businesses to store and process data at a previously unattainable scale. Hadoop also
includes a powerful processing engine called MapReduce, which allows for
parallel data processing across the cluster. The prominent companies that use
Apache Hadoop are:
 Yahoo
 Facebook
 Twitter
b. Apache Spark
Apache Spark is a unified analytics engine for batch processing, streaming data,
machine learning, and graph processing. It is one of the most popular big data
platforms used by companies. One of the key benefits that Apache Spark offers is
speed. It is designed to perform data processing tasks in-memory and achieve
significantly faster processing times than traditional disk-based systems.
Spark also supports various programming languages, including Java,
Scala, Python, and R, making it accessible to a wide range of developers. Hadoop
offers a rich set of libraries and tools, such as Spark SQL for querying structured
data, MLlib for machine learning, and GraphX for graph processing. Spark
integrates well with other big data technologies, such as Hadoop, allowing
companies to leverage their existing infrastructure. The prominent companies that
use Apache Spark include:
 Netflix
 Uber
 Airbnb
c. Google Cloud BigQuery
Google Cloud BigQuery is a top-rated big data platform that provides a fully
managed and serverless data warehouse solution. It offers a robust and scalable
infrastructure for storing, querying, and analyzing massive datasets. BigQuery is
designed to handle petabytes of data and allows users to run SQL queries on large
datasets with impressive speed and efficiency.
BigQuery supports multiple data formats and integrates seamlessly with other
Google Cloud services, such as Google Cloud Storage and Google Data Studio.
BigQuery's unique architecture enables automatic scaling, ensuring users can
process data quickly without worrying about infrastructure management. BigQuery
offers a standard SQL interface for querying data, built-in machine learning
algorithms for predictive analytics, and geospatial analysis capabilities. The
prominent companies that use Google Cloud BigQuery are:
 Spotify
 Walmart
 The New York Times
d. Amazon EMR
Amazon EMR is a widely used big data platform from Amazon Web Services
(AWS). It offers a scalable and cost-effective solution for processing and analyzing
large datasets using popular open-source frameworks such as Apache Hadoop,
Apache Spark, and Apache Hive. EMR allows users to quickly provision and
manage clusters of virtual servers, known as instances, to process data in parallel.
EMR integrates seamlessly with other AWS services, such as Amazon S3 for data
storage and Amazon Redshift for data warehousing, enabling a comprehensive big
data ecosystem. Additionally, EMR supports various data processing frameworks
and tools, making it suitable for a wide range of use cases, including data
transformation, machine learning, log analysis, and real-time analytics. The
prominent companies that use Amazon EMR are:
 Expedia
 Lyft
 Pfizer
e. Microsoft Azure HDInsight
Microsoft Azure HDInsight is a leading big data platform offered by Microsoft
Azure. It provides a fully managed cloud service for processing and analyzing
large datasets using popular open-source frameworks such as Apache Hadoop,
Apache Spark, Apache Hive, and Apache HBase. HDInsight offers a scalable and
reliable infrastructure that allows users to easily deploy and manage clusters.
HDInsight integrates seamlessly with other Azure services, such as Azure Data
Lake Storage and Azure Synapse Analytics, offering a comprehensive ecosystem
of Microsoft Azure services. HDInsight supports various programming languages,
including Java, Python, and R, making it accessible to a wide range of users. The
prominent companies that use Microsoft Azure HDInsight are:
 Starbucks
 Boeing
 T-Mobile
f. Cloudera
Cloudera is a leading big data platform that offers a comprehensive suite of tools
and services designed to help organizations effectively manage and analyze large
volumes of data. Cloudera's platform is built on Apache Hadoop, an open-source
framework for distributed storage and processing of big data. Cloudera is a hybrid
data platform deployed across on-premise, cloud, and edge environments.
Cloudera offers a unified platform that integrates various components such as
Hadoop Distributed File System (HDFS), Apache Spark, and Apache Hive,
enabling users to perform various data processing and analytics tasks. Cloudera
also provides machine learning and advanced analytics tools, allowing businesses
to gain deeper insights from their data. The prominent companies that use Cloudera
are:
 Dell
 Nissan Motor
 Comcast
g. IBM InfoSphere BigInsights
IBM InfoSphere BigInsights is a powerful big data platform that offers a range of
tools to manage and analyze large volumes of structured as well as unstructured
data in a reliable manner. IBM InfoSphere BigInsights can handle massive data,
making it suitable for enterprises dealing with complex datasets. It provides a
comprehensive set of features for data management, data warehousing, data
analytics, machine learning, and more.
IBM InfoSphere BigInsights provides a user-friendly interface and intuitive data
exploration and visualization tools. The platform also offers robust security and
governance features, ensuring data privacy and compliance with regulatory
requirements. BigInsights is built on top of Apache Hadoop and Apache Spark,
and it integrates with other IBM products and services, such as IBM DB2, IBM
SPSS Modeler, and IBM Watson Analytics. This integration makes it a good
choice for businesses already using the IBM product/services ecosystem. The
prominent companies that use IBM Infosphere BigInsights are:
 Lenovo
 DBS Bank
 General Motors
h. Databricks
Databricks is a prominent big data platform built on Apache Spark. Databricks
simplifies the process of building and deploying big data applications by providing
a scalable and fully managed infrastructure. It allows users to process large
datasets in real-time, perform complex analytics, and build machine learning
models using Spark's powerful capabilities.
Databricks provides an interactive workspace where users can write code, visualize
data, and collaborate on projects. It also integrates with popular data sources and
tools, making it easy to ingest and process data from various sources. With its
auto-scaling capabilities, Databricks ensures that users have the resources to handle
their workloads efficiently. Its automated infrastructure management and scaling
capabilities make it a reliable choice for handling large datasets and complex
workloads. The prominent companies that use Databricks are:
 Nvidia Corporation
 Johnson & Johnson
 Salesforce
Factors for Big Data Platform Consideration:
Factor Description
Scalability The platform should handle increasing data volume, velocity, and variety
without performance issues. It must support horizontal scaling to
accommodate growing workloads.
Performance High data processing speed, efficient scaling, and fault tolerance are crucial.
The platform should support parallel processing and distributed computing
for optimized performance.
Security & Must include data encryption, access controls, and authentication
Compliance mechanisms to protect sensitive data. Compliance with regulations like
GDPR, HIPAA is essential.
Ease of Use A user-friendly interface with minimal learning curve improves adoption and
productivity. The platform should offer comprehensive documentation,
resources, and tutorials.
Integration The platform should integrate with databases, cloud services, APIs, and
Capabilities programming languages to ensure seamless data processing without
complex migrations.

 Big Data is everywhere—from social media, e-commerce, banking, healthcare, IoT, and
AI.
 Today, we will explore why Big Data is growing so fast and what is fueling its adoption

Terraform+Notes+PPT+2nd+May+2025+ +KPLABS
No ratings yet
Terraform+Notes+PPT+2nd+May+2025+ +KPLABS
754 pages
(Innovations in Transactional Analysis - Theory and Practice) Sari Van Poelje, Anne de Graaf - New Theory and Practice of Transactional Analysis in Organizations - On The Edge-Routledge (2021)
100% (1)
(Innovations in Transactional Analysis - Theory and Practice) Sari Van Poelje, Anne de Graaf - New Theory and Practice of Transactional Analysis in Organizations - On The Edge-Routledge (2021)
213 pages
MATH 5 - Q1 - Mod1 PDF
78% (49)
MATH 5 - Q1 - Mod1 PDF
25 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Law of Property and Easement-NOTES
No ratings yet
Law of Property and Easement-NOTES
62 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
UNIT-1 Bda Kalyan
No ratings yet
UNIT-1 Bda Kalyan
25 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
BDA Unit-1
No ratings yet
BDA Unit-1
35 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data Analytics Unit-1 New 2025
No ratings yet
Big Data Analytics Unit-1 New 2025
31 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
Unit 1
No ratings yet
Unit 1
26 pages
Module 1
No ratings yet
Module 1
60 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
What Is Data
No ratings yet
What Is Data
20 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
34 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
big Data
No ratings yet
big Data
21 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Unit 1
No ratings yet
Unit 1
17 pages
BIG DATA Research PDF
No ratings yet
BIG DATA Research PDF
9 pages
Big Data
No ratings yet
Big Data
84 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
BDA Unit 1
No ratings yet
BDA Unit 1
22 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
UNIT4
No ratings yet
UNIT4
20 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
35 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
Unit-I Bdaur-Bcom
No ratings yet
Unit-I Bdaur-Bcom
5 pages
UNIT I Notes
No ratings yet
UNIT I Notes
26 pages
BDA ppt1
No ratings yet
BDA ppt1
45 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
AI Primer
No ratings yet
AI Primer
24 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
BigData 1
No ratings yet
BigData 1
14 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
81 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Agents in Langchain
No ratings yet
Agents in Langchain
6 pages
Vector Databases
No ratings yet
Vector Databases
2 pages
ADASYN Class Imbalance Explanation
No ratings yet
ADASYN Class Imbalance Explanation
2 pages
Assignment 1
No ratings yet
Assignment 1
1 page
ReAct Agents
No ratings yet
ReAct Agents
6 pages
SANS MGT414 10 Course Book
No ratings yet
SANS MGT414 10 Course Book
100 pages
Kerry Anderson Resume 2017 Weebly
No ratings yet
Kerry Anderson Resume 2017 Weebly
3 pages
FLEX-1500 Service Manual
No ratings yet
FLEX-1500 Service Manual
49 pages
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
100% (2)
Full Download The Future of HRD, Volume I: Innovation and Technology Mark Loon PDF
76 pages
Arroyo Housing Project
No ratings yet
Arroyo Housing Project
20 pages
Taxi Reimbursement Request Form 07.31.24 - 0
No ratings yet
Taxi Reimbursement Request Form 07.31.24 - 0
2 pages
1
No ratings yet
1
5 pages
Logical Fallacies
100% (1)
Logical Fallacies
52 pages
Practice Question Bank UNIT 1&2
No ratings yet
Practice Question Bank UNIT 1&2
3 pages
Xie 2021
No ratings yet
Xie 2021
8 pages
Bus Ethics q3 Mod3 Code of Ethics in Business Final
No ratings yet
Bus Ethics q3 Mod3 Code of Ethics in Business Final
30 pages
Fall of Dhaka
100% (4)
Fall of Dhaka
4 pages
Research Thesis
No ratings yet
Research Thesis
6 pages
A FAREWELL TO VIROLOGY (EXPERT EDITION) DR Mark Bailey
No ratings yet
A FAREWELL TO VIROLOGY (EXPERT EDITION) DR Mark Bailey
67 pages
Delta Neutral Vega Long
No ratings yet
Delta Neutral Vega Long
6 pages
Instruction Manual
No ratings yet
Instruction Manual
2 pages
Tutorial Letter 201/1/2018: Organisational Communication
No ratings yet
Tutorial Letter 201/1/2018: Organisational Communication
37 pages
PYQ DEMO COMBO PYQ BANK All Odisha Previous Year Subject Wise Topic Wise 20000 Questions Answer PDF
100% (1)
PYQ DEMO COMBO PYQ BANK All Odisha Previous Year Subject Wise Topic Wise 20000 Questions Answer PDF
51 pages
Lesson 23 - Unit Review Part 1
No ratings yet
Lesson 23 - Unit Review Part 1
2 pages
A Shani 2020
No ratings yet
A Shani 2020
9 pages
Catalogue Wireframe
No ratings yet
Catalogue Wireframe
11 pages
THEONE ? Sentence Improvement Pre 4th Oct Level Up Your English
No ratings yet
THEONE ? Sentence Improvement Pre 4th Oct Level Up Your English
145 pages
Signature Assignment Art Analysis-Final Paper
No ratings yet
Signature Assignment Art Analysis-Final Paper
5 pages
KYC Template Individual AnnexB1
No ratings yet
KYC Template Individual AnnexB1
1 page
Philippine Education: Where We Are, Basic Characteristics, Issues and Concerns
No ratings yet
Philippine Education: Where We Are, Basic Characteristics, Issues and Concerns
56 pages
6 Month MCQs (Oct To May 25) English
No ratings yet
6 Month MCQs (Oct To May 25) English
197 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 1

Uploaded by

Lecture 1

Uploaded by

WHAT IS BIG DATA

• It can be structured, semi-structured, or unstructured. Big data is used by

Traditional Data Big Data

Traditional data is generated in Big data is generated outside the enterprise

Traditional data source is centralized Big data source is distributed and it is

Data integration is very easy. Data integration is very difficult.

Normal system configuration is High system configuration is required to

Traditional data is in manageable Big data is in huge volume which

It is easy to manage and manipulate It is difficult to manage and manipulate the

1.Types of Digital Data

• Real-World Use Cases:

• Limitations of structured data :

• Real-World Use Cases:

• Storage & Processing of Unstructured Data

Big Data Tools: Hadoop, Apache Spark

Cloud Storage: Amazon S3, Google Cloud Storage

AI & ML Processing: TensorFlow, OpenCV, Natural Language Processing (NLP)

• Managing unstructured data:

TAGS/METADATA : Using metadata

CLASSIFICATION/TAXONOMY : Taxonomy is classifying data on the basis of

But in absence of any structure/metadata, identifying relationships between data is difficult

• Limitations of unstructured data :

 E-commerce websites (Amazon) store product details in JSON format.

Sources of Semi-structured data

Storage & Processing of Semi-Structured Data

NoSQL Databases: MongoDB, Apache Cassandra

Big Data Platforms: Hadoop, Google Cloud Bigtable

• Summary: Key Differences Between Digital Data Types

Feature Structured Data Unstructured Data Semi-Structured Data

Definition Highly organized, Raw, unorganized, Partially organized,

2. History of Big Data innovation

Era Key Developments

1960s-1990s Traditional Databases, SQL, Data Warehousing

Long before computers, humans were collecting and analyzing data.

 18th Century: The first census efforts, such as the US Census of

 19th Century: The invention of the tabulating machine by Herman

2. The Rise of Computing and Databases (20th Century)

 1940s-1950s: The invention of the first computers, like the ENIAC,

 1960s-1970s: The development of relational databases by Edgar F.

 1980s-1990s: The explosion of the internet and the proliferation of

3. The Birth of Big Data (Early 21st Century)

1. The Internet Boom: The rise of social media, e-commerce, and

2. Mobile Revolution: Smartphones and IoT devices began producing

3. Cloud Computing: The advent of cloud platforms like AWS (2006)

In 2001, Doug Laney (Gartner) formalized the 3 Vs of Big Data, providing a

4. The Era of Big Data Technologies (2010s)

 Hadoop: An open-source framework (2006) that allowed for

 NoSQL Databases: Systems like MongoDB and Cassandra emerged

 Real-Time Analytics: Tools like Apache Spark (2014) made it possible

Today, Big Data is everywhere:

 Personalization: Companies like Netflix and Amazon use Big Data to

 Healthcare: Big Data powers predictive analytics, enabling early

Looking ahead, the future of Big Data is shaped by:

 Edge Computing: Processing data closer to its source (e.g., IoT

 Quantum Computing: Potentially revolutionizing data processing

 Ethics and Privacy: As data grows, so do concerns about security,

Key Takeaway: Big Data is no longer just a tool—it’s a fundamental part of

3. Big data platforms are comprehensive software solutions designed to

e. Data Security & Governance

Types of Big Data Platforms:

The best big data platforms

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.