Lecture 1
Lecture 1
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but
with huge size.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Traditional database system deals Big data system deals with structured,
with structured data. semi-structured, database,
and unstructured data.
Traditional data is generated per hour But big data is generated more frequently
or per day or more. mainly per seconds.
The size of the data is very small. The size is more than the traditional data
size.
Traditional data base tools are Special kind of data base tools are required
required to perform any data base to perform any database
operation. schema based operation.
Normal functions can manipulate data. Special kind of functions can manipulate
data.
Its data model is strict schema based Its data model is a flat schema based and it
and it is static. is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Its data sources includes ERP Its data sources includes social media,
transaction data, financial data, device data, sensor data, video,
organizational data, web transaction images, audio etc.
data etc.
• Key Characteristics:
Organized & Standardized – Data is stored in fixed formats (rows and
columns).
Easily Searchable – Can be quickly queried using SQL.
Highly Scalable – Can be managed efficiently in large databases.
Fixed Schema – The structure of the data (like column names and data
types) is predefined.
• Examples
🔹 Student Records – A database storing student details with fields like Roll
No, Name, Course, and Marks.
🔹 Bank Transactions – A banking system storing transaction details
(Account No, Transaction Amount, Date).
🔹 E-commerce Data – Online retailers storing product details (Product ID,
Name, Price, Stock).
🔹 Employee Records – HR databases containing structured employee
information.
Unorganized & Raw – Data exists in its natural form without any
predefined schema.
Difficult to Search & Analyze – Traditional SQL queries cannot retrieve
information easily.
Massive in Volume – Unstructured data forms over 80% of the world’s
digital data.
Requires AI & ML for Analysis – Deep learning models help in
recognizing patterns in images, videos, and text.
• Examples
🔹 Social Media Posts – Text, images, and videos posted on Facebook, Twitter,
Instagram.
🔹 Emails & Messages – Emails have structured headers (To, From) but an unstructured
body.
🔹 Videos & Images – Surveillance footage, YouTube videos, and digital images.
🔹 IoT Sensor Data – Unstructured logs generated by smart devices like temperature
sensors.
Netflix & YouTube: Analyze unstructured data (watch history) to recommend content.
Facebook & Twitter: Process billions of posts for sentiment analysis & trends.
Healthcare: AI scans X-ray images to detect diseases.
INDEXING : Data is indexed to enable faster search and retrieval. On the basis of some value
in data, index is defined as an identifier which represents a large record in the data set.
Indexing in unstructured data is difficult as text can be indexed based on a text string but in
case of non-text based files, e.g. audio/video, indexing depends on file names.
CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a
unique name to every object stored in it.AMh1 e object is retrieved based on its content and not
its location. It is used to store emails etc.
• Key Characteristics
✅ Partially Organized – Some structure is present, but not as rigid as relational databases.
✅ Uses Tags & Metadata – Data often has labels, attributes, or key-value pairs.
✅ Flexible Schema – Unlike structured data, the format can change dynamically.
✅ Common in Web & API Systems – Used in internet applications and data exchanges.
Examples
🔹 JSON & XML Files – Used for data exchange in web applications (APIs).
🔹 HTML & Web Pages – Contains structured elements (headings, paragraphs) but variable
content.
🔹 Log Files & System Records – Semi-structured data generated by software applications.
🔹 Emails (Partially Structured) – Headers are structured, but the email body is
unstructured.
Real-World Use Cases:
• Email
• XML
• TCP/IP Packets
• Zipped File
• Binary Executables
• Mark-Up Languages
• Integration of data from heterogeneous sources
Web & API Processing: REST APIs, XML parsers, JSON databases
Key Takeaway: Even in its infancy, data collection and analysis were critical
for governance and decision-making.
The 20th century saw the advent of computers and the first steps toward
modern data management.
Key Takeaway: The 20th century transformed data from a static resource
into a dynamic, digital asset.
Key Takeaway: The early 21st century marked the tipping point where data
became too big, too fast, and too diverse for traditional tools.
The 2010s saw the development of tools and frameworks designed to handle
Big Data:
Machine Learning and AI: Big Data became the fuel for training
advanced algorithms, enabling breakthroughs in fields like natural
language processing and computer vision.
Key Takeaway: The 2010s were about building the infrastructure and tools to
harness the power of Big Data.
5. Big Data Today and the Future (2020s and Beyond)
Smart Cities: Urban centers use Big Data to optimize traffic, reduce
energy consumption, and improve public safety.
b. Data storage
Once the data is collected, it must be stored for efficient retrieval and processing.
Big data platforms typically utilize distributed storage systems that can handle
large volumes of data. These systems include Hadoop Distributed File System
(HDFS), Google Cloud Storage, or Amazon S3. This distributed storage
architecture ensures high availability, fault tolerance, and scalability.
Technologies Used:
• Hadoop Distributed File System (HDFS) – Stores large datasets across
multiple servers.
• Amazon S3 / Google Cloud Storage – Cloud-based data storage.
• NoSQL Databases (MongoDB, Apache Cassandra, HBase) – Stores
semi-structured and unstructured data.
c. Data processing
Once the data is stored, it must be processed to extract valuable insights. This layer
is responsible for processing large-scale data using parallel computing.
The process involves various operations such as cleaning, transforming, and
aggregating the data. The parallel processing capabilities of big data platforms,
such as Apache Hadoop and Apache Spark, enable rapid computations and
complex data transformations.
Technologies Used:
• Apache Hadoop (MapReduce) – Processes massive datasets across
distributed nodes.
• Apache Spark – Real-time, in-memory Big Data processing.
• Flink / Storm / Kafka – Handles real-time streaming data (IoT, Stock
Market).
d. Data analysis
Data analysis involves examining and interpreting large volumes of data to extract
meaningful insights and patterns. This layer provides data querying,
transformation, and analysis. The analysis process includes using machine
learning algorithms, data mining techniques, or visualization tools to better
understand the information. The analysis results can then be used to make data-
driven decisions, optimize processes, identify opportunities, or solve complex
problems.
Technologies Used:
• SQL-based tools (Hive, Presto, Google BigQuery) – Query large datasets
using SQL.
• AI & Machine Learning (TensorFlow, Scikit-Learn, H2O.ai) – Applies
ML on Big Data for predictions & automation.
• BI Tools (Tableau, Power BI, Looker) – Visualizes insights from Big
Data.
1. Apache Hadoop
Overview
Apache Hadoop is an open-source framework that provides a reliable, scalable, and
distributed computing environment. It is designed to store and process large data
sets across clusters of computers using simple programming models.
Use Cases
Data warehousing: Storing and analyzing large volumes of structured and
unstructured data.
Log and event data processing: Analyzing massive log files and event data in
real-time or batch mode.
Fraud detection: Identifying fraudulent activities by analyzing vast amounts of
transaction data.
Pros and Cons
Pros: Cost-effective, highly scalable, fault-tolerant.
Cons: Complex setup and management, slower processing speed compared to
newer technologies like Spark.
2. Apache Spark
Overview
Apache Spark is a unified analytics engine for big data processing with built-in
modules for streaming, SQL, machine learning, and graph processing. It is
designed to be fast and general-purpose, capable of processing large datasets
quickly by leveraging in-memory processing.
Use Cases
• Real-time data analytics: Analyzing streaming data in real-time for
immediate insights.
• Machine learning: Training models on large datasets using MLlib, Spark’s
machine learning library.
• Interactive data exploration allows data scientists to explore and
manipulate large datasets interactively.
Pros and Cons
• Pros: Fast processing, versatile, extensive community support.
• Cons: High memory usage can be complex to deploy and manage for large-
scale applications.
3. Google BigQuery
• Overview
• Google BigQuery is a fully managed, serverless data warehouse allowing
super-fast SQL queries using Google’s infrastructure. It is designed to
analyze large datasets efficiently. Data strategy consulting can help
organizations integrate BigQuery into their existing data strategy, optimizing
its use for specific business intelligence needs and cost management.
• Use Cases
• Data warehousing: Storing and querying massive datasets without needing
to manage infrastructure.
• Business intelligence: Using SQL queries to generate reports and
dashboards for decision-making.
• Log analysis: Analyzing logs and other event data for operational insights.
• Pros and Cons
• Pros: Fully managed, fast, integrates well with Google’s ecosystem.
• Cons: High costs with large-scale queries, as well as data transfer fees, may
apply.
4. IBM Big Data Platform
• Overview
• IBM’s Big Data Platform offers tools for managing and analyzing big data,
including IBM Watson Studio, IBM Db2 Big Data, and IBM Cloud Pak for
Data. It is known for its comprehensive capabilities and focus on enterprise-
grade solutions.
• Use Cases
• Predictive analytics: Using IBM Watson to predict trends and outcomes
based on big data.
• Data management: Managing large volumes of structured and unstructured
data.
• AI and machine learning: Developing AI-driven applications using
Watson’s machine learning capabilities.
• Pros and Cons
• Pros: Strong enterprise focus, advanced analytics and AI capabilities, robust
security.
• Cons: High cost, may have a steep learning curve for some users.
Detailed Notes :
a. Apache Hadoop
Apache Hadoop is one of the industry's most widely used big data platforms. It is
an open-source framework that enables distributed processing for massive datasets
throughout clusters. Hadoop provides a scalable and cost-effective solution for
storing, processing, and analyzing massive amounts of structured and unstructured
data.
One of the key features of Hadoop is its distributed file system, known as Hadoop
Distributed File System (HDFS). HDFS enables data to be stored across multiple
machines, providing fault tolerance and high availability. This feature allows
businesses to store and process data at a previously unattainable scale. Hadoop also
includes a powerful processing engine called MapReduce, which allows for
parallel data processing across the cluster. The prominent companies that use
Apache Hadoop are:
Yahoo
Facebook
Twitter
b. Apache Spark
Apache Spark is a unified analytics engine for batch processing, streaming data,
machine learning, and graph processing. It is one of the most popular big data
platforms used by companies. One of the key benefits that Apache Spark offers is
speed. It is designed to perform data processing tasks in-memory and achieve
significantly faster processing times than traditional disk-based systems.
Spark also supports various programming languages, including Java,
Scala, Python, and R, making it accessible to a wide range of developers. Hadoop
offers a rich set of libraries and tools, such as Spark SQL for querying structured
data, MLlib for machine learning, and GraphX for graph processing. Spark
integrates well with other big data technologies, such as Hadoop, allowing
companies to leverage their existing infrastructure. The prominent companies that
use Apache Spark include:
Netflix
Uber
Airbnb
c. Google Cloud BigQuery
Google Cloud BigQuery is a top-rated big data platform that provides a fully
managed and serverless data warehouse solution. It offers a robust and scalable
infrastructure for storing, querying, and analyzing massive datasets. BigQuery is
designed to handle petabytes of data and allows users to run SQL queries on large
datasets with impressive speed and efficiency.
BigQuery supports multiple data formats and integrates seamlessly with other
Google Cloud services, such as Google Cloud Storage and Google Data Studio.
BigQuery's unique architecture enables automatic scaling, ensuring users can
process data quickly without worrying about infrastructure management. BigQuery
offers a standard SQL interface for querying data, built-in machine learning
algorithms for predictive analytics, and geospatial analysis capabilities. The
prominent companies that use Google Cloud BigQuery are:
Spotify
Walmart
The New York Times
d. Amazon EMR
Amazon EMR is a widely used big data platform from Amazon Web Services
(AWS). It offers a scalable and cost-effective solution for processing and analyzing
large datasets using popular open-source frameworks such as Apache Hadoop,
Apache Spark, and Apache Hive. EMR allows users to quickly provision and
manage clusters of virtual servers, known as instances, to process data in parallel.
EMR integrates seamlessly with other AWS services, such as Amazon S3 for data
storage and Amazon Redshift for data warehousing, enabling a comprehensive big
data ecosystem. Additionally, EMR supports various data processing frameworks
and tools, making it suitable for a wide range of use cases, including data
transformation, machine learning, log analysis, and real-time analytics. The
prominent companies that use Amazon EMR are:
Expedia
Lyft
Pfizer
e. Microsoft Azure HDInsight
Microsoft Azure HDInsight is a leading big data platform offered by Microsoft
Azure. It provides a fully managed cloud service for processing and analyzing
large datasets using popular open-source frameworks such as Apache Hadoop,
Apache Spark, Apache Hive, and Apache HBase. HDInsight offers a scalable and
reliable infrastructure that allows users to easily deploy and manage clusters.
HDInsight integrates seamlessly with other Azure services, such as Azure Data
Lake Storage and Azure Synapse Analytics, offering a comprehensive ecosystem
of Microsoft Azure services. HDInsight supports various programming languages,
including Java, Python, and R, making it accessible to a wide range of users. The
prominent companies that use Microsoft Azure HDInsight are:
Starbucks
Boeing
T-Mobile
f. Cloudera
Cloudera is a leading big data platform that offers a comprehensive suite of tools
and services designed to help organizations effectively manage and analyze large
volumes of data. Cloudera's platform is built on Apache Hadoop, an open-source
framework for distributed storage and processing of big data. Cloudera is a hybrid
data platform deployed across on-premise, cloud, and edge environments.
Cloudera offers a unified platform that integrates various components such as
Hadoop Distributed File System (HDFS), Apache Spark, and Apache Hive,
enabling users to perform various data processing and analytics tasks. Cloudera
also provides machine learning and advanced analytics tools, allowing businesses
to gain deeper insights from their data. The prominent companies that use Cloudera
are:
Dell
Nissan Motor
Comcast
g. IBM InfoSphere BigInsights
IBM InfoSphere BigInsights is a powerful big data platform that offers a range of
tools to manage and analyze large volumes of structured as well as unstructured
data in a reliable manner. IBM InfoSphere BigInsights can handle massive data,
making it suitable for enterprises dealing with complex datasets. It provides a
comprehensive set of features for data management, data warehousing, data
analytics, machine learning, and more.
IBM InfoSphere BigInsights provides a user-friendly interface and intuitive data
exploration and visualization tools. The platform also offers robust security and
governance features, ensuring data privacy and compliance with regulatory
requirements. BigInsights is built on top of Apache Hadoop and Apache Spark,
and it integrates with other IBM products and services, such as IBM DB2, IBM
SPSS Modeler, and IBM Watson Analytics. This integration makes it a good
choice for businesses already using the IBM product/services ecosystem. The
prominent companies that use IBM Infosphere BigInsights are:
Lenovo
DBS Bank
General Motors
h. Databricks
Databricks is a prominent big data platform built on Apache Spark. Databricks
simplifies the process of building and deploying big data applications by providing
a scalable and fully managed infrastructure. It allows users to process large
datasets in real-time, perform complex analytics, and build machine learning
models using Spark's powerful capabilities.
Databricks provides an interactive workspace where users can write code, visualize
data, and collaborate on projects. It also integrates with popular data sources and
tools, making it easy to ingest and process data from various sources. With its
auto-scaling capabilities, Databricks ensures that users have the resources to handle
their workloads efficiently. Its automated infrastructure management and scaling
capabilities make it a reliable choice for handling large datasets and complex
workloads. The prominent companies that use Databricks are:
Nvidia Corporation
Johnson & Johnson
Salesforce
Factors for Big Data Platform Consideration:
Factor Description
Scalability The platform should handle increasing data volume, velocity, and variety
without performance issues. It must support horizontal scaling to
accommodate growing workloads.
Performance High data processing speed, efficient scaling, and fault tolerance are crucial.
The platform should support parallel processing and distributed computing
for optimized performance.
Security & Must include data encryption, access controls, and authentication
Compliance mechanisms to protect sensitive data. Compliance with regulations like
GDPR, HIPAA is essential.
Ease of Use A user-friendly interface with minimal learning curve improves adoption and
productivity. The platform should offer comprehensive documentation,
resources, and tutorials.
Integration The platform should integrate with databases, cloud services, APIs, and
Capabilities programming languages to ensure seamless data processing without
complex migrations.
Big Data is everywhere—from social media, e-commerce, banking, healthcare, IoT, and
AI.
Today, we will explore why Big Data is growing so fast and what is fueling its adoption