BIGDATAUNIT1 AKTUpdf
BIGDATAUNIT1 AKTUpdf
BIG DATA
Study4sub
SYLLABUS:
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to
Big Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big
Data, Big Data technology components, Big Data importance and applications, Big Data
features – security, compliance, auditing and protection, Big Data privacy and ethics, Big Data
Analytics, Challenges of conventional systems, intelligent data analysis, nature of data, analytic
processes and tools, analysis vs reporting, modern data analytic tools.
What is Big Data?
Big Data refers to extremely large and complex data sets that are difficult to store, manage,
and process using traditional data processing tools and methods.
In Simple Words:
• Big Data means “a huge amount of data” — so big and fast that traditional software (like
Excel or simple databases) can't handle it.
Key Points:
• It includes structured, semi-structured, and unstructured data.
Study4sub
Early development of databases and data storage systems like IBM’s IMS.
1960s–70s
Data stored on magnetic tapes.
AI, ML, and IoT generate and consume massive real-time data. Focus on
2020 onwards
data privacy, ethics, and advanced analytics tools.
Key Drivers of Big Data
• The term “drivers” refers to the factors or reasons behind the growth and importance
of Big Data. These are the main forces that have pushed Big Data to become essential
in today’s world.
1. Rapid Growth of Internet and Social Media
• Billions of users are active on platforms like Facebook, YouTube, Instagram, and
Twitter.
• Every second, people are uploading photos, videos, comments, likes, etc.
• This creates huge volumes of data every Study4sub
day.
Example: YouTube gets more than 500 hours of video uploads per minute.
2. Increased Use of Smartphones and IoT Devices
• Every smartphone, smartwatch, and smart device (like Alexa, fitness bands, smart
TVs) collects and sends data.
• These devices generate real-time data from various sensors.
• Example: A smart home system collects data about temperature, lights, and
energy usage.
3. Cheap Storage and Cloud Computing
• Earlier, storing large data was expensive.
• Now, cloud services like AWS, Google Cloud, and Microsoft Azure offer cheap and scalable storage.
• This allows companies to collect and store massive amounts of data easily.
• Point to remember: Cloud storage is flexible, cost-effective, and accessible from anywhere.
4. Advancements in Data Processing Technologies
• Tools like Hadoop, Spark, NoSQL databases allow fast and distributed processing of large data.
• These technologies help in handling structured and unstructured data with ease.
• Example: Hadoop breaks big data into smaller parts and processes it in parallel.
Study4sub
5. Need for Real-Time Decision Making
• Companies need to make quick decisions to stay competitive.
• Big Data helps in analyzing trends, customer behavior, and business performance in real-time.
• Example: E-commerce sites suggest products based on what users just searched.
6. Growth of AI and Machine Learning
• AI and ML need huge amounts of data to learn and make accurate predictions.
• Big Data provides the fuel for these smart systems.
• Example: Netflix uses ML and Big Data to recommend shows based on your watch history.
1. Introduction to Big Data Platform
Definition:
A Big Data Platform is an integrated system that combines various tools and
technologies to manage, store, and analyze massive volumes of data efficiently.
It provides the infrastructure and environment required for:
• Ingesting data (bringing data from sources)
• Storing data (on distributed systems)
Study4sub
• Processing data (in batch or real-time)
• Analyzing and visualizing data
Benefits of Big Data Platforms:
• Scalability: Easily handle growing data
• Flexibility: Supports all types of data (structured, semi-structured, unstructured)
• Real-Time Processing: Immediate insights and decisions
• Cost-Effective: Cloud and open-source tools reduce expenses
Main Components of a Big Data Platform:
1.Data Ingestion Tools
1. Used to collect and import data from different sources
2. Examples: Apache Kafka, Apache Flume, Sqoop
2.Data Storage Systems
1. Store large datasets reliably
2. Examples: HDFS (Hadoop Distributed File System), NoSQL (MongoDB, Cassandra)
3.Processing Engines
Study4sub
1. Perform computations and analytics on data
2. Examples: Hadoop MapReduce (batch), Apache Spark (real-time)
4.Data Management
1. Tools to organize, clean, and maintain data quality
2. Examples: Hive, HBase
5.Analytics & Visualization Tools
1. Help in generating reports and dashboards
2. Examples: Tableau, Power BI, Apache Pig, R, Python
Examples of Big Data Platforms:
• Apache Hadoop Ecosystem
• Apache Spark Framework
• Google Cloud BigQuery
• Amazon EMR (Elastic MapReduce)
• Microsoft Azure HDInsight
Big Data Architecture and Characteristics
Study4sub
• These systems can't process data fast enough for real-time decisions.
• Problem: Businesses miss out on opportunities that require immediate action.
7. Limited Flexibility and Integration
• Traditional systems don’t integrate well with modern technologies like cloud or
machine learning.
• Problem: It's hard to use new tools alongside old systems.
8. Data Quality Issues
• Conventional systems struggle with ensuring clean and consistent data.
• Problem: Data errors or inconsistencies can affect decision-making.
Intelligent Data Analysis
Intelligent Data Analysis refers to using advanced techniques, algorithms, and tools to analyze large
datasets and extract meaningful patterns, insights, and predictions. It involves the application of
artificial intelligence (AI), machine learning (ML), and statistical models to make smarter decisions
based on data.
Key Points:
1.AI & Machine Learning: These technologies help in learning from data and predicting future trends
or behaviors without human intervention. Study4sub
2.Pattern Recognition: Intelligent data analysis identifies hidden patterns in data that are not
immediately obvious.
3.Automation: It automates data analysis processes, making it faster and more efficient.
4.Predictive Analytics: It helps forecast future events, trends, or behaviors based on historical data.
5.Real-time Insights: Intelligent analysis can provide real-time insights, helping businesses to make
quicker, more informed decisions.
Example:
• In retail, intelligent data analysis can be used to predict which products will sell best in the future by
analyzing past sales data, customer preferences, and trends.
Nature of Data
Nature of Data refers to the different forms, types, and characteristics of data that affect how
it is stored, processed, and analyzed.
Types of Data:
1.Structured Data:
1. What it is: Data that is organized into tables, rows, and columns, typically in relational databases (e.g.,
customer records, sales data).
2. Example: A database of employee information where each row represents an employee with columns like
name, age, salary, etc.
2.Unstructured Data:
1. What it is: Data that doesn't have a predefined structure, making it difficult to analyze with traditional
Study4sub
methods (e.g., text, images, audio, video).
2. Example: Social media posts, customer reviews, or video files.
3.Semi-structured Data:
1. What it is: Data that doesn't have a rigid structure but contains tags or markers that make it easier to
analyze (e.g., XML, JSON).
2. Example: A log file that contains a mixture of structured data (timestamps) and unstructured data (event
descriptions).
4.Big Data:
1. What it is: Extremely large datasets that require advanced tools and techniques for storage, processing,
and analysis. Big Data is often characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.
2. Example: Data from IoT sensors, social media platforms, and web logs.
Characteristics of Data:
1.Volume: The amount of data being generated. It can be terabytes or even
petabytes.
2.Velocity: The speed at which data is generated and needs to be processed
(e.g., real-time data).
3.Variety: The different types of data (structured, unstructured, semi-
structured).
Study4sub
4.Veracity: The quality and accuracy of the data.
5.Value: The usefulness of the data for decision-making or gaining insights.
Nature of Data in Big Data:
• Big Data contains data from multiple sources that vary in type, speed, and
structure. Processing this data requires advanced technologies like Hadoop,
Spark, and machine learning to handle its complexity.
Analytic Processes and Tools
Analytic Process
The analytic process in Big Data involves several steps to extract meaningful insights
from large datasets. These steps are essential for data analysis and decision-making.
1.Data Collection:
1. Collect data from various sources such as sensors, databases, social media, and logs.
2. Example: Collecting sales data from e-commerce websites.
2.Data Cleaning:
1. Remove errors, duplicates, and irrelevant Study4sub
information to ensure high-quality data.
2. Example: Removing duplicate customer entries from a database.
3.Data Analysis:
1. Apply statistical methods, machine learning models, and algorithms to analyze the data and
uncover patterns.
2. Example: Analyzing customer behavior patterns using machine learning.
4.Interpretation of Results:
1. After analysis, interpret the results to make informed decisions.
2. Example: Predicting future sales trends based on past data.
Tools : Excel , R and Python, Hadoop and Spark, Tableau/Power BI , SQL
Analysis vs Reporting
While both analysis and reporting involve working with data, they serve different
purposes.
Analysis:
• Goal: To explore data, find patterns, and make predictions.
• Process: Involves using statistical models, machine learning, and algorithms.
• Outcome: Provides insights that can guide strategic decision-making.
• Example: Using customer data to predict future purchase behavior.
Study4sub
Reporting:
• Goal: To present data in a simple, understandable format.
• Process: Involves summarizing data in charts, graphs, and tables.
• Outcome: Provides an overview of performance or trends, typically for monitoring
purposes.
• Example: A monthly sales report showing the total revenue, top-selling products, and
key metrics.
Key Differences:
• Analysis is more about understanding and extracting insights from data, while reporting is about summarizing and
presenting data for easy consumption.
• Analysis typically involves advanced methods, while reporting is more about presenting results in an understandable way.
Modern Data Analytic Tools (Short Notes - AKTU Oriented)
1. Hadoop
1. Open-source framework for storing and processing large data sets in a distributed manner.
2. Handles structured and unstructured data.
2. Apache Spark
1. Fast in-memory data processing tool.
2. Suitable for real-time analytics.
3. Power BI
Study4sub
1. Microsoft’s tool for creating interactive dashboards and reports.
2. Easy to use and integrates with various data sources.
4. Tableau
1. Data visualization tool.
2. Helps in making graphs, charts, and dashboards for better understanding.
5. Python & R
1. Programming languages for data analysis, visualization, and machine learning.
2. Python is widely used due to its simplicity and libraries like Pandas, NumPy.
6. SQL
1. Language used to query and manage structured data in databases.
2. Essential for data extraction and manipulation.
7. Google Analytics
1. Used to track and report website traffic and user behavior.
THANKS FOR WATCHING
BEST OF LUCK
Study4subFOR EXAM