0% found this document useful (0 votes)
10 views54 pages

Bigdata

The document outlines the syllabus for the Big Data Analytics course at Shri Sakthikailassh Women’s College for the academic year 2025-2026. It includes course objectives, prerequisites, detailed unit content, evaluation methods, and the characteristics and benefits of big data. The course aims to equip students with the necessary skills to analyze large data sets and understand various big data tools and techniques.

Uploaded by

pgcriteria2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views54 pages

Bigdata

The document outlines the syllabus for the Big Data Analytics course at Shri Sakthikailassh Women’s College for the academic year 2025-2026. It includes course objectives, prerequisites, detailed unit content, evaluation methods, and the characteristics and benefits of big data. The course aims to equip students with the necessary skills to analyze large data sets and understand various big data tools and techniques.

Uploaded by

pgcriteria2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

SHRI SAKTHIKAILASSH WOMEN’S COLLEGE

(AUTONOMOUS)
SALEM

DEPARTMENT OF COMPUTER APPLICATIONS

SUBJECT NAME: BIG DATA ANALYTICS

SUBJECT TITLE: 23UCAE06

ACADEMIC YEAR
(2025-2026)
ODD SEMSTER
Paper Title BIG DATA Analytics

Semester Course Code Course Name Category L T P Credits

BIG 4 - -
V CAE06 DATA Analytics Elective 3

Big Data Analytics is the process of analyzing large and complex data sets to
Preamble discover patterns, trends, and insights that support better decision-making.
Basic knowledge of programming (preferably Python, Java, or R),
understanding of databases and SQL, fundamentals of statistics and
Prerequisite mathematics, and familiarity with data structures and algorithms.
Course Outcomes(Cos) Work with big data tools and its analysis techniques.
Bloom's
Taxonomy
Knowledge
CO Number Course Outcomes(Cos) Statement Level
Understand the Big Data Platform and its Use cases, Map
CO1 Reduce Jobs KI
To identify and understand the basics of cluster and decision tree
CO2 K2
To study about the Association Rules, Recommendation System
CO3 K2
To learn about the concept of stream
CO4 K3
Understand the concepts of No S QL Databases
CO5 K3

Mapping with Program Outcomes:


Cos/Pos PO1 PO2 PO3 PO4 PO5
CO1 1 3 2 2 3
CO2 3 2 3 2 3
CO3 1 3 2 2 2
CO4 3 3 3 1 3
CO5 3 2 3 3 3

Syllabus
E-Content/
Unit Content HOURS Resources
Evolution of Big data — Best Practices for Big
data Analytics—Big data characteristics—
Validating —The Promotion of the Value of Big
Data — Big DataUseCases-Characteristics of Big
Data Applications—Perception and Quantification
of Value -UnderstandingBigDataStorage—
AGeneralOverviewofHigh-
PerformanceArchitecture—HDFS—MapReduce
I and YARN—Map Reduce Programming Modeling 15
Advanced Analytical Theory and Methods:
Clustering:
Overview of Clustering – K-Means – Use Cases –
Overview of the Method – Determining the
Number of Clusters – Diagnostics – Reasons to
Choose and Cautions Classification:
Decision Trees – Overview of a Decision Tree –
The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree –
Decision Trees in R Naïve Bayes:
II Bayes Theorem – Naïve Bayes Classifier 15
Advanced Analytical Theory and Methods:
Association Rules – Overview – Apriori Algorithm
– Evaluation of Candidate Rules – Applications of
Association Rules – Finding Association &
Finding Similarity Recommendation System:
Collaborative Recommendation – Content-Based
Recommendation – Knowledge-Based
Recommendation – Hybrid Recommendation
III Approaches 15
Introduction to Streams Concepts:
Stream Data Model and Architecture – Stream
Computing – Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in
a Stream – Estimating Moments – Counting
Oneness in a Window – Decaying WindowReal-
Time Analytics Platform (RTAP) Applications:
Case Studies – Real-Time Sentiment Analysis –
Stock Market Predictions Using Graph Analytics
for Big Data:
IV Graph Analytics 15
NoSQL Databases:
Schema-less Models – Increasing Flexibility for
Data Manipulation – Key-Value Stores –
Document Stores – Tabular Stores – Object Data
Stores – Graph Databases – Hive – Sharding –
HBase Applications of Big Data: Analyzing Big
Data with Twitter – Big Data for E-Commerce –
Big Data for Blogs Tools and Methods: Review
V of Basic Data Analytic Methods using R 15
Total

Note:
1.Svetlin Nakov, Veselin Kolev & Co., Fundamentals of Computer
Programming with C#, Faber Publication, 2019.
2. Mathew MacDonald, The Complete Reference ASP.NET, Tata McGraw-
Text Books 1 Hill, 2015.
1 Herbert Schildt, The Complete Reference C#.NET, Tata McGraw-Hill, 2017.
Kogent Learning Solutions, C# 2012 Programming Covers .NET 4.5 Black
Reference 2 Book, Dreamtech Press, 2013.
Books Anne Boehm, Joel Murach, Murach’s C# 2015, Mike Murach & Associates
3 Inc., 2016.
Denielle Otey, Michael Otey, ADO.NET: The Complete Reference, McGraw-
4 Hill, 2008.

MSDN Magazine – Published by Microsoft, covering ASP.NET, .NET


Journals and Framework, C#, and related development tools.
Magazines
https://www.geeksforgeeks.org/introduction-to-net-framework/
E-Resources and https://www.javatpoint.com/net-framework
Website

Learning Methods

Focus of the Course


Methods of Evaluation

Continuous Internal Assessment


Test
Assignments
25 Marks
Seminar
Attendance and Class
Internal Evaluation participation
External Evaluation End Semester Examinations 75 Marks

Total 100 Marks


MODEL QUESTION PAPER

Time: 3 Hours Maximum :75 Marks

PART-A (15*1 =15 Marks)

(Answer All Questions)

1 TO 3- Unit I

4 TO 6- Unit II

7 TO 9- Unit III

10 TO 12- Unit IV

13 TO 15- Unit V

PART-B (5*2 =10 Marks)

(Answer ANY Two Questions)

16- Unit I

17- Unit II

18- Unit III


19- Unit IV

20- Unit V

PART-C (10*5 =50 Marks)

(Answer ALL Questions)

21. (a) (Or) (b) - Unit I

22. (a) (Or) (b) - Unit II

23. (a) (Or) (b) - Unit III

24. (a) (Or) (b) - Unit IV

22. (a) (Or) (b) - Unit V


Bigdata Analytics

Evolution Of Bigdata

What is Big Data

Big Data is nothing but lots of data consisting of varieties of data. It is the
concept of gathering useful insights from such voluminous amounts of
structured, semi-structured and unstructured data that can be used for
effective decision making in the business environment. This data is collected
from various sources over a course of time and is cumbersome to be managed
by traditional database tools.

Key Stages in the Evolution of Big Data:


• Early Data Management (1960s-1980s):
The foundation was laid with the development of data centers and relational
databases.
• The Rise of the Internet (1990s):
The internet and online services fueled the rapid growth of data, leading to
challenges in managing large datasets.
• Web 2.0 and Social Media (2000s):
Platforms like Facebook and YouTube generated massive amounts of unstructured
data, necessitating new tools like Hadoop and NoSQL to handle it.
• Advanced Analytics and Cloud Computing (2010s):
Big data analytics and cloud computing enabled more scalable and cost-effective
data storage and processing, with tools like Apache Spark and machine learning
frameworks playing a crucial role.
• AI and IoT Integration (2020s and Beyond):
AI and machine learning are now integrated with big data to enhance analytics, while
the Internet of Things (IoT) continues to generate vast amounts of data, driving further
evolution in data processing and analysis.
Best practices of Bigdata analytics

Understanding business requirements, prioritizing data quality and security, utilizing


appropriate tools and technologies, and fostering collaboration and continuous
improvement. Specifically, it’s crucial to define clear business goals, ensure data
integrity, employ scalable and distributed testing tools, and automate testing
processes.

1. Define Clear Business Objectives:


• Understand Business Requirements:
Before diving into data, it's essential to understand the specific business goals and how
big data analytics can contribute to achieving them.
• Prioritize Business Value:
Focus on projects and initiatives that offer the greatest potential for revenue or
operational gains.

2. Data Quality and Integrity:


• Data Quality Management: Implement programs to ensure data accuracy,
completeness, and consistency.
• Data Profiling: Conduct thorough analysis of data sources to identify potential issues
and ensure data quality.
• Data Cleaning and Transformation: Address missing values, remove duplicates, and
normalize data to ensure reliable insights.
• Version Control and Documentation: Maintain a detailed history of data changes and
processes.

3. Security and Privacy:


• Data Security Measures: Implement strong encryption and access controls to
protect sensitive data.
• Data Breach Prevention: Regularly audit systems and train employees on data
security best practices.
• Ethical Data Practices: Adhere to ethical guidelines and privacy regulations when
collecting, using, and managing data.
4. Technology and Tools:
• Scalable and Distributed Tools:
Use tools capable of handling distributed environments and large data volumes.
• Cloud-Based Platforms:
Leverage cloud-based platforms for storage, processing, and analysis.
• Data Integration Solutions:
Employ middleware solutions or data lakes to integrate data from various sources.
• Data Visualization Tools:
Utilize tools to present complex data in a clear and understandable format.

5. Collaboration and Communication:


• Team Collaboration: Foster collaboration between data scientists, data engineers,
and business stakeholders.
• Data Literacy: Promote data literacy across the organization to enhance
understanding and collaboration.
• Continuous Feedback and Iteration: Embrace an iterative approach to data analysis
and development.

6. Testing and Validation:


• End-to-End Testing: Validate data across all stages of the big data pipeline.
• Performance Testing: Assess the performance of big data systems and ensure
scalability.
• Automated Testing: Automate testing where possible to improve efficiency and
reduce errors.
• Fault Tolerance and Failover: Test for fault tolerance and ensure systems can
recover from failures.

7. Continuous Monitoring and Optimization:


• Resource Optimization:
Monitor resource utilization and optimize systems for performance and efficiency.
• Performance Monitoring:
Continuously monitor system performance and identify areas for improvement.
• Agile Development:
Employ Agile methodologies to enable flexibility and continuous adaptation to
changing requirements.
By implementing these best practices, organizations can effectively harness the
power of big data analytics, extract valuable insights, and drive business
outcomes.

Bigdata characteristics

In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of Big
Data which are also termed as the characteristics of Big Data as follows:

1.Volume:

The name ‘Big Data’ itself is related to a size which is enormous.

Volume is a huge amount of data.

To determine the value of data, size of data plays a very crucial role. If the volume of data
is very large, then it is actually considered as a ‘Big Data’. This means whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data.

Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 Exabytes of data.

2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to
meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made on
Google. Also, Facebook users are increasing by 22%(Approx.) year by
year.

3. Variety:
• It refers to nature of data that is structured, semi-structured and
unstructured data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-structured
and unstructured.
o Structured data: This data is basically an organized data.
It generally refers to data that has defined the length and
format of data.
o Semi- Structured data: This data is basically a semi-
organised data. It is generally a form of data that do not
conform to the formal structure of data. Log files are the
examples of this type of data.
o Unstructured data: This data basically refers to
unorganized data. It generally refers to data that doesn’t
fit neatly into the traditional row and column structure of
the relational database. Texts, pictures, videos etc. are the
examples of unstructured data which can’t be stored in
the form of rows and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are difficult
to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which stands
for Value! The bulk of Data having no Value is of no good to the company,
unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that
Value! is the most important V of all the 6V’s.
6. Variability:
• How fast or available data that extent is the structure of your data is
changing?
• How often does the meaning or shape of your data change?
• Example: if you are eating same ice-cream daily and the taste just keep
changing.

Validating the promotion of the value of Bigdata

Uses of Bigdata

Big Data enables you to gather information about customers and their experience,
then eventually helps you to align it properly.

Helps in maintaining the predictive failures beforehand by analyzing the problems


and provides with their potential solutions.

Big Data is also useful for companies to anticipate customer demand, roll out new
plans, test markets, etc.

Big Data is very useful in assessing predictive failures by analyzing various


indicators such as unstructured data, error messages, log entries, engine
temperature, etc.
Big Data is also very efficient in maintaining operational functions along with
anticipating future demands of the customers, current market demands thus
providing proper results.

Benefits of Bigdata in Business

• Data quality has a direct impact on business process efficiency. In


purchase to pay process, poor quality vendor data can cause missing
purchase contracts or pricing information which can lead to delays in
procuring vital goods. Many companies use big data solutions or
algorithms to simply do what they have already been doing, so that there
is no data loss moreover if we run the algorithm against the data set, the
result might be the list of individual who exhibits attributes of fraudulent
behavior.
• In order to cash process, incomplete or inaccurate credit limits or pricing
information can lead to overall customer service loss or reduce revenue or
may increase service cost, with the help of big data technologies and the
ability to run various algorithms more quickly, the data can be updated at
regular intervals throughout the day.

• Big Data enables you to gather information about customers and their
experience, then eventually helps you to align it properly.
• Helps in maintaining the predictive failures beforehand by analyzing the
problems and provides with their potential solutions.
• Big Data is also useful for companies to anticipate customer demand, roll
out new plans, test markets, etc.
• Big Data is very useful in assessing predictive failures by analyzing
various indicators such as unstructured data, error messages, log entries,
engine temperature, etc.
• Big Data is also very efficient in maintaining operational functions along with
anticipating future demands of the customers, current market demands thus
providing proper results.

Benefits of Bigdata in IT sectors


• Many old IT companies are fully dependent on big data in order to
modernize their outdated mainframes by identifying the root causes of
failures and issues in real-time and antiquated code bases. Many
organizations are replacing their traditional system with open-source
platforms like Hadoop.
• Most big data solutions are based on Hadoop, which allows designs to
scale up from a single machine to thousands of machines, each offering
local computation and storage, moreover, it is a "free" open source
platform, allowing minimizing capital investment for an organization in
acquiring new platforms.
• With the help of big data technologies IT companies are able to process
third-party data fast, which is often hard to understand at once by having
inherently high horsepower and parallelized working of platforms

Benefits of Big Data in Enterprise :


• Big data might allow a company to collect trillions or billions of real-time
data points on its products, resources, or customers- and then repackage
the data instantaneously to optimize the customer experience.
• The speed at which data is updated using big data technologies allows
enterprises to more quickly and accurately respond to customer
demands. For example, MetLife used MongoDB to quickly consolidate
customer information over 70 different sources and provide a single,
rapidly-updated view
• Big data can help enterprises to act more nimbly allowing them to adapt
to changes faster than their competitors.

Benefits of Big Data in Enterprise :


• Big data might allow a company to collect trillions or billions of real-time
data points on its products, resources, or customers- and then repackage
the data instantaneously to optimize the customer experience.
• The speed at which data is updated using big data technologies allows
enterprises to more quickly and accurately respond to customer
demands. For example, MetLife used MongoDB to quickly consolidate
customer information over 70 different sources and provide a single,
rapidly-updated view
• Big data can help enterprises to act more nimbly allowing them to adapt
to changes faster than their competitors.

Bigdata usecases
Big data use cases span various industries and include applications
like predictive maintenance, fraud detection, customer analytics, supply chain
optimization, and risk management.

Examples of Big Data Use Cases:


• Predictive Maintenance:
Analyzing sensor data from equipment to predict potential failures and schedule
maintenance proactively, reducing downtime and costs.
• Fraud Detection:
Identifying fraudulent transactions or activities by analyzing large datasets of financial
transactions and other data sources.
• Customer Analytics:
Understanding customer behavior, preferences, and needs by analyzing customer data
from various sources like websites, social media, and purchase history, enabling
personalized marketing and improved customer service.
• Supply Chain Optimization:
Optimizing supply chain operations by analyzing data on inventory levels,
transportation routes, and supplier performance to improve efficiency and reduce
costs.
• Risk Management:
Identifying and mitigating potential risks by analyzing historical data and real-time data
streams to assess risks in various areas like finance, healthcare, and cybersecurity.
• Healthcare:
Using big data for disease prediction, early symptom detection, electronic health
records, real-time alerts, patient engagement, and enhanced analysis of medical
images.
• Finance:
Utilizing big data for predictive analysis, customer segmentation, personalized banking,
asset and wealth management, credit scoring, and business performance monitoring.
• Manufacturing:
Customizing product design, improving quality, detecting anomalies, managing supply
chains, forecasting production, and improving yield.
• Retail and E-commerce:
Tracking customer spending habits, providing recommendations, and optimizing
pricing strategies.
• Telecommunications:
Optimizing network performance, identifying and resolving network issues, and
improving customer service.
• Transportation:
Optimizing traffic flow, improving public transportation, and managing logistics.
• Education:
Personalizing learning experiences, predicting student outcomes, and improving
educational resources.
• Government:
Using big data for emergency response, crime prevention, and smart city initiatives.

Characteristics of Bigdata application


Big companies utilize those data for their business growth. By analyzing this data, the useful
decision can be made in various cases as discussed below:

1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of
customer’s spending habit (in which product customer spent, in which brand
they wish to spent, how frequently they spent), shopping behavior, customer’s
most liked product (so that they can keep those products in the store). Which
product is being searched/sold most, based on that data, production/collection
rate of that product get fixed.

2 Recommendation: By tracking customer spending habit, shopping behavior, Big


retails store provide a recommendation to the customer. E-commerce site like
Amazon, Walmart, Flipkart does product recommendation. They track what product a
customer is searching, based on that data they recommend that type of product to
that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon
got data that customer may be interested to buy bed cover. Next time when that
customer will go to any google page, advertisement of various bed covers will be
seen. Thus, advertisement of the right product to the right customer can be sent.

3 Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant
tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to
provide the answer of the various question asked by users. This tool tracks the
location of the user, their local time, season, other data

4 IoT:
• Manufacturing company install IOT sensor into machines to collect
operational data. Analyzing such data, it can be predicted how long machine
will work without any problem when it requires repairing so that company
can take action before the situation when machine facing a lot of issues or
gets totally down. Thus, the cost to replace the whole machine can be saved.
• In the Healthcare field, Big data is providing a significant contribution. Using
big data tool, data regarding patient experience is collected and is used by
doctors to give better treatment. IoT device can sense a symptom of
probable coming disease in the human body and prevent it from giving
advance treatment.

Perception and quantification of values

Perception and quantification of value in big data involve both subjective


understanding of its worth and objective measurements of its impact. Organizations
must assess the perceived benefits of big data initiatives

Key Facets of Perception:


Organizations need to consider how big data can potentially:
Increase Revenues: For example, a recommendation engine might be expected to increase
sales by suggesting related products.
Lower Costs: Using big data platforms can reduce reliance on specialized servers and lower
operational costs.
Increase Productivity: Faster analysis can lead to quicker identification of fraudulent
activities and proactive prevention measures.
Reduce Risk: Real-time data from sensors can provide insights into potential outages or
other risks.
2. Quantification of Value:
Measurable Outcomes:
Once the perceived value is identified, it needs to be translated into specific, measurable
metrics.
Examples of Quantification:
Cost Savings: Measuring the reduction in operational costs due to a new big data platform.
Revenue Growth: Tracking the increase in sales attributed to a recommendation engine or
other data-driven initiatives.
Productivity Gains: Quantifying the increase in efficiency or speed of processes, such as
fraud detection.
Risk Reduction: Calculating the cost of avoided risks or the reduction in incidents, such as
power outages.
Challenges in Quantification:
Attributing Value: It can be difficult to isolate the specific impact of big data initiatives from
other factors.
Data Quality: The accuracy of the data and the reliability of the analysis are crucial for
accurate quantification.
Time Horizon: The value of big data may not be immediately apparent, and it may take time
to realize the full benefits.
3. Importance of Both Perception and Quantification:
Justifying Investments:
Quantifiable value is essential for justifying investments in big data initiatives and securing
buy-in from stakeholders.
Measuring Success:
Quantification helps organizations track the progress of their data-driven efforts and
measure the effectiveness of their strategies.
Improving Data-Driven Decision-Making:
By understanding both the perceived and quantified value of big data, organizations can
make more informed decisions about where to focus their resources and efforts.
Understanding Bigdata storage

Big data storage refers to systems designed to efficiently store, manage, and retrieve
massive datasets for analysis and decision-making. It addresses the challenges of
storing and processing large volumes, diverse formats, and rapidly changing
data. Big data storage solutions often utilize distributed architectures and
specialized technologies to handle the unique needs of big data.

Key factors of Bigdata


Scalability:
Big data storage solutions need to scale to accommodate growing data volumes and
processing needs.
Data Variety:
They must handle a wide range of data types, including structured, unstructured, and semi-
structured data.
Data Velocity:
They need to manage the speed at which data is generated and processed, often in real-
time.
Cost-Effectiveness:
Storing and managing large amounts of data can be expensive, so solutions must be cost-
effective.
Data Access and Retrieval:
Solutions need to provide efficient access to data for analysis and reporting.
Common big data storage technologies and approaches:
Hadoop:
A distributed processing framework that includes the Hadoop Distributed File System
(HDFS) for storage.
Data Lakes:
Centralized repositories that store data in its native format, allowing for different analysis
methods.
Cloud Storage:
Leveraging cloud services like Amazon S3 or Google Cloud Storage to store and manage
data.
NoSQL Databases:
Alternative database systems designed to handle unstructured and semi-structured data.
Object Storage:
A storage model where data is stored as objects, allowing for efficient access and
scalability.
Network Attached Storage (NAS):
A storage system that provides network access to data, offering scalability and cost-
effectiveness.

Overview of High performance architecture


High-performance big data architectures are designed to process and analyze
massive datasets quickly and efficiently, often using parallel computing and scalable
infrastructure. They are crucial for modern AI and other data-intensive applications
that require real-time insights.

Key Components and Characteristics:


• Parallel Processing:
Big data architectures leverage parallel computing, where multiple processors work on
different parts of the data simultaneously, significantly speeding up processing.
• Scalability:
They are designed to be horizontally scalable, meaning resources can be added or removed
as needed to handle varying workloads.
• High-Performance Computing (HPC):
HPC often underpins big data architectures, using clusters of powerful processors and
specialized hardware (like GPUs) for accelerated computation.
• Data Storage and Networking:
High-performance storage and networking are essential for efficient data transfer and
access.
• Data Processing Frameworks:
Frameworks like Hadoop and Spark are used for processing and analyzing data in a
distributed manner.

Benefits of High-Performance Big Data Architectures:


• Faster Processing:
Enables rapid analysis of large datasets, leading to quicker insights and decision-making.
• Real-time Analytics:
Supports real-time processing and analysis of data streams, crucial for applications like
streaming analytics and IoT.
• Improved Decision-Making:
Provides timely and accurate data analysis for better business decisions.
• Scalability and Flexibility:
Adapts to changing data volumes and workload demands, ensuring optimal performance.
• AI and Machine Learning:
Provides the necessary infrastructure for training and deploying AI models at scale.

Hdfs and map reduce


Java that utilizes a large cluster of commodity hardware to maintain and store big size data.
Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today
lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg.
Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4
components.

components.
Hadoop
hardware
Programming
Companies
Facebook,
isto
Yahoo,
aare
framework
maintain
Algorithm
using
Netflix,
Hadoop
and
written
that
eBay,
store
was
inetc.
in
big
their
Java
introduced
size
The
Organization
that
data.
Hadoop
utilizes
by
Hadoop
Google.
Architecture
to
a large
deal
works
Today
with
cluster
on
Mainly
lots
big
MapReduce
ofdata,
of
commodity
consists
Bigeg.
Brand
of 4
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.

Map Task:
• RecordReader The purpose of recordreader is to break the records. It is
responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with it.
• Map: A map is nothing but a user-defined function whose work is to process
the Tuples obtained from record reader. The Map() function either does not
generate any key-value pair or generate multiple pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is
similar to a Local reducer. The intermediate key-value that are generated in
the Map is combined with the help of this combiner. Using a combiner is not
necessary as it is optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated
in the Mapper Phases. The partitioner generates the shards corresponding to
each reducer. Hashcode of each key is also fetched by this partition. Then
partitioner performs it's(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers)).

Reduce Task

• Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them
to the Reducer task is known as Shuffling. Using the Shuffling process the
system can sort the data using its key value.

Once some of the Mapping tasks are done Shuffling begins that is why it is a
faster process and does not wait for the completion of the task performed by
Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort of
process on those key-value depending on its key element.
• OutputFormat: Once all the operations are performed, the key-value pairs
are written into the file with the help of record writer, each record in a new
line, and the key and value in a space-separated manner.

Hadoop YARN Architecture


YARN stands for “Yet Another Resource Negotiator”. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed operating system used for Big Data
processing.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to
manage the available resources properly so that every application can leverage them.

YARN Features: YARN gained popularity because of the following features-


Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to
extend and manage thousands of nodes and clusters.
Compatibility: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
• DSA
• Practice Problems
• C
• C++
• Java
• Python
• JavaScript
• Data Science
• Machine Learning
• Courses
• Linux
• DevOps
• SQL
• Web Development
• System Design
• Aptitude
• GfG Premium

Open In App

MapReduce Architecture
Last Updated : 10 Sep, 2020
••
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process
or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data
which we are using is then fed to the Map Task and the Map will generate intermediate
key-value pair as its output. The output of Map i.e. these key-value pairs are then fed
to the Reducer and the final output is stored on the HDFS. There can be n number of
Map and Reduce tasks made available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such that the
time complexity or space complexity is minimum.
Let's discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer aggregate
or group the data based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
Unit 2
Advanced analytical theory and methods

Advanced analytics and methods in big data utilize sophisticated


techniques, including machine learning, deep learning, and predictive
modeling, to extract valuable insights and make informed decisions from
large datasets. These methods go beyond traditional business intelligence
(BI) by identifying patterns, predicting future outcomes, and uncovering
hidden relationships in data.

Key Areas and Techniques:


• Machine Learning:
This field uses algorithms that learn from data to make predictions or
decisions.
• Deep Learning:
A subset of machine learning, deep learning uses artificial neural networks with
multiple layers to analyze complex data.
• Predictive Modeling:
This involves building statistical models to forecast future trends or events
based on historical data.
• Data Mining:
This process involves discovering patterns, relationships, and trends within large
datasets.
• Clustering:
A technique used to group similar data points together based on their
characteristics.
• Association Rule Mining:
This method identifies relationships between different items or events in a
dataset.
• Time Series Analysis:
This technique analyzes data that changes over time to identify patterns and
trends.
• Regression Analysis:
This method explores the relationship between a dependent variable and one or
more independent variables.
• Classification:
This technique assigns data points to different categories or classes based on
their characteristics.

Applications:
• Fraud Detection:
Advanced analytics can be used to identify fraudulent activities by analyzing
transaction patterns.
• Marketing:
Predictive analytics can help businesses develop targeted marketing campaigns
and improve customer engagement.
• Supply Chain Management:
By analyzing data, businesses can optimize inventory levels, reduce costs, and
improve efficiency.
• Healthcare:
Advanced analytics can be used to improve patient outcomes, predict disease
outbreaks, and personalize treatment plans.
• Financial Services:
Advanced analytics can be used to assess credit risk, detect anomalies, and
improve financial decision-making.

Overview of clustering

Clustering is a technique in data science that groups similar data points


together into clusters, where points within a cluster are more alike than
points in different clusters. This process helps reveal underlying patterns
and structures within data by identifying natural groupings.

Here's a more detailed overview:

Key Concepts:
• Unsupervised Learning:
Clustering is a type of unsupervised learning, meaning it doesn't rely on labeled
data to train the model.
• Similarity and Dissimilarity:
Clustering algorithms rely on measuring the similarity or dissimilarity between
data points, often using distance metrics like Euclidean distance or cosine
similarity.
• Purpose:
Clustering is used for various purposes, including:
• Exploratory Data Analysis: Identifying natural groupings and trends in
data.
• Data Reduction: Simplifying large datasets by grouping similar data
points into clusters, reducing the number of features.
• Anomaly Detection: Identifying data points that are far from any cluster,
potentially indicating outliers or anomalies.
• Types of Clustering Algorithms:
• K-Means: A popular centroid-based algorithm that assigns data points to
clusters based on their distance to cluster centers (centroids).
• Hierarchical Clustering: Builds a hierarchy of clusters, starting with each
point as a separate cluster and iteratively merging the closest clusters.
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): Groups data points based on their density, identifying clusters
as dense regions of points.
• Hard vs. Soft Clustering:
• Hard Clustering: Assigns each data point to exactly one cluster.
• Soft Clustering: Allows data points to belong to multiple clusters with
varying degrees of membership.
• Applications:
Clustering finds applications in various fields, including:
• Marketing: Customer segmentation.

• Image Processing: Image segmentation.


• Network Analysis: Identifying communities in social networks.
• Bioinformatics: Clustering gene expression data.

K means
K-means clustering is an iterative process to minimize the sum of
distances between the data points and their cluster centroids. The
k-means clustering algorithm operates by categorizing data points
into clusters by using a mathematical distance measure, usually
euclidean, from the cluster center.
The algorithm works by first randomly picking some central points
called centroids and each data point is then assigned to the closest centroid
forming a cluster. After all the points are assigned to a cluster the centroids
are updated by finding the average position of the points in each cluster.
This process repeats until the centroids stop changing forming clusters. The
goal of clustering is to divide the data points into clusters so that similar
data points belong to same group.

The algorithm will categorize the items into k groups or clusters of similarity.
To calculate that similarity we will use the Euclidean distance as a
measurement. The algorithm works as follows:
1. First we randomly initialize k points called means or cluster
centroids.
2. We categorize each item to its closest mean and we update the
mean's coordinates, which are the averages of the items
categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the
end, we have our clusters.
The "points" mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a
lot of options. An intuitive method is to initialize the means at random items
in the data set. Another method is to initialize the means at random values
between the boundaries of the data set. For example for a feature x the items
have values in [0,3] we will initialize the means with values for x at [0,3].

Use cases

Big data use cases span various industries and include applications
like predictive maintenance, fraud detection, customer analytics, supply
chain optimization, and risk management. These use cases leverage the
analysis of large, complex datasets to gain insights and make informed
decisions.

Examples of Big Data Use Cases:


• Predictive Maintenance:
Analyzing sensor data from equipment to predict potential failures and schedule
maintenance proactively, reducing downtime and costs.
• Fraud Detection:
Identifying fraudulent transactions or activities by analyzing large datasets of
financial transactions and other data sources.
• Customer Analytics:
Understanding customer behavior, preferences, and needs by analyzing
customer data from various sources like websites, social media, and purchase
history, enabling personalized marketing and improved customer service.
• Supply Chain Optimization:
Optimizing supply chain operations by analyzing data on inventory levels,
transportation routes, and supplier performance to improve efficiency and
reduce costs.
• Risk Management:
Identifying and mitigating potential risks by analyzing historical data and real-
time data streams to assess risks in various areas like finance, healthcare, and
cybersecurity.
• Healthcare:
Using big data for disease prediction, early symptom detection, electronic health
records, real-time alerts, patient engagement, and enhanced analysis of medical
images.
• Finance:
Utilizing big data for predictive analysis, customer segmentation, personalized
banking, asset and wealth management, credit scoring, and business
performance monitoring.
• Manufacturing:
Customizing product design, improving quality, detecting anomalies, managing
supply chains, forecasting production, and improving yield.
• Retail and E-commerce:
Tracking customer spending habits, providing recommendations, and optimizing
pricing strategies.
• Telecommunications:
Optimizing network performance, identifying and resolving network issues, and
improving customer service.
• Transportation:
Optimizing traffic flow, improving public transportation, and managing logistics.
• Education:
Personalizing learning experiences, predicting student outcomes, and improving
educational resources.

• Government:
Using big data for emergency response, crime prevention, and smart city initiatives.

Overview of the method to determine the number of clusters

In Clustering algorithms like K-Means clustering, we have to determine the right


number of clusters for our dataset. This ensures that the data is properly and
efficiently divided. An appropriate value of 'k' i.e. the number of clusters helps in
ensuring proper granularity of clusters and helps in maintaining a good balance
between compressibility and accuracy of clusters.
Let us consider two cases:
Case 1: Treat the entire dataset as one cluster
Case 2: Treat each data point as a cluster
This will give the most accurate clustering because of the zero distance between the
data point and its corresponding cluster center. But, this will not help in predicting
new inputs. It will not enable any kind of data summarization.
So, we can conclude that it is very important to determine the 'right' number of
clusters for any dataset. This is a challenging task but very approachable if we depend
on the shape and scaling of the data distribution. A simple method to calculate the
number of clusters is to set the value to about √(n/2) for a dataset of 'n' points. In the
rest of the article, two methods have been described and implemented in Python for
determining the number of clusters in data mining.
1. Elbow Method:
This method is based on the observation that increasing the number of clusters can
help in reducing the sum of the within-cluster variance of each cluster. Having more
clusters allows one to extract finer groups of data objects that are more similar to each
other. For choosing the 'right' number of clusters, the turning point of the curve of the
sum of within-cluster variances with respect to the number of clusters is used. The
first turning point of the curve suggests the right value of 'k' for any k > 0. Let us
implement the elbow method in Python.
Step 1: Importing the libraries
# importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Determining the Number of Clusters in Data


••

In Clustering algorithms like K-Means clustering, we have to determine the
right number of clusters for our dataset. This ensures that the data is
properly and efficiently divided. An appropriate value of 'k' i.e. the number
of clusters helps in ensuring proper granularity of clusters and helps in
maintaining a good balance between compressibility and accuracy of
clusters.
Let us consider two cases:
Case 1: Treat the entire dataset as one cluster
Case 2: Treat each data point as a cluster
This will give the most accurate clustering because of the zero distance
between the data point and its corresponding cluster center. But, this will
not help in predicting new inputs. It will not enable any kind of data
summarization.
So, we can conclude that it is very important to determine the 'right'
number of clusters for any dataset. This is a challenging task but very
approachable if we depend on the shape and scaling of the data
distribution. A simple method to calculate the number of clusters is to set
the value to about √(n/2) for a dataset of 'n' points. In the rest of the
article, two methods have been described and implemented in Python for
determining the number of clusters in data mining.
1. Elbow Method:
This method is based on the observation that increasing the number of
clusters can help in reducing the sum of the within-cluster variance of each
cluster. Having more clusters allows one to extract finer groups of data
objects that are more similar to each other. For choosing the 'right' number
of clusters, the turning point of the curve of the sum of within-cluster
variances with respect to the number of clusters is used. The first turning
point of the curve suggests the right value of 'k' for any k > 0. Let us
implement the elbow method in Python.
Step 1: Importing the libraries
# importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Step 2: Loading the dataset

We have used the Mall Customer dataset which can be found on this link.

# loading the dataset


dataset = pd.read_csv('Mall_Customers.csv')

# printing first five rows of the dataset


print(dataset.head(5))
Output:

First five rows of the dataset

Step 3: Checking for any null values

The dataset has 200 rows and 5 columns. It has no null values.

# printing the shape of dataset


print(dataset.shape)

# checking for any


# null values present
print(dataset.isnull().sum())

Output:
Shape of the dataset along
with count of null values

Step 4: Extracting 2 columns from the dataset for clustering

Let us extract two columns namely 'Annual Income (k$)' and 'Spending
Score (1-100)' for further process.

# extracting values from two


# columns for clustering
dataset_new = dataset[['Annual Income (k$)',
'Spending Score (1-100)']].values

Step 5: Determining the number of clusters using the elbow method


and plotting the graph

# determining the maximum number of clusters


# using the simple method
limit = int((dataset_new.shape[0]//2)**0.5)

# selecting optimal value of 'k'


# using elbow method

# wcss - within cluster sum of


# squared distances
wcss = {}

for k in range(2,limit+1):
model = KMeans(n_clusters=k)
model.fit(dataset_new)
wcss[k] = model.inertia_

# plotting the wcss values


# to find out the elbow value
plt.plot(wcss.keys(), wcss.values(), 'gs-')
plt.xlabel('Values of "k"')
plt.ylabel('WCSS')
plt.show()
Output:

Plot of Elbow Method

Through the above plot, we can observe that the turning point of this curve
is at the value of k = 5. Therefore, we can say that the 'right' number of
clusters for this data is 5.

Clusters and it’s types

The process of making a group of abstract objects into classes of similar


objects is known as clustering.
Objectives of cluster:
One group is treated as a cluster of data objects
• In the process of cluster analysis, the first step is to partition the
set of data into groups with the help of data similarity, and then
groups are assigned to their respective labels.
• The biggest advantage of clustering over-classification is it can
adapt to the changes made and helps single out useful features
that differentiate different groups.
Applications of cluster analysis :
• It is widely used in many applications such as image processing,
data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base
and they can characterize their customer groups by using
purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant
taxonomies and identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on
the web.
Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
• Scalability - we require highly scalable clustering algorithms to
work with large databases.
• Ability to deal with different kinds of attributes - Algorithms
should be able to work with the type of data such as categorical,
numerical, and binary data.
• Discovery of clusters with attribute shape - The algorithm should
be able to detect clusters in arbitrary shapes and it should not be
bounded to distance measures.
• Interpretability - The results should be comprehensive, usable,
and interpretable.
• High dimensionality - The algorithm should be able to handle
high dimensional space instead of only handling low dimensional
data.

Classification of Decision tree

Decision tree is a simple diagram that shows different choices and their
possible results helping you make decisions easily. This article is all about
what decision trees are, how they work, their advantages and
disadvantages and their applications.
Understanding Decision Tree
A decision tree is a graphical representation of different options for solving
a problem and show how different factors are related. It has a hierarchical
tree structure starts with one main question at the top called a node which
further branches out into different possible outcomes where:
• Root Node is the starting point that represents the entire dataset.
• Branches: These are the lines that connect nodes. It shows the
flow from one decision to another.
• Internal Nodes are Points where decisions are made based on
the input features.
• Leaf Nodes: These are the terminal nodes at the end of
branches that represent final outcome also support decision-
making by visualizing outcomes. You can quickly evaluate and
compare the “branches” to determine which course of action is best
for you.

Now, let’s take an example to understand the decision tree. Imagine you want to decide
whether to drink coffee based on the time of day and how tired you feel. First the tree
checks the time of day—if it’s morning it asks whether you are tired. If you’re tired the
tree suggests drinking coffee if not it says there’s no need. Similarly in the afternoon the
tree again asks if you are tired. If you recommends drinking coffee if not it concludes no
coffee is needed.

Overview of Decision Tree

Types of Decision Tree

We have mainly two types of decision tree based on the nature of the target variable:
classification trees and regression trees.

Classification trees: They are designed to predict categorical outcomes means they
classify data into different classes. They can determine whether an email is “spam” or
“not spam” based on various features of the email.

Regression trees : These are used when the target variable is continuous It predict
numerical values rather than categories. For example a regression tree can estimate the
price of a house based on its size, location, and other features.

Advantages of Decision Trees


• Simplicity and Interpretability: Decision trees are straightforward
and easy to understand. You can visualize them like a flowchart
which makes it simple to see how decisions are made.
• Versatility: It means they can be used for different types of tasks
can work well for both classification and regression
• No Need for Feature Scaling: They don’t require you to normalize
or scale your data.
• Handles Non-linear Relationships: It is capable of capturing non-
linear relationships between features and target variables.
Disadvantages of Decision Trees
• Overfitting: Overfitting occurs when a decision tree captures noise
and details in the training data and it perform poorly on new data.
• Instability: instability means that the model can be unreliable
slight variations in input can lead to significant differences in
predictions.
• Bias towards Features with More Levels: Decision trees can
become biased towards features with many categories focusing
too much on them during decision-making. This can cause the
model to miss out other important features led to less accurate
predictions .
Applications of Decision Trees
• Loan Approval in Banking: A bank needs to decide whether to
approve a loan application based on customer profiles.
o Input features include income, credit score,
employment status, and loan history.
o The decision tree predicts loan approval or
rejection, helping the bank make quick and reliable
decisions.
• Medical Diagnosis: A healthcare provider wants to predict
whether a patient has diabetes based on clinical test results.
o Features like glucose levels, BMI, and blood
pressure are used to make a decision tree.
o Tree classifies patients into diabetic or non-
diabetic, assisting doctors in diagnosis.
• Predicting Exam Results in Education : School wants to predict
whether a student will pass or fail based on study habits.
o Data includes attendance, time spent studying, and
previous grades.
o The decision tree identifies at-risk students,
allowing teachers to provide additional support.
A decision tree can also be used to help build automated predictive models,
which have applications in machine learning, data mining, and statistics.

The General Decision Tree Algorithm

machine learning algorithms and can be applied to


both classification and regression tasks. These models work by splitting
data into subsets based on features this process is known as decision
making. Each leaf node provides a prediction and the splits create a tree-
like structure. Decision trees are popular because they are easy
to interpret and visualize making it easier to understand the decision-
making process.
In machine learning, there are various types of decision tree algorithms. In
this article, we'll explore these types so that you can choose the most
appropriate one for your task.
Types of Decision Tree Algorithms
There are six different decision tree algorithms as shown in diagram are
listed below. Each one of has its advantage and limitations
Decision Tree Algorithms
1. ID3 (Iterative Dichotomiser 3)
ID3 is a classic decision tree algorithm commonly used for classification
tasks. It works by greedily choosing the feature that maximizes the
information gain at each node. It calculates entropy and information
gain for each feature and selects the feature with the highest information
gain for splitting.
Entropy: It measures impurity in the dataset. Denoted by H(D) for dataset D
is calculated using the formula:

H(D)=Σi=1npilog2(pi)H(D)=Σi=1npilog2(pi)

Information gain: It quantifies the reduction in entropy after splitting the


dataset on a feature:

InformationGain=H(D)−Σv=1V ∣Dv∣∣D∣H(Dv)InformationGain=H(D)
−Σv=1V∣D∣∣Dv∣H(Dv)

ID3 recursively splits the dataset using the feature with the highest
information gain until all examples in a node belong to the same class or no
features remain to split. After the tree is constructed it prune branches that
don't significantly improve accuracy to reduce overfitting. But it tends to
overfit the training data and cannot directly handle continuous attributes.
These issues are addressed by other algorithms like C4.5 and CART.

2. C4.5
C4.5 uses a modified version of information gain called the gain ratio to reduce the
bias towards features with many values. The gain ratio is computed by dividing
the information gain by the intrinsic information which measures the amount of
data required to describe an attribute’s values:
GainRatio=SplitgainGaininformationGainRatio=GaininformationSplitgain
C4.5 has limitations:
• It can be prone to overfitting especially in noisy datasets even if
uses pruning techniques.
• Performance may degrade when dealing with datasets that have
many features.
3. CART (Classification and Regression Trees)
CART is a widely used decision tree algorithm that is used
for classification and regression tasks.
• For classification CART splits data based on the Gini impurity
which measures the likelihood of incorrectly classified randomly
selected data. The feature that minimizes the Gini impurity is
selected for splitting at each node. The formula is:

Gini(D)=1−Σi=1npi2Gini(D)=1−Σi=1npi2

4. CHAID (Chi-Square Automatic Interaction Detection)


CHAID uses chi-square tests to determine the best splits especially
for categorical variables. It recursively divides the data into smaller subsets
until each subset contains only data points of the same class or within a
specified range of values. It chooses feature for splitting with highest chi-
squared statistic indicating the strong relationship with the target variable.
This approach is particularly useful for analyzing large datasets with many
categorical features. The Chi-Square Statistic formula:

X2=Σ(Oi−Ei)2EiX2=ΣEi(Oi−Ei)2

Where:
• OiOi represents the observed frequency
EiEi represents the expected frequency in each category. 5. MARS
(Multivariate Adaptive Regression Splines)
MARS is an extension of the CART algorithm. It uses splines to model non-
linear relationships between variables. It constructs a piecewise linear
model where the relationship between the input and output variables is
linear but with variable slopes at different points, known as knots. It
automatically selects and positions these knots based on the data
distribution and the need to capture non-linearities.
Basis Functions: Each basis function in MARS is a simple linear function
defined over a range of the predictor variable. The function is described as:

h(x)={x−tifx>tt−xifx≤t}h(x)={x−tifx>tt−xifx≤t}

Where
• xx is a predictor variable
• ttis the knot function.
Knot Function: The knots are the points where the piecewise linear
functions connect. MARS places these knots to best represent the data's
non-linear structure.
6. Conditional Inference Trees
Conditional Inference Trees uses statistical tests to choose splits based on
the relationship between features and the target variable. It use
permutation tests to select the feature that best splits the data while
minimizing bias.
The algorithm follows a recursive approach. At each node it evaluates the
statistical significance of potential splits using tests like the Chi-squared
test for categorical features and the F-test for continuous features. The
feature with the strongest relationship to the target is selected for the split.
The process continues until the data cannot be further split or meets
predefined stopping criteria.

Evaluation of Decision Tree

Evaluation of Decision tree in Revaluating decision trees in R involves assessing their


performance and robustness. Common methods include using metrics like accuracy,
precision, recall, and F1-score, and employing techniques like cross-validation. Pruning
decision trees can also help improve generalization by reducing complexity.

Evaluation Metrics:

Accuracy:

Measures the overall correctness of the model by comparing predicted values with
actual values.

Precision:

Measures the ability of the model to correctly identify positive cases among those
predicted as positive.

Recall:

Measures the ability of the model to correctly identify all positive cases.

F1-score:

The harmonic mean of precision and recall, providing a balance between both metrics.

ROC-AUC (Receiver Operating Characteristic – Area Under the Curve):

A measure of the model’s ability to distinguish between different classes, especially


useful for binary classification tasks.

Techniques for Robustness:

Cross-validation:
Splits the data into multiple folds and trains the model on different combinations of
these folds to assess its performance on unseen data.

Pruning:

Removes branches or nodes from the decision tree to simplify it and improve its ability
to generalize to new data. This can be achieved using techniques like cost complexity
pruning.

Ensemble methods:

Utilizing multiple decision trees to create a more robust and accurate model, such as
random forests or gradient boosting.

R Packages:

Rpart: A powerful package for building decision trees.

Caret: A package that simplifies model training and evaluation, including cross-
validation and hyperparameter tuning.

randomForest: A package for implementing random forests, an ensemble method built


on decision trees.

By employing these evaluation metrics, techniques, and R packages, you can assess
the performance of decision trees in R and ensure that they are providing accurate and
robust predictions.

Naive Bayes

Naive Bayes is a classification algorithm that uses Bayes' Theorem to


predict the probability of a certain event occurring based on prior
knowledge. It's known for its simplicity and efficiency, making it a popular
choice for tasks like text classification, spam filtering, and document
classification.

Here's a more detailed explanation:

Key Concepts:
• Supervised Learning:
Naive Bayes is a supervised learning algorithm, meaning it learns from labeled
data to make predictions.
• Bayes' Theorem:
The core principle behind Naive Bayes is Bayes' Theorem, which calculates the
probability of an event based on prior knowledge.
• Conditional Independence:
A key assumption in Naive Bayes is that the presence of one feature does not
affect the presence of another feature. This is where the "naive" part of the
name comes in.
• Probabilistic Classifier:
Naive Bayes is a probabilistic classifier, meaning it assigns probabilities to
different classes and predicts the most likely class for a given input.

Bayes theorem

Bayes' Theorem helps us update probabilities based on prior knowledge


and new evidence. In this case, knowing that the pet is quiet (new
information), we can use Bayes' Theorem to calculate the updated
probability of the pet being a cat or a dog, based on how likely each animal
is to be quiet.

Bayes Theorem and Conditional Probability


Bayes' theorem (also known as the Bayes Rule or Bayes Law) is used to
determine the conditional probability of event A when event B has already
occurred.
The general statement of Bayes’ theorem is “The conditional probability of
an event A, given the occurrence of another event B, is equal to the product
of the event of B, given A, and the probability of A divided by the probability
of event B.” i.e.
For example, if we want to find the probability that a white marble drawn
at random came from the first bag, given that a white marble has already
been drawn, and there are three bags each containing some white and
black marbles, then we can use Bayes’ Theorem.

Bayes Theorem Formula

For any two events A and B, Bayes’s formula for the Bayes theorem is given by:
Bayes-theorem-2

Formula for the Bayes theorem

Where,

• P(A) and P(B) are the probabilities of events A and B, also, P(B) is
never equal to zero.
• P(A|B) is the probability of event A when event B happens,
• P(B|A) is the probability of event B when A happens.
Bayes Theorem Statement
Bayes's Theorem for n sets of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in
which all the events E1, E2,…, En have a non-zero probability of occurrence.
All the events E1, E2,…, E form a partition of S. Let A be an event from space
S for which we have to find the probability, then according to Bayes theorem,

P(Ei∣A)=P(Ei)⋅P(A∣Ei)∑k=1nP(Ek)⋅P(A∣Ek)P(Ei∣A)=∑k=1nP(Ek)⋅P(A∣Ek)P(Ei
)⋅P(A∣Ei)
for k = 1, 2, 3, …., n

Bayes Theorem Applications


Bayesian inference is very important and has found application in various
activities, including medicine, science, philosophy, engineering, sports, law,
etc., and Bayesian inference is directly derived from Bayes theorem.
Some of the Key Applications are:
• Medical Testing → Finding the real probability of having a disease
after a positive test.
• Spam Filters → Checking if an email is spam based on keywords.
• Weather Prediction → Updating the chance of rain based on new
data.
• AI & Machine Learning → Used in Naïve Bayes classifiers to
predict outcomes.
Difference Between Conditional Probability
and Bayes Theorem
The difference between Conditional Probability and Bayes's. The theorem
can be understood with the help of the table given below.

Bayes Theorem Conditional Probability

Bayes's Theorem is derived using the Conditional Probability is the


definition of conditional probability. It is probability of event A when
used to find the reverse probability. event B has already occurred.

Formula: P(A|B) = [P(B|A)P(A)] / P(B) Formula: P(A|B) = P(A∩B) / P(B)

Purpose: To find the probability


Purpose: To update the probability of
of one event based on the
an event based on new evidence.
occurrence of another.

Focus: Uses prior knowledge and


Focus: Direct relationship
evidence to compute a revised
between two events.
probability.

Naive Bayes classifier

Naive Bayes classifier is a simple probabilistic machine learning algorithm


based on Bayes' theorem. It's called "naive" because it makes the
assumption that each feature is independent of the others given the class
label. This means that the presence or absence of one feature does not
affect the probability of another feature being present in the same class.

Objectives:
• Bayes' Theorem:
The foundation of the algorithm, used to calculate the probability of a
hypothesis (or class) given the evidence (or features).
• Conditional Independence:
The core assumption that features are independent of each other given the
class label.
• Probabilistic Classification:
The algorithm predicts the class with the highest probability for a given input.

How it works:
1. 1. Training:
The algorithm learns the probability distribution of each feature given each class
label from the training data.
2. 2. Prediction:
For a new input, the algorithm calculates the probability of each class given the
input features, based on the learned probabilities.
3. 3. Classification:
The algorithm assigns the input to the class with the highest calculated
probability.

Advantages:
• Simple and fast: It's easy to implement and computationally efficient, making
it suitable for large datasets.
• Scalable: Can handle large datasets and many features.
• Handles both continuous and categorical data: Can be used with various data
types.
• Not sensitive to irrelevant features: Can ignore irrelevant data and maintain
good performance.
• Low false positive rate: A study on Naive Bayes spam filtering showed that
Naive Bayes can achieve low false positive rates in spam detection.

Disadvantages:
• The "naive" assumption: The assumption of feature independence is often not
true in real-world scenarios. This can lead to suboptimal performance,
especially when features are strongly correlated.
• Limited ability to model complex dependencies: It struggles to model
complex relationships between features.

Applications:
• Spam filtering: Used to classify emails as spam or not spam.
• Text classification: Used to categorize documents into different topics.
• Sentiment analysis: Used to determine the sentiment expressed in text, such
as positive or negative.
• Medical diagnosis: Used to help diagnose patients by predicting the
probability of different diseases.
• Face recognition: Used to identify faces or features like the nose, mouth, and
eyes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy