DM HarshQuesAns
DM HarshQuesAns
KDD stands for Knowledge Discovery in Databases. It is a systematic process of discovering useful,
valid, and understandable patterns or knowledge from large volumes of data.
It includes Data Mining as a key step, but it goes beyond just mining – it covers everything from data
preparation to interpretation of results.
The KDD process typically consists of the following five key steps:
1. Data Selection
• Example: From a hospital’s records, you may select only patient age, disease type, and
treatment data for analysis.
• Description: Real-world data is often dirty – it may contain errors, missing values,
duplicates, or inconsistencies.
• Techniques include:
o Filling missing values
o Removing duplicates
o Normalizing data
• Example: Replacing all null values in an income column with the average income.
3. Data Transformation
• Techniques include:
4. Data Mining
• Description: This is the core step where intelligent methods are applied to extract useful
knowledge.
• Techniques:
o Classification
o Clustering
o Prediction
• Example: Discovering that customers who buy milk and bread also tend to buy butter
(association rule).
• Description:
o Interpret the mined patterns.
o Evaluate whether they are interesting, valid, novel, and useful.
• Summary
Step Purpose
Data Selection Choose relevant data
Data Cleaning Fix errors and missing values
Data Transformation Format data for mining
Data Mining Discover patterns using algorithms
Interpretation Evaluate and understand discovered knowledge
Summary
Step Purpose
Data mining is the process of discovering useful patterns, knowledge, and insights from large amounts of
data. It is mainly categorized into two types: Descriptive and Predictive data mining.
Let’s compare both based on definition, objective, methods used, examples, and a comparison table for
quick revision.
Definition:
Descriptive data mining is used to analyze past data and provide a summary or insights about what has
happened. It helps in understanding the underlying patterns, relationships, and characteristics of data.
Objective:
To describe the main features of data and identify patterns or trends without making future
predictions.
Common Techniques:
• Clustering
• Summarization
• Data visualization
Example:
• A supermarket finds that "people who buy bread also often buy butter."
• Clustering customers based on purchasing behavior.
Definition:
Predictive data mining is used to predict future outcomes based on historical data. It involves building
models using machine learning or statistical methods.
Objective:
To forecast unknown or future values using patterns discovered in past data.
Common Techniques:
• Classification
• Regression
• Time series analysis
• Decision trees
• Neural networks
Example:
Purpose Understand and describe patterns in data Predict future outcomes or unknown values
Use Case
“What has happened?” “What will happen?”
Goal
Conclusion
• Descriptive data mining is like looking in the rear-view mirror to understand where
you’ve been.
• Predictive data mining is like looking through the windshield to see where you're going.
Both are essential in data analysis: one for understanding data, the other for making informed decisions.
Q3: Define Data Warehouse and Differentiate with Database
A Data Warehouse is a centralized repository that stores large volumes of historical data from
multiple sources. It is specifically designed for querying, analysis, and reporting, rather than for
routine transactional processing.
Definition:
A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data
that helps support decision-making processes in an organization.
Feature Description
Subject-oriented
Focuses on high-level subjects (e.g., sales, finance) rather than daily operations.
What is a Database?
A Database is a collection of related data that is organized to support day-to-day operations (like insert,
update, delete, retrieve) of an organization. It is optimized for transactional processing.
Example: A bank’s system that updates your account balance every time you deposit or withdraw money
is based on a database.
Normalized (3NF) to
Data Structure Denormalized for faster queries
avoid redundancy
Feature Database Data Warehouse
Analysts, Managers,
Users Clerks, Admins, Application Users
Decision-makers
Summary
OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are two different
types of data processing systems used in database management, each serving a different purpose in an
organization.
Definition:
OLTP systems are used for managing day-to-day transactional data. These systems are designed for
high-speed, real-time operations, such as insert, update, and delete operations.
Objective:
To handle a large number of short, fast transactions such as banking, online shopping, or airline
reservations.
Characteristics:
Examples:
• ATM transactions
Definition:
OLAP systems are used for data analysis and decision-making. These systems work with large
volumes of historical data and help in analyzing trends, patterns, and insights.
Objective:
To support complex queries and analytical operations like reporting, data mining, and business
intelligence.
Characteristics:
Examples:
• Sales forecasting
• Market trend analysis
Operations INSERT, UPDATE, DELETE, SELECT Complex SELECT queries (analysis, aggregation)
Data Size Smaller, per transaction Large volumes (multi-terabyte data warehouses)
Conclusion
• OLTP focuses on efficiently handling real-time business operations, and is optimized for
performance and reliability.
• OLAP focuses on analyzing large volumes of historical data to extract insights and
support strategic decisions.
Think of OLTP as the engine that runs the business, and OLAP as the compass that guides the
business.
Q5. Explain Data Mining Functionalities: Characterization, Regression, Discrimination
Data mining functionalities are types of tasks or operations that help discover patterns, relationships,
or trends from large datasets. These are grouped mainly into:
Definition:
Data Characterization is the process of summarizing the general features of a target class of data.
What it does:
Example:
• "What are the characteristics of customers who bought electronics last year?"
o Age: 25-40
o Location: Urban areas
Techniques Used:
• Data aggregation
• Generalization
Data Discrimination compares two or more classes (groups) of data and highlights the differences
between them.
What it does:
Example:
3⃣ Regression (Predictive)
Definition:
Regression is a data mining technique used to predict a continuous numeric value based on input
variables.
What it does:
• Models the relationship between a dependent (target) and one or more independent
(input) variables.
Example:
• Location
• Number of bedrooms
• Year built
Types of Regression:
Summary Table
Predicts numeric/continuous
Regression Predictive "What will be next month's sales based on trends?"
values
Q6. State Major Requirements and Challenges in Data Mining
Data mining involves extracting valuable patterns and knowledge from large data sets. However,
successful data mining requires meeting certain requirements and overcoming several challenges.
1. Scalability
• Data mining systems must handle very large volumes of data, possibly in terabytes or
petabytes.
2. High Performance
3. Data Quality
• The data used for mining should be clean, complete, accurate, and well-formatted.
4. User-Friendly Interface
• The system should provide interactive interfaces, visualization tools, and query support
for ease of use.
• Users should be able to interpret and explore results without technical expertise.
• Data mining must protect sensitive information and follow legal and ethical standards.
• The system must handle structured, semi-structured, and unstructured data (e.g., text,
images, videos).
Despite its benefits, data mining comes with several key challenges:
• Mining sensitive or personal data (e.g., medical records) raises serious privacy concerns.
• Dealing with big data that is distributed, high-dimensional, and dynamic is a big challenge.
• Requires powerful storage, processing, and optimization techniques.
4. Algorithm Complexity
5. Interpretability of Results
• The patterns and models discovered may be too complex for end-users to understand.
• Need for explainable AI and visualization tools.
• Data can evolve over time; for example, customer preferences or fraud patterns.
• Models must be updated to adapt to dynamic environments.
• Data often comes from multiple sources (databases, web, sensors), possibly with different
formats and semantics.
Conclusion
• Overcome challenges like poor data quality, privacy concerns, and interpretability
“Data mining is powerful, but only when supported by clean data, smart algorithms, and strong ethics.”
Q7: Define Market Basket Analysis with Example
Market Basket Analysis (MBA) is a data mining technique used to discover associations or
relationships between items that customers buy together frequently.
Definition:
Market Basket Analysis is a technique used to identify patterns of co-occurrence among items in large
datasets, typically in the context of transactional data, such as sales records in a retail store.
It is commonly used in association rule mining, where we try to find rules like:
css
CopyEdit
If a customer buys item A, they are likely to buy item B.
Real-Life Example
Imagine a supermarket analyzing customer transactions. Market Basket Analysis might discover:
CopyEdit
“Customers who buy bread and butter often also buy jam.”
Market Basket Analysis uses Association Rule Mining, which involves three main metrics:
Metric Description
Lift How much more likely items occur together than if they were independent
Example:
So,
• Support = 80 / 1000 = 8%
Summary
• It's widely used in retail, marketing, and e-commerce for upselling and product
recommendations.
Q8. What are Data Mining Issues?
Data mining is a powerful process for extracting valuable knowledge from large datasets. However, it
faces several issues that can affect the accuracy, efficiency, and usefulness of the results.
• Data may come from multiple heterogeneous sources (databases, files, web).
• Integrating these sources and transforming data into a suitable format is complex.
• Users often have limited knowledge about the data mining process.
• Questions about who owns the data and the mined knowledge can arise.
Data Ownership/Legal Rights and legal restrictions Access and use limitations
Conclusion
Understanding and addressing these data mining issues is crucial to extract reliable, useful, and ethical
insights from data. Data mining is not just about algorithms, but also about managing data quality,
privacy, scalability, and user needs.
Q9: Importance of Data Mining
Data mining is the process of extracting meaningful patterns, trends, and knowledge from large
datasets using algorithms and statistical methods.
• Provides data-driven insights that help managers and executives make informed
decisions.
• Example: Banks use data mining to decide which customers are creditworthy.
• Organizations that effectively use data mining can gain a strategic edge over competitors.
• Data mining automates the process of analyzing large, complex datasets efficiently.
• Example: Social media companies analyzing millions of user posts to understand trends.
Summary Table
Discover Hidden Patterns Find unknown relationships in data Customer buying habits
Predict Future Trends Forecast outcomes based on historical data Stock price prediction
Gain Competitive Advantage Better market positioning and innovation Targeted advertising
Handle Big Data Analyze huge datasets quickly and accurately Social media trend analysis
Q10. Explain Interestingness of Association Rules
Association rule mining is a key technique in data mining used to discover relationships or patterns
among a set of items in large datasets, such as market basket analysis (e.g., customers who buy bread also
buy butter).
However, not all discovered rules are useful or meaningful. This is where the concept of
interestingness comes in — it helps us identify which rules are worth analyzing or actionable.
What is Interestingness?
Interestingness measures evaluate how valuable, surprising, or useful a discovered association rule is.
They help filter out trivial or unimportant rules and highlight those that provide significant insights.
1. Objective Measures
2. Subjective Measures
Summary
Feature Explanation
Purpose of Interestingness
Filter and rank association rules
Example
In a supermarket dataset:
o Confidence = 70% (70% of people who bought bread also bought butter),
o Lift = 1.5 (meaning buying bread increases the chance of buying butter by 1.5 times
compared to chance alone).
If these values are high, the rule is interesting and useful for marketing campaigns or product
placements.
Q11: Describe Data Cleaning and Methods
Data Cleaning (also called Data Cleansing) is the process of detecting and correcting (or removing)
errors and inconsistencies in data to improve its quality and ensure accurate analysis.
Since real-world data is often noisy, incomplete, or inconsistent, data cleaning is a critical step in the
Data Mining and KDD process.
• Removes noise and errors that can mislead or degrade mining results.
• Helps in creating a clean dataset that reflects the true nature of the data.
Methods:
• Ignore the record: Remove tuples (rows) with missing values if they are few.
• Fill with a global constant: Use a fixed value such as “Unknown” or “0”.
• Fill with mean/median/mode: Replace missing numeric values with the average or most
common value of that attribute.
• Predict missing values: Use machine learning or regression to estimate missing data.
• Binning: Sort data values and group them into bins, then smooth by replacing values with
bin means or medians.
• Regression: Fit data to a regression model and replace noisy values with predicted values.
• Clustering: Group similar data points and replace outliers or noise based on cluster
properties.
Methods:
4. Resolving Inconsistencies
Methods:
Methods:
Summary Table
Missing Values Data is incomplete Ignore record, fill with mean/mode, predict values
Noisy Data Contains random errors Binning, regression, clustering, outlier removal
Inconsistent Data Conflicting or invalid entries Validation rules, standard formats, expert input
Example
Suppose a customer database has missing ages, typos in names, duplicate entries, and inconsistent date
formats.
Conclusion
Data cleaning is a fundamental process that ensures data quality and improves the effectiveness of data
mining algorithms. Ignoring this step can lead to inaccurate, misleading results.
Q12. Handling Missing Values in Real-World Data
In many real-world datasets, some data entries are missing — meaning certain attribute values are not
recorded or are unavailable. Missing values can occur due to various reasons such as:
• System glitches
Missing data can negatively impact the quality of data analysis and the accuracy of data mining models
if not properly handled.
There are several approaches to handle missing data depending on the context, the amount of missing
data, and the data mining task.
o Simple but may result in significant data loss if many records have missing data.
o Appropriate only when missing data is minimal and random.
• Pairwise Deletion: Use all available data without deleting entire records. For example,
compute statistics using pairs of variables where data is present.
o Retains more data but can complicate analysis and produce inconsistent sample
sizes.
2. Imputation Methods
• Mean/Median/Mode Imputation:
o Replace missing numeric values with the mean or median of the attribute.
o Replace missing categorical values with the mode (most frequent category).
• Regression Imputation:
• Multiple Imputation:
• Some machine learning algorithms, like decision trees or random forests, can handle
missing values internally by splitting based on available data.
• Create a new binary attribute indicating whether a value was missing (1 if missing, 0 if
present).
Captures complex
KNN Imputation Use nearest neighbors to impute Computationally expensive
patterns
Multiple Imputation Multiple estimates of missing values Statistically valid Complex, time-consuming
Captures missingness
Indicator Variables Flag missing values explicitly Adds complexity
pattern
Best Practices
• Choose imputation methods based on the amount of missing data, data type, and analysis
goals.
• Always validate the impact of your missing data handling on model performance.
Example
Suppose a healthcare dataset has missing blood pressure values. Simply deleting these records may
remove important patient data. Instead, using regression imputation based on age, weight, and heart rate
could provide reasonable estimates, preserving the dataset's integrity.
Q13: Handling Noisy Data – Techniques
Noisy data refers to data that contains errors, inconsistencies, or random variations which do not
represent the true underlying patterns. Noise can arise from:
• Measurement errors
• Data entry mistakes
• Transmission errors
• Sensor malfunctions
Noisy data can mislead analysis, reduce the accuracy of models, and increase computational complexity.
Handling noisy data is a key part of data preprocessing to ensure clean and reliable data for mining.
1. Binning
• Description: Binning smooths data by dividing the data into intervals (bins) and then
smoothing the values within each bin.
• How it works:
• Example: Temperatures measured every hour can be binned into daily averages to reduce
noise.
2. Regression
• Description: Use regression techniques to fit a model to the data and replace noisy values
with predicted values.
• How it works:
3. Clustering
• Description: Group similar data points into clusters and identify noise as points that do not
belong well to any cluster.
• How it works:
• Example: In customer segmentation, customers who don’t fit into any segment can be
considered noise.
• How it works:
• Description: Replace each data point with the average of its neighboring points.
• How it works:
• Example: Smooth stock prices using moving average to reduce day-to-day volatility.
• How it works:
o Convert numeric data into categorical data.
o Use concept hierarchies (e.g., age groups instead of exact age).
• Example: Group ages into ranges like 0-18, 19-35, 36-60, etc.
Summary Table
Binning Group data into bins and smooth values Numeric data with local noise
Regression Model data trends and replace noisy values Continuous data with trend
Clustering Group similar points and identify noise Data with natural clusters
Outlier Detection Identify and remove extreme values Data with clear outliers
Smoothing by Averaging Use neighboring averages to smooth data Time-series or sequential data
Example
• Detect outliers where readings exceed physical limits and remove them.
Conclusion
Handling noisy data is crucial to improve the quality and reliability of data mining outcomes. Choosing
the right technique depends on the nature of the data and the specific noise characteristics.
Q14. Explain Normalization (Min-Max, Z-Score)
What is Normalization?
Normalization is a data preprocessing technique used in data mining and machine learning to rescale
data attributes to a common scale without distorting differences in the ranges of values. It helps
improve the performance of many algorithms, especially those sensitive to the scale of data (e.g.,
distance-based methods like KNN, clustering).
• Different features may have different units and scales (e.g., age in years, income in dollars).
• Algorithms might be biased toward features with larger scales.
• Normalization brings all features to a comparable scale, improving convergence speed and
accuracy.
• Min-Max rescales data to a specific range, preserving the shape but sensitive to outliers.
• Z-Score standardizes data to zero mean and unit variance, better for normally distributed
data.
Q15: Explain Binning with Example
What is Binning?
Binning is a data preprocessing technique used to reduce noise and smooth data by grouping
continuous data values into a smaller number of intervals, called bins.
Instead of working with individual raw data points, data values are replaced with representative values
for each bin, which helps reduce the effect of minor observation errors or noise.
Purpose of Binning
2. Divide the data into a set number of bins (intervals), which can be:
o Equal-frequency bins: Each bin contains roughly the same number of data points.
3. Smooth the data in each bin by replacing the data points with a representative value:
o Bin mean: Replace all points with the average value of the bin.
o Bin median: Replace all points with the median value of the bin.
• Easy to implement.
• Reduces noise and smooths fluctuations.
Summary
Step Description
Smoothing is a data preprocessing technique used to reduce noise and remove outliers from data,
making it easier to analyze and model. It helps in revealing important patterns and trends by eliminating
irregularities or random variations in data.
Smoothing is especially useful for time series data, sensor data, or any data with fluctuations or
measurement errors.
4. Median Smoothing
• More robust to outliers than moving average since median is less affected by extreme
values.
5. Gaussian Smoothing
Comparison Table
Exponential
Weighted average with decay factor Responsive, adaptable Needs parameter tuning
Smoothing
Median
Median of neighbors Robust to outliers Less smooth than average
Smoothing
Gaussian Computationally
Weights based on Gaussian curve Smooth, well-behaved
Smoothing intensive
Summary
• Moving average and exponential smoothing are widely used for time series.
• Median smoothing is preferred when outliers are present.
Q17: Describe Data Reduction and Dimensionality Reduction
Data Reduction refers to the process of reducing the volume or size of data while maintaining its
integrity and meaning for analysis. It helps make data mining faster and more efficient without losing
significant information.
Dimensionality Reduction is a specific type of data reduction focused on reducing the number of
input variables (features or attributes) in a dataset while preserving the essential properties of the data.
• High-dimensional data (many features) can cause the “curse of dimensionality” leading to
poor model performance.
1. Feature Selection
o Select a subset of relevant features.
2. Feature Extraction
Comparison Table
Purpose Improve efficiency and reduce storage Improve model performance and reduce complexity
Techniques Sampling, aggregation, compression Feature selection, feature extraction (PCA, LDA)
Example
Suppose a retail dataset contains 1 million transactions with 100 features each.
Summary
• Data Reduction decreases the dataset size while preserving meaningful information.
• Dimensionality Reduction specifically reduces the number of features/attributes.
• Both are essential preprocessing steps in data mining for handling large, complex datasets
efficiently.
Q18. Define Concept Hierarchy
A concept hierarchy is a structured representation that organizes data attributes or concepts into
multiple levels of abstraction or granularity, forming a hierarchy from general to specific.
Key Points
• Used mainly in OLAP, data mining, and knowledge discovery to support operations like
roll-up and drill-down.
Example
The concept hierarchy allows grouping or generalizing city-level data to the country or continent
level.
• Efficient Querying: Supports OLAP operations like roll-up (aggregating data) and drill-
down (breaking down data).
• Improves Interpretability: Easier to understand broad trends than raw detailed data.
Summary
Aspect Description
Correlation Analysis is a statistical method used to measure and describe the strength and
direction of the relationship between two variables.
• It helps understand how one variable changes when another variable changes.
• The result is a correlation coefficient that quantifies this relationship.
Correlation Coefficient
Types of Correlation
1. Understanding Relationships
o Helps identify how variables are related, which is key in exploratory data analysis.
2. Feature Selection
3. Data Reduction
o Correlated variables can be combined or one can be discarded, reducing
dimensionality.
4. Predictive Modeling
5. Detecting Multicollinearity
Summary Table
Negative Correlation One variable increases as the other decreasesPrice and demand
Example
Correlation analysis shows r = 0.9, indicating a strong positive correlation: as advertising increases,
sales tend to increase.
Conclusion
Correlation analysis is a fundamental tool in data mining that helps understand and quantify
relationships between variables, aiding in feature selection, model building, and data
simplification.
Q20. What is Data Discretization and Concept Hierarchy Generation?
1. Data Discretization
Definition:
Data discretization is the process of transforming continuous (numeric) attributes into discrete
(categorical) intervals or bins. This helps simplify the data, reduce its complexity, and make it
easier to analyze, especially for algorithms that work better with categorical data.
• Many data mining algorithms (e.g., decision trees, association rule mining) perform better
or require categorical data.
Method Description
Entropy-based Uses information gain to choose intervals that best separate classes
• Young: 18–30
• Middle-aged: 31–50
• Senior: 51–70
Definition:
Concept hierarchy generation is the process of building a hierarchy (levels) of concepts that
represent data attributes at various levels of abstraction, from detailed to general.
• Enables multi-level analysis and roll-up/drill-down operations in OLAP and data mining.
Example:
Data Integration is the process of combining data from multiple heterogeneous sources into a
unified, consistent view for analysis and mining.
Data integration merges these into a single dataset for analysis, showing customer behavior, sales trends,
and web interactions together.
Data integration is a challenging task due to differences in data sources and formats. Common issues
include:
1. Data Heterogeneity
• Same data may appear in multiple sources but with conflicting values.
• For example, a customer’s address might differ between systems.
• Requires conflict resolution and data cleansing.
• Efficient algorithms and storage solutions are necessary to handle big data.
• Data from different sources may have varying security and privacy requirements.
Summary Table
Redundancy & Inconsistency Conflicting data across sources Inaccurate or unreliable integrated data
Schema Integration Different schemas and field names Complex mapping and transformation required
Data Quality Missing, noisy, or erroneous data Poor analysis results if unaddressed
Volume and Scalability Large datasets across multiple sources Performance and storage challenges
Conclusion
Data Integration is essential for creating a unified data view from diverse sources but involves
challenges such as heterogeneity, inconsistency, schema differences, and data quality issues.
Addressing these issues requires careful planning, preprocessing, and the use of specialized tools and
techniques.
Q22. Preprocessing Steps in Data Mining
Data preprocessing is a crucial initial step in the data mining process. It involves cleaning,
transforming, and organizing raw data into a suitable format for analysis. Proper preprocessing
improves the quality of the data and helps mining algorithms produce better and more reliable results.
1. Data Cleaning
• Tasks:
o Removing noise: Smooth noisy data using techniques like binning or regression.
o Correcting inconsistencies: Fix conflicting data entries.
2. Data Integration
• Purpose: Combine data from multiple heterogeneous sources into a unified dataset.
• Example: Integrating customer data from sales, marketing, and support databases.
3. Data Transformation
• Purpose: Convert data into appropriate forms for mining.
• Tasks:
• Techniques:
5. Data Cleaning
Summary Table
Data Cleaning Handle missing, noisy, and inconsistent data Improve data quality
Data Transformation Normalize, aggregate, discretize data Prepare data for mining
Data Generalization is a data abstraction process that summarizes detailed data into a higher-level,
more compact form.
• It replaces low-level, detailed data with concepts or categories from a concept hierarchy.
• Helps reduce the complexity and volume of data while preserving important patterns.
• Often used in data preprocessing and knowledge discovery to enable easier interpretation
and analysis.
1. Concept Hierarchy
o Data attributes are organized into a hierarchy of concepts, ranging from detailed
(low-level) to general (high-level).
2. Aggregation
o Data values are mapped from specific values to higher-level concepts.
o For example, individual ages (23, 29, 34) can be generalized to “Adult”.
3. Result
o The dataset size reduces because many specific data points are grouped into fewer
generalized categories.
o The generalized data retains meaningful information for mining patterns or trends.
Employee ID Age
101 23
102 29
Employee ID Age
103 34
104 52
105 58
Generalized dataset:
101 Adult
102 Adult
103 Adult
104 Middle-aged
105 Middle-aged
Summary
Aspect Description
Data transformation is a key preprocessing step in data mining that converts data into a suitable format
or structure for analysis. It enhances data quality and improves the performance of mining algorithms.
1. Smoothing
Definition:
Smoothing reduces noise and random fluctuations in data to reveal important patterns and trends.
How it works:
Common methods:
Purpose:
• Remove noise.
• Highlight trends.
2. Normalization
Definition:
Normalization rescales numeric attributes to a common scale without distorting differences in the ranges.
Common techniques:
Purpose:
3. Discretization
Definition:
Discretization converts continuous data into discrete intervals or categories.
How it works:
Methods:
• Equal-width binning: Fixed interval size.
Purpose:
• Simplify data.
Primitives in data mining refer to the basic building blocks or commands used to specify a data
mining task clearly and precisely. They define what the data mining system should do and how it should
perform the mining.
Primitives are like a set of instructions or parameters that help the user describe the mining process,
including:
4. Interestingness Measures
• Facilitate communication between the user and the data mining system.
Suppose you want to mine association rules from a sales database with these specifications:
Summary
Apriori is a classic algorithm used in data mining for finding frequent itemsets and generating
association rules in transactional databases. It is widely used for market basket analysis to discover
relationships between items bought together.
Key Concepts:
• Frequent Itemsets: Sets of items that appear together in transactions with frequency above
a specified threshold called minimum support.
• Association Rules: Implication rules of the form A → B, meaning if itemset A appears,
itemset B is likely to appear.
• Support: Proportion of transactions containing an itemset.
3. Count support for candidate 2-itemsets; retain those meeting minimum support.
4. Repeat the process for k-itemsets using (k-1)-itemsets until no more frequent itemsets are
found.
5. Generate association rules from frequent itemsets that satisfy minimum confidence.
Example
1 Bread, Milk
• Minimum Support = 60% (i.e., itemsets must appear in at least 3 out of 5 transactions)
Support
Item Support %
Count
Bread 4 80%
Milk 4 80%
Diaper 4 80%
Beer 3 60%
Eggs 1 20%
Cola 2 40%
Summary
Association rules are if-then patterns that describe relationships between items in large datasets.
• For example: If a customer buys bread, they are likely to buy butter.
Key Terms
Transaction ID Items
1 Bread, Milk
• Use Apriori algorithm or similar to find itemsets with support above a threshold.
• Example: Minimum support = 60% (support ≥ 0.6)
Final Output
scss
CopyEdit
FP-Growth (Frequent Pattern Growth) is an efficient and scalable data mining algorithm
used to find frequent itemsets in a transactional database, an alternative to the Apriori
algorithm.
Unlike Apriori, FP-Growth does not generate candidate itemsets explicitly, which makes it
much faster especially for large datasets.
Key Concepts
• FP-Tree (Frequent Pattern Tree): A compact data structure that stores essential
information about frequent patterns in the dataset.
• Frequent Itemsets: Sets of items that appear together in transactions at least as often
as a user-specified minimum support threshold.
1. Scan the database once to find all frequent items (those that meet minimum support).
o Insert transactions one by one into the FP-tree, following the order of sorted
frequent items.
Example
1 f, a, c, d, g, i, m, p
2 a, b, c, f, l, m, o
3 b, f, h, j, o
4 b, c, k, s, p
5 a, f, c, e, l, p, m, n
Item Count
f 4
a 3
c 3
b 3
m 3
p 3
others <3
Frequent items: f, a, c, b, m, p
• For example, frequent patterns involving ‘p’ might be: {p}, {f, p}, {a, p}, etc.
Advantages of FP-Growth
Summary
Can be slow and inefficient for large Generally faster and more
Performance datasets due to costly candidate scalable than Apriori, especially
generation and repeated database scans. on large and dense datasets.
Handling Performs well on sparse datasets where Also effective but FP-tree size can
Sparse Data candidate sets are small. grow for very dense datasets.
Suitable for small to medium datasets Preferred for large datasets with
Use Cases
or where simplicity is preferred. many frequent patterns.
Summary
• Apriori is conceptually simpler but can be inefficient because it generates many
candidate itemsets and requires multiple database scans.
• FP-Growth is more efficient by compressing the dataset into an FP-tree and mining
frequent patterns directly without candidate generation.
• FP-Growth generally outperforms Apriori on large, dense datasets but requires more
memory.
Q30. Frequent Itemset Mining Using Support and Confidence Thresholds
Frequent itemset mining is a fundamental task in data mining where the goal is to find sets of
items (itemsets) that frequently occur together in a transactional database.
Key Concepts
• Support: The proportion (or count) of transactions in the dataset that contain the
itemset.
Mining Process
o Find all itemsets whose support is greater than or equal to the minimum
support.
Example
TID Items
1 Bread, Milk
2 Bread, Diaper
3 Milk, Diaper
4 Bread, Milk
5 Milk, Diaper
Summary Table
Term Description
Why Important?
1. Itemset
2. Support
• Formula
4. Association Rule
Summary Table
Association Rule Mining is a key data mining task that discovers interesting relationships
(rules) among items in large datasets. These rules help in understanding patterns, like which
products are bought together.
Association rule mining can be classified based on different criteria such as the nature of
rules, constraints, and the type of data involved.
4. Based on Constraints
• Constrained Association Rules:
Mining rules that satisfy user-specified constraints such as rule length, specific items,
or thresholds.
Example: Rules must contain “Milk” or have a confidence > 80%.
Summary Table
Background
The Apriori algorithm is a classic method for mining frequent itemsets and association rules.
However, it can be computationally expensive because it:
• Pruning:
Use the Apriori property: All subsets of a frequent itemset must also be frequent.
So, candidate itemsets containing any infrequent subset are discarded early.
• Hash-based Techniques:
Use a hash tree to store candidate itemsets and count their support efficiently. This
helps reduce the number of candidates by hashing itemsets into buckets.
• Instead of scanning the entire database for every candidate itemset, use transaction
reduction:
o Remove transactions that do not contain any frequent itemsets from previous
passes.
• Use partitioning:
o Split the database into smaller partitions.
• Hash Trees:
Store candidate itemsets in a hash tree for efficient counting.
4. Sampling
• This reduces the size of data scanned but may miss some itemsets.
• Often combined with a second pass to verify candidates on the full database.
6. Transaction Reduction
• After identifying frequent itemsets of size k, remove transactions that do not contain
any frequent k-itemsets from future scans.
Summary Table
Conclusion
Binary data: items are either Numerical data: items have continuous
Data Type
present (1) or absent (0) or discrete numeric values
Item Items are treated as yes/no or Items have numeric attributes such as
Representation true/false attributes quantity, price, weight
Summary
• Boolean association rules focus on whether items appear together or not — a simple
presence/absence model.
• Instead of mining all possible patterns, the system focuses on patterns that satisfy
given constraints.
Meta-Rules are high-level rules or templates that define the kinds of constraints or patterns
users want to mine.
o Example: "Find association rules where the antecedent contains 'age' and the
consequent contains 'income'".
• Meta-rule 1: Find classification rules where the target class is 'high income'.
• Meta-rule 2: Rules must involve the attribute 'age' in the antecedent.
Using these meta-rules, the mining process only searches for rules matching these criteria.
Summary
Aspect Description
Constraint-Based
Mining patterns under user-defined constraints
Mining
Market Basket Analysis (MBA) is a data mining technique used to discover associations or
patterns between products customers buy together. It helps retailers understand product
relationships and optimize sales strategies, such as product placement or promotions.
Example Case
1 Bread, Milk
Bread→Milk
• Support Calculation:
Count transactions with both Bread and Milk:
Interpretation
• The support of 60% means that Bread and Milk are bought together in 60% of all
transactions.
• The confidence of 75% means that in 75% of the transactions where Bread is bought,
Milk is also bought.
• Support ensures that the rule applies to a significant portion of the dataset (avoiding
rare associations).
• Confidence indicates the strength or trustworthiness of the rule.
Q37: Frequent Pattern Mining Classification Criteria
Frequent pattern mining algorithms can be classified based on several important criteria:
• Frequent Itemsets:
Sets of items appearing frequently together (e.g., in market basket analysis).
• Frequent Sequences:
Ordered sequences of items occurring frequently (e.g., customer purchase sequences).
• Frequent Subgraphs:
Patterns in graph data (e.g., social networks, chemical structures).
2. Mining Approach
• Apriori-based Methods:
Use candidate generation and pruning (e.g., Apriori algorithm).
Generate candidate patterns level-wise and prune infrequent ones.
4. Handling of Constraints
• Constraint-Based Mining:
Incorporates user-specified constraints (e.g., length, value constraints) during mining
to focus on relevant patterns.
• Unconstrained Mining:
Mines all frequent patterns without restrictions.
• Absolute Support:
Minimum number of transactions containing the pattern.
• Relative Support:
Percentage or proportion of transactions containing the pattern.
6. Output Type
• Transactional Data:
Market basket transactions, logs.
• Sequence Data:
Time-ordered data, clickstreams.
• Graph Data:
Networks, molecules.
Summary Table
Conclusion
Understanding these classification criteria helps in selecting the right frequent pattern mining
algorithm tailored to the data characteristics, mining goals, and constraints.
Q38. Multidimensional Association Rules
Key Idea
• Instead of mining associations only among items in a single dimension (e.g., products
purchased), multidimensional rules consider multiple attributes such as customer
demographics, time, location, product categories, etc.
Rule Format
Interpretation:
Customers aged 25-35 with high income are likely to buy laptops.
Applications
Advantages
Summary
Aspect Description
• This is useful when some items are rare but important, and setting a high global
minimum support would exclude them.
• It relaxes the minimum support constraint for certain items, enabling discovery of
interesting patterns involving infrequent but meaningful items.
• In real-world data, some important items or events occur less frequently but are still
valuable.
• For example, in market basket analysis, rare but high-value products (like luxury
goods) might have low support.
• Reduced minimum support helps to capture both frequent and rare patterns
effectively.
How It Works
• Set:
• Itemsets involving Item B will be considered frequent if their support ≥ 10%, even if
this is below the global 50% threshold.
Advantages
Summary
Aspect Description
In association rule mining, interestingness measures help evaluate the quality and
usefulness of the discovered rules beyond just support and confidence. These measures help
filter out trivial or misleading rules and highlight the most relevant patterns for decision-
making.
• Support and confidence alone may generate too many rules, some of which may be
obvious or uninformative.
• Lift:
Indicates how much more often A and B occur together than expected if independent.
o Lift > 1 → Positive correlation (items occur together more than chance)
o Lift = 1 → Independence
• Leverage:
Shows the difference between actual co-occurrence and what would be expected if A
and B were independent.
• Conviction:
Measures the degree of implication; higher conviction means fewer exceptions to the
rule.
Conclusion
Goal To classify data into distinct groups. To predict future values or trends.
Nature of
Categorical (nominal or ordinal). Numeric (continuous).
Target Variable
Summary
Learning from labeled data where Learning from unlabeled data without
Definition
input-output pairs are known predefined outputs
Data Requires labeled dataset (input Uses unlabeled dataset (only input
Requirement features + target labels) features)
Training Model learns a mapping function Model learns intrinsic data structure or
Process from inputs to outputs distribution
Summary
Bayes’ Theorem
Summary
Concept Explanation
• Select the attribute with maximum information gain as the decision node.
Example Dataset
Final Result
A Decision Tree is a supervised learning algorithm used for classification (and regression)
tasks. It models decisions and their possible consequences as a tree structure, where:
The algorithm recursively splits the dataset based on attribute values, aiming to create subsets
that are pure (i.e., contain data points mostly from one class).
Key Steps
2. Select the best attribute to split the data based on a criterion (e.g., Information Gain,
Gini Index).
3. Split the data into subsets according to the selected attribute’s values.
o If the subset is pure (all same class) or meets stopping criteria (max depth,
minimum samples), assign a class label.
5. Build the tree until all data are classified or stopping conditions are met.
• Gini Index:
Measures impurity of a dataset. Choose attribute that minimizes Gini impurity.
Example
Suppose you want to classify whether a customer will buy a product based on attributes: Age
(Young, Middle, Old), Income (High, Medium, Low).
1 Young High No
4 Old High No
Advantages
Summary
Step Description
Example of a Rule
1. Rule Generation:
Extract rules from the training dataset using algorithms like RIPPER, CN2, or OneR.
2. Rule Pruning:
Simplify rules to remove noise and avoid overfitting by eliminating unnecessary
conditions.
3. Rule Selection:
Choose the most accurate and relevant rules based on coverage and accuracy.
4. Classification:
Apply the rules to new instances — the first matching rule usually decides the class.
Advantages
Disadvantages
• May have conflicts when multiple rules match; requires conflict resolution strategies.
• CN2
Summary
Aspect Description
Definition
The goal is to choose the attribute that best separates the data into classes, creating subsets
that are more pure (i.e., contain mostly instances of a single class).
• Compute Information Gain = Entropy before split – weighted entropy after split.
Summary
Helps identify the best attribute for splitting the data at each node in a decision tree.
What is Cross-Validation?
• It helps estimate how well the model will perform on unseen data, reducing problems
like overfitting.
• When you train and test a classifier on the same data, the accuracy might be overly
optimistic.
1. k-Fold Cross-Validation
• This process repeats k times, each fold used once as the test set.
Example:
Benefits of Cross-Validation
Summary Table
Aspect Description
Information Gain
• Information Gain (IG) is a measure used in decision tree algorithms (like ID3) to
select the best attribute for splitting the data.
• Higher Information Gain means the attribute better separates the classes.
Example (Simplified)
Definition:
Unlike linear regression, which predicts continuous values, logistic regression predicts
probabilities and maps the result to a class label using a sigmoid function.
• It's ideal when the dependent variable (target) is categorical (usually binary).
Core Idea:
• Logistic regression calculates the probability that a given input X belongs to a certain
class (say, class 1).
Feature Description
What is Regression?
1. Linear Regression
Definition:
Linear regression models the relationship between the variables by fitting a straight line to
the data.
Mathematical Form:
Key Differences:
Summary
• Use non-linear regression when the data shows curves or patterns that linear models
can't capture.
Q54. Define and Explain Backpropagation Algorithm
Definition
It's called "backpropagation" because it propagates the error backward from the output layer
to the input layer during training.
Purpose of Backpropagation
• Input layer
• One or more hidden layers
• Output layer
1. Forward Propagation
• This process (forward + backward + update) is repeated over many epochs (iterations
over the training dataset) until the model converges (error becomes minimal).
Intuition
• It makes predictions.
• If predictions are wrong, it learns how wrong it was and adjusts each weight slightly
in the right direction.
Summary
Concept Description
• Each time they miss, you tell them how far off they were.
• They adjust their throw slightly.
• Eventually, they get better with every throw — that's like backpropagation adjusting
weights in a neural network.
Q55: What is a Perceptron? Single-Layer vs Multi-Layer
What is a Perceptron?
• A perceptron takes multiple input values, applies weights to them, adds a bias, and
passes the result through an activation function (usually a step function or sign
function).
Goal:
To find optimal weights so the perceptron can classify input data correctly.
1. Single-Layer Perceptron
Definition:
A single-layer perceptron consists of only one layer of output nodes connected directly to
the input features.
Architecture:
• Input Layer → Output Neuron
(No hidden layers)
Suitable For:
Definition:
A multi-layer perceptron has one or more hidden layers between the input and output
layers. It’s the foundation of modern neural networks.
Architecture:
Key Features:
Suitable For:
Summary
• It’s used when you don’t have labels and want to find natural groupings in your
data.
• K-Means tries to minimize the distance between data points and their respective
cluster centers (called centroids).
1. Initialize: Choose the number of clusters, kkk, and randomly select kkk centroids.
2. Assign: Assign each data point to the nearest centroid based on distance (usually
Euclidean distance).
3. Update: Recalculate the centroid of each cluster (i.e., mean of all data points
assigned to that cluster).
Mathematical Objective
Example
Point X Y
A 1 2
B 1 4
C 1 0
D 10 2
E 10 4
F 10 0
Step-by-step:
3. Update centroids:
Advantages of K-Means
Limitations
• Affected by outliers.
Summary
Feature Description
What is Clustering?
Clustering is an unsupervised learning technique used to group similar data points into
clusters, where:
1. K-Means Clustering
Concept:
Algorithm Steps:
2. K-Medoids Clustering
Concept:
Cluster center Mean of all points in the cluster Actual data point (medoid)
Sensitivity to
High (means shift easily) Low (medoids are more robust)
outliers
Accuracy in noisy
Lower (due to outliers) Higher (more stable)
data
Used in Large datasets, fast clustering More robust clustering, smaller data
Example
• K-Medoids will choose the most representative customer as the group center.
When to Use
What is Clustering?
Clustering is an unsupervised learning technique that groups similar data points into
clusters, where data points in the same cluster are more similar to each other than to those in
other clusters.
For a clustering algorithm to be practical and meaningful, it must meet the following
requirements:
1. Scalability
• Real-world data often contains millions of records, so the clustering method must
scale well in terms of both time and memory.
The clustering algorithm should be able to handle all relevant data types or be adaptable.
• The algorithm must be able to handle high-dimensional data (e.g., datasets with
hundreds or thousands of features).
Techniques: Dimensionality reduction (like PCA or t-SNE) is often used before clustering.
• Clustering methods should be robust and not get skewed by such data.
Example: DBSCAN can ignore noise points, whereas K-Means is sensitive to outliers.
• The algorithm should require as little prior knowledge as possible, like the number
of clusters kkk.
• In some cases, automatically determining the number of clusters is preferred.
Summary Table
Requirement Explanation
Handling different data types Should support numeric, categorical, and mixed data
Requirement Explanation
Low domain knowledge Minimal need for user-defined inputs like number of
dependency clusters
Clustering Overview
Clustering is an unsupervised learning technique used to group data into clusters such that:
• Partitioning Methods
• Hierarchical Methods
Definition:
Partitioning methods divide the dataset into a predefined number (k) of clusters, where each
data point belongs to only one cluster.
Working:
• Move points between clusters to optimize an objective function (e.g., minimize intra-
cluster distance).
• Iterative process.
Examples:
• K-Means
• K-Medoids
• CLARANS
Suitable For:
Key Characteristics:
Feature Description
Structure Flat/non-hierarchical
Definition:
Hierarchical methods build a tree-like structure (dendrogram) showing how data points are
grouped together step-by-step.
Two Approaches:
1. Agglomerative (Bottom-Up)
Examples:
• Single-link
• Complete-link
• Average-link
• BIRCH, CURE
Suitable For:
Key Characteristics:
Feature Description
Number of clusters Not required initially (can cut dendrogram at desired level)
Structure Hierarchical/tree-like
Sensitivity to
High (e.g., in K-means) Moderate (depends on linkage)
noise/outliers
Summary
• Use Partitioning methods (like K-Means) when you know how many clusters you
want and need a fast solution.
• Use Hierarchical methods when you want to understand the structure or nested
grouping of your data.
Q60. Agglomerative vs Divisive Hierarchical Clustering
• Divisive (Top-Down)
Process:
• At each step, merge the two closest clusters based on a distance metric (e.g.,
Euclidean).
• Repeat until all points are in a single cluster or until a stopping condition is met.
Steps:
Output:
Process:
2. Choose the cluster to split (typically the one with the highest variance).
3. Use a clustering algorithm (like K-means with k=2k=2k=2) to split it.
Output:
Key Differences
Dendrogram Direction Builds from leaves to root Builds from root to leaves
Q61: Discuss PAM, CLARA, and BIRCH Algorithms
These are three popular clustering algorithms, each designed for different types and sizes of
data:
Concept:
How It Works:
3. Try to swap a medoid with a non-medoid and check if the total cost (dissimilarity)
reduces.
Advantages:
• Robust to outliers.
Disadvantages:
Concept:
• CLARA is an extension of PAM designed to handle large datasets efficiently.
How It Works:
Advantages:
Disadvantages:
Concept:
How It Works:
Disadvantages:
Comparison Table
Robust to
Yes Yes No (less robust)
outliers
Summary
• BIRCH is ideal for very large, high-dimensional datasets and performs well in
memory-constrained environments.
Q62. Clustering with Euclidean Distance
Overview
In clustering, one of the most common ways to measure how similar or dissimilar two data
points are is by using a distance metric. The Euclidean distance is the most widely used
metric, especially in algorithms like K-Means, Hierarchical Clustering, and DBSCAN.
Why Euclidean Distance Is Popular
Limitations
• Network data is a special case of graph data, often used in social networks, citation
networks, web graphs, etc.
Examples:
To discover:
Graph clustering (also called community detection) is the task of dividing a graph into
groups of nodes (clusters or communities) such that:
1. Modularity-Based Clustering
Modularity:
A measure of the quality of clustering. It compares the actual number of edges within
clusters vs expected in a random graph.
Algorithm Example:
Concept:
Suitable for:
Concept:
Example Methods:
4. Label Propagation
Concept:
Idea:
• Nodes in the same community are more likely to be visited together in a random
walk.
Example:
• Walktrap algorithm
Summary Table
• Requires specialized algorithms that respect the graph topology (unlike regular data
clustering).
• Plays a key role in social network analysis, bioinformatics, web structure mining,
and more.
Q64. Outlier Detection and Its Importance
What is an Outlier?
An outlier is a data point that is significantly different from the rest of the data. It lies far
away from the other observations and does not conform to the general pattern of the dataset.
Definition:
An outlier is an object or value that deviates so much from other observations that it raises
suspicion that it was generated by a different mechanism.
Examples of Outliers:
Outlier detection is the process of identifying unusual data points that are inconsistent
with the rest of the data.
• Outliers can result from data entry errors, sensor faults, or measurement issues.
• Outliers can skew the results of machine learning models, especially regression and
classification.
3. Anomaly Detection
4. Business Intelligence
Aspect Explanation
• Distributed Data Mining (DDM): Data mining performed on data stored in multiple
locations (different machines or sites).
• Parallel Data Mining (PDM): Data mining done by splitting tasks across multiple
processors or machines to run simultaneously and reduce computation time.
4. Telecommunications
Advantage Description
Conclusion
Distributed and Parallel Data Mining are essential in the age of Big Data, enabling:
• Real-time processing
• Large-scale learning
These techniques form the backbone of modern data-driven systems in industries ranging
from finance and healthcare to IoT and AI.
Q66. Web Content Mining
Web Content Mining is the process of extracting useful information or knowledge from
the content of web pages. It focuses on analyzing text, images, audio, video, and
structured data (like tables) found on websites.
Definition:
Web Content Mining refers to the technique of retrieving, analyzing, and deriving patterns
from the actual content of web pages (not just links or structure).
1. Structured data: Tables, lists, HTML tags like <table>, <ul>, etc.
2. Semi-structured data: HTML documents with tags but not strictly formatted.
Technique Description
1. E-commerce:
3. News Aggregators:
5. Academic Research:
Summary Table
Aspect Details
Web Usage Mining (WUM) is the process of discovering useful patterns and insights from
web user behavior data collected from web server logs, browser logs, or user profiles.
It focuses on analyzing how users interact with websites—what pages they visit, how long
they stay, and the sequence of their actions.
1. Data Preprocessing
2. Pattern Discovery
3. Pattern Analysis
Technique Purpose
• Website Optimization: Improve site structure and navigation for better user
experience.
Summary
Aspect Explanation
Data Sources Server logs, browser logs, proxy logs, user profiles
Aspect Explanation
Definition
Web Structure Mining is the process of analyzing the structure of hyperlinks within the
web to discover patterns and relationships between web pages.
Instead of focusing on the content of pages, it studies the connections (links) between pages,
treating the web as a directed graph where:
Purpose
Key Concepts
Concept Explanation
Hubs and Pages that link to many pages (hubs), and pages linked by many hubs
Authorities (authorities)
Link Analysis Studying link structure to find clusters and important pages
Applications
• Search engines use web structure mining to improve relevance (e.g., Google’s
PageRank).
This network of links can be analyzed to identify which pages are more “important” or
“central”.
2. Web Logs
Definition
Web logs (or server logs) are records automatically generated by web servers that track all
the requests made to a website.
They contain detailed information about user activities on the site.
Field Description
Purpose
Summary Table
Web Mining is the process of using data mining techniques to extract knowledge from web
data. It includes:
• Web Content Mining: Extracting useful information from the content of web pages
(text, images, videos).
• Web Structure Mining: Analyzing the link structure between web pages.
• Web Usage Mining: Analyzing user behavior from web logs.
Summary Table
Application Area Use Case / Benefit
What is WEKA?
Feature Description
Cross-validation and Built-in tools for model evaluation like confusion matrix, ROC
Evaluation curves, etc.
• Educational Tool: Widely used for teaching machine learning and data mining
concepts.
Task Algorithms
Clustering K-Means, EM
Example Workflow
Summary
Aspect Description
FP-Growth (Frequent Pattern Growth) is an efficient and scalable data mining algorithm used to find
frequent itemsets in a transactional database, an alternative to the Apriori algorithm.
Unlike Apriori, FP-Growth does not generate candidate itemsets explicitly, which makes it much faster
especially for large datasets.
Key Concepts
• FP-Tree (Frequent Pattern Tree): A compact data structure that stores essential
information about frequent patterns in the dataset.
• Frequent Itemsets: Sets of items that appear together in transactions at least as often as a
user-specified minimum support threshold.
1. Scan the database once to find all frequent items (those that meet minimum support).
o Extract frequent patterns by exploring conditional FP-trees for each frequent item.
Example
TID
Items
1f, a, c, d, g, i, m, p
TID
Items
2a, b, c, f, l, m, o
3b, f, h, j, o
4b, c, k, s, p
5a, f, c, e, l, p, m, n
Item
Count
f 4
a3
c3
b3
m3
p3
others
<3
Frequent items: f, a, c, b, m, p
• For example, frequent patterns involving ‘p’ might be: {p}, {f, p}, {a, p}, etc.
Advantages of FP-Growth
Summary
Algorithm
Candidate Generation
Data Structure
Speed
FP-Growth
No FP-Tree Faster and scalable