0% found this document useful (0 votes)
25 views38 pages

DM passing package

The document is a model question paper for a Data Mining course, structured into three parts with questions of varying marks. It covers fundamental concepts of data mining, including definitions, techniques, algorithms, and issues associated with data mining. Additionally, it discusses specific algorithms like K-Nearest Neighbors and K-Means Clustering, along with their applications and complexities.

Uploaded by

Abhinav Kk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

DM passing package

The document is a model question paper for a Data Mining course, structured into three parts with questions of varying marks. It covers fundamental concepts of data mining, including definitions, techniques, algorithms, and issues associated with data mining. Additionally, it discusses specific algorithms like K-Nearest Neighbors and K-Means Clustering, along with their applications and complexities.

Uploaded by

Abhinav Kk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Model Question Paper -1 DATA MINING

Instructions to Candidates: 1. Answer any Four questions from each part.


2. Answer All Parts
PART-A I. Answer any Four questions, each carries Two marks. (4x2=8)

1. What do you mean by Data Mining?


Ans: Definition: Data Mining is defined as the procedure of extracting information
from huge sets of data.
In other words, data mining is mining knowledge from data.
Terminologies involved in data mining: Knowledge discovery, query language,
classification and prediction, decision tree induction, cluster analysis etc.

2. Define Prediction.
Ans: PREDICTION:
To find a numerical output, prediction is used. The training dataset contains the
inputs and numerical output values. According to the training dataset, the algorithm
generates a model or predictor. When fresh data is provided, the model should find
a numerical output. This approach, unlike classification, does not have a class
label. A continuous-valued function or ordered value is predicted by the model.
Example: 1. Predicting the worth of a home based on facts like the number of
rooms, total area, and so on.

3. Define Regression.
Ans: REGRESSION IN DATA MINING:
Regression refers to a data mining technique that is used to predict the numeric
values in a given data set. Regression involves the technique of fitting a straight
line or a curve on numerous data points.
For example, regression might be used to predict the product or service cost or
other variables. It is also used in various industries for business and marketing
behavior, trend analysis, and financial forecast.
Regression is divided into five different types
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
4. What do you mean by outliers?
Ans: outliers are sample points with values much different from those of the
remaining set of data. Outliers may represent errors in the data or could be
correct data values that are simply much different from the remaining data. A
person who is 2.5 meters tall is much taller than most people. In analysing
the height of individuals; this value probably would be viewed as an outlier.
Some clustering techniques do not perform well with the presence of
outliers.
5. What is Decision Tree?
Ans: A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction.

6. What do you mean by Distributed Algorithm?


Ans: The distribution of sample data values has to do with the shape which refers to
how data values are distributed across the range of values in the sample. In simple
terms, it means if the values are clustered around the average to show how they are
symmetrically arranged around it or if there are more values to one side than the
order.
Two ways to explore the distribution of the sample data are
1. Graphically through shape statistics

PART-B
II. Answer any Four questions, each carries Five marks. ( 4 x 4 = 20 )

7) What are the difference between Data Mining and knowledge discovery in databases?
Ans: DATA MINING VS KDD.

Key Features Data Mining KDD

Basic Definition Data mining is the process of identifying The KDD method is a complex and
patterns and extracting details about big iterative approach to knowledge
data sets using intelligent methods. extraction from big data.

Goal To extract patterns from datasets. To discover knowledge from


datasets.
Scope In the KDD method, the fourth phase is KDD is a broad method that
called "data mining." includes data mining as one of its
steps.

Used Techniques Classification Data cleaning


Clustering Data Integration
Decision Trees Data selection
Dimensionality Reduction Data transformation
Neural Networks Data mining
Regression Pattern evaluation
Knowledge Presentation

Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.

8) What are the various issues associated with the Data Mining?
Ans: FACTORS THAT CREATE SOME ISSUES.

1. Mining Methodology and User Interaction issues


2. Performance Issues
3. Diverse Data Types Issues

MINING METHODOLOGY AND USER INTERACTION ISSUES:


1. Mining different kinds of knowledge in databases − Different users
may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery
task.
2. Interactive mining of knowledge at multiple levels of abstraction −
The data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining requests
based on the returned results.
3. Incorporation of background knowledge − To guide discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
4. Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
5. Presentation and visualization of data mining results − Once the
patterns are discovered it needs to be expressed in high level languages,
and visual representations. These representations should be easily
understandable.
6. Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy
of the discovered patterns will be poor.
7. Pattern evaluation − The patterns discovered should be interesting or
relevant.

PERFORMANCE ISSUES :

1. Efficiency and scalability of data mining algorithms − In order to


effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

2. Parallel, distributed, and incremental mining algorithms − The


factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel
and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.

DIVERSE DATA TYPES ISSUES :

1. Handling of relational and complex types of data − The database may


contain complex data objects, multimedia data objects, spatial data(data
related to a specific location on the Earth's surface )etc. It is not possible
for one system to mine all these kind of data.
2. Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

9) Write short note on K-Nearest Neighbors algorithm and its applications.


Ans: K Nearest Neighbors:
One common classification scheme based on the use of distance measures is that of the K
nearest neighbors (KNN). The KNN technique assumes that the entire training set includes not
only the data in the set but also the desired classification for each item. In effect, the training data
become the model. When a classification is to be made for a new item, its distance to each item
in the training set must be determined. Only the K closest entries in the training set are
considered further. The new item is then placed in the class that contains the most items from this
set of K closest items.

Fig: Classification using KNN


Here the points in the training set are shown and K = 3. The three closest items in the
training set are shown; t will be placed in the class to which most of these are members. We use
T to represent the training data. Since each tuple to be classified must be compared to each
element in the training data, if there are q elements in the training set, this is O (q). Given n
elements to be classified, this becomes an O (nq) problem. Given that the training data are of a
constant size (although perhaps quite large), this can then be viewed as an O(n) problem.

Applications

• Simplistic algorithm — uses only value of K (odd number) and the distance function
(Euclidean, as mentioned today).
• Efficient method for small datasets.
• Utilises “Lazy Learning.” In doing so, the training dataset is stored and is used only when
making predictions therefore making it more quick than Support Vector Machines
(SVMs) and Linear Regression.

10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
1. Begin with the entire dataset as the root node of the decision tree.
2. Determine the best attribute to split the dataset based on a given
criterion,
3. Create a new internal node that corresponds to the best attribute and
connects it to the root node.
4. Partition the dataset into subsets based on the values of the best
attribute.
5. Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
6. Assign a leaf node to each subset that contains instances that belong to
the same class.
7. Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.

The benefits of having a decision tree are

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
11) Explain Hierarchical clustering in detail.
Ans: HIERARCHICAL ALGORITH MS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters. Hierarchical algorithms differ in how the sets are created. A tree data
structure, called a dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters. The root in a dendrogram tree
contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the
dendrogram represent new clusters formed by merging the clusters that appear as
its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular
level were combined because the children clusters had a distance between them
less than the distance value associated with this level in the tree.
Fig: Dendrogram

Fig: Five Levels of Clustering


shows six elements, {A, B, C, D, E, F}, to be clustered. Parts (a) to (e) of the figure
show five different sets of clusters. In part (a) each cluster is viewed to consist of
The space complexity for hierarchical algorithms is O (n2) because this is the space
required for the adjacency matrix. The space required for the dendrogram is O (kn),
which is much less than O (n2) . The time complexity for hierarchical algorithms is
0 (kn2) because there is one iteration for each level in the dendrogram. Depending
on the specific algorithm, however, this could actually be O (maxd n2) where maxd
is the maximum distance between points. Different algorithms may actually merge
the closest clusters from the next lowest level or simply create new clusters at each
level with progressively larger distances. Hierarchical techniques are well suited
for
many clustering applications that naturally exhibit a nesting relationship between
clusters
.

12) Write short note on Data Parallelism.


Ans:

Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.

Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.

1. Same task are performed on different subsets of same data.


2. Synchronous computation is performed.
3. As there is only one execution thread operating on all sets of data, so the
speedup is more.

4. Amount of parallelization is proportional to the input size.


5. It is designed for optimum load balance on multiprocessor system.

PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )

13) How can you describe Data mining from the perspective of database?
Ans: Data Mining from a Database Perspective.

A data mining system can be classified according to the kinds of databases on


which the data mining is performed. For example, a system is a relational data
miner if it discovers knowledge from relational data, or an object-oriented one if it
mines knowledge from object-oriented databases.
Database technology has b een successfully used in traditional busi- ness
data pro cessing. Companies have b een gathering a large amount of
data, using a DBMS system to manage it. Therefore, it is desirable that
we have an easy and painless use of database technology within other
areas, such as data mining.
DBMS technology o ers many features that make it valuable when
implementing data mining applications. For example, it is p os- sible to
work with data sets that are considerably larger than main memory, since
the database itself is resp onsible for handling informa- tion, paging and
swapping when necessary. Besides, a simpli ed data management and a
closer integration to other systems are available (e.g. data may b e up
dated or managed as a part of a larger op er- ational pro cess). Moreover,
as emerging ob ject-relational databases are providing the ability to
handle image, video and voice, there is a p otential area to exploit mining
of complex data typ es. Finally, after rules are discovered, we can use ad-
ho c and OLAP queries to validate discovered patterns in an easy way. We
must not forget that information used during mining pro cessing is often
con dential. Thus, DBMSs can also b e used as a means of providing data
security, which is widely implemented in commercial databases, avoiding
the need of using encryption algorithms to pro cess information 14)
Write a short note on Scalable DT techniques. Ans: refer notes
15) Explain how K-Means Clustering algorithm is working give examples.
Ans: K Means Clustering:
K- means is an iterative clustering algorithm in which items are moved among sets of clus- .
ters until the desired set is reached. As such, it may be viewed as a type of squared error
algorithm, although the convergence criteria need not be defined based on the squared
error. A high degree of similarity among elements in clusters is obtained, while a high
degree of dissimilarity among elements in different clusters is achieved simultaneously

The time complexity of K-means is O(tkn) where t is the number of iteratio ns. K-means
finds a local optimum and may actually miss the global optimum. K-eans does not work
on categorical data because the mean must be defined on the attnbute type.

16) Write a short note on hierarchical clustering. Ans: repeated


17) What do you mean by Large item-sets explain in detail. Ans:
refer notes
18) What is Data Parallelism explain in detail?
Ans: repeated

Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.

Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.

1. Same task are performed on different subsets of same data.


2. Synchronous computation is performed.
3. As there is only one execution thread operating on all sets of data, so the
speedup is more.
4. Amount of parallelization is proportional to the input size.
5. It is designed for optimum load balance on multiprocessor system.

And refer notes;

Model Question Paper-2 DATA MINING

Instructions to Candidates: 1. Answer any Four questions from each part.


2. Answer All Parts
PART-A I. Answer any Four questions, each carries Two marks. (4x2=8)

1. What do you mean by ETL process?


Ans: ETL Tools are applications/platforms that enable users to execute ETL
processes. In simple terms, these tools help businesses move data from one or
many desparate data sources to a destination. These help in making the data both
digestible and accessible (and in turn analysis-ready) in the desired location – often
a data warehouse.
ETL tools are the first essential step in the data warehousing process that
eventually make more informed decisions in less time.
2. Define Regression and its types.

 Ans: Regression

Regression is a statistical tool that helps determine the cause and effect relationship
between the variables. It determines the relationship between a dependent and an
independent variable. It is generally used to predict future trends and events.

Regression is divided into five different types


1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression

3. How will you solve Classification problem?


Ans: The decision tree approach is most useful in classification problems. With this
technique, a tree is constructed to model the classification process. Once the tree is built,
it is applied to each tuple in the database and results in a classification for that tuple. There
are two basic steps in the technique: building the tree and applying the tree to the database
4. What do you mean by outliers? Ans: repeated
5. What is CART classification?

Ans: CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction for the target variable at the end.

6. What do you mean by Distributed Algorithm?


Ans: The distribution of sample data values has to do with the shape which refers to
how data values are distributed across the range of values in the sample. In simple
terms, it means if the values are clustered around the average to show how they are
symmetrically arranged around it or if there are more values to one side than the
order.
Two ways to explore the distribution of the sample data are
2. Graphically
3. through shape statistics.
PART-B
II. Answer any Four questions, each carries Five marks. ( 4 x 4 = 20 )

7) What are the difference between Data Mining and knowledge


discovery in databases?
Ans: DATA MINING VS KDD.

Key Features Data Mining KDD

Basic Definition Data mining is the process of identifying The KDD method is a complex and
patterns and extracting details about big iterative approach to knowledge
data sets using intelligent methods. extraction from big data.

Goal To extract patterns from datasets. To discover knowledge from


datasets.

Scope In the KDD method, the fourth phase is KDD is a broad method that
called "data mining." includes data mining as one of its
steps.

Used Techniques Classification Data cleaning


Clustering Data Integration
Decision Trees Data selection
Dimensionality Reduction Data transformation
Neural Networks Data mining
Regression Pattern evaluation
Knowledge Presentation

Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.

8) Explain Naive Bayesian method.


Ans: Bayesian classification:
Bayesian classification uses Bayes theorem to predict the occurrence of any
event. Bayesian classifiers are the statistical classifiers with the Bayesian
probability understandings. The theory expresses how a level of belief,
expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized
conditional probability to provide an algorithm that uses evidence to
calculate limits on an unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that


is given below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X


is given that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y


is given that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of


each other. This is known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief."


Bayes theorem connects the degree of belief in a hypothesis before and after
accounting for evidence. For example, Lets us consider an example of the
coin. If we toss a coin, then we get either heads or tails, and the percent of
occurrence of either heads and tails is 50%. If the coin is flipped numbers of
times, and the outcomes are observed, the degree of belief may rise, fall, or
remain the same depending on the outcomes.

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X o P(X/Y),

the posterior is the degree of belief having accounted for Y. o

The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:


Where P (X Y) is the joint probability of both X and Y being true, because

Although the naive Bayes approach is straightforward to use, it does not always yield satisfactory results.
First, the attributes usually are not independent. We could use a subset of the attributes by ignoring any
that are dependent on others. The technique does not handle continuous data.

9) Write a short note on Data Mining tasks.

Ans: 1. Classification:

This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing


the data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. In other words, we can say that
Clustering analysis is a data mining technique to identify similar data. This
technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.

3. Regression:
Regression analysis is the data mining process ,used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used
to define the probability of the specific variable. Regression, primarily a form of
planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more
variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases., For example, a list of grocery items that you have been buying for the
last six months. It calculates a percentage of items being purchased together.

These are three major measurements technique: Lift,Support,Confidence.

5. Outlier detection:

This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outlier mining. The outlier is
a data point that diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier. Outlier detection plays a significant role in the
data mining field. Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating


sequential data to discover sequential patterns. It comprises of finding interesting
sub sequences in a set of sequences, where the stake of a sequence can be
measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:

Prediction uses a combination of other data mining techniques such as trends,


clustering, classification, etc. It analyzes past events or instances in the right
sequence to predict a future event.

10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
8. Begin with the entire dataset as the root node of the decision tree.
9. Determine the best attribute to split the dataset based on a given
criterion,
10.Create a new internal node that corresponds to the best attribute and
connects it to the root node.
11.Partition the dataset into subsets based on the values of the best
attribute.
12.Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
13.Assign a leaf node to each subset that contains instances that belong to
the same class.
14.Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.

The benefits of having a decision tree are

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
11) Explain Hierarchical clustering in detail.
Ans: HIERARCHICAL ALGORITH MS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters. Hierarchical algorithms differ in how the sets are created. A tree data
structure, called a dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters. The root in a dendrogram tree
contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the
dendrogram represent new clusters formed by merging the clusters that appear as
its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular
level were combined because the children clusters had a distance between them
less than the distance value associated with this level in the tree.
Fig: Dendrogram

Fig: Five Levels of Clustering


shows six elements, {A, B, C, D, E, F}, to be clustered. Parts (a) to (e) of the figure
show five different sets of clusters. In part (a) each cluster is viewed to consist of
The space complexity for hierarchical algorithms is O (n2) because this is the space
required for the adjacency matrix. The space required for the dendrogram is O (kn),
which is much less than O (n2) . The time complexity for hierarchical algorithms is
0 (kn2) because there is one iteration for each level in the dendrogram. Depending
on the specific algorithm, however, this could actually be O (maxd n2) where maxd
is the maximum distance between points. Different algorithms may actually merge
the closest clusters from the next lowest level or simply create new clusters at each
level with progressively larger distances. Hierarchical techniques are well suited
for many clustering applications that naturally exhibit a nesting relationship
between clusters.

12) Write a short note on Data warehouse.


Ans: A data warehouse, or enterprise data warehouse (EDW), is a system that
aggregates data from different sources into a single, central, consistent data store to
support data analysis, data mining, artificial intelligence (AI), and machine
learning. Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting, structured and/or ad hoc
queries, and decision making. Data warehousing involves data cleaning, data
integration, and data consolidations.
Using Data Warehouse Information
There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and
effectively. They can gather data, analyze it, and take decisions based on the
information present in the warehouse. The information gathered in a warehouse can
be used in any of the following domains
Tuning Production Strategies − The product strategies can be well tuned by
repositioning the products and managing the product portfolios by comparing the
sales quarterly or yearly.
Customer Analysis − Customer analysis is done by analyzing the customer's
buying preferences, buying time, budget cycles, etc.
Operations Analysis − Data warehousing also helps in customer relationship
management, and making environmental corrections. The information also allows
us to analyze business operations.
FUNCTIONS OF DATA WAREHOUSE TOOLS AND UTILITIES:
• Data Extraction − Involves gathering data from multiple heterogeneous
sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy format to
warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating, checking
integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.

PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )

13) How can you describe Data mining from the perspective of a database?
Ans: Data Mining from a Database Perspective.

A data mining system can be classified according to the kinds of databases on


which the data mining is performed. For example, a system is a relational data
miner if it discovers knowledge from relational data, or an object-oriented one if it
mines knowledge from object-oriented databases.
Statistical Methods in Data Mining

Data mining refers to extracting or mining knowledge from large amounts of data.
In other words, data mining is the science, art, and technology of discovering large
and complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process
more efficient, cost-effective, and accurate. Any situation can be analyzed in two
ways in data mining:

1. Non-statistical Analysis: This analysis provides generalized information


and includes sound, still images, and moving images.
2. Statistical Analysis: In statistics, data is collected, analyzed, explored, and
presented to identify patterns and trends. Alternatively, it is referred to as
quantitative analysis. It is the analysis of raw data using mathematical
formulas, models, and techniques. Through the use of statistical methods,
information is extracted from research data, and different ways are available
to judge the robustness of research outputs. It is created for the effective
handling of large amounts of data that are generally multidimensional and
possibly of several complex types.

14) Write a short note on Scalable DT techniques.


Ans: Refer notes

15) Explain how K-Means Clustering algorithm is working and give examples.
Ans: K Means Clustering:
K- means is an iterative clustering algorithm in which items are moved among sets of clus- .
ters until the desired set is reached. As such, it may be viewed as a type of squared error
algorithm, although the convergence criteria need not be defined based on the squared
error. A high degree of similarity among elements in clusters is obtained, while a high
degree of dissimilarity among elements in different clusters is achieved simultaneously
The time complexity of K-means is O(tkn) where t is the number of iteratio ns. K-means
finds a local optimum and may actually miss the global optimum. K-eans does not work
on categorical data because the mean must be defined on the attnbute

type.

16) Write a short note on clustering techniques.


Ans: Clustering is similar to classification in that data are grouped. However, unlike
classification,
the groups are not predefined. Instead, the grouping is accomplished by finding
similarities between data according to characteristics found in the actual data. The
groups are called clusters.
-Set of like elements. Elements from different clusters are not alike.
-The distance between points in a cluster is less than the distance between a point
in the cluster and any point outside it.
A term similar to clustering is database segmentation, where like tuples (records) in
a database are grouped together. This is done to partition or segment the database
into components that then give the user a more general view of the data. This
example illustrates the fact that determining how to do the clustering is not
straightforward
Clustering has been used in many application domains, including biology,
medicine, anthropology, marketing, and economics. Clustering applications include
plant and animal classification, disease classification, image processing, pattern
recognition, and document retrieval. One of the first domains in which clustering
was used was biological taxonomy.

When clustering is applied to a real-world database, many interesting


problems occur:

o Outlier handling is difficult. Here the elements do not naturally fall


into any cluster
o Dynamic data in the database implies that cluster membership may
change over time o Interpreting the semantic meaning of each cluster
may be difficult. With classification, the labeling of the classes is
known ahead of time. However, with clustering, this may not be the
case. Thus, when the clustering process finishes creating a set of
clusters, the exact meaning of each cluster may not be obvious. Here
There is no one correct answer to a clustering problem. In fact, many
answers may be found. The exact number of clusters required is not
easy to determine. Again, a domain expert may be required. o Another
related issue is what data should be used for clustering. Unlike
learning
during a classification
supervised learning process,
to aid the where
process. thereclustering
Indeed, is some acan
priori knowledge
be viewed concerning
as similar
what the attributes of each classification should be, in clustering we have no

to unsupervised learning.
A classification of the different types of clustering algorithms is shown Clustering
algorithms themselves may be viewed as hierarchical or partitional. With
hierarchical clustering, a nested set of clusters is created.

17) Explain apriori algorithm..


Ans: Apriori Algorithm – Frequent Pattern Algorithms
Apriori algorithm was the first algorithm that was proposed for frequent itemset
mining. It was later improved by R Agarwal and R Srikant and came to be known
as Apriori. This algorithm uses two steps “join” and “prune” to reduce the search
space. It is an iterative approach to discover the most frequent itemsets.

Apriori says:
The probability that item I is not frequent is if:

• P(I) < minimum support threshold, then I is not frequent.


• P (I+A) < minimum support threshold, then I+A is not frequent, where A
also belongs to itemset.
• If an itemset set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored. This
property is called the Antimonotone property.
The steps followed in the Apriori Algorithm of data mining are:
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining
each item with itself.
2. Prune Step: This step scans the count of each item in the database. If the
candidate item does not meet minimum support, then it is regarded as
infrequent and thus it is removed. This step is performed to reduce the size
of the candidate itemsets.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent
itemset in the given database. This data mining technique follows the join and the
prune steps iteratively until the most frequent itemset is achieved. A minimum
support threshold is given in the problem or it is assumed by the user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item.

#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those candidates
which count more than or equal to min_sup, are taken ahead for the next iteration
and the others are pruned.

#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.

#5) The next iteration will form 3 –itemsets using join and prune step. This
iteration will follow antimonotone property where the subsets of 3-itemsets, that is
the 2 –itemset subsets of each group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.

Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in
large databases Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
3. FPM has many applications in the field of data analysis, software bugs,
cross-marketing, sale campaign analysis, market basket analysis, etc.

18) What is Data Parallelism explain in detail?


Ans:

Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.

Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.

1. Same task are performed on different subsets of same data.


2. Synchronous computation is performed.
3. As there is only one execution thread operating on all sets of data, so the
speedup is more.

4. Amount of parallelization is proportional to the input size.


5. It is designed for optimum load balance on multiprocessor system.
V Semester B.C.A. Degree Examination, February/March - 2024
PART-A
I. Answer the FOUR questions. Each question carries Two marks.
1.Define data mining.
Ans: Refer MQP 1 Q 1
2.What is prediction?
Ans: Refer MQP 1 Q 2
3.What is Regression?
Ans: Refer MQP 1 Q 3
4.Define outliers.
Ans: Refer MQP 1 Q 4
5.What is parallel algorithm?
Ans: These algorithms perform multiple operations simultaneously on different processors, enhancing
computational speed. Categories include:
◦ Data Parallelism: Distributes subsets of data across processors.
◦ Task Parallelism: Distributes different tasks across processors
6.What is spanning tree?
Ans: A spanning tree is a subgraph of a connected, undirected graph that includes all the vertices of the
graph and is a tree (i.e., it is connected and acyclic). In simpler terms, a spanning tree connects all the
vertices in a graph without forming any cycles and uses the minimum number of edges necessary to do so.
PART-B
II. Answer any FOUR questions. Each question carries Five marks.

7. Compare data mining and knowledge discovery in databases.


Ans: Refer MQP 1 Q 7
8. Discuss the data mining issues.
Ans: Refer MQP 1 Q 8
9. Explain K-nearest algorithm with example.
Ans: Refer MQP 1 Q 9
10. Explain any one of the decision tree algorithms with example.
Ans: Refer MQP 1 Q 10
11. Explain outlier in detail with examples.
Ans: Outlier is an observation in a given dataset that lies far from the rest of the observations. That means an
outlier is vastly larger or smaller than the remaining values in the set.
An outlier may occur due to the variability in the data, or due to experimental error/human error. In statistics, we
have three measures of central tendency namely Mean, Median, and Mode. They help us describe the data.
 Mean is the accurate measure to describe the data when we do not have any outliers present.
 Median is used if there is an outlier in the dataset.
 Mode is used if there is an outlier AND about ½ or more of the data is the same.
Mean’ is the only measure of central tendency that is affected by the outliers which in turn impacts Standard
deviation.
Example Consider a small dataset, sample=[15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By looking at it, one can
quickly say ‘101’ is an outlier that is much larger than the other values.
Computation with and without outlier

From the above calculations, we can clearly say the Mean is more affected than the Median.
Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we have a huge
dataset, how do we identify the outliers then? We need to use visualization and mathematical techniques.
Below are some of the techniques of detecting outliers
 Boxplots
 Z-score
 Inter Quantile Range(IQR)
12. Explain Apriori algorithm with examples.
Ans: Refer MQP 2 Q 17

PART-C
III. Answer any FOUR questions. Each question carries Eight marks.
13. Explain data mining from a database perspective.
Ans: Refer MQP 2 Q 13
14. Explain CART in detail.
Ans:
A Classification and Regression Tree (CART) is a decision tree algorithm used for both classification and
regression tasks. It's a popular and versatile machine learning algorithm that recursively splits the dataset into
subsets based on the most significant attribute, A Classification and Regression Tree (CART) is a decision tree
algorithm used for both classification and regression tasks. It's a popular and versatile machine learning
algorithm that recursively splits the dataset into subsets based on the most significant attribute, splits the dataset
into subsets based on the most significant attribute, resulting in a tree-like structure.
Key Concepts:
1. Decision Tree:
• A tree-like model where each internal node represents a decision based on the value of a particular attribute.
• Each leaf node represents the outcome or predicted value.
2. Splitting Criteria:
• The algorithm selects the attribute and the split point (or threshold) that best separates the data into
homogeneous subsets.
• For classification, common criteria include Gini impurity and entropy.
• For regression, the mean squared error (MSE) is often used.
3. Recursive Splitting:
• The dataset is split recursively until a stopping criterion is met (e.g., maximum depth, minimum samples per
leaf).
• Each split further refines the decision boundaries.
4. Classification:
• For classification tasks, the leaf nodes represent the predicted class based on majority voting.
5. Regression:
• For regression tasks, the leaf nodes represent the predicted value based on the average of the target values in
that node.
How CART Works:
1. Root Node:
• The algorithm selects the attribute and split point that best separates the entire dataset.
2. Splitting:
• The dataset is split into subsets based on the chosen attribute and split point.
• The dataset is split into subsets based on the chosen attribute and split point. and split point.
• This process is repeated for each subset until a stopping criterion is met.
3. Leaf Nodes:
• The terminal nodes (leaves) contain the final predictions or classifications.
4. Prediction/Classification:
• For a new instance, it traverses the tree from the root to a leaf, making decisions based on attribute values.
• The predicted class or value is determined by the leaf node reached.
Example:
Let's consider a classification task where we want to predict whether a passenger survived or not based on
features like age, gender, and ticket class.
• The root node might split the data based on gender.
• The next level might split based on age.
• The leaf nodes might represent different survival outcomes.
For a regression task, the target might be the price of a house based on features like the number of bedrooms and
square footage.
• The tree would split the dataset based on features to create leaves that represent predicted house prices.
Applications:
• Classification: Predicting outcomes like spam or non-spam emails, customer churn, etc.
• Regression: Predicting numeric values like house prices, temperature, etc.
• Interpretability: Decision trees are human-readable and can help understand the decision-making process.

CART is a powerful algorithm with the ability to handle complex relationships in data.
However, it's prone to overfitting, and CART is a powerful algorithm with the ability to handle complex
relationships in data.
However, it's prone to overfitting, and relationships in data.
However, it's prone to overfitting, and techniques like pruning are often applied to prevent this.
15. Explain K-means clustering algorithm with examples.
Ans: Refer MQP 1 Q 9
16. What is hierarchical clustering? Explain in detail and give example.
Ans: Refer MQP 1 Q 16
17. What is large item-sets? Explain in detail.
Ans: Refer MQP 1 Q 17
18. Write a note on data parallelism.
Ans: Refer MQP 1 Q 18

DATA MINING –ELECTIVE


Short answers
1.What is data mining?
Ans: Refer MQP 1 Q 1
2.What are the applications of data mining?
Ans:
 Marketing and Sales
 Healthcare
 Finance
 Telecommunications
 Education
 Manufacturing:
3.What is data warehousing?
Ans: Refer MQP 2 Q 12
4.What are the functions of data warehouse tools and utilities?
Ans:
 Data Extraction
 Data Cleaning
 Data Transformation
 Data Loading
 Refreshing
5.what is ETL? what are the steps involved in ETL process?
Ans: ETL: ETL stands for Extract, Transform, and Load.

It is defined as a Data Integration service and allows companies to combine data from various sources into a single,
consistent data store that is loaded into a Data Warehouse or any other target system.

STEPS INVOLVED IN THE ETL PROCESS:

• Extraction: In this, the structured or unstructured data is extracted from its source and consolidated into a single
repository. ETL tools automate the extraction process and create a more efficient and reliable workflow for handling
large volumes of data and multiple sources.
• Transformation: In order to improve data integrity the data needs to be transformed such as it needs to be sorted,
standardized, and redundant data should be removed. This step ensures that raw data which arrives at its new
destination is fully compatible and ready to use.
• Loading: This is the final step of the ETL process which involves loading the data into the final destination(data lake or
data warehouse). The data can be loaded all at once(full load) or at scheduled intervals(incremental load).

6.What are ETL tools ?what are the different types of ETL tools?
Ans:
ETL Tools are applications/platforms that enable users to execute ETL processes. In simple terms, these tools
help businesses move data from one or many desperate data sources to a destination. These help in making the
data both digestible and accessible (and in turn analysis-ready) in the desired location – often a data warehouse.

ETL tools are the first essential step in the data warehousing process that eventually make more informed
decisions in less time.

TYPES OF ETL TOOLS:

• Enterprise ETL Tools

The ETL tools are often bundled as part of a larger platform and appeal to enterprises with older, legacy systems
that they need to work with and build on. These ETL tools can handle pipelines efficiently and are highly
scalable since they were one of the first to offer ETL tools and mature in the market. These tools support most
relational and nonrelational databases.
• Custom ETL Tools
In this, the custom tools and pipelines are created using scripting languages like SQL or Python. While this gives
you an opportunity for customization and higher flexibility, it also requires more administration and
maintenance.
• Cloud-Based ETL Tools
These tools integrate with proprietary data sources and ingest data from different web apps or on-premises
sources. These tools move data between systems and copy, transform, and enrich data before writing it to data
warehouses or data lakes.
• Open-Source ETL Tools
Many ETL tools today are free and provide easy-to-use user interfaces for designing data exchange processes
and monitoring the flow of information. An advantage of open-source solutions is that organizations can access
the source code to study the tool infrastructure and extend the functionality.

7.Define classification in data mining.


Ans: The process of categorizing data or objects into predefined groups or classes based on their characteristics
is referred to as classification. The primary goal of classification is to develop a model that can effectively assign
a label or category to a new observation using its features. For example, a classification model could be trained
on a dataset of images labelled as either dogs or cats and then utilized to predict the class of new, unseen images
of dogs or cats based on attributes such as color, texture, and shape.
8.What is prediction? Give examples.
Ans: Refer MQP 1 Q 2
9.Define summarization.
Ans:
The term Data Summarization can be defined as the presentation of a summary/report of generated data in a
comprehensible and informative manner. To relay information about the dataset, summarization is obtained from
the entire dataset.
It is a carefully performed summary that will convey trends and patterns from the dataset in a simplified manner.
Data has become more complex hence, there is a need to summarize the data to gain useful information. Data
summarization has great importance in data mining as it can also help in deciding appropriate statistical tests to
use depending on the general trends revealed from the summarization.
10.What is clustering?
Ans: Refer MQP 1 Q 11
11.What is sequence discovery?
Ans:
Sequence Discovery or Sequential Analysis: Sequential analysis or sequence discovery is a technique used in
data mining to find patterns in data that happen in a specific order over time. It's like finding connections
between events, but the order in which they happen is important.

Long answers
1.Explain classification process with examples.
Ans:
THE DATA CLASSIFICATION PROCESS INCLUDES TWO STEPS
1. Building the Classifier or Model This step is the learning step or the learning phase. In this step the
classification algorithms build the classifier. The classifier is built from the training set made up of database
tuples and their associated class labels. A model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data.
2. Using Classifier for Classification In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification rules can be applied to the new data
tuples if the accuracy is considered acceptable.

2.Define regression. Explain linear and logistic regression.


Ans: Regression refers to a data mining technique that is used to predict the numeric values in a given data set.
Regression involves the technique of fitting a straight line or a curve on numerous data points.
LINEAR REGRESSION

Linear regression is the type of regression that forms a relationship between the target variable and one or more
independent variables utilizing a straight line. The given equation represents the equation of linear regression.

It is a statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm
shows a linear relationship between a dependent (y) and one or more independent (x) variables, hence called as
linear regression. The linear regression model provides a sloped straight line representing the relationship
between the variables.
The values for x and y variables are training datasets (data points) for Linear Regression model representation.
Y = a + b*X + e. Where, a represents the intercept of the line (The point where the line or curve crosses the axis
of the graph is called intercept. If a point crosses the x-axis, then it is called the x-intercept. If a point crosses the
y-axis, then it is called the y-intercept.) b represents the slope of the regression line e represents the random error
X represent the predictor variable (independent)
Y represent the target variable (dependent).
LOGISTIC REGRESSION

When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or failure, the logistic
regression technique comes into existence. Here, the target value (Y) ranges from 0 to 1, and it is primarily used
for classification based problems. Unlike linear regression, it does not need any independent and dependent
variables to have a linear relationship. Example: Acceptance into university based on student grades.

Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
either Yes or No, 0 or 1, true or False, etc. In Logistic regression, instead of fitting a regression line, we fit an
"S" shaped logistic function, which predicts two maximum values (0 or 1).

3.What are the applications of linear and logistic regression?


Ans:
LINEAR REGRESSION

Applications:

1. Medical researchers can use this regression model to determine the relationship between independent
characteristics, such as age and body weight, and dependent ones, such as blood pressure. This can help
reveal the risk factors associated with diseases. They can use this information to identify high-risk
patients and promote healthy lifestyles.
2. Financial analysts use linear models to evaluate a company's operational performance and forecast
returns on investment. They also use it in the capital asset pricing model, which studies the relationship
between the expected investment returns and the associated market risks. It shows companies if an
investment has a fair price and contributes to decisions on whether or not to invest in the asset

Logistic regression applications in business

1. An e-commerce company that mails expensive promotional offers to customers, would like to know
whether a particular customer is likely to respond to the offers or not i.e., whether that consumer will be a
"responder" or a "non-responder."
2. Likewise, a credit card company will develop a model to help it predict if a customer is going to default
on its credit card based on such characteristics as annual income, monthly credit card payments and the number
of defaults.
3.A medical researcher may want to know the impact of a new drug on treatment outcomes across different age
groups. This involves a lot of nested multiplication and division for comparing the outcomes of young and older
people who never received a treatment, younger people who received the treatment, older people who received
the treatment, and then the whole spontaneous healing rate of the entire group.
4. Logistic regression has become particularly popular in online advertising, enabling marketers to predict the
likelihood of specific website users who will click on particular advertisements as a yes or no percentage.
5.In healthcare to identify risk factors for diseases and plan preventive measures;
6. In drug research to learn the effectiveness of medicines on health outcomes across age, gender etc
7. In weather forecasting apps to predict snowfall and weather conditions;
8. in political polls to determine if voters will vote for a particular candidate;
9. in banking to predict the chances that a loan applicant will default on a loan or not, based on annual income,
past defaults and past debts.

4.Write a note on time series analysis.


Ans: T I M E S E R I E S A N A L Y S I S :
Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time.
In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just
recording the data points intermittently or randomly. However, this type of analysis is not merely the act of
collecting data over time. time series analysis can show how variables change over time. In other words, time is
a crucial variable because it shows how the data adjusts over the course of the data points as well as the final
results. It provides an additional source of information and a set order of dependencies between the data. Time
series analysis typically requires a large number of data points to ensure consistency and reliability. An extensive
data set ensures you have a representative sample size and that analysis can cut through noisy data
Examples:
 Weather forecast
 Rainfall measurements
 Temperature readings
 Heart rate monitoring (ECG)
 Brain monitoring
 Quarterly sales
 Stock market analysis
 Automated stock trading
 Industry forecasts
 Interest rates
5.Explain the implementation of data summarization
Ans:
Areas in which Data Summarization is implemented:
1. Centrality
2. Dispersion
3. Distribution of a Sample of Data
1) Centrality: The principle of Centrality is used to describe the centre or middle value of the data. Measures
used to show the centrality
Mean: This is used to calculate the numerical average of the set of values.
Mode: This shows the most frequently repeated value in a dataset.
Median: This identifies the value in the middle of all the values in the dataset when values are ranked in order.
The most appropriate measure to use will depend largely on the shape of the dataset.
2) Dispersion: The dispersion of a sample refers to spreading out the values around the average (center). It
shows the amount of variation or diversity within the data. When the values are close to the center, the
sample has low dispersion while high dispersion occurs when they are widely scattered about the center.

Different measures of dispersion can be used based on the dataset

Standard deviation: This provides a standard way of knowing what is normal or extra large or extra small and
helps to understand the spread of the variable from the mean. It shows how close all the values are to the mean.
Variance: This is similar to standard deviation but it measures how tightly or loosely values are spread around
the average.
Range: The range indicates the difference between the largest and the smallest values thereby showing the
distance between the extremes.
3) Distribution of a Sample of Data
The distribution of sample data values has to do with the shape which refers to how data values are distributed
across the range of values in the sample. In simple terms, it means if the values are clustered around the average
to show how they are symmetrically arranged around it or if there are more values to one side than the order.
Two ways to explore the distribution of the sample data are
1. Graphically
2. through shape statistics.

.6.What are the applications of cluster analysis ?


Ans: Applications of cluster analysis in data mining:
1. data analysis, market research, pattern recognition, and image processing.
2. It assists marketers to find different groups in their client base and based on the purchasing patterns. They can
characterize their customer groups.
3. It helps in allocating documents on the internet for data discovery.
4. used in tracking applications such as detection of credit card fraud.
5. In terms of biology, It can be used to determine plant and animal taxonomies, categorization of genes with the
same functionalities and gain insight into structure inherent to populations.

7.What are the requirements of clustering in data mining?


Ans: REQUIREMENTS OF CLUSTERING IN DATA MINING :

1. Scalability: Scalability (to increase Style)in clustering implies that as we boost the amount of data objects,
the time to perform clustering should approximately scale to the complexity order of the algorithm. If we
raise the number of data objects 10 folds, then the time taken to cluster them should also approximately
increase 10 times. It means there should be a linear relationship. If that is not the case, then there is some
error with our implementation process.
2. Interpretability: The outcomes of clustering should be interpretable, comprehensible, and usable.
3. Discovery of clusters with attribute shape: The clustering algorithm should be able to find arbitrary shape
clusters. They should not be limited to only distance measurements that tend to discover a spherical cluster of
small sizes.
4. Ability to deal with different types of attributes: Algorithms should be capable of being applied to any
data such as data based on intervals (numeric), binary data, and categorical data.
5. Ability to deal with noisy data: Databases contain data that is noisy, missing, or incorrect. Few algorithms
are sensitive to such data and may result in poor quality clusters.
6. High dimensionality: The clustering tools should not only be able to handle high dimensional data space but
also the low-dimensional space.

8.Explain with an example how we generate association rules.


Ans:
Steps to Generate Association Rules

 Data Preparation:
Collect and preprocess the dataset to ensure it is clean and formatted correctly. This often involves removing
duplicates, handling missing values, and transforming data into a suitable format (e.g., transactional data).
 Define Minimum Support and Confidence:
Support: This is the proportion of transactions in the dataset that contain a particular itemset. It helps to identify
how frequently an itemset appears in the dataset.
Confidence: This measures the likelihood that an item B is purchased when item A is purchased. It is calculated
as the ratio of the support of the itemset containing both A and B to the support of the itemset containing A.
 Generate Frequent Itemsets:
Use an algorithm (like Apriori or FP-Growth) to identify all itemsets that meet the minimum support threshold.
Apriori Algorithm: This algorithm works by iteratively identifying frequent itemsets. It starts with single items
and combines them to form larger itemsets, pruning those that do not meet the support threshold.
 Generate Association Rules:
From the frequent itemsets, generate rules of the form ( A > B ), where A and B are itemsets.
For each frequent itemset, generate all possible rules by splitting the itemset into two parts (antecedent and
consequent).
 Calculate Confidence for Each Rule:
For each generated rule, calculate the confidence to determine how strong the rule is. Only keep the rules
that meet the minimum confidence threshold.
 Evaluate and Filter Rules:
Optionally, you can also calculate other metrics such as lift, which measures how much more likely the
consequent is purchased when the antecedent is purchased compared to when it is not.
Filter the rules based on additional criteria, such as lift or interestingness, to focus on the most relevant rules.
Example: Suppose you have a dataset of transactions in a grocery store:
Transaction 1: {Bread, Milk}
Transaction 2: {Bread,Rice, Beer, Eggs}
Transaction 3: {Milk,Rice, Beer, Cola}
Transaction 4: {Bread, Milk,Rice, Beer}
Transaction 5: {Bread, Milk, Cola}
Define Support and Confidence: Set minimum support to 40% and minimum confidence to 70%.
Generate Frequent Itemsets: Identify frequent itemsets like {Bread}, {Milk}, {Rice}, {Beer}, {Bread, Milk},etc.
Generate Rules: From the frequent itemsets, generate rules like {Bread} → {Milk}, {Rice} → {Beer}, etc.
Calculate Confidence: For the rule {Bread} → {Milk}, calculate confidence and check if it meets the threshold.
Filter Rules: Keep only the rules that meet both support and confidence thresholds.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy