0% found this document useful (0 votes)

14 views17 pages

DM Module 4

Cluster analysis is a data mining technique that groups similar data points into clusters to identify patterns and relationships within datasets. Various algorithms, such as k-means and hierarchical clustering, are used depending on the data's nature and analysis requirements. The method has applications in fields like marketing, biology, and image processing, but it also has limitations, including sensitivity to noise and the computational expense for large datasets.

Uploaded by

blessonsunil26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views17 pages

DM Module 4

Uploaded by

blessonsunil26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

MODULE 4

Cluster analysis
Cluster analysis, also known as clustering, is a method of data mining that groups similar
data points together. The goal of cluster analysis is to divide a dataset into groups (or clusters)
such that the data points within each group are more similar to each other than to data points
in other groups. This process is often used for exploratory data analysis and can help identify
patterns or relationships within the data that may not be immediately obvious. There are many
different algorithms used for cluster analysis, such as k-means, hierarchical clustering, and
density-based clustering. The choice of algorithm will depend on the specific requirements
of the analysis and the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It
is an unsupervised machine learning-based algorithm that acts on unlabelled data. A group
of data points would comprise together to form a cluster in which all the objects would belong
to the same group.
The given data is divided into different groups by combining similar objects into a group.
This group is nothing but a cluster. A cluster is nothing but a collection of similar data which
is grouped together.
For example, consider a dataset of vehicles given in which it contains information about
different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no
class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a
structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using
clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming
clusters like cars cluster which contains all the cars, bikes clusters which contains all the
bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering:
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing
with huge databases. In order to handle extensive databases, the clustering algorithm should
be scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate
result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space
along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like
discrete, categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may
lead to poor quality clusters. So it should be able to handle unstructured data and give some
structure to the data by organising it into groups of similar data objects. This makes the job
of the data expert easier in order to process the data and discover new patterns.
5.Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method

Partitioning Method: It is used to make partitions on the data in order to form clusters. If
“n” partitions are done on “p” objects of the database then each partition is represented by a
cluster and n < p. The two conditions which need to be satisfied with this Partitioning
Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means
the object will be moved from one group to another to improve the partitioning

Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose
of classification on the basis of how the hierarchical decomposition is formed. There are two
types of approaches for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the
bottom-up approach. Initially, the given data is divided into which objects form
separate groups. Thereafter it keeps on merging the objects or the groups that are
close to one another which means that they exhibit similar properties. This
merging process continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known as the top-down
approach. In this approach, we would start with the data objects that are in the
same cluster. The group of individual clusters is divided into small clusters by
continuous iteration. The iteration continues until the condition of termination is
met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical Clustering
Quality in Data Mining are: –
• One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
• One can use a hierarchical agglomerative algorithm for the integration of
hierarchical agglomeration. In this approach, first, the objects are grouped into
micro-clusters. After grouping data objects into microclusters, macro clustering is
performed on the microcluster.

Density-Based Method: The density-based method mainly focuses on density. In this

method, the given cluster will keep on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data point within a given cluster. The
radius of a given cluster has to contain at least a minimum number of points.

Grid-Based Method: In the Grid-Based method a grid is formed using the object together,
i.e, the object space is quantized into a finite number of cells that form a grid structure. One
of the major advantages of the grid-based method is fast processing time and it is dependent
only on the number of cells in each dimension in the quantized space. The processing time
for this method is much faster so it can save time.

Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is
used to locate the clusters for a given model. It reflects the spatial distribution of data points
and also provides a way to automatically determine the number of clusters based on standard
statistics, taking outlier or noise into account. Therefore, it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Applications Of Cluster Analysis:
• It is widely used in image processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies
and identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.

Advantages of Cluster Analysis:

• It can help identify patterns and relationships within a dataset that may not
be immediately obvious.
• It can be used for exploratory data analysis and can help with feature selection.
• It can be used to reduce the dimensionality of the data
• It can be used for anomaly detection and outlier identification.
• It can be used for market segmentation and customer profiling.
Disadvantages of Cluster Analysis:
• It can be sensitive to the choice of initial conditions and the number of clusters.
• It can be sensitive to the presence of noise or outliers in the data.

• It can be difficult to interpret the results of the analysis if the clusters are not well-
defined.
• It can be computationally expensive for large datasets.
• The results of the analysis can be affected by the choice of clustering algorithm used.
• It is important to note that the success of cluster analysis depends on the data, the
goals of the analysis, and the ability of the analyst to interpret the results.

Partitioning Method (K-Mean) in Data Mining

Partitioning Method: This clustering method classifies the information into multiple groups based on
the characteristics and similarity of the data. Its the data analysts to specify the number of clusters that
has to be generated for the clustering methods. In the partitioning method when database(D) that
contains multiple(N) objects then the partitioning method constructs user-specified(K) partitions of the
data in which each partition represents a cluster and a particular region. There are many algorithms that
come under partitioning method some of the popular ones are K-Mean, PAM(K-Medoids), CLARA
algorithm (Clustering Large Applications) etc.

Here we will be seeing the working of K Mean algorithm in detail. K-Mean (A centroid based
Technique):
The K means algorithm takes the input parameter K from the user and partitions the dataset containing
N objects into K clusters so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects from outside the cluster is
low (intercluster). The similarity of the cluster is determined with respect to the mean value of the
cluster. It is a type of square error algorithm. At the start randomly k objects from the dataset are chosen
in which each of the objects represents a cluster mean(centre). For the rest of the data objects, they are
assigned to the nearest cluster based on their distance from the cluster mean. The new mean of each of
the cluster is then calculated with the added data objects. Algorithm: K mean:

Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:

1. Randomly assign K objects from the dataset(D) as cluster centres(C)

2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.

Suppose we want to group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:

K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:

C1 = 16.33 [16, 16, 17]

C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:

C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore, we get the clusters (16-29) and (36-66) as
2 clusters we get using K Mean Algorithm.

Density-based clustering
Density-based clustering refers to unsupervised ML approaches that find discrete clusters in the dataset,
based on the notion that a cluster/group in a dataset is a continuous area of high point density that is
isolated from another cluster by sparse regions. Typically, in data points in the dividing, sparse zones
are regarded as noise or outliers.

Clustering Techniques
The Clustering Methods parameter of the density-based clustering tool gives three possibilities for
locating clusters in your point data:

Density-Based Spatial Clustering Of Applications With Noise (DBSCAN) : Clusters are dense
regions in the data space, separated by regions of the lower density of points. The DBSCAN algorithm
is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster,
the neighbourhood of a given radius has to contain at least a minimum number of points.
Parameters Required for DBSCAN Algorithm
eps: It defines the neighbourhood around a data point i.e., if the distance between two points is lower
or equal to ‘eps’ then they are considered neighbours. If the eps value is chosen too small then a large
part of the data will be considered as an outlier. If it is chosen very large then the clusters will merge
and the majority of the data points will be in the same clusters. One way to find the eps value is based
on the k-distance graph.
MinPts: Minimum number of neighbours (data points) within eps radius. The larger the dataset, the
larger value of MinPts must be chosen. As a general rule, the minimum MinPts can be derived from the
number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighbourhood of a core
point.
Noise or outlier: A point which is not a core point or border point.

Steps Used in DBSCAN Algorithm

Find all the neighbour points within eps and identify the core points or visited with more than MinPts
neighbours.
For each core point if it is not already assigned to a cluster, create a new cluster.
Find recursively all its density-connected points and assign them to the same cluster as the core point.
A point a and b are said to be density connected if there exists a point c which has a sufficient number
of points in its neighbours and both points a and b are within the eps distance. This is a chaining process.
So, if b is a neighbour of c, c is a neighbour of d, and d is a neighbour of e, which in turn is neighbour
of a implying that b is a neighbour of a.
Iterate through the remaining unvisited points in the dataset. Those points that do not belong to any
cluster are noise.

HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends

DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract
a flat clustering based in the stability of clusters.
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm,
similar to DBSCAN (Density-Based Spatial Clustering of Applications with Noise), but it can extract
clusters of varying densities and shapes. It is useful for identifying clusters of different densities in
large, high-dimensional datasets.
Grid-Based Clustering
In grid-based clustering, the data set is represented into a grid structure which comprises of grids (also
called cells). The overall approach in the algorithms of this method differs from the rest of the
algorithms.

They are more concerned with the value space surrounding the data points rather than the data points
themselves. One of the greatest advantages of these algorithms is its reduction in computational
complexity. This makes it appropriate for dealing with humongous data sets.
After partitioning the data sets into cells, it computes the density of the cells which helps in identifying
the clusters. A few algorithms based on grid-based clustering are as follows: –

STING (Statistical Information Grid Approach): – In STING, the data set is divided recursively in
a hierarchical manner. Each cell is further sub-divided into a different number of cells. It captures the
statistical measures of the cells which helps in answering the queries in a small amount of time. Each
cell is divided into a different number of cells. Thereafter, the statistical measures of the cell are
collected, which helps answer the query as quickly as possible.

Wave Cluster: – In this algorithm, the data space is represented in form of wavelets. The data space
composes an n-dimensional signal which helps in identifying the clusters. The parts of the signal with
a lower frequency and high amplitude indicate that the data points are concentrated. These regions are
identified as clusters by the algorithm. The parts of the signal where the frequency high represents the
boundaries of the clusters. It could use a wavelet transformation to change the original feature space to
find dense domains in the transformed space. For more details, you can refer to this paper.

CLIQUE (Clustering in Quest): – CLIQUE is a combination of density-based and grid-based

clustering algorithm. It partitions the data space and identifies the sub-spaces using the Apriori
principle. It identifies the clusters by calculating the densities of the cells. It can find clusters of any
shape and is able to find any number of clusters in any number of dimensions, where the number is not
predetermined by a parameter. It outperforms K-means, DBSCAN, and Farthest First in both execution,
time, and accuracy.

Model-based clustering
Model-based clustering is a statistical approach to data clustering. The observed (multivariate) data is
considered to have been created from a finite combination of component models. Each component
model is a probability distribution, generally a parametric multivariate distribution.

For instance, in a multivariate Gaussian mixture model, each component is a multivariate Gaussian
distribution. The component responsible for generating a particular observation determines the cluster
to which the observation belongs.

Model-based clustering is a try to advance the fit between the given data and some mathematical model
and is based on the assumption that data are created by a combination of a basic probability distribution.

There are the following types of model-based clustering are as follows −

Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An

extension to k-means −
• It can assign each object to a cluster according to weight (probability distribution).
• New means are computed based on weight measures.

The basic idea is as follows −

• It can start with an initial estimate of the parameter vector.

• It can be used to iteratively rescore the designs against the mixture density made by the
parameter vector.
• It is used to rescored patterns are used to update the parameter estimates.
• It can be used to pattern belonging to the same cluster if they are placed by their scores in a
particular component.
Algorithm
• Initially, assign k cluster centers randomly.
• It can be iteratively refined the clusters based on two steps are as follows −

Machine learning approach − Machine learning is an approach that makes complex algorithms for
huge data processing and supports results to its users. It uses complex programs that can understand
through experience and create predictions.

The algorithms are improved by themselves by frequent input of training information. The main
objective of machine learning is to learn data and build models from data that can be understood and
used by humans.

It is a famous approach of incremental conceptual learning, which produces a hierarchical clustering in

the form of a classification tree. Each node defines a concept and includes a probabilistic representation
of that concept.

Limitations

• The assumption that the attributes are independent of each other is often too strong because
correlation can exist.
• It is not suitable for clustering large database data, skewed trees, and expensive probability
distributions.

Neural Network Approach − The neural network approach represents each cluster as an example,
acting as a prototype of the cluster. The new objects are distributed to the cluster whose example is the
most similar according to some distance measure.

Constrained Clustering
Constrained clustering is an approach to clustering the data while it incorporates the domain knowledge
in form of constraints. All data including input data, constraints, and domain knowledge are processed
in the clustering process with constraints and give the output clusters as an output.
There are various methods for clustering with constraints and can handle specific constraints:

Handling Hard Constraints: There is a method for handling the hard constraints by regarding the
constraint in a cluster assignment procedure. It is a very important method for handling the difficult
constraints we can regard the constraints in the assignment procedure of cluster.

Generating the super instances for must-link constraints: There are must-link constraints that have
transitive closure that can be calculated by it. so that we can say that must-link constraints are known
as an equivalence relation. The subset can be defined by it. In subset, there are some objects which can
be replaced by the mean.

Handling the soft constraints: In the clustering process of soft constraints, there is always an
optimization process. There is always a penalty requires in the clustering process. Hence the
optimization in this process’s aim is to optimize the constraint violation and decreasing the clustering
aspect. For example, if we take two sets one is of data sets and the other is a set of constraints, CVQE
stands for Constrained Vector Quantization Error. In the CVQE algorithm, K-means clustering is
enforced to constraint violation penalty. The main objective of CVQE is the total distance used for K-
means which are used as follows:
• Penalty in must-link violation: This penalty occurs due to when there is a must-link constraint
present on objects y, x. They are created to the given two centers C1, C2 by which the constraint
can be violated hence the distance that lies between C1 and C2 is inserted but as a penalty.
• Penalty in cannot-link violation: This type of penalty is different from a must-link violation
as in this penalty there is one center created to a common center C when cannot link is present
on objects x, y. Therefore, the constraints are violated and hence the distance that lies between
(C, C) can be inserted in the objective function and it is recognized as a penalty.

Outlier analysis
Outlier analysis is the process of identifying and examining data points that significantly differ from
the rest of the dataset. Outliers can be caused by a variety of factors, such as measurement errors,
unexpected events, data processing errors, or they may simply be legitimate data points that fall outside
of the normal range.
Outlier analysis is an important step in data analysis, as it can help to identify and remove erroneous or
inaccurate observations that might otherwise skew conclusions. It can also be used to identify legitimate
outliers that may be of interest, such as fraud cases or rare events.

Types of Outliers
Outliers are divided into three different types
1. Global or point outliers
2. Collective outliers
3. Contextual or conditional outliers
Global Outliers

Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers.
When data points deviate from all the rest of the data points in a given data set, it is known as the global
outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers.
The green data point is the global outlier.

Collective Outliers

In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but when
you consider the data objects as a whole, they may behave as outliers. To identify the types
of different outliers, you need to go through background information about the relationship
between the behaviour of outliers shown by different data objects. For example, in an
Intrusion Detection System, the DOS package from one system to another is taken as normal
behaviour. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behaviour, and as a whole, they are called collective outliers. The green
data points as a whole represents the collective outlier.
Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data object
deviates from the other data points because of any specific condition in a given data set. As
we know, there are two types of attributes of objects of data: contextual attributes and
behavioural attributes. Contextual outlier analysis enables the users to examine outliers in
different contexts and conditions, which can be useful in various applications. For example,
A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still,
it will behave like a normal data point in the context of a summer season. In the given
diagram, a green dot representing the low-temperature value in June is a contextual outlier
since the same value in December is not an outlier.

There are a number of different techniques that can be used for outlier analysis, including:
Statistical methods: These methods identify outliers based on their distance from the rest of
the data. Some common statistical methods for outlier detection include z-scores, the median
absolute deviation (MAD), and the Tukey fences.
Machine learning algorithms: These algorithms can be used to learn the distribution of the
data and identify outliers that fall outside of the learned distribution. Some common machine
learning algorithms for outlier detection include isolation forests, one-class support vector
machines (SVMs), and k-nearest neighbours (KNN).
Data visualization: This can be a helpful way to identify outliers by visually inspecting the
data. For example, you can plot the data on a scatter plot or histogram to see if there are any
data points that fall far outside of the main distribution.
The purpose of outlier analysis is to identify and understand outliers in a dataset. This can be
done for a variety of reasons, such as:
To improve the accuracy of data analysis: Outliers can skew the results of data analysis,
so it is important to identify and remove them before performing any analysis.
To identify fraudulent or suspicious activity: Outliers can sometimes indicate fraudulent
or suspicious activity. For example, an outlier in a financial dataset could indicate money
laundering or fraud.
To discover new insights: Outliers can sometimes reveal new insights about the data. For
example, an outlier in a customer satisfaction dataset could indicate a problem with a
particular product or service.
Outlier analysis is a valuable tool that can be used to improve the accuracy of data analysis,
identify fraudulent or suspicious activity, and discover new insights. If you are working with
a dataset, it is important to consider whether outlier analysis is appropriate for your needs.
Graph mining
Graph mining is a subfield of data mining that deals with the extraction of knowledge from
graphs. Graphs are data structures that represent relationships between entities. For example,
a social network graph can represent the relationships between people, such as friends, co-
workers, or family members.

Graph mining algorithms can be used to find patterns in graphs, such as frequent subgraphs,
influential nodes, and communities. These patterns can be used to answer a variety of
questions, such as:

• What are the most common relationships between people in a social network?
• Who are the most influential people in a social network?
• What are the different communities in a social network?

Graph mining is a powerful tool that can be used to extract valuable insights from graphs. It
is used in a variety of applications, such as:

Social network analysis: Graph mining is used to analyze social networks to identify patterns
and relationships between people. This information can be used to improve marketing
campaigns, identify influencers, and prevent fraud.
Fraud detection: Graph mining is used to detect fraud in financial transactions. By analyzing
the relationships between transactions, graph mining algorithms can identify patterns that are
indicative of fraud.
Biological network analysis: Graph mining is used to analyze biological networks, such as
protein-protein interaction networks. This information can be used to identify potential drug
targets and understand the mechanisms of disease.
Graph mining is a rapidly growing field, and new algorithms are being developed all the time.
As the amount of graph data continues to grow, graph mining will become an increasingly
important tool for extracting knowledge from data.

Here are some of the most common graph mining tasks:

Frequent subgraph mining: This task involves finding subgraphs that occur frequently in a
graph. Frequent subgraphs can be used to identify patterns in graphs, such as common social
interactions or biological pathways.
Community detection: This task involves finding groups of nodes that are densely connected
to each other. Communities can be used to identify groups of people with similar interests or
groups of genes that work together.
Link prediction: This task involves predicting whether two nodes in a graph are likely to be
connected. Link prediction can be used to recommend new friends on social media or to
identify potential fraudsters.
Influential node identification: This task involves finding nodes that have a large impact on
the graph. Influential nodes can be used to target marketing campaigns or to identify key
players in a social network.
Graph mining is a powerful tool that can be used to extract valuable insights from graphs. It
is used in a variety of applications, such as social network analysis, fraud detection, and
biological network analysis. As the amount of graph data continues to grow, graph mining
will become an increasingly important tool for extracting knowledge from data.
Data mining of complex data types

Data mining of complex data types is the process of extracting knowledge from data that is
not easily represented in a traditional tabular format. Complex data types can include
images, text, audio, video, and time series data.

There are a number of different techniques that can be used for data mining of complex
data types, including:

• Image mining: This is the process of extracting knowledge from images. Image
mining techniques can be used to identify objects in images, classify images, and
find patterns in images.
• Text mining: This is the process of extracting knowledge from text. Text mining
techniques can be used to identify keywords, phrases, concepts, and relationships in
text.
• Audio mining: This is the process of extracting knowledge from audio. Audio
mining techniques can be used to identify sounds, classify audio, and find patterns
in audio.
• Video mining: This is the process of extracting knowledge from video. Video
mining techniques can be used to identify objects in videos, classify videos, and find
patterns in videos.
• Time series mining: This is the process of extracting knowledge from time series
data. Time series mining techniques can be used to identify trends, patterns, and
anomalies in time series data.

Data mining of complex data types is a challenging task, but it can be a powerful tool for
extracting valuable insights from data. By using the right techniques, you can find patterns
and trends in complex data that would be difficult to identify using traditional methods.

Here are some of the benefits of data mining of complex data types:

• It can help you to understand your customers better. By analyzing customer reviews,
social media posts, and other text data, you can learn about their needs, wants, and
pain points. This information can be used to improve your products and services.
• It can help you to identify new market opportunities. By analyzing trends in image,
audio, and video data, you can identify new products or services that are in demand.
This information can help you to grow your business.
• It can help you to improve your marketing campaigns. By understanding how
people interact with your content, you can tailor your marketing messages to be
more effective. This can help you to reach more customers and improve your
conversion rates.

If you are looking for a way to extract valuable insights from complex data, then data
mining of complex data types is a powerful tool that you should consider.

Here are some of the challenges of data mining of complex data types:

• The data can be noisy and inconsistent. This can make it difficult to find patterns
and trends in the data.
• The data can be large and complex. This can make it difficult to store and process
the data.
• There are a limited number of techniques available for data mining of complex data
types. This can make it difficult to find the right techniques for your data.

Despite the challenges, data mining of complex data types can be a powerful tool for
extracting valuable insights from data. By using the right techniques, you can find patterns
and trends in complex data that would be difficult to identify using traditional methods.
What is Spatial Data Mining?

A spatial database stores a large amount of space-related data, such as maps, pre-
processed remote sensing or medical imaging data, and VLSI chip layout data. Spatial
databases have many features distinguishing them from relational databases.

Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands an
integration of data mining with spatial database technologies.

It can be used for understanding spatial data, discovering spatial relationships and
relationships between spatial and nonspatial data, constructing spatial knowledge bases,
reorganizing spatial databases, and optimizing spatial queries.

It is expected to have wide applications in geographic information systems, geomarketing,

remote sensing, image database exploration, medical imaging, navigation, traffic control,
environmental studies, and many other areas where spatial data are used.

A crucial challenge to spatial data mining is the exploration of efficient spatial data mining
techniques due to the huge amount of spatial data and the complexity of spatial data types and
spatial access methods.

“Can we construct a spatial data warehouse?” Yes, as with relational data, we can integrate
spatial data to construct a data warehouse that facilitates spatial data mining.
A spatial data warehouse is a subject-oriented, integrated, time-variant, and non-
volatile collection of both spatial and nonspatial data in support of spatial data mining and
spatial-data related decision-making processes.

There are three types of dimensions in a spatial data cube

1. A nonspatial dimension contains only nonspatial data.

(Such as “hot” for temperature and “wet” for precipitation)

2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial

but whose generalization, starting at a certain high level, becomes nonspatial.eg: -city

3. A spatial-to-spatial dimension is a dimension whose primitive level and all of its high-
level generalized data are spatial.eg: equitemp.

Two types of measures in a spatial data cube:

1. A numerical measure contains only numerical data. For example, one measure in a
spatial data warehouse could be the monthly revenue of a region
2. A spatial measure contains a collection of pointers to spatial
objects.eg temperature and precipitation

What is Multimedia Data Mining?

A multimedia database system stores and manages a large collection of multimedia data, such as audio,
video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups,
and linkages.

Similarity Search in Multimedia Data

For similarity searching in multimedia data, we consider two main families of multimedia indexing and
retrieval systems:
Description-based retrieval systems, which build indices and perform object retrieval based on image
descriptions, such as keywords, captions, size, and time of creation;
Content-based retrieval systems, which support retrieval based on the image content, such as color
histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations
within the image.

Mining associations in multimedia data

Mining associations in multimedia data is the process of finding patterns in multimedia data that are
associated with each other. This can be done for a variety of purposes, such as:
Finding relationships between image content and non-image content features: For example, you could
find that images that contain a certain type of object are more likely to be viewed by people who are
interested in a particular topic.
Finding associations among image contents that are not related to spatial relationships: For example,
you could find that images that contain a certain color are more likely to be viewed together with images
that contain a certain shape.
Finding associations among image contents related to spatial relationships: For example, you could find
that images that contain a certain object are more likely to be located in a certain part of the image.

There are a number of different techniques that can be used to mine associations in multimedia data,
including:
Association rule mining: This is a well-known technique for finding patterns in transactional data. It
can be used to find associations between image content and non-image content features, as well as
between image contents that are not related to spatial relationships.
Spatial association rule mining: This is a specialized technique for finding associations between image
contents that are related to spatial relationships. It takes into account the spatial location of objects in
an image when finding associations.
Content-based image retrieval: This is a technique for finding images that are similar to a given image.
It can be used to find associations between images that contain the same or similar content.

Mining associations in multimedia data is a powerful tool that can be used to find valuable insights
from multimedia data. It can be used to improve the accuracy of image retrieval, recommend images to
users, and detect fraud.
Here are some examples of how mining associations in multimedia data can be used:

A retailer could use association rule mining to find that images of products that are frequently purchased
together are more likely to be viewed by customers who are interested in those products.

A museum could use spatial association rule mining to find that images of paintings that are located
near each other are more likely to be viewed together by visitors.
A bank could use content-based image retrieval to find images of fraudulent transactions.
These are just a few examples of how mining associations in multimedia data can be used. As the
amount of multimedia data continues to grow, this technique will become increasingly important for
finding valuable insights from data.

Text mining

Text mining, also known as text data mining or text analytics, is the process of extracting knowledge
from text. It is a subfield of data mining that deals with the extraction of knowledge from unstructured
text data. Text mining can be used to extract keywords, phrases, concepts, and relationships from text.
It can also be used to identify patterns and trends in text.

There are a number of different techniques that can be used for text mining, including:

Natural language processing (NLP): This is a field of computer science that deals with the interaction
between computers and human (natural) languages. NLP techniques can be used to extract keywords,
phrases, and concepts from text.
Machine learning: This is a field of computer science that deals with the development of algorithms
that can learn from data. Machine learning algorithms can be used to identify patterns and trends in text.
Statistical methods: This is a field of mathematics that deals with the collection, analysis,
interpretation, and presentation of data. Statistical methods can be used to identify patterns and trends
in text.

Text mining is a powerful tool that can be used to find valuable insights from text. It can be used for a
variety of purposes, such as:

Sentiment analysis: This is the process of identifying the sentiment of text, such as whether it is
positive, negative, or neutral. Sentiment analysis can be used to understand customer sentiment, identify
trends in social media, and track the effectiveness of marketing campaigns.
Topic modelling: This is the process of identifying the topics that are discussed in text. Topic modelling
can be used to understand the content of documents, identify trends in news articles, and track the
evolution of topics over time.
Relation extraction: This is the process of identifying the relationships between entities in text.
Relation extraction can be used to understand the relationships between people, products, and events.

Text mining is a rapidly growing field, and new techniques are being developed all the time. As the
amount of text data continues to grow, text mining will become an increasingly important tool for
extracting knowledge from data.

Here are some of the benefits of text mining:

• It can help you to understand your customers better. By analyzing customer reviews and social
media posts, you can learn about their needs, wants, and pain points. This information can be
used to improve your products and services.
• It can help you to identify new market opportunities. By analyzing trends in text data, you can
identify new products or services that are in demand. This information can help you to grow
your business.
• It can help you to improve your marketing campaigns. By understanding how people interact
with your content, you can tailor your marketing messages to be more effective. This can help
you to reach more customers and improve your conversion rates.
Web mining
Web mining is the process of extracting knowledge from the World Wide Web (WWW). It is a subfield
of data mining that deals with the extraction of knowledge from web documents, web usage data, and
web structure.

There are three main areas of web mining:

Web content mining: This is the process of extracting knowledge from the content of web pages. This
can be done by extracting keywords, phrases, and concepts from the text of web pages.
Web usage mining: This is the process of extracting knowledge from the usage of web pages. This can
be done by tracking the pages that users visit, the links that they click on, and the time that they spend
on each page.
Web structure mining: This is the process of extracting knowledge from the structure of the WWW.
This can be done by analyzing the links between web pages and the relationships between web pages.

There are a number of different techniques that can be used for web mining, including:

Text mining: This is a technique for extracting knowledge from text. It can be used to extract keywords,
phrases, and concepts from the text of web pages.
Web crawler: This is a program that automatically traverses the WWW and collects web pages.
Web server log analysis: This is the process of analyzing the logs of web servers to extract information
about the usage of web pages.
Link analysis: This is the process of analyzing the links between web pages to extract information
about the structure of the WWW.

Web mining is a powerful tool that can be used to find valuable insights from the WWW. It can be used
for a variety of purposes, such as:

Personalization: This is the process of presenting web pages to users in a way that is tailored to their
interests.
Recommendation systems: This is the process of recommending web pages to users based on their
past behaviour.
Fraud detection: This is the process of detecting fraudulent activity on the WWW.
Market research: This is the process of collecting data about the WWW to understand the behaviour
of users and the market.

Web mining is a rapidly growing field, and new techniques are being developed all the time. As the
WWW continues to grow, web mining will become an increasingly important tool for extracting
knowledge from data.

Here are the steps involved in web mining:

Data collection: The first step is to collect data from the WWW. This can be done by using a web
crawler to collect web pages or by analyzing web server logs.
Data pre-processing: The collected data needs to be pre-processed before it can be mined. This
includes tasks such as cleaning the data, removing noise, and normalizing the data.
Data mining: The pre-processed data can then be mined to extract knowledge. This can be done using
a variety of techniques, such as text mining, web crawler, and web server log analysis.
Data interpretation: The extracted knowledge needs to be interpreted to make sense of it. This can be
done by using statistical methods, machine learning algorithms, or human experts.
Knowledge presentation: The interpreted knowledge needs to be presented in a way that is
understandable and useful. This can be done by using visualization techniques, reports, or dashboards.
Web mining is a complex process, but it can be a powerful tool for extracting knowledge from the
WWW. By following the steps outlined above, you can start to mine the WWW to extract valuable
insights.

Unit 5
No ratings yet
Unit 5
27 pages
Sap Hana Vora & Iot: Webinar 4 Octubre 2016
No ratings yet
Sap Hana Vora & Iot: Webinar 4 Octubre 2016
54 pages
Math S1 Notes
No ratings yet
Math S1 Notes
177 pages
1 - Micromine Basic Module
No ratings yet
1 - Micromine Basic Module
101 pages
Kortix-Ai - Suna - Suna - Open Source Generalist AI Agent
No ratings yet
Kortix-Ai - Suna - Suna - Open Source Generalist AI Agent
9 pages
SMS - Copy 1 (1) 1
No ratings yet
SMS - Copy 1 (1) 1
73 pages
Clustering
No ratings yet
Clustering
41 pages
Sample
No ratings yet
Sample
72 pages
Cluster Analysis (1) - RMM
No ratings yet
Cluster Analysis (1) - RMM
17 pages
Data Mining in Business Intelligence Book
No ratings yet
Data Mining in Business Intelligence Book
112 pages
BigData Questions
No ratings yet
BigData Questions
17 pages
Data Mining Module 2
No ratings yet
Data Mining Module 2
23 pages
Unit VII
No ratings yet
Unit VII
30 pages
Bank MGMT System
No ratings yet
Bank MGMT System
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
SharePoint Migration Checklist
100% (1)
SharePoint Migration Checklist
7 pages
Bcis5420 - Lecture Note - ch4 - ER Modleing
No ratings yet
Bcis5420 - Lecture Note - ch4 - ER Modleing
42 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Data Structure Cheat Sheet
No ratings yet
Data Structure Cheat Sheet
29 pages
DataMining Unit4 Notes
No ratings yet
DataMining Unit4 Notes
27 pages
Unit 5 Clustering-2
No ratings yet
Unit 5 Clustering-2
28 pages
Operations Research
No ratings yet
Operations Research
3 pages
Prof. Ram Meghe Institute of Technology & Research Badnera, Amravati (M.S.) 444701
No ratings yet
Prof. Ram Meghe Institute of Technology & Research Badnera, Amravati (M.S.) 444701
17 pages
System Analysis and Design 2017
No ratings yet
System Analysis and Design 2017
2 pages
Correction
No ratings yet
Correction
1 page
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
C Ts4fi 2020
No ratings yet
C Ts4fi 2020
22 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Answers Practicas
No ratings yet
Answers Practicas
19 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Config
No ratings yet
Config
17 pages
Cse 7 Sem Data Warehousing and Data Mining Winter 2017
No ratings yet
Cse 7 Sem Data Warehousing and Data Mining Winter 2017
2 pages
Unit 4
No ratings yet
Unit 4
106 pages
Data Redundancy
No ratings yet
Data Redundancy
4 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
CLUSTER ANALYSIS Unit 3 Data Mining
No ratings yet
CLUSTER ANALYSIS Unit 3 Data Mining
84 pages
Forms and Reports
No ratings yet
Forms and Reports
3 pages
Unit 4
No ratings yet
Unit 4
21 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Cluster Analysis
No ratings yet
Cluster Analysis
3 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Module V
No ratings yet
Module V
16 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
Data Mining - 1.
No ratings yet
Data Mining - 1.
34 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Chapter 4 Mail Merge
No ratings yet
Chapter 4 Mail Merge
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
3 pages
Clustering
No ratings yet
Clustering
8 pages
Clase XML Oracle
No ratings yet
Clase XML Oracle
72 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Evaluation of Business Performance Source 01
No ratings yet
Evaluation of Business Performance Source 01
25 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Rahul CV
No ratings yet
Rahul CV
7 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Clustering
No ratings yet
Clustering
6 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Pentaho Data Integration
No ratings yet
Pentaho Data Integration
99 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
Data Mining - Cluster Analysis
No ratings yet
Data Mining - Cluster Analysis
4 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Collective Data
No ratings yet
Collective Data
1 page
Cultural Mapping Project: Cindy Mae B. Soncuya
100% (2)
Cultural Mapping Project: Cindy Mae B. Soncuya
47 pages
CH 9 11EM MCQ
No ratings yet
CH 9 11EM MCQ
9 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
CLASS 12 ComputerScience SQP With Marking Scheme (2024-25) - 2
No ratings yet
CLASS 12 ComputerScience SQP With Marking Scheme (2024-25) - 2
43 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
N Chandrasekhar - Mainframe - Resume
No ratings yet
N Chandrasekhar - Mainframe - Resume
2 pages
01-Introduction To DS With Python
No ratings yet
01-Introduction To DS With Python
32 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Practical Software Testing
No ratings yet
Practical Software Testing
3 pages
Reading Sample Sap Press Configuring Sap S4hana Finance
No ratings yet
Reading Sample Sap Press Configuring Sap S4hana Finance
37 pages
DM Cluster Analysis
No ratings yet
DM Cluster Analysis
3 pages
Unit 4
No ratings yet
Unit 4
4 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DM Module 4

Uploaded by

DM Module 4

Uploaded by

MODULE 4

Density-Based Method: The density-based method mainly focuses on density. In this

Advantages of Cluster Analysis:

Partitioning Method (K-Mean) in Data Mining

1. Randomly assign K objects from the dataset(D) as cluster centres(C)

C1 = 16.33 [16, 16, 17]

Steps Used in DBSCAN Algorithm

HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends

CLIQUE (Clustering in Quest): – CLIQUE is a combination of density-based and grid-based

There are the following types of model-based clustering are as follows −

Statistical approach − Expectation maximization is a popular iterative refinement algorithm. An

The basic idea is as follows −

• It can start with an initial estimate of the parameter vector.

It is a famous approach of incremental conceptual learning, which produces a hierarchical clustering in

Here are some of the most common graph mining tasks:

It is expected to have wide applications in geographic information systems, geomarketing,

There are three types of dimensions in a spatial data cube

1. A nonspatial dimension contains only nonspatial data.

2. A spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial

Two types of measures in a spatial data cube:

What is Multimedia Data Mining?

Similarity Search in Multimedia Data

Mining associations in multimedia data

Here are some of the benefits of text mining:

There are three main areas of web mining:

Here are the steps involved in web mining:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.