0% found this document useful (0 votes)
30 views8 pages

Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm

SAGA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm

SAGA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

D. Nagamalai, E. Renault, M. Dhanushkodi (Eds.): CCSEIT 2011, CCIS 204, pp. 171178, 2011.

Springer-Verlag Berlin Heidelberg 2011


Rule Acquisition in Data Mining Using a Self Adaptive
Genetic Algorithm
K. Indira
1
, S. Kanmani
2
, D. Gaurav Sethia
2
, S. Kumaran
2
, and J. Prabhakar
2

1
Department of Computer Science,
2
Department of Information Technology,
Pondicherry Engineering College,
Puducherry, India
{induharini,kanmani.n,gaurav.sethia7,kumarane.90}@gmail.com,
prabhakar_pec64@yahoo.co.in
Abstract. Rule acquisition is a technique of data mining that is used to deduce
inferences from large databases. These inferences cannot be noticed easily
without data mining. Genetic algorithms (GAs) are considered as a global
search approach for optimization problems. Through the proper evaluation
strategy, the best chromosome can be found from the numerous genetic
combinations. In the self-adaptive genetic algorithm, its main thought is to let
control parameter (crossover rate, mutation rate) adjusted adaptively within the
proper range, thus achieve a more optimum solution. It is proved that the self-
adaptive genetic algorithm is with excellent convergence and higher precision
than the traditional genetic algorithm.
Keywords: Association rule mining, Genetic algorithm, Crossover, Mutation,
Fitness, Support, Confidence.
1 Introduction
Mining is used to refer to the process of searching through a large volume of data,
stored into a database, to discover useful and interesting information previously
unknown. Association rule mining is a type of data mining. It is the method of finding
the relations between entities in databases. Association rule mining is mainly used in
market analysis, transaction data analysis or in the medical field. For example, in a
medical database, diagnosis is possible provided the symptoms or in case of
supermarket, the relation between the purchase of different commodities can be
obtained. Such inferences are drawn using association rule mining and can be used for
making decisions.
There are some well known techniques for association rule mining. Some of the
well known algorithms are Apriori, constraint based mining, Frequency Pattern
Growth Approach, genetic algorithm. There have been several attempts for mining
association rules using Genetic Algorithm.
The main reason for choosing a genetic algorithm for data mining is that a GA
performs global search and copes better with attribute interaction when compared
with the traditional greedy methods, based on induction.
172 K. Indira et al.
Genetic algorithm is evolved from Charles Darwins Survival of the fittest theory.
It is based on individuals fitness and genetic similarity between the individuals.
Breeding occurs in every generation and eventually it leads to better and optimal
group in the later generations. [1] Analyses the mining of Association Rules by
applying Genetic Algorithms.
[2] introduces CBGA approach that hybridizes constraint-based reasoning within a
genetic algorithm for rule induction. The CBGA approach uses Apriori algorithm to
improve its efficiency.
[3], [4] discuss some variations of the traditional Genetic algorithms in the field of
data mining. [3] is based on a evolutionary strategy and [4] adopts a self adaptive
approach. The self adaptive modification on a GA has never been attempted on
association rule mining before, but as this promises to be very promising in improving
the efficiency, it has been taken up.
The main modules in data mining process are
i. Data cleaning: also so known as data cleansing, is a phase in which noise
data and irrelevant data are removed from the collection.
ii. Data selection: at this step, the data relevant for the analysis is decided on
and retrieved from the large data collection.
iii. Data mining: it is the crucial step in which clever techniques are applied to
extract patterns potentially useful.
A brief introduction about Association Rule Mining and GA is given in Section 2,
followed by the proposed system in section 3. In section 4 the parameters used in
association rule mining using SAGA are defined. Section 5 presents the experimental
results followed by conclusion in the last section.
2 Association Rules and Genetic Algorithm
2.1 Association Rules
An important type of knowledge acquired by many data mining systems takes the
form of if-then rules [5]. Such rules state that the presence of one or more items
implies or predicts the presence of other items. A typical rule has the form
If A, B, Cn then Y
The two parameters with respect to if-then rules are described below.
The confidence [6] for a given rule is a measure of how often the consequent
is true, given that the antecedent is true. If the consequent is false while the antecedent
is true, then the rule is also false. If the antecedent is not matched by a given data
item, then this item does not contribute to the determination of the confidence of
the rule.
The support indicates how often the rule holds in a set of data. This is a relative
measure determined by dividing the number of data that the rule covers, i.e., that
support the rule, by the total number of data in the set.


Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 173
2.2 Genetic Algorithm
Genetic algorithm [7] is simulated in the natural environment of biological evolution
and genetics and the formation of an adaptive search algorithm for global
optimization probability. Genetic algorithm is suitable for solving problems
characterized by large space, multi-peak, non-linear, global optimization problem of
high complexity display. The parameters of the problem to be resolved into a binary
code or the decimal code (also into other hexadecimal code) that the gene, a number
of genes form a chromosome (individual), a number of chromosomes is similar to
natural selection, crossover and mutation matching algorithms, after repeated iteration
(that is, hereditary from generation to generation) until the final results of the
optimization. The use of genetic algorithms to solve the problem involved the
following seven key factors: encoding, fitness function, selection operator, the
crossover operator, mutation operator, control parameters.
In the traditional genetic algorithm, the crossover rate and mutation rate are fixed
values which are selected based on experience. Generally we believe that when the
crossover rate is too low, the evolutionary process can easily fall into local optimum
to result in groups of premature convergence due to population size and the lack of
diversity. When the crossover rate is too high, the process is optimized to the vicinity
of optimal point and the individual is difficult to reach optimal point which can slow
the speed of convergence significantly though groups can ensure the diversity.

Fig. 1. Flow Chart Traditional GA

174 K. Indira et al.
3 Proposed System
To overcome the drawbacks of traditional genetic algorithm, SAGA is proposed.
SAGA involves changing the crossover and mutation rates adaptively [8]. The main
purpose of setting mutation operator is to maintain the diversity of population and
avoid stagnation of evolution. In the traditional genetic algorithm the mutation rate is
fixed and after several iterations, the groups quality will gradually to converge and to
form inbreeding. The organized adaptive genetic algorithm has a higher robustness,
global optimality and efficiency.

Procedure SAGA

Begin
Initialize population p(k);
Define the crossover and mutation rate;
Do
{
Do
{

Calculate support of all k rules;
Calculate confidence of all k rules;
Obtain fitness;
Select individuals for crossover / mutation;
Calculate the average fitness of the n and (n-1) the generation;
Calculate the maximum fitness of the n and (n-1) the generation;
Based on the fitness of the selected item, calculate the new crossover and
mutation rate;
Choose the operation to be performed;

} k times;
}

4 Parameters in Genetic Algorithm
4.1 Selection of Individuals
Chromosomes are selected from the population for breeding. According to Darwin's
evolution theory the best ones should survive and create new offspring. According to
Roulette Wheel Selection, Parents are selected according to their fitness. The better
the chromosomes are, the more chances to be selected they have, because the
selection depends on fitness.

Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 175

Fig. 2. Roulette Wheel Selection
4.2 Fitness Function
Given a particular chromosome, the fitness function returns a single numerical
"fitness," or "figure of merit". This value is proportional to the "utility" or "ability" of
the individual which that chromosome represents.
This paper adopts minimum support and minimum confidence for filtering rules.
Then correlative degree is confirmed in rules which satisfy minimum support-degree
and minimum confidence degree. After support-degree and confidence-degree are
synthetically taken into account, fit degree function is defined as follows.
(1)
In the above formula, R
s
+ R
c
= 1 (Rs 0_Rc 0) and Supp
min
, Conf
min
are respective
values of minimum support and minimum confidence. By all appearances_ if the
Supp
min
and Conf
min
are set to higher values, then the value of fitness function is also
found to be high.
4.3 Crossover Operator
Crossover selects genes from parent chromosomes and creates a new offspring. The
most common form of crossover is single point crossover in which a crossover point
on both parents is selected and child 1 is head of chromosome of parent 1 with tail of
chromosome of parent 2 and child 2 is head of 2 with tail of 1.
4.4 Mutation Operator
Mutation changes randomly the new offspring. For binary encoding we can switch a
few randomly chosen bits from 1 to 0 or from 0 to 1. Mutation provides a small
amount of random search, and helps ensure that no point in the search has a zero
probability of being examined.

176 K. Indira et al.
4.5 Number of Generations
The generational process of mining association rules by Genetic algorithm is repeated
until a termination condition has been reached. Common terminating conditions are:
A solution is found that satisfies minimum criteria.
Fixed number of generations reached.
The highest ranking solution's fitness is reaching or has reached a
plateau such that successive iterations no longer produce better results.
Manual inspection.
Combinations of the above.
4.6 Self Adaptive
The use of a fixed mutation probability Pm, when Pm value is small, does not have an
impact on the mutation operator. When Pm value is great, it could undermine the
group's excellent genes, the algorithm does not even slow down the convergence.
Here, a method of adaptive mutation rate is used as follows:
(2)
p
m
n
is the nth generation mutation rate, p
m
(n+1)
is the (n+1)
th
generation mutation rate.
The first generation mutation rate is p
m
0
. f
i
(m)
is the fitness of the nth individual
stocks i. f
max
(n+1)
is the highest fitness of the (n+1)
th
individual stocks. f
i
(n)
is the fitness of
the nth individual i. m is the number of individual stocks. is the adjustment factor.
5 Experimental Studies
The objective of this study is to compare the accuracy achieved on different datasets
using a traditional GA and a SAGA. The encoding of chromosome is binary encoding
with fixed length. The fitness function adopted is as given.
Three datasets namely Lenses, Haberman and Car evaluation from UCI Machine
Learning Repository have been taken up for experimentation. Lenses dataset has 4
attributes with 24 instances. The second dataset is Haberman which has 4 attributes
and 306 instances. The final one is the car evaluation dataset, which has 6 attributes
and 1728 instances. The Algorithm is implemented using Java.
The accuracy and the convergence rate are recorded in the table below. Accuracy is
the count of rules matching between the original dataset and resulting population
divided by the number of instances in dataset. The convergence rate is the generation
at which the fitness value becomes fixed.
The parameters for the fitness function and the other parameters are chosen for best
performance of a traditional GA. The parameters are set such that the convergence is
fastest and the number of matches is maximum and SAGA is run with the same
parameters and the results obtained are tabulated below. Accuracy is in terms of
percentage with respect to number of matches.
Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 177
Table 1. Default GA parameters
Parameter Value
Population Size
Varies as per the
dataset
Initial Crossover rate 0.9
Initial Mutation rate 0.1
Selection Method
Roulette wheel
selection
Minimum Support 0.2
Minimum Confidence 0.8
Table 2. Accuracy comparison between GA and SAGA when parameters are ideal for
traditional GA
Dataset Traditional GA Self Adaptive GA
Accuracy No. of
Generations
Accuracy No. of
Generations
Lenses 75 38 87.5 35
Haberman 52 36 68 28
Car Evaluation 85 29 96 21
Table 3. Accuracy comparison between GA and saga when parameters are according to
termination of SAGA
Dataset Traditional GA Self Adaptive GA
Accuracy No. of
Generations
Accuracy No. of
Generations
Lenses 50 35 87.5 35
Haberman 36 38 68 28
Car
Evaluation
74 36 96 21


From the above table we can conclude that the Self adaptive GA performs better than
a traditional GA in both aspects, i.e. the convergence rate and the accuracy. The
accuracy in case of the Haberman dataset is low because one of the parameter in the
data is age. As age can take a wide range of values and only perfect matches are
considered, the accuracy comes down.

178 K. Indira et al.
When the algorithm comes to an end, the parameter values for mutation rate and
crossover rate are changed because of self adaptivity. If the new values are set as the
original values for a GA, then the performance of the GA is as below.
The table shows that the accuracy of the traditional GA goes down if the
parameters are set in accordance with the termination Condition mutation rate of
SAGA, this is because, when the SAGA ends, the mutation rate might take a high
value, which when applied to a GA, will bring down the accuracy. The fitness
threshold plays a major role in deciding the efficiency of the rules mined and
convergence of the system.
6 Conclusion
Genetic Algorithms have been used to solve difficult optimization problems in a
number of fields and have proved to produce optimum results in mining Association
rules. When Genetic algorithm is used for mining association rules the GA parameters
decides the efficiency of the system. Once the optimum values are fixed for individual
parameters, then making the algorithm self adaptive increases the efficiency because
it changes the mutation and crossover rate adaptively thus making the algorithm more
intelligent. When the mutation rate is varied with respect to the result from the
previous generation the accuracy increases. The efficiency of the methodology could
be further explored on more datasets with varying attribute sizes.
References
1. Collard, M., Francisi, D.: Evolutionary Data Mining: An Overview of Genetic-Based
Algorithms. In: 8th IEEE International Conference on Emerging Technologies and Factory
Automation, vol. 1, pp. 39 (2001)
2. Chiu, C., Hsu, P.-l.: A Constraint Based Genetic algorithm approach for Mining
Classification Rules. IEEE Transactions on Systems, Man and Cybernetics 35, 305320
(2005)
3. Saggar, M., Agarwal, A.K., Lad, A.: Optimization of Association Rule Mining using
Improved Genetic Algorithms. IEEE, Transaction on System, Man and Cybernetics 4,
37253729 (2004)
4. Zhu, X., Yu, Y., Guo, X.: Genetic Algorithm Based on Evolution Strategy and the
Application in Data Mining. In: First International Workshop on Education Technology and
Computer Science, ETCS 2009, vol. 1, pp. 848852 (2009)
5. Cattral, R., Oppacher, F., Dwego, D.: Rule Acquisition with Genetic Algorithm. In:
Congress on Evolutionary Computation, CEC 1999, vol. 1 (1999)
6. Dai, S., Gao, L., Zhu, Q., Zhu, C.: A Novel Genetic Algorithm Based on Image Databases
for Mining Association Rules. In: IEEE Conference on Computer and Information Science,
pp. 977980 (2007)
7. Wu, Y.-T., An, Y.J., Geller, J., Wu, Y.T.: A Data Mining Based Genetic Algorithm. In:
IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems
(2006)
8. Li, J., Feng, H.R.: A Self-Adaptive Genetic Algorithm Based on Real Code, pp. 14.
Capital Normal University, CNU (2010)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy