We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8
D. Nagamalai, E. Renault, M. Dhanushkodi (Eds.): CCSEIT 2011, CCIS 204, pp. 171178, 2011.
Springer-Verlag Berlin Heidelberg 2011
Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm K. Indira 1 , S. Kanmani 2 , D. Gaurav Sethia 2 , S. Kumaran 2 , and J. Prabhakar 2
1 Department of Computer Science, 2 Department of Information Technology, Pondicherry Engineering College, Puducherry, India {induharini,kanmani.n,gaurav.sethia7,kumarane.90}@gmail.com, prabhakar_pec64@yahoo.co.in Abstract. Rule acquisition is a technique of data mining that is used to deduce inferences from large databases. These inferences cannot be noticed easily without data mining. Genetic algorithms (GAs) are considered as a global search approach for optimization problems. Through the proper evaluation strategy, the best chromosome can be found from the numerous genetic combinations. In the self-adaptive genetic algorithm, its main thought is to let control parameter (crossover rate, mutation rate) adjusted adaptively within the proper range, thus achieve a more optimum solution. It is proved that the self- adaptive genetic algorithm is with excellent convergence and higher precision than the traditional genetic algorithm. Keywords: Association rule mining, Genetic algorithm, Crossover, Mutation, Fitness, Support, Confidence. 1 Introduction Mining is used to refer to the process of searching through a large volume of data, stored into a database, to discover useful and interesting information previously unknown. Association rule mining is a type of data mining. It is the method of finding the relations between entities in databases. Association rule mining is mainly used in market analysis, transaction data analysis or in the medical field. For example, in a medical database, diagnosis is possible provided the symptoms or in case of supermarket, the relation between the purchase of different commodities can be obtained. Such inferences are drawn using association rule mining and can be used for making decisions. There are some well known techniques for association rule mining. Some of the well known algorithms are Apriori, constraint based mining, Frequency Pattern Growth Approach, genetic algorithm. There have been several attempts for mining association rules using Genetic Algorithm. The main reason for choosing a genetic algorithm for data mining is that a GA performs global search and copes better with attribute interaction when compared with the traditional greedy methods, based on induction. 172 K. Indira et al. Genetic algorithm is evolved from Charles Darwins Survival of the fittest theory. It is based on individuals fitness and genetic similarity between the individuals. Breeding occurs in every generation and eventually it leads to better and optimal group in the later generations. [1] Analyses the mining of Association Rules by applying Genetic Algorithms. [2] introduces CBGA approach that hybridizes constraint-based reasoning within a genetic algorithm for rule induction. The CBGA approach uses Apriori algorithm to improve its efficiency. [3], [4] discuss some variations of the traditional Genetic algorithms in the field of data mining. [3] is based on a evolutionary strategy and [4] adopts a self adaptive approach. The self adaptive modification on a GA has never been attempted on association rule mining before, but as this promises to be very promising in improving the efficiency, it has been taken up. The main modules in data mining process are i. Data cleaning: also so known as data cleansing, is a phase in which noise data and irrelevant data are removed from the collection. ii. Data selection: at this step, the data relevant for the analysis is decided on and retrieved from the large data collection. iii. Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. A brief introduction about Association Rule Mining and GA is given in Section 2, followed by the proposed system in section 3. In section 4 the parameters used in association rule mining using SAGA are defined. Section 5 presents the experimental results followed by conclusion in the last section. 2 Association Rules and Genetic Algorithm 2.1 Association Rules An important type of knowledge acquired by many data mining systems takes the form of if-then rules [5]. Such rules state that the presence of one or more items implies or predicts the presence of other items. A typical rule has the form If A, B, Cn then Y The two parameters with respect to if-then rules are described below. The confidence [6] for a given rule is a measure of how often the consequent is true, given that the antecedent is true. If the consequent is false while the antecedent is true, then the rule is also false. If the antecedent is not matched by a given data item, then this item does not contribute to the determination of the confidence of the rule. The support indicates how often the rule holds in a set of data. This is a relative measure determined by dividing the number of data that the rule covers, i.e., that support the rule, by the total number of data in the set.
Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 173 2.2 Genetic Algorithm Genetic algorithm [7] is simulated in the natural environment of biological evolution and genetics and the formation of an adaptive search algorithm for global optimization probability. Genetic algorithm is suitable for solving problems characterized by large space, multi-peak, non-linear, global optimization problem of high complexity display. The parameters of the problem to be resolved into a binary code or the decimal code (also into other hexadecimal code) that the gene, a number of genes form a chromosome (individual), a number of chromosomes is similar to natural selection, crossover and mutation matching algorithms, after repeated iteration (that is, hereditary from generation to generation) until the final results of the optimization. The use of genetic algorithms to solve the problem involved the following seven key factors: encoding, fitness function, selection operator, the crossover operator, mutation operator, control parameters. In the traditional genetic algorithm, the crossover rate and mutation rate are fixed values which are selected based on experience. Generally we believe that when the crossover rate is too low, the evolutionary process can easily fall into local optimum to result in groups of premature convergence due to population size and the lack of diversity. When the crossover rate is too high, the process is optimized to the vicinity of optimal point and the individual is difficult to reach optimal point which can slow the speed of convergence significantly though groups can ensure the diversity.
Fig. 1. Flow Chart Traditional GA
174 K. Indira et al. 3 Proposed System To overcome the drawbacks of traditional genetic algorithm, SAGA is proposed. SAGA involves changing the crossover and mutation rates adaptively [8]. The main purpose of setting mutation operator is to maintain the diversity of population and avoid stagnation of evolution. In the traditional genetic algorithm the mutation rate is fixed and after several iterations, the groups quality will gradually to converge and to form inbreeding. The organized adaptive genetic algorithm has a higher robustness, global optimality and efficiency.
Procedure SAGA
Begin Initialize population p(k); Define the crossover and mutation rate; Do { Do {
Calculate support of all k rules; Calculate confidence of all k rules; Obtain fitness; Select individuals for crossover / mutation; Calculate the average fitness of the n and (n-1) the generation; Calculate the maximum fitness of the n and (n-1) the generation; Based on the fitness of the selected item, calculate the new crossover and mutation rate; Choose the operation to be performed;
} k times; }
4 Parameters in Genetic Algorithm 4.1 Selection of Individuals Chromosomes are selected from the population for breeding. According to Darwin's evolution theory the best ones should survive and create new offspring. According to Roulette Wheel Selection, Parents are selected according to their fitness. The better the chromosomes are, the more chances to be selected they have, because the selection depends on fitness.
Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 175
Fig. 2. Roulette Wheel Selection 4.2 Fitness Function Given a particular chromosome, the fitness function returns a single numerical "fitness," or "figure of merit". This value is proportional to the "utility" or "ability" of the individual which that chromosome represents. This paper adopts minimum support and minimum confidence for filtering rules. Then correlative degree is confirmed in rules which satisfy minimum support-degree and minimum confidence degree. After support-degree and confidence-degree are synthetically taken into account, fit degree function is defined as follows. (1) In the above formula, R s + R c = 1 (Rs 0_Rc 0) and Supp min , Conf min are respective values of minimum support and minimum confidence. By all appearances_ if the Supp min and Conf min are set to higher values, then the value of fitness function is also found to be high. 4.3 Crossover Operator Crossover selects genes from parent chromosomes and creates a new offspring. The most common form of crossover is single point crossover in which a crossover point on both parents is selected and child 1 is head of chromosome of parent 1 with tail of chromosome of parent 2 and child 2 is head of 2 with tail of 1. 4.4 Mutation Operator Mutation changes randomly the new offspring. For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1. Mutation provides a small amount of random search, and helps ensure that no point in the search has a zero probability of being examined.
176 K. Indira et al. 4.5 Number of Generations The generational process of mining association rules by Genetic algorithm is repeated until a termination condition has been reached. Common terminating conditions are: A solution is found that satisfies minimum criteria. Fixed number of generations reached. The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations no longer produce better results. Manual inspection. Combinations of the above. 4.6 Self Adaptive The use of a fixed mutation probability Pm, when Pm value is small, does not have an impact on the mutation operator. When Pm value is great, it could undermine the group's excellent genes, the algorithm does not even slow down the convergence. Here, a method of adaptive mutation rate is used as follows: (2) p m n is the nth generation mutation rate, p m (n+1) is the (n+1) th generation mutation rate. The first generation mutation rate is p m 0 . f i (m) is the fitness of the nth individual stocks i. f max (n+1) is the highest fitness of the (n+1) th individual stocks. f i (n) is the fitness of the nth individual i. m is the number of individual stocks. is the adjustment factor. 5 Experimental Studies The objective of this study is to compare the accuracy achieved on different datasets using a traditional GA and a SAGA. The encoding of chromosome is binary encoding with fixed length. The fitness function adopted is as given. Three datasets namely Lenses, Haberman and Car evaluation from UCI Machine Learning Repository have been taken up for experimentation. Lenses dataset has 4 attributes with 24 instances. The second dataset is Haberman which has 4 attributes and 306 instances. The final one is the car evaluation dataset, which has 6 attributes and 1728 instances. The Algorithm is implemented using Java. The accuracy and the convergence rate are recorded in the table below. Accuracy is the count of rules matching between the original dataset and resulting population divided by the number of instances in dataset. The convergence rate is the generation at which the fitness value becomes fixed. The parameters for the fitness function and the other parameters are chosen for best performance of a traditional GA. The parameters are set such that the convergence is fastest and the number of matches is maximum and SAGA is run with the same parameters and the results obtained are tabulated below. Accuracy is in terms of percentage with respect to number of matches. Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm 177 Table 1. Default GA parameters Parameter Value Population Size Varies as per the dataset Initial Crossover rate 0.9 Initial Mutation rate 0.1 Selection Method Roulette wheel selection Minimum Support 0.2 Minimum Confidence 0.8 Table 2. Accuracy comparison between GA and SAGA when parameters are ideal for traditional GA Dataset Traditional GA Self Adaptive GA Accuracy No. of Generations Accuracy No. of Generations Lenses 75 38 87.5 35 Haberman 52 36 68 28 Car Evaluation 85 29 96 21 Table 3. Accuracy comparison between GA and saga when parameters are according to termination of SAGA Dataset Traditional GA Self Adaptive GA Accuracy No. of Generations Accuracy No. of Generations Lenses 50 35 87.5 35 Haberman 36 38 68 28 Car Evaluation 74 36 96 21
From the above table we can conclude that the Self adaptive GA performs better than a traditional GA in both aspects, i.e. the convergence rate and the accuracy. The accuracy in case of the Haberman dataset is low because one of the parameter in the data is age. As age can take a wide range of values and only perfect matches are considered, the accuracy comes down.
178 K. Indira et al. When the algorithm comes to an end, the parameter values for mutation rate and crossover rate are changed because of self adaptivity. If the new values are set as the original values for a GA, then the performance of the GA is as below. The table shows that the accuracy of the traditional GA goes down if the parameters are set in accordance with the termination Condition mutation rate of SAGA, this is because, when the SAGA ends, the mutation rate might take a high value, which when applied to a GA, will bring down the accuracy. The fitness threshold plays a major role in deciding the efficiency of the rules mined and convergence of the system. 6 Conclusion Genetic Algorithms have been used to solve difficult optimization problems in a number of fields and have proved to produce optimum results in mining Association rules. When Genetic algorithm is used for mining association rules the GA parameters decides the efficiency of the system. Once the optimum values are fixed for individual parameters, then making the algorithm self adaptive increases the efficiency because it changes the mutation and crossover rate adaptively thus making the algorithm more intelligent. When the mutation rate is varied with respect to the result from the previous generation the accuracy increases. The efficiency of the methodology could be further explored on more datasets with varying attribute sizes. References 1. Collard, M., Francisi, D.: Evolutionary Data Mining: An Overview of Genetic-Based Algorithms. In: 8th IEEE International Conference on Emerging Technologies and Factory Automation, vol. 1, pp. 39 (2001) 2. Chiu, C., Hsu, P.-l.: A Constraint Based Genetic algorithm approach for Mining Classification Rules. IEEE Transactions on Systems, Man and Cybernetics 35, 305320 (2005) 3. Saggar, M., Agarwal, A.K., Lad, A.: Optimization of Association Rule Mining using Improved Genetic Algorithms. IEEE, Transaction on System, Man and Cybernetics 4, 37253729 (2004) 4. Zhu, X., Yu, Y., Guo, X.: Genetic Algorithm Based on Evolution Strategy and the Application in Data Mining. In: First International Workshop on Education Technology and Computer Science, ETCS 2009, vol. 1, pp. 848852 (2009) 5. Cattral, R., Oppacher, F., Dwego, D.: Rule Acquisition with Genetic Algorithm. In: Congress on Evolutionary Computation, CEC 1999, vol. 1 (1999) 6. Dai, S., Gao, L., Zhu, Q., Zhu, C.: A Novel Genetic Algorithm Based on Image Databases for Mining Association Rules. In: IEEE Conference on Computer and Information Science, pp. 977980 (2007) 7. Wu, Y.-T., An, Y.J., Geller, J., Wu, Y.T.: A Data Mining Based Genetic Algorithm. In: IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems (2006) 8. Li, J., Feng, H.R.: A Self-Adaptive Genetic Algorithm Based on Real Code, pp. 14. Capital Normal University, CNU (2010)