850 likes | 1.02k Views
CSE 634 Data Mining Concepts & Techniques Prof: Anita Wasilewska Genetic Algorithms (GAs) By: Group 1 Abhishek Sharma, Mikhail Rubnich, George Iordache, Marcela Boboila. General description of the method. By: Abhishek Sharma. References.
E N D
CSE 634Data Mining Concepts & TechniquesProf: Anita WasilewskaGenetic Algorithms (GAs)By: Group 1Abhishek Sharma, Mikhail Rubnich, George Iordache, Marcela Boboila
General descriptionof the method By: Abhishek Sharma
References • ‘DATA MINING Concepts and Techniques’ Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003 • Data Mining Techniques : Class Lecture Notes and PP Slides. • http://cs.felk.cvut.cz/~xobitko/ga/ • Massachusetts Institute of Technology - Prof. de Weck and Prof. Willcox, Multidisciplinary System Design Optimization Course Lecture Notes on Heuristic Techniques, “A Basic Introduction to Genetic Algorithms”: http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
History of Genetic Algorithms • “Evolutionary Computing” was introduced in the 1960s by I. Rechenberg. • Professor John Holland at the University of Michigan came up with book "Adaptation in Natural and Artificial Systems" explored the concept of using mathematically-based artificial evolution as a method to conduct a structured search for solutions to complex problems. • Dr. David E. Goldberg. In his 1989 landmark text "Genetic Algorithms in Search, Optimization and Machine Learning”, suggested applications for genetic algorithms in a wide range of engineering fields.
What Are Genetic Algorithms (GAs)? • Genetic Algorithms are search and optimization techniques based on Darwin’s Principle of Natural Selection. “problems are solved by an evolutionary process resulting in a best (fittest) solution (survivor) , -In Other words, the solution is evolved” 1. Inheritance – Offspring acquire characteristics 2. Mutation – Change, to avoid similarity 3. Natural Selection – Variations improve survival 4. Recombination - Crossover
Genetics Chromosome • All Living organisms consists of cells. In each cell there is a same set of Chromosomes. • Chromosomes are strings of DNA and consists of genes, blocks of DNA. • Each gene encodes a trait, for example color of eyes. Reproduction • During reproduction, recombination (or crossover) occurs first. Genes from parents combine to form a whole new chromosome. The newly created offspring can then be mutated. The changes are mainly caused by errors in copying genes from parents. • The fitness of an organism is measure by success of the organism in its life (survival) Citation: http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
Principle Of Natural Selection • “Select The Best, Discard The Rest” • Two important elements required for any problem before a genetic algorithm can be used for a solution are: • Method for representing a solution (encoding) ex: string of bits, numbers, character • Method for measuring the quality of any proposed solution, using fitness function ex: Determining total weight
GA Elements Citation: http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
Search Space • If we are solving some problem, we are usually looking for some solution, which will be the best among others. The space of all feasible solutions (it means objects among those the desired solution is) is called search space (also state space). Each point in the search space represent one feasible solution. Each feasible solution can be "marked" by its value or fitness for the problem. • Initialization Initially many individual solutions are randomly generated to form an initial population, covering the entire range of possible solutions (the search space) Each point in the search space represents one possible solution marked by its value( fitness) • Selection A proportion of the existing population is selected to bread a new bread of generation. • Reproduction Generate a second generation population of solutions from those selected through genetic operators: crossover and mutation. • Termination A solution is found that satisfies minimum criteria Fixed number of generations found Allocated budget (computation, time/money) reached The highest ranking solution’s fitness is reaching or has reached
Methodology Associated with GAs Begin Initialize population Evaluate Solutions T =0 (first step) Optimum Solution? N Selection Y T=T+1 (go to next step) Stop Crossover Mutation Citation:http://cs.felk.cvut.cz/~xobitko/ga/
Creating a GA on Computer Simple_Genetic_Algorithm(){ Initialize the Population; Calculate Fitness Function; While(Fitness Value != Optimal Value) { Selection;//Natural Selection, Survival Of Fittest Crossover;//Reproduction, Propagate favorable characteristics Mutation;//Mutation Calculate Fitness Function; }}
Encoding • The process of representing the solution in the form of a string that conveys the necessary information. • Just as in a chromosome, each gene controls a particular characteristic of the individual, similarly, each element in the string represents a characteristic of the solution.
Chromosome A 10110010110011100101 Chromosome B 11111110000000011111 Chromosome A 1 5 3 2 6 4 7 9 8 Chromosome B 8 5 6 7 2 3 1 4 9 Encoding Methods • Binary Encoding – Most common method of encoding. Chromosomes are strings of 1s and 0s and each position in the chromosome represents a particular characteristic of the problem. • Permutation Encoding – Useful in ordering problems such as the Traveling Salesman Problem (TSP). Example. In TSP, every chromosome is a string of numbers, each of which represents a city to be visited.
Chromosome A 1.235 5.323 0.454 2.321 2.454 ChromosomeB (left), (back), (left), (right), (forward) Encoding Methods (contd.) • Value Encoding –Used in problems where complicated values, such as real numbers, are used and where binary encoding would not suffice. Good for some problems, but often necessary to develop some specific crossover and mutation techniques for these chromosomes.
Encoding Methods (contd.) • Tree Encoding –This encoding is used mainly for evolving programs or expressions, i.e. for Genetic programming. • Tree Encoding - every chromosome is a tree of some objects, such as values/arithmetic operators or commands in a programming language. ( + x ( / 5 y ) ) ( do_until step wall ) Citation: http://ocw.mit.edu/NR/rdonlyres/Aeronautics-and-Astronautics/16-888Spring-2004/D66C4396-90C8-49BE-BF4A-4EBE39CEAE6F/0/MSDO_L11_GA.pdf
GA Operators By: Mikhail Rubnich
References • ‘DATA MINING Concepts and Techniques’ Jiawei Han, Micheline Kamber Morgan Kaufman Publishers, 2003 • http://www.ai-junkie.com/ga/intro/gat2.html • http://www.faqs.org/faqs/ai-faq/genetic/part2/ • http://en.wikipedia.org/wiki/Genetic_algorithms
Basic GA Operators Recombination Crossover - Looking for solutions near existing solutions Mutation - Looking at completely new areas of search space
Fitness function • quantifies the optimality of a solution (that is, a chromosome): that particular chromosome may be ranked against all the other chromosomes • A fitness value is assigned to each solution depending on how close it actually is to solving the problem. • Ideal fitness function correlates closely to goal + quickly computable. • For instance, knapsack problem Fitness Function=Total value of the things in the knapsack
Recombination Main idea:"Select The Best, Discard The Rest”. The process that chooses solutions to be preserved and allowed to reproduce and selects which ones must to die out. • The main goal of the recombination operator is to emphasize the good solutions and eliminate the bad solutions in a population ( while keeping the population size constant)
So, how to select the best? • Roulette Selection • Rank Selection • Steady State Selection • Tournament Selection
Roulette wheel selection Main idea: the fitter is the solution with the most chances to be chosen HOW IT WORKS ?
Example of Roulette wheel selection Citation: : www.cs.vu.nl/~gusz/
Roulette wheel selection All you have to do is spin the ball and grab the chromosome at the point it stops
Crossover Main idea: combine genetic material ( bits ) of 2 “parent” chromosomes ( solutions ) and produce a new “child” possessing characteristics of both “parents”. How it works ? Several methods ….
Crossover methods • Single Point Crossover- A random point is chosen on the individual chromosomes (strings) and the genetic material is exchanged at this point. Citation: http://www.ewh.ieee.org/soc/es/May2001/14/CROSS0.GIF
Crossover methods • Two-Point Crossover- Two random points are chosen on the individual chromosomes (strings) and the genetic material is exchanged at these points. NOTE: These chromosomes are different from the last example.
Crossover methods • Uniform Crossover-Each gene (bit) is selected randomly from one of the corresponding genes of the parent chromosomes. NOTE: Uniform Crossover yields ONLY 1 offspring.
Crossover (contd.) • Crossover between 2 good solutions MAY NOT ALWAYS yield a better or as good a solution. • Since parents are good, probability of the child being good is high. • If offspring is not good (poor solution), it will be removed in the next iteration during “Selection”.
Elitism Main idea:copy the best chromosomes (solutions)to new population before applying crossover and mutation • When creating a new population by crossover or mutation the best chromosome might be lost. • Forces GAs to retain some number of the best individuals at each generation. • Has been found that elitism significantly improves performance.
Mutation Main idea:random inversion of bits in solution to maintain diversity in population set Ex. giraffes’ - mutations could be beneficial. Citation:http://www.ewh.ieee.org/soc/es/May2001/14/MUTATE0.GIF
Advantages and disadvantages Advantages: • Always an answer; answer gets better with time • Good for “noisy” environments • Inherently parallel; easily distributed Issues: • Performance • Solution is only as good as the evaluation function (often hardest part) • Termination Criteria
Applications - Genetic programming and data mining By: George Iordache
A.A. Freitas. “A survey of evolutionary algorithms for data mining and knowledge discovery”, Pontificia Universidade Catolica do Parana, Brazil. In A. Ghosh and S. Tsutsui, editors, Advances in Evolutionary Computation, pages 819--845. Springer-Verlag, 2002.http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_papers.dirzSzAdvEC-bk.pdf/freitas01survey.pdf • Anita Wasilewska, Course Lecture Notes (2007 and previous years) on Classification (Data Mining book Chapters 5 and 7) - http://www.cs.sunysb.edu/~cse634/lecture_notes/07classification.pdf • J. Han, and M. Kamber. “Data Mining: Concepts and Techniques 2nd ed.”, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6 • R. Mendes, F. Voznika, A. Freitas, and J. Nievola. “Discovering fuzzy classification rules with genetic programming and co-evolution”, Pontificia Universidade Catolica do Parana, Brazil. In L. de Raedt and A. Siebes, editors, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), volume 2168 of LNAI, pages 314--325. Springer Verlag, 2001.http://citeseer.ist.psu.edu/cache/papers/cs/23050/http:zSzzSzwww.ppgia.pucpr.brzSz~alexzSzpub_papers.dirzSzPKDD-2001.pdf/mendes01discovering.pdf • John R. Koza, Medical Informatics, Department of Medicine, Department of Electrical Engineering, Stanford University, Genetic algorithms and genetic programming, Lecture notes, 2003. www.genetic-programming.com/c2003lecture1modified.ppt
Genetic Programming A program in C • int foo (int time) { int temp1, temp2; if (time > 10) temp1 = 3; else temp1 = 4; temp2 = temp1 + 1 + 2; return (temp2); } • Equivalent expression (similar to a classification rule in data mining): (+ 1 2 (IF (> TIME 10) 3 4)) Citation:www.genetic-programming.com/c2003lecture1modified.ppt
Program tree (+ 1 2 (IF (> TIME 10) 3 4)) Citation:www.genetic-programming.com/c2003lecture1modified.ppt
Given data Citation:www.genetic-programming.com/c2003lecture1modified.ppt
Problem description Citation:www.genetic-programming.com/c2003lecture1modified.ppt
Generation 0 Population of 4 randomly created individuals x x + 1 x2 + 1 2 Citation: examples taken from: www.genetic-programming.com/c2003lecture1modified.ppt
Σ Σ Σ Σ Fitness: 4.40 6.00 9.48 15.40 Best in Gen 0
Mutation Mutation: picking “2” as mutation point / Citation: part of the pictures used as examples are taken from: www.genetic-programming.com/c2003lecture1modified.ppt
Crossover Crossover: picking “+” subtree and leftmost “x” as crossover points Citation: example taken from: www.genetic-programming.com/c2003lecture1modified.ppt
First offspring of crossover of (a) and (b) picking “+” of parent (a) and left-most “x” of parent (b) as crossover points Second offspring of crossover of (a) and (b) picking “+” of parent (a) and left-most “x” of parent (b) as crossover points Mutant of (c) picking “2” as mutation point Copy of (a) Generation 1 / Citation: part of the examples is taken from: www.genetic-programming.com/c2003lecture1modified.ppt
Σ Σ Σ Σ Fitness: 4.40 6.00 15.40 0.00 Found!
GA & Classification Classify customers based on number of children and salary: Citation: data table is taken from prof. Anita Wasilewska previous years course slides
GA & Classification Rules • A classification rule is of the form (the rule is in a predicate form – see course lectures): IF formulaTHEN class=ci AntecedentConsequence
Formula representation • Possible rule: • If (NOC = 2) AND ( S > 80000) then GOOD (customer) Formula Class AND = > NOC 2 S 80000 Citation: the example is taken from prof. Anita Wasilewska previous years course slides