Exploring Genetic Programming for Automated Discovery in Mathematics

Automated discovery in math • Machine learning techniques (GP, ILP, etc.) have been successfully applied in science • How about mathematics? Can they be used to discover interesting relationships in mathematical “data”? • This is an exploration of using GP for that purpose • Specifically, using GP to automatically discover Euler’s identity (V – E + F = 2) from a fairly limited amount of data

Cubes V = 8 E = 12 F = 6 V – E + F = 8 – 12 + 6 = 2

Tetrahedra V = 4 E = 6 F = 4 V – E + F = 4 – 6 + 4 = 2

Octahedra V = 6 E = 12 F = 8 V – E + F = 6 – 8 + 12 = 2

Data for Euler’s identity

At a glance • 50 generations • Population: 4000 ASTs • Generation #: 3600 (90% of population) • Maximum AST depth: 13 • Ramped half-and-half initialization • 3 non-terminals: +, -, * • 12 terminals: V, E, F, 1, 2, …, 9 • Crossover, no mutation

Genetic algorithms (GA) • Search a space of solution attempts (“individuals”) • Use natural selection to guide the search • Must have a fitness function that can evaluate any given individual • Individuals procreate by exchanging (recombining) “genetic material”

Example: SAT solving • Problem: Given a CNF formula P over n variables x1,…,xn, find a satisfying assignment • Search space: all n-bit strings • Fitness measure for a given individual b1 bn: # of satisfied clauses in P • Genetic operations: crossover and mutation

Crossover: a1 … aj-1|aj … an + b1 … bj-1|bj … bn a1 … aj-1| bj … bn b1 … bj-1| aj … an Mutation: 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1

Generic GA algorithm Parameterized over: N, P, G • Construct a random initial population • Set i := 1 • If i > N then halt • Compute the fitness of each individual; if the fittest solves the problem, halt. • Create a new population: • Pick P – G individuals and copy them • Create G new individuals by repeated applications of genetic operations • Set i := i + 1 and go to step 3

Selection • How is an individual “picked” for reproduction or copying? • Main idea: the probability that an individual is selected should be proportional to the individual’s fitness • Many ways to ensure that. One method is tournament selection: • Pick 0 < k <= P individuals randomly • Select the fittest of the k • When k = 1: No selection pressure • When k = P: Too much selection pressure

Genetic Programming (GP) • An instance of the generic GA scheme • Individuals are now programs, i.e., syntactic objects • Search space is kept finite by bounding program size • Programs are represented as ASTs (abstract syntax trees)

Programs as ASTs if x > 0 then y := x * x else y := z + 1 Parsing if := := > + x y y 0 * x 1 x z

Program structure in GP • Programs are usually simple Herbrand terms, i.e., functional expressions • AST leaves are called terminals • Internal nodes are non-terminals • Non-terminals are function symbols (e.g. +) • Terminals are constants and variables • Terminals + non-terminals must be sufficient for expressing solutions

Viewing a functional AST as a “program” + * y x 2 The program has two “inputs”, x and y. Given specific values for these, it produces a unique result as output

AST Crossover Crossover pt 1 Crossover pt 2 + - + T4 * T3 T1 T2 T5 T6 Parents Children - + T4 * + T3 T1 T5 T6 T2

Initial population • Built randomly • Two methods for building a random AST: • Full method: All branches are equally long • Grow method: Different subtrees can have different sizes (but less than the maximum) • More usual: ramped half-and-half initialization: half of the trees are built with one method, the other half with the other method

Problem formulation • Can cast it as a standard symbolic regression problem • View F as a function of E and V, and search space of all rational functions of two variables (up to a max depth) • Error function: difference between actual # of faces and the result produced by the program • Optimization: minimize the error • Quick convergence

Another approach • Search space of all identities • Generated as follows: I T1 = T2 T L | T1 + T2 | T1 – T2 | T1 * T2 L V | E | F | 1 | 2 | … | 9 • Any other integer can be built from 1,…, 9 and the given non-terminals • Identity is not a non-terminal; it can only appear at the root of an AST

Details • Generate P identities randomly (using ramped half-and-half initialization) • Crossover on two identities S1 = S2 and T1 = T2: • Mate two random subterms Si and Tj from each identity, producing two new subterms Si’ and Tj’ • If either new term is deeper than the max depth, then use one of the original parents • Replace Si and Tj in the identities by Si’ and Tj’ • No mutation

Fitness • An identity is evaluated on a given triple of values for V, E, and F • Computing the fitness of an identity S = T: • For each of the k data triples ½: • If S = T holds for ½, then give the identity a point • Higher score, greater fitness • Maximum fitness: 9, minimum: 0

Problem • Trivially true identities can get perfect scores, e.g.: • V = V • 1 + 2 = 5 – 3 • E – E + E = E • Solution: negative triples, e.g.: • V = 0, E = 0, F = 1 • Trivial identities will hold for such negative triples, but plausible identities will not

Fitness computation • To evaluate an identity S = T: • For each of the k data triples p: • Allocate a point if S = T holds for p • Allocate a second point if S = T does not hold for the negative triple • Maximum score: 18, minimum: 0 • Also impose a penalty of b n/20 c points for an identity of length n (to discourage excessively long expressions)

Exploring Genetic Programming for Automated Discovery in Mathematics

Exploring Genetic Programming for Automated Discovery in Mathematics

Presentation Transcript

Math Models for Learning and Discovery

Discovery in the Garden and Discovery in the Kitchen

NIF Resource Curation and Automated Resource Discovery?

Resource Curation and Automated Resource Discovery

Automated Discovery of Parameter Pollution Vulnerabilities in Web Applications

Facilitating Drug Discovery: An Automated High-content Inflammation Assay in Zebrafish

Automated Gateware Discovery Using Open Firmware

Transparency in Discovery:

Automated Gateware Discovery Using Open Firmware

Automated Discovery in Biological Sciences

Automated Process of Electronic Discovery March 8, 2010

Automated Discovery of Faults and Fault Domains using Random Testing

Semi-Automated Discovery of Application Session Structure

Automated Discovery of Numerical Approximation Formulae via Genetic Programming

Automated Discovery of claims of party membership

Automated Discovery in Pure Mathematics

Towards Automated Verification Through Type Discovery

Automated Discovery of Recommendation Knowledge David McSherry

Automated Process of Electronic Discovery October 19, 2009

Automated Business Process Discovery

COMP3503 Automated Discovery and Clustering Methods

Math Models for Learning and Discovery