Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona

Computational Discovery of Explanatory Process Models Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona http://cll.stanford.edu/~langley langley@asu.edu Thanks to N. Asgharbeygi, K. Arrigo, D. Billman, S. Borrett, W. Bridewell, S. Dzeroski, O. Shiran, and L. Todorovski for their contributions to this research, which is funded by a grant from the National Science Foundation.

The Challenge of Systems Science Disciplines like Earth science and computational biology differ from traditional fields in that they: focus on synthesis rather than analysis in their operation; develop system-level models with many variables / relations; rely on computational methods to aid in their construction. However, the key challenge involves search through the model space, not running rapid simulations or handling large data sets.

Example: Explain Data from the Ross Sea

A Model of the Ross Sea Ecosystem d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus Differential equation models of this sort are regularly used to explain observations and predict future behavior.

The Task of Model Construction Environmental scientists are confronted with a challenging task: Given: A set of variables of interest to the scientist; Given: Observations of how these variables change over time; Find: A model that explains these variations in plausible terms and that generalizes well to future observations. Automating such model construction is a natural task for artificial intelligence and machine learning. We can develop algorithms that search the space of differential equation models, but this space is huge, so we need constraints.

As phytoplankton uptakes nitrogen, its concentration increases and nitrogen decreases. This continues until the nitrogen supply is exhausted, which leads to a phytoplankton die off. This produces detritus, which gradually remineralizes to replenish the nitrogen. Zooplankton grazes on phytoplankton, which slows the latter’s increase and also produces detritus. Another Account of the Ross Sea Ecosystem d[phyto,t,1] =  0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus

Knowledge about candidate processes requires that some terms occur either together or not at all. Processes in the Ross Sea Ecosystem d[phyto,t,1] =  0.307  phyto 0.495  zoo + 0.411  phyto d[zoo,t,1] =  0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo  0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus Here we highlight the terms related to phytoplantkon loss, which decreases phyto concentration and increases detritus.

Here we highlight terms related to zooplankton grazing, which decreases phyto but increases zoo and detritus. Processes in the Ross Sea Ecosystem d[phyto,t,1] = 0.307  phyto  0.495  zoo + 0.411  phyto d[zoo,t,1] = 0.251  zoo + 0.615  0.495  zoo d[detritus,t,1] = 0.307  phyto +0.251  zoo + 0.385  0.495  zoo 0.005  detritus d[nitro,t,1] =  0.098  0.411  phyto + 0.005  detritus We can use knowledge about processes to reorganize models and constrain search through the model space.

model Ross_Sea_Ecosystem variables: phyto, zoo, nitro, detritus observables: phyto, nitro process phyto_loss equations: d[phyto,t,1] =  0.307  phyto d[detritus,t,1] = 0.307  phyto process zoo_loss equations: d[zoo,t,1] =  0.251  zoo d[detritus,t,1] = 0.251  zoo process zoo_phyto_grazing equations: d[zoo,t,1] = 0.615  0.495  zoo d[detritus,t,1] = 0.385  0.495  zoo d[phyto,t,1] =  0.495  zoo process nitro_uptake equations: d[phyto,t,1] = 0.411  phyto d[nitro,t,1] =  0.098  0.411  phyto process nitro_remineralization; equations: d[nitro,t,1] = 0.005  detritus d[detritus,t,1 ] =  0.005  detritus A Process Model for the Ross Sea This model is equivalent to a standard differential equation model, but it makes explicit assumptions about which processes are involved. For completeness, we must also make assumptions about how to combine influences from multiple processes.

The Task of Inductive Process Modeling We can use these ideas to reformulate the modeling problem: Given: A set of variables of interest to the scientist; Given: Observations of how these variables change over time; Given: Background knowledge about plausible processes; Find: A process model that explains these variations and that generalizes well to future observations. We can use background knowledge about candidate processes to make search much more tractable. Moreover, the resulting model will be consistent with this domain knowledge, making it more comprehensible.

Generic Processes as Background Knowledge We cast background knowledge as generic processes that specify: the variables involved in a process and their types; the parameters appearing in a process and their ranges; the forms of conditions on the process; and the forms of associated equations and their parameters. Generic processes are building blocks from which one can compose a specific process model.

generic process exponential_loss generic process remineralization variables: S{species}, D{detritus} variables: N{nutrient}, D{detritus} parameters:  [0, 1] parameters:  [0, 1] equations: d[S,t,1] = 1  S equations: d[N, t,1] =  D d[D,t,1] =  S d[D, t,1] = 1  D generic process grazing generic process constant_inflow variables: S1{species}, S2{species}, D{detritus} variables: N{nutrient} parameters:  [0, 1],  [0, 1] parameters:  [0, 1] equations: d[S1,t,1] =  S1 equations: d[N,t,1] =  d[D,t,1] = (1 )  S1 d[S2,t,1] = 1  S1 generic process nutrient_uptake variables: S{species}, N{nutrient} parameters:  [0, ],  [0, 1],  [0, 1] conditions: N >  equations: d[S,t,1] =  S d[N,t,1] = 1  S Generic Processes for Aquatic Ecosystems Our current library contains about 20 generic processes, including ones with alternative functional forms for loss and grazing processes.

model AquaticEcosystem variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto observables: nitro, phyto, zoo process phyto_exponential_growth equations: d[phyto,t] = 0.1  phyto process zoo_logistic_growth equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5) process phyto_nitro_consumption equations: d[nitro,t] = 1  phyto  nutrient_nitro, d[phyto,t] = 1  phyto  nutrient_nitro process phyto_nitro_no_saturation equations: nutrient_nitro = nitro process zoo_phyto_consumption equations: d[phyto,t] = 1  zoo  nutrient_phyto, d[zoo,t] = 1  zoo  nutrient_phyto process zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5) process exponential_growth variables: P {population} equations: d[P,t] = [0, 1,]  P process logistic_growth variables: P {population} equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ]) process constant_inflow variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, ] process consumption variables: P1 {population}, P2 {population}, nutrient_P2 equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2, d[P2,t] =  [0, 1, ]  P1  nutrient_P2 process no_saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P process saturation variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, ]) phyto, nitro, zoo, nutrient_nitro, nutrient_phyto variables observations Constructing Process Models process model Heuristic Search generic processes

Our initial system, IPM, constructs process models from generic components in four stages: A Method for Process Model Construction 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints; 2. Combine instantiated processes into candidate generic models subject to additional constraints (e.g., number of processes); 3. For each generic model, carry out search through parameter space to find good coefficients; 4. Return the parameterized model with the best overall score. Our typical evaluation metric is squared error, but we have also explored other measures of explanatory adequacy.

Results on Observations from Ross Sea We provided IPM with 188 samples of phytoplnkton, nitrate, and ice measures taken from the Ross Sea. From 2035 distinct model structures, it found accurate models that limited phyto growth by the nitrate and the light available. Some high-ranking models incorporated zooplankton, whereas others did not.

Results with Inductive Process Modeling battery behavior population dynamics biochemical kinetics hydrology

In recent work, we have extended our system to incorporate: Extensions to Inductive Process Modeling heuristic beam search through the space of process models; hierarchical generic processes that further constrain search; an ensemble-like method that mitigates overfitting effects; an EM-like method that deals with missing observations. This approach has great potential to speed the construction of scientifc models – provided that domain users adopt it.

Interfacing with Scientists Because few scientists want to be replaced, we are developing an interactive environment, PROMETHEUS, that lets users: specify a quantitative process model of the target system; display and edit the model’s structure and details graphically; simulate the model’s behavior over time and situations; compare the model’s predicted behavior to observations; invoke a revision module in response to detected anomalies. The environment offers computational assistance in forming and evaluating models but lets the user retain control.

Viewing a Process Model Graphically

Viewing a Process Model as Equations

Adding a Process Manually

Requesting Automatic Model Revision

Results of Automatic Model Revision

Despite our progress to date, we need further work in order to: Directions for Future Research provide better ways to visualize models, data, and their relation offer users more natural ways to define the space of models specifying constraints on relations among entities and processes characterizing subsystems that decompose complex models incorporate intuitive metrics like match to trajectory shape more generally improve the usability of PROMETHEUS Taken together, these will make inductive process modeling a more robust approach to scientific model construction.

Our approach to aiding scientific model construction incorporates ideas from many traditions: Intellectual Influences • computational scientific discovery (e.g., Langley et al., 1983); • theory revision in machine learning (e.g., Towell, 1991); • qualitative physics and simulation (e.g., Forbus, 1984); • languages for scientific simulation (e.g., STELLA, MATLAB); • interactive tools for data analysis (e.g., Schneiderman, 2001). Our work combines, in novel ways, insights from machine learning, AI, programming languages, and human-computer interaction.

In summary, our work on computational model construction has produced an approach that: Contributions of the Research incorporates a formalism that is familiar to many scientists; takes into account background knowledge about the domain; produces meaningful results from small amounts of data; generates models that explain rather than describe observations; provides an interactive environment for model construction. We need much more research in computational systems science that addresses these challenges.

End of Presentation

Pat Langley School of Computing and Informatics Arizona State University Tempe, Arizona