Exploring Metabolomic data with recursive partitioning

Exploring Metabolomic data with recursive partitioning Metabolomic Workshop NISS July 14-15, 2005

Why study metabolites? • Metabolomics – the global study of all small molecules produced in the human body • Biochemical consequences of environment, drugs, and mutations can be observed directly through metabolites • Understand how drugs work, interactions and possible side effects • ~2500 metabolites

Challenges of metabolomic data • Nonnormal distributions • Outliers • Informative missing values • High correlation among metabolites • n < p problem (n - number of biological samples and p - number of metabolites)

Why recursive partitioning? • Is fairly robust to non-normal data • Missing values is not an issue • Correlation among variables is not an issue • Useful for discovering outliers • Is efficient at handling large p, small n data sets

How recursive partitioning works • Recursive partitioning efficiently searches through all of the variables and finds the one with the best split (most significant) • Once data is split or “partitioned” on this variable, the resulting daughter nodes are more homogeneous • Now each daughter node is explored to find the best split • This process is continued until no significant split remains

Example

Multiple Trees • All effects are not necessarily found in a single tree • In any node, there may be more than one significant variable • Creating multiple trees may reveal a number of possible effects • Gain an understanding of interactions/correlations among metabolites

Software • Helix Tree (Partitionator) • www.goldenhelix.com • Uses Formal Inference-based Recursive Modeling (FIRM) developed by Douglas Hawkins • Anyone can download free 7 day trial (webinars to assist in using the software)

Illustration of Software • Data • 317 metabolites • LC/MS and GC/MS • 63 biological samples • Want to discover which metabolites differentiate between the diseased group and the “healthy” individuals (within the diseased group there is a subset of individuals currently taking drugs)

Exploring Metabolomic data with recursive partitioning