130 likes | 283 Views
Modeling Gene Interactions in Disease. CS 686 Bioinformatics. Some Definitions. Data mining : extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be:
E N D
Modeling Gene Interactions in Disease CS 686 Bioinformatics
Some Definitions • Data mining: extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything" - Jeff Jonas, IBM • Machine learning: the ability of a program to learn from experience. Ex- neural networks, decision trees, rule-based methods, MDR.
Methods • Regression methods: modeling the relationship between a dependent variable and one of more independent variables. • Data mining methods: Search the space of possible models efficiently. Better with non-linear and high-dimensional data, or data with many potential interactions. • Exhaustive Search: search all possible models for the best one.
Linear regression • Relates outcome as a linear combination of the parameters (but not necessarily of the independent variables). • Ex: Let y = incidence of disease, n data points. Independent variables A,B 1) yi = b0 + b1Ai + εi, i = 1,…,n 2) yi = b0 + b2(Bi)2 + εi, i = 1,…,n where b0, b1,b2 = parameters, εi is error term. In both of these examples, the disease is modeled as linear in the parameters, although it is quadratic in variable B
Linear regression Given a sample, we estimate the params (ex: can use least squares) to arrive at the linear regression model: [1]
Multiple regression • Relates the the probability of an event to a linear combination of predictor variables. • Ex: Let y = incidence of disease, n data points. Independent variables x1, x2 yi = b0 + b1xi1 + b2xi2 + … + bpxip + εi, i = 1,…,n Best-fit line: For each unit increase in xip, is expected to increase by .
Logistic regression[1] • Often used when the outcome is binary, relates the log-odds of the probability of an event to a linear combination of predictor variables. Ex: • ln(p/(1 – p)) = α + βxB + γxC + ixBxC, where xBand xC are measured binary indicator variables, and regression coefficients βand y represent main effects, i represents interaction.
Other statistical methods [1] • Bayesian model selection: a statistical approach incorporating both prior distributions for parameters and observed data into the model. • Maximum likelihood: a statistical method used to make inferences about the combination of parameter values resulting in the highest probability of obtaining the observed data
Modeling Terminology[1] • Saturated: a statistical model that is as full as possible (saturated) with parameters. • Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters • Entropy: the uncertainty associated with a random variable
Modeling Terminology[1] • Cross-validation: partitioning a data set into n subsets, then using each subset in turn as the test set while using the other n-1 to train. • Overfitting: a model that provides a good fit to a specific data set but generalizes poorly. • Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters.
Marginal Effects [2] Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what value B has Table II. Penetrance values for combinations of genotypes from two single nucleotide polymorphisms exhibiting interactions in the absence of independent main effects Genotype Genotype Marginal penetrance B AA (0.25) Aa (0.50) aa (0.25) BB (0.25) 0 1 0 0.5 Bb (0.50) 1 0 1 0.5 bb (0.25) 0 1 0 0.5 Marginal 0.5 0.5 0.5 penetrance A Genotype frequencies are given in parentheses Marginal penetrance values for the A, B genotypes.
Weka [3] • A collection of visualization tools and algorithms for data analysis and predictive modeling. • Preprocessing tools for reading data in a variety of formats and transforming it. • Classification algorithms include regression, neural network, support vector machine, decision tree. Display includes ROC curves • Clustering: k-means, expectation maximization • Visualization includes scatter-plot, bar graph
References • Cordell, 2009, Detecting gene–gene interactions that underlie human diseases. Nature Review Genetics • McKinney et al, 2006, Machine Learning for Detecting Gene-Gene Interactions, A Review. Biomedical Genomics and Proteomics • Weka site: http://www.cs.waikato.ac.nz/ml/weka