1 / 44

Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium

Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium. Guy Harari. FABIA: factor analysis for bicluster acquisition. Sepp Hochreiter et al., University of Linz, Austria. FABIA - Motivation. Plaid models: for bicluster i :

morey
Download Presentation

Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari

  2. FABIA: factor analysis for bicluster acquisition SeppHochreiter et al., University of Linz, Austria

  3. FABIA - Motivation • Plaid models: for bicluster i: • They use least squares fit for model selection • Thus assume Gaussian effects • However, microarray datasets are not Gaussian (heavy tails)

  4. FABIA – model • Biclusters have multiplicative coherent values • λ – prototype • z - factors • In the example above:

  5. FABIA – model • For p biclusters and additive Gaussian noise: • The j-th sample (column in X) is: • where is the j-the column of Z. • Λ and Z are sparse.

  6. Generative Model for Factor Analysis • Data was produced by: • Picking values independently from some Gaussian hidden factors. • Linearly combining the factors using a factor loading matrix. • Add Gaussian noise for each input

  7. Generative Model for Factor Analysis • Assume factors and noise areindependent. • Assume also . • Select #factors by e.g. Kaiser criterion – • Extract factors using e.g.maximum likelihood.

  8. FABIA – model • Fix the value for j. • Factors are the ‘s, . • Biclusters shouldn’t be correlated. • are the loading matrix’s entries. • is diagonal – independentGaussian noise.

  9. Sparseness • We want sparse solutions for and • So use Laplace distribution for : • For use one of: • FABIA: • FABIAS: parameter

  10. Model Selection • Center the data to zero median. • Normalization – divide values by row’s std. • Use EM where the parameters are and . • Rank biclusters according to mutual information: • Determine members of each bicluster using two thresholds for values and .

  11. Experiments – Simulated Datasets • n=1000genes, l=100 samples • p=10 multiplicative biclusters • Generate : • Choose - the number of genes in bicluster i - uniformly at random from {10,…,210}. • Choose genes from {1,…,1000}. • Set components not in bicluster i to . • Set components in bicluster i to .

  12. Experiments – Simulated Datasets • Generate : • Choose - the number of samples in bicluster i - uniformly at random from {5,…,25}. • Choose samples from {1,…,100}. • Set components not in bicluster i to . • Set components in bicluster i to . • Add random noise to all entries according to . • Compute the dataset with

  13. Evaluation – consensus score • For two sets of biclusters: • Compute similarity between each pair of biclusters, one from each set. • Find maximum assignment using the Munkres (Hungarian) algorithm. • Penalize different numbers of biclusters - Divide the sum of similarities of the assigned biclusters by the number of biclusters of the largest set. • Use Jaccard index for computing similarity.

  14. Simulated Datasets - Results • Average score and STD for each method:

  15. Simulated Datasets - Results • Avg. and STD of information content and similarity:

  16. Simulated additive datasets • Generate biclusters in the same way. • Use additive model for each bicluster: • Choose from and from . • Choose from one of three models: • Low signal – • Moderate signal – • High signal –

  17. Additive Datasets - results • Low signal:

  18. Additive Datasets - results • Moderate signal:

  19. Additive Datasets - results • High signal:

  20. Gene Expression Datasets • Breast cancer (Van’t Veer et al., 2002) – 3 classes (clusters) were found in Hoshida et al., 2007. • Multiple tissue types dataset (Su et al., 2002) • Diffuse large-B-cell lymphoma dataset (DLBCL) (Rosenwald et al., 2002) – 3 classes (clusters) were found in Hoshida et al. (2007).

  21. Gene Expression Datasets - results

  22. Biological Interpretation • Breast cancer: • Bicluster 1 is related to cell cycle (GO and KEGG, ) and to the proteins CDC2 (division control) and KIF (mitosis). • Bicluster 2 is related to immune response (GO, ) and cytokine-cytokine receptor interaction (KEGG ), and to cytokine-related proteins as CCR5, CCL4 and CSF2RB. • Multiple tissue – no biological interpretation.

  23. Biological Interpretation • DLBCL: • Bicluster 1 is related to the ribosome (GO , KEGG ) and to B-cell receptor signaling (KEGG ). • Bicluster 2 is related to the immune system (GO , KEGG ).

  24. Drag Design • Goal: find compounds with similar effects on gene expression. • Use Affymetrix GeneChip HT HG-U133+ PM array plates with 12*8 samples per plate. • Selected compounds are active on a cancer cell line. • Each compound was testes in a group of three replicates.

  25. Drag Design • 3 biclusters were found to have 2-5 replicate sets. • One of them extracted genes related to mitosis (GO ). • The compounds of this bicluster are now under investigation by Johnson & Johnson Pharmaceutical R&D.

  26. Biclustering Gene Expression Time Series Sara C Madeira, Technical University of Lisbon

  27. Introduction • Input: columns correspond to samples taken in consecutive instants of time. • Output: biclusters with contiguous columns. • Motivation: biological processes start and end in a contiguous time leading to increased/decreased activity of some genes. • Goal: find all maximal contiguous column coherent (CCC) biclusters sorted by a statistical score.

  28. Discretization • Let be the input expression matrix. • Define • Standardize A’ to mean=0 and STD=1 by gene.

  29. Discretization • Define • Where D symbolizes Down-regulation, U for Up-regulation and N for No-change. • And t=1 is the standard deviation of a gene.

  30. CCC-Bicluster • Definition: A CCC-Bicluster is a subset of rows and contiguous subset of columns such that for all rows and columns . • Note that each CCC-Bicluster defines a string S which is common to every row in I.

  31. Suffix Trees Each node, other than the root, has at least two children. Each edges is labeled with nonempty substring of S (here “BANANA”) No two edges out of a node have edge labels starting with the same symbol. The label from the root to a leaf is a suffix of S.

  32. Example Internal node = row-maximal, right-maximal CCC-Bicluster

  33. Main Result • Every (inclusion) maximal CCC-Bicluster with at least two rows corresponds to an internal node in the suffix tree such that: • It does not have incoming suffix links, or, • It has incoming suffix links only from nodes having less leaves in their subtress. • Each such an internal node defines a maximal CCC-Bicluster with at least two rows. • This implies an O(nm) time algorithm for finding all CCC-Biclusters.

  34. Experiments – Simulated Datasets • Generate a random 1000 x 50 dataset. • Apply the algorithm on it. • Plant 10 CCC-Biclusters on the same dataset. • Apply again the algorithm on the dataset. • Define a similarity measure to be Jaccard index (genes and conditions) and a statistical test. • Filter out similar biclusters and those didn’t pass the statistical test.

  35. The Statistical Test • Null hypothesis – expression values of a subset of genes evolve independently. • Expression patterns are modeled by a first-order Markov Chain, e.g. for the pattern : where

  36. The Statistical Test • n – the number of genes in the dataset. • I – the subset of genes in a CCC-Bicluster. • The significance of a CCC-Bicluster B with anexpression pattern is:

  37. Simulated Datasets - results • 165 CCC-Biclusters passed the test at the 1 percent level, after Bonferroni correction.

  38. Experiments – Real Datasets • Use yeast heat shock response dataset from Gasch et al. • 25 CCC-Biclusters were found to be highly significant at the 1% after Bonferroni corr. • 9 of them removed after similarity check. • Test results for GO enrichment (hypergeo.)

  39. Real Datasets - results

  40. Up-regulated CCC-Biclusters

  41. Down-regulated CCC-Biclusters

  42. Improvements • Allow errors: replacement of D/U with N and vice versa. • Discover biclusters with opposite patterns (anti-correlated). • Allow scaled and time-lagged (shifted) patterns. • TriClustering – genes x time points x exemplars (different patients/stress conditions).

  43. Other talks • “biclust” R package – Ludwig Maximilian University of Munich (Inst. of statistics) and Hasselt University. • ISA and related tools (R packages) – Gabor Csardi, University of Lausanne, Switzerland. • Clustering of dose-response microarray data – Hasselt University, Johnson & Johnson PR&D. • Model- and graph-based clustering of genomic data – Freiburg inst. For advanced studies, Ger.

  44. Questions?

More Related