1 / 44

Clustering and pathway analysis

Clustering and pathway analysis. MLW STO2.3 2010 – BiGCaT Bioinformatics. Clustering and pathway analysis. Tools to help you interpret large datasets (e.g. microarray) Clustering Discover patterns in your data Usually without prior knowledge (unsupervised)

toril
Download Presentation

Clustering and pathway analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering and pathway analysis MLW STO2.3 2010 – BiGCaT Bioinformatics

  2. Clustering and pathway analysis • Tools to help you interpret large datasets (e.g. microarray) • Clustering • Discover patterns in your data • Usually without prior knowledge (unsupervised) Find sets of genes that behave in a similar way • Pathway Analysis • Combine data with what we know about biology Find which and how biological processes are affected in your experiment

  3. Microarray analysis Scanned microarrays Image analysis Raw intensities QC, Normalization Normalized intensities Statistical analysis Lists of regulated genes Clustering Pathway analysis Pathway analysis Sets of co-regulated genes Sets of affected pathways Biological interpretation

  4. Clustering • A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation).

  5. Clustering microarray data • High dimensional >10,000 measurements from relatively small number of samples • Use clustering to find groups of similar profiles • You can cluster on both samples and genes Image from J. Pennings, RIVM, NL

  6. Quality control • You can also use clustering for quality control Control group Hdh knockout group Zhang et al. BMC Neuroscience, 2008

  7. How to compare data points? • Calculate a (dis)similarity measure • Commonly used similarity measures are Pearson and Spearman correlation coefficients • Commonly used dissimilarity measure is the Euclidean distance ? ?

  8. Calculating Euclidean distance • dpq is the distance between gene p and gene q • n is the number of samples • E.g. p1 is the measured expression of gene p, sample 1 a ya dab = ((xb – xa)2 + (yb – ya)2)1/2 yb b xa xb

  9. Calculating Euclidean distance • Watch out for measures on different scales • E.g.: weight in kg, length in cm, age in years • Important to standardize scales first: • This can be done by: • subtracting the mean • dividing by the estimate of the standard deviation (s)

  10. Cluster algorithms • There are different cluster algorithms • Make use of (dis)similarity measure • Hierarchical • K-means • SOM • … and many more variations

  11. Hierarchical clustering • May be agglomerative … • building up the branches of a tree, beginning with the two most closely related objects • … or divisive • building the tree by finding the most dissimilar objects first • In each case, we end up with a clustering tree or dendrogramhaving branches and nodes.

  12. a,b a,b,c,d,e c,d,e d,e Agglomerative 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)

  13. a a,b b c c,d,e d d,e e Divisive Tree is constructed! a,b,c,d,e 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)

  14. 1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Fig. 7.10 Page 207

  15. Hierarchical clustering (continued) • How to compute the closest related items? • For step 1, simply the two items with the highest similarity (or smallest distance) are grouped first. • For the following steps we need to compute the (dis)similarity between groups of items • To compute this, several methods are available • In single-linkage clustering, the (dis)similarity between two groups of items is that of their most similar (less dissimilar) pair.

  16. Other hierarchical clustering methods • There are many other criteria for defining clusters: • Single linkage • Complete linkage • Average linkage • Median linkage • Centroid linkage • Ward’s method • minimises variance single linkage complete linkage centroid linkage

  17. Other hierarchical clustering methods (continued) • These do not always give equivalentclustering patterns • Average linkage and Ward’s methodseem most stable • Single-linkage clustering may besusceptible to ‘chaining’ of closelyrelated items, obscuring reasonablecluster structure chaining in single linkage

  18. K-means clustering • Besides hierarchical methods many others are available • K-means clustering / Fuzzy K-means clustering • Choose number of clusters K • Take a random center for each cluster • Assign every item to the closest cluster • Recompute cluster centers (to be the average of all items it contains) • Re-assign items to the cluster of which the center is closest now • …and so on…until no change occurs any more • Advantage: easy method • Disadvantage: choice of K is arbitrarily – you still do not learn much about the relationships between the clusters (no dendrogram)

  19. Kohonen Self Organising Maps (SOM) • Start with a ‘grid’ of MxN cluster centers • Train this grid to fit the data at hand • Clusters are linked to each other, if one cluster moves, it’s neigbhours also move a bit • Advantage: indicates relations between the clusters • Disadvantages: M and N arbitrarily – results not easy to interpret Image from J. Pennings, RIVM, NL

  20. Clustering  can lead toBiological interpretation Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Fig. 7.13 Page 209

  21. Free clustering software • TM4 MeV http://www.tm4.org/mev.html • R http://www.r-project.org/ • Eisen http://rana.lbl.gov/EisenSoftware.htm • NCBI GEO also supports basic clustering:

  22. Pathway analysis • You found that 1300 genes in your microarray experiment were significantly up-regulated after treatment with X • You found a cluster containing 60 genes increasing over time • But then what? You really want to know what this means biologically… • Is a certain biological pathway activated in my experiment? • Which pathways are more activated than other average? • How do the genes in the cluster interact with each other? Pathway analysis can help here…

  23. Why pathway analysis • Provides biological relevant context • More intuitive method than purely mathematical clustering methods • More efficient than looking up biology gene-by-gene • Improve statistical power • Network analysis

  24. Biological pathways Interactions • Biochemical reactions • Transport • Inhibition • Activation ENSG00000141510, Ensembl

  25. Pathway analysis • Based on linking microarray data to genes on a pathway Annotation: ENSG00000131828 Identifier mapping

  26. Identifier mapping • What if the annotation on the pathways is different than used for the microarray? • Microarrays typically use internal ids: • Affymetrix: 205749_at • Agilent: A_14_P106416 • Illumina: ILMN_4380 • Pathway typically use gene/protein ids: • Entrez Gene: 1543 • Ensembl: ENSG00000140465 • UniProt: P04637

  27. Identifier mapping • 2 scenarios: • The software will take care of it • E.g. PathVisio uses synonym databases • You will have to convert the ids yourself: • DAVID http://david.abcc.ncifcrf.gov/ • SOURCE http://smd.stanford.edu/cgi-bin/source/sourceBatchSearch • BioMART http://www.biomart.org/ • NetAffx http://www.affymetrix.com

  28. Where to get pathways? • Online pathway databases • Kegg http://www.genome.jp/kegg • Reactome http://www.reactome.org • WikiPathways http://www.wikipathways.org • Many more… http://pathguide.org • Gene Ontology http://www.geneontology.org • Not really pathways but usable in pathway statistics

  29. Draw your own pathways! • On your computer with PathVisio • Share with WikiPathways

  30. Pathway analysis tools • Based on pathway diagrams • BioRAG • MetaCore (GeneGO) • Pathway-Express • GenMAPP / MAPPFinder • PathVisio • Based on Gene Ontology • Onto-Express • GOToolbox • MAPPFinder • Gostat • GeneMerge • GOSurfer • EASE • Fatigo

  31. From lists of differentially expressed genes to biological interpretation. Main Function: Find pathways that are overrepresented in regulated genes Visualize expression changes on pathways Pathway analysis tools

  32. Analysis on Gene Ontology The Gene Ontology (GO) project gives a consistent description of gene products from different databases. GO consortium: http://www.geneontology.org

  33. Analysis on Gene Ontology Find terms that contain highest ratio of significantly changed genes…

  34. Analysis on pathway diagrams

  35. PathVisio www.pathvisio.org • Visualize gene expression on biological pathways • Identify significantly changed processes

  36. Pathway Content • Contributed by research community • Contributed by large-scale curation efforts • Converted between species • Distributed on WikiPathways

  37. Identifiers in PathVisio Pathways Experimental Data • Affymetrix, Illumina, Agilent, CodeLink • Entrez Gene • RefSeq (protein only) • Unigene • UniProt • Ensembl • PDB • Species-specific MOD IDs • Entrez Gene • Unigene • UniProt • Ensembl

  38. Synonym Database Pathways Experimental Data Gene Database • Genes and annotation • Relational information • Assembled from Ensembl

  39. PathVisio results: Data mapped on Pathway

  40. PathVisio Results: Z-score

  41. Z-score Unchanged gene Changed gene Question: Does the small circle have a higher percentage of changed genes than the large circle? Is this difference significant?

  42. Z-score The Z-score can be used as a measure for how much a subset of genes is different from the rest r = changed genes in Pathway n = total genes in Pathway R = changed genes N = total genes

  43. Afternoon practical session You are going to do: • Clustering using the GEO website • Pathway analysis using PathVisio

  44. Internships and “Jaarwerkstuk”at the dept. of Bioinformatics • Subjects: • Designing new pathways and analyzing microarray data • Developing an online repository for data generated by PathVisio/GenMAPP • miRNA • CpG islands • SNPs • Microarray analyse • - ??????? INTERESTED???? email: chris.evelo@bigcat.unimaas.nl for more information about the department of Bioinformatics look at eleum  course information

More Related