450 likes | 682 Views
Clustering and pathway analysis. MLW STO2.3 2010 – BiGCaT Bioinformatics. Clustering and pathway analysis. Tools to help you interpret large datasets (e.g. microarray) Clustering Discover patterns in your data Usually without prior knowledge (unsupervised)
E N D
Clustering and pathway analysis MLW STO2.3 2010 – BiGCaT Bioinformatics
Clustering and pathway analysis • Tools to help you interpret large datasets (e.g. microarray) • Clustering • Discover patterns in your data • Usually without prior knowledge (unsupervised) Find sets of genes that behave in a similar way • Pathway Analysis • Combine data with what we know about biology Find which and how biological processes are affected in your experiment
Microarray analysis Scanned microarrays Image analysis Raw intensities QC, Normalization Normalized intensities Statistical analysis Lists of regulated genes Clustering Pathway analysis Pathway analysis Sets of co-regulated genes Sets of affected pathways Biological interpretation
Clustering • A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation).
Clustering microarray data • High dimensional >10,000 measurements from relatively small number of samples • Use clustering to find groups of similar profiles • You can cluster on both samples and genes Image from J. Pennings, RIVM, NL
Quality control • You can also use clustering for quality control Control group Hdh knockout group Zhang et al. BMC Neuroscience, 2008
How to compare data points? • Calculate a (dis)similarity measure • Commonly used similarity measures are Pearson and Spearman correlation coefficients • Commonly used dissimilarity measure is the Euclidean distance ? ?
Calculating Euclidean distance • dpq is the distance between gene p and gene q • n is the number of samples • E.g. p1 is the measured expression of gene p, sample 1 a ya dab = ((xb – xa)2 + (yb – ya)2)1/2 yb b xa xb
Calculating Euclidean distance • Watch out for measures on different scales • E.g.: weight in kg, length in cm, age in years • Important to standardize scales first: • This can be done by: • subtracting the mean • dividing by the estimate of the standard deviation (s)
Cluster algorithms • There are different cluster algorithms • Make use of (dis)similarity measure • Hierarchical • K-means • SOM • … and many more variations
Hierarchical clustering • May be agglomerative … • building up the branches of a tree, beginning with the two most closely related objects • … or divisive • building the tree by finding the most dissimilar objects first • In each case, we end up with a clustering tree or dendrogramhaving branches and nodes.
a,b a,b,c,d,e c,d,e d,e Agglomerative 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)
a a,b b c c,d,e d d,e e Divisive Tree is constructed! a,b,c,d,e 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)
1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Fig. 7.10 Page 207
Hierarchical clustering (continued) • How to compute the closest related items? • For step 1, simply the two items with the highest similarity (or smallest distance) are grouped first. • For the following steps we need to compute the (dis)similarity between groups of items • To compute this, several methods are available • In single-linkage clustering, the (dis)similarity between two groups of items is that of their most similar (less dissimilar) pair.
Other hierarchical clustering methods • There are many other criteria for defining clusters: • Single linkage • Complete linkage • Average linkage • Median linkage • Centroid linkage • Ward’s method • minimises variance single linkage complete linkage centroid linkage
Other hierarchical clustering methods (continued) • These do not always give equivalentclustering patterns • Average linkage and Ward’s methodseem most stable • Single-linkage clustering may besusceptible to ‘chaining’ of closelyrelated items, obscuring reasonablecluster structure chaining in single linkage
K-means clustering • Besides hierarchical methods many others are available • K-means clustering / Fuzzy K-means clustering • Choose number of clusters K • Take a random center for each cluster • Assign every item to the closest cluster • Recompute cluster centers (to be the average of all items it contains) • Re-assign items to the cluster of which the center is closest now • …and so on…until no change occurs any more • Advantage: easy method • Disadvantage: choice of K is arbitrarily – you still do not learn much about the relationships between the clusters (no dendrogram)
Kohonen Self Organising Maps (SOM) • Start with a ‘grid’ of MxN cluster centers • Train this grid to fit the data at hand • Clusters are linked to each other, if one cluster moves, it’s neigbhours also move a bit • Advantage: indicates relations between the clusters • Disadvantages: M and N arbitrarily – results not easy to interpret Image from J. Pennings, RIVM, NL
Clustering can lead toBiological interpretation Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Fig. 7.13 Page 209
Free clustering software • TM4 MeV http://www.tm4.org/mev.html • R http://www.r-project.org/ • Eisen http://rana.lbl.gov/EisenSoftware.htm • NCBI GEO also supports basic clustering:
Pathway analysis • You found that 1300 genes in your microarray experiment were significantly up-regulated after treatment with X • You found a cluster containing 60 genes increasing over time • But then what? You really want to know what this means biologically… • Is a certain biological pathway activated in my experiment? • Which pathways are more activated than other average? • How do the genes in the cluster interact with each other? Pathway analysis can help here…
Why pathway analysis • Provides biological relevant context • More intuitive method than purely mathematical clustering methods • More efficient than looking up biology gene-by-gene • Improve statistical power • Network analysis
Biological pathways Interactions • Biochemical reactions • Transport • Inhibition • Activation ENSG00000141510, Ensembl
Pathway analysis • Based on linking microarray data to genes on a pathway Annotation: ENSG00000131828 Identifier mapping
Identifier mapping • What if the annotation on the pathways is different than used for the microarray? • Microarrays typically use internal ids: • Affymetrix: 205749_at • Agilent: A_14_P106416 • Illumina: ILMN_4380 • Pathway typically use gene/protein ids: • Entrez Gene: 1543 • Ensembl: ENSG00000140465 • UniProt: P04637
Identifier mapping • 2 scenarios: • The software will take care of it • E.g. PathVisio uses synonym databases • You will have to convert the ids yourself: • DAVID http://david.abcc.ncifcrf.gov/ • SOURCE http://smd.stanford.edu/cgi-bin/source/sourceBatchSearch • BioMART http://www.biomart.org/ • NetAffx http://www.affymetrix.com
Where to get pathways? • Online pathway databases • Kegg http://www.genome.jp/kegg • Reactome http://www.reactome.org • WikiPathways http://www.wikipathways.org • Many more… http://pathguide.org • Gene Ontology http://www.geneontology.org • Not really pathways but usable in pathway statistics
Draw your own pathways! • On your computer with PathVisio • Share with WikiPathways
Pathway analysis tools • Based on pathway diagrams • BioRAG • MetaCore (GeneGO) • Pathway-Express • GenMAPP / MAPPFinder • PathVisio • Based on Gene Ontology • Onto-Express • GOToolbox • MAPPFinder • Gostat • GeneMerge • GOSurfer • EASE • Fatigo
From lists of differentially expressed genes to biological interpretation. Main Function: Find pathways that are overrepresented in regulated genes Visualize expression changes on pathways Pathway analysis tools
Analysis on Gene Ontology The Gene Ontology (GO) project gives a consistent description of gene products from different databases. GO consortium: http://www.geneontology.org
Analysis on Gene Ontology Find terms that contain highest ratio of significantly changed genes…
PathVisio www.pathvisio.org • Visualize gene expression on biological pathways • Identify significantly changed processes
Pathway Content • Contributed by research community • Contributed by large-scale curation efforts • Converted between species • Distributed on WikiPathways
Identifiers in PathVisio Pathways Experimental Data • Affymetrix, Illumina, Agilent, CodeLink • Entrez Gene • RefSeq (protein only) • Unigene • UniProt • Ensembl • PDB • Species-specific MOD IDs • Entrez Gene • Unigene • UniProt • Ensembl
Synonym Database Pathways Experimental Data Gene Database • Genes and annotation • Relational information • Assembled from Ensembl
Z-score Unchanged gene Changed gene Question: Does the small circle have a higher percentage of changed genes than the large circle? Is this difference significant?
Z-score The Z-score can be used as a measure for how much a subset of genes is different from the rest r = changed genes in Pathway n = total genes in Pathway R = changed genes N = total genes
Afternoon practical session You are going to do: • Clustering using the GEO website • Pathway analysis using PathVisio
Internships and “Jaarwerkstuk”at the dept. of Bioinformatics • Subjects: • Designing new pathways and analyzing microarray data • Developing an online repository for data generated by PathVisio/GenMAPP • miRNA • CpG islands • SNPs • Microarray analyse • - ??????? INTERESTED???? email: chris.evelo@bigcat.unimaas.nl for more information about the department of Bioinformatics look at eleum course information