Clustering and pathway analysis

Clustering and pathway analysis MLW STO2.3 2010 – BiGCaT Bioinformatics

Clustering and pathway analysis • Tools to help you interpret large datasets (e.g. microarray) • Clustering • Discover patterns in your data • Usually without prior knowledge (unsupervised) Find sets of genes that behave in a similar way • Pathway Analysis • Combine data with what we know about biology Find which and how biological processes are affected in your experiment

Microarray analysis Scanned microarrays Image analysis Raw intensities QC, Normalization Normalized intensities Statistical analysis Lists of regulated genes Clustering Pathway analysis Pathway analysis Sets of co-regulated genes Sets of affected pathways Biological interpretation

Clustering • A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation).

Clustering microarray data • High dimensional >10,000 measurements from relatively small number of samples • Use clustering to find groups of similar profiles • You can cluster on both samples and genes Image from J. Pennings, RIVM, NL

Quality control • You can also use clustering for quality control Control group Hdh knockout group Zhang et al. BMC Neuroscience, 2008

How to compare data points? • Calculate a (dis)similarity measure • Commonly used similarity measures are Pearson and Spearman correlation coefficients • Commonly used dissimilarity measure is the Euclidean distance ? ?

Calculating Euclidean distance • dpq is the distance between gene p and gene q • n is the number of samples • E.g. p1 is the measured expression of gene p, sample 1 a ya dab = ((xb – xa)2 + (yb – ya)2)1/2 yb b xa xb

Calculating Euclidean distance • Watch out for measures on different scales • E.g.: weight in kg, length in cm, age in years • Important to standardize scales first: • This can be done by: • subtracting the mean • dividing by the estimate of the standard deviation (s)

Cluster algorithms • There are different cluster algorithms • Make use of (dis)similarity measure • Hierarchical • K-means • SOM • … and many more variations

Hierarchical clustering • May be agglomerative … • building up the branches of a tree, beginning with the two most closely related objects • … or divisive • building the tree by finding the most dissimilar objects first • In each case, we end up with a clustering tree or dendrogramhaving branches and nodes.

a,b a,b,c,d,e c,d,e d,e Agglomerative 0 1 2 3 4 a b c d e Tree is constructed! Adapted from Kaufman and Rousseeuw (1990)

a a,b b c c,d,e d d,e e Divisive Tree is constructed! a,b,c,d,e 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)

1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Fig. 7.10 Page 207

Hierarchical clustering (continued) • How to compute the closest related items? • For step 1, simply the two items with the highest similarity (or smallest distance) are grouped first. • For the following steps we need to compute the (dis)similarity between groups of items • To compute this, several methods are available • In single-linkage clustering, the (dis)similarity between two groups of items is that of their most similar (less dissimilar) pair.

Other hierarchical clustering methods • There are many other criteria for defining clusters: • Single linkage • Complete linkage • Average linkage • Median linkage • Centroid linkage • Ward’s method • minimises variance single linkage complete linkage centroid linkage

Other hierarchical clustering methods (continued) • These do not always give equivalentclustering patterns • Average linkage and Ward’s methodseem most stable • Single-linkage clustering may besusceptible to ‘chaining’ of closelyrelated items, obscuring reasonablecluster structure chaining in single linkage

K-means clustering • Besides hierarchical methods many others are available • K-means clustering / Fuzzy K-means clustering • Choose number of clusters K • Take a random center for each cluster • Assign every item to the closest cluster • Recompute cluster centers (to be the average of all items it contains) • Re-assign items to the cluster of which the center is closest now • …and so on…until no change occurs any more • Advantage: easy method • Disadvantage: choice of K is arbitrarily – you still do not learn much about the relationships between the clusters (no dendrogram)

Kohonen Self Organising Maps (SOM) • Start with a ‘grid’ of MxN cluster centers • Train this grid to fit the data at hand • Clusters are linked to each other, if one cluster moves, it’s neigbhours also move a bit • Advantage: indicates relations between the clusters • Disadvantages: M and N arbitrarily – results not easy to interpret Image from J. Pennings, RIVM, NL

Clustering  can lead toBiological interpretation Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Fig. 7.13 Page 209

Free clustering software • TM4 MeV http://www.tm4.org/mev.html • R http://www.r-project.org/ • Eisen http://rana.lbl.gov/EisenSoftware.htm • NCBI GEO also supports basic clustering:

Pathway analysis • You found that 1300 genes in your microarray experiment were significantly up-regulated after treatment with X • You found a cluster containing 60 genes increasing over time • But then what? You really want to know what this means biologically… • Is a certain biological pathway activated in my experiment? • Which pathways are more activated than other average? • How do the genes in the cluster interact with each other? Pathway analysis can help here…

Why pathway analysis • Provides biological relevant context • More intuitive method than purely mathematical clustering methods • More efficient than looking up biology gene-by-gene • Improve statistical power • Network analysis

Biological pathways Interactions • Biochemical reactions • Transport • Inhibition • Activation ENSG00000141510, Ensembl

Pathway analysis • Based on linking microarray data to genes on a pathway Annotation: ENSG00000131828 Identifier mapping

Identifier mapping • What if the annotation on the pathways is different than used for the microarray? • Microarrays typically use internal ids: • Affymetrix: 205749_at • Agilent: A_14_P106416 • Illumina: ILMN_4380 • Pathway typically use gene/protein ids: • Entrez Gene: 1543 • Ensembl: ENSG00000140465 • UniProt: P04637

Identifier mapping • 2 scenarios: • The software will take care of it • E.g. PathVisio uses synonym databases • You will have to convert the ids yourself: • DAVID http://david.abcc.ncifcrf.gov/ • SOURCE http://smd.stanford.edu/cgi-bin/source/sourceBatchSearch • BioMART http://www.biomart.org/ • NetAffx http://www.affymetrix.com

Where to get pathways? • Online pathway databases • Kegg http://www.genome.jp/kegg • Reactome http://www.reactome.org • WikiPathways http://www.wikipathways.org • Many more… http://pathguide.org • Gene Ontology http://www.geneontology.org • Not really pathways but usable in pathway statistics

Draw your own pathways! • On your computer with PathVisio • Share with WikiPathways

Pathway analysis tools • Based on pathway diagrams • BioRAG • MetaCore (GeneGO) • Pathway-Express • GenMAPP / MAPPFinder • PathVisio • Based on Gene Ontology • Onto-Express • GOToolbox • MAPPFinder • Gostat • GeneMerge • GOSurfer • EASE • Fatigo

From lists of differentially expressed genes to biological interpretation. Main Function: Find pathways that are overrepresented in regulated genes Visualize expression changes on pathways Pathway analysis tools

Analysis on Gene Ontology The Gene Ontology (GO) project gives a consistent description of gene products from different databases. GO consortium: http://www.geneontology.org

Analysis on Gene Ontology Find terms that contain highest ratio of significantly changed genes…

Analysis on pathway diagrams

PathVisio www.pathvisio.org • Visualize gene expression on biological pathways • Identify significantly changed processes

Pathway Content • Contributed by research community • Contributed by large-scale curation efforts • Converted between species • Distributed on WikiPathways

Identifiers in PathVisio Pathways Experimental Data • Affymetrix, Illumina, Agilent, CodeLink • Entrez Gene • RefSeq (protein only) • Unigene • UniProt • Ensembl • PDB • Species-specific MOD IDs • Entrez Gene • Unigene • UniProt • Ensembl

Synonym Database Pathways Experimental Data Gene Database • Genes and annotation • Relational information • Assembled from Ensembl

PathVisio results: Data mapped on Pathway

PathVisio Results: Z-score

Z-score Unchanged gene Changed gene Question: Does the small circle have a higher percentage of changed genes than the large circle? Is this difference significant?

Z-score The Z-score can be used as a measure for how much a subset of genes is different from the rest r = changed genes in Pathway n = total genes in Pathway R = changed genes N = total genes

Afternoon practical session You are going to do: • Clustering using the GEO website • Pathway analysis using PathVisio

Internships and “Jaarwerkstuk”at the dept. of Bioinformatics • Subjects: • Designing new pathways and analyzing microarray data • Developing an online repository for data generated by PathVisio/GenMAPP • miRNA • CpG islands • SNPs • Microarray analyse • - ??????? INTERESTED???? email: chris.evelo@bigcat.unimaas.nl for more information about the department of Bioinformatics look at eleum  course information

Clustering and pathway analysis

Clustering and pathway analysis

Presentation Transcript

Pathway analysis using BioConductor

Evolutionary Clustering and Analysis of Bibliographic Networks

Biological pathway and systems analysis

Clustering Analysis Basics

Pathway Analysis

Pathway analysis Daniel Hurley

Clustering Analysis

Pathway Analysis

Data Point Visualization and Clustering Analysis

Clustering and Pathway Analysis

Pathway Analysis Tools

Clustering and Visual Data Analysis

Metabolic Pathway Analysis. Fundamentals and Applications

Metabolic pathway analysis

Pathway Analysis

Methods and resources for pathway analysis

Biological systems and pathway analysis

Teranode Tools and Platform for Pathway Analysis

Clustering and Visual Data Analysis

Chapter19 Clustering Analysis

Clustering Analysis

Analysis of Clustering technique