180 likes | 391 Views
Visual Exploratory Data Analysis: HCE. In this lecture you learn. Analysis and communication of multi-dimensional data sets Gene expression data (multi-dimensional) from micro-array experiments HCE – hierarchical clustering explorer GRID Principles Rank by Feature Framework. Introduction.
E N D
In this lecture you learn • Analysis and communication of multi-dimensional data sets • Gene expression data (multi-dimensional) from micro-array experiments • HCE – hierarchical clustering explorer • GRID Principles • Rank by Feature Framework Dept. of Computing Science, University of Aberdeen
Introduction • Multi-dimensional data sets are studied in many domains • Micro-array data sets from genomics • Aggregate data sets from census • Several techniques proposed for their analysis • Principal Component Analysis (PCA) • Online Analytical Processing OLAP • Data mining Algorithms including Clustering • With new data sets, exploratory data analysis is recommended before using the above techniques Dept. of Computing Science, University of Aberdeen
HCE • A visual knowledge discovery tool for analysing and understanding multi-dimensional (> 3D) data • Offers multiple views of • input data and clustered input data • where views are coordinated • Like all modern information visualization tools HCE is • Highly interactive allowing user • to control the visual displays and • to query data visually • handles very very large data sets (data from genomics) • Many other similar tools do a patch work of statistics and graphics • HCE follows two fundamental statistical principles of exploratory data analysis • To examine each dimension first and then find relationships among dimensions • To try graphical displays first and then find numerical summaries Dept. of Computing Science, University of Aberdeen
GRID Principles • GRID – graphics, ranking and interaction for discovery • Two principles • Study 1D, study 2D and find features • Ranking guides insight, statistics confirm • These principles help users organize their knowledge discovery process • Because of GRID, HCE is more than R + Visualization • GRID can be used to derive some scripts to organize exploratory data analysis using R (or some such statistics package) Dept. of Computing Science, University of Aberdeen
Rank-by-Feature Framework • A user interface framework based on the GRID Principles • The framework • Uses interactive information visualization techniques combined with • statistical methods and data mining algorithms • Enables users to orderly examine input data • HCE implements rank-by-feature framework • This means • HCE uses existing statistical and data mining methods to analyse input data and • Communicate those results using interactive information visualization techniques Dept. of Computing Science, University of Aberdeen
Multiple Views in HCE • Dendrogram • Colour Mosaic • 1 D histograms • 2D scatterplots • And more Dept. of Computing Science, University of Aberdeen
Micro-array Experiments • Functional Genomics is a field of study in molecular biology and genetics to connect • Genome sequence data to genome function • DNA micro-array is a glass or nylon substrate with specific DNA gene samples spotted in an array • Also known as gene arrays or gene chips • Micro-array data is used in genomics for understanding the function of genes • A flash movie on DNA micro-array methodology at http://www.bio.davidson.edu/courses/genomics/chip/chip.html Dept. of Computing Science, University of Aberdeen
DNA Micro-array Data • Gene samples from experiments are ‘hybridized’ with micro-array genes • The experimental gene sample binds with variable strengths to different genes on the gene array • The strength of binding is measured as gene expression data • Several such gene expression data sets from several experiments are tabulated to form a multi-dimensional data set Dept. of Computing Science, University of Aberdeen
Micro-array Data (2) Samples • Micro-array data has several thousands of rows and columns • Rows (i) correspond to genes • Columns (j) correspond to samples from different experiments • An element a(i,j) has the gene expression (strength) value of the jth sample on the ith gene on the array 1 i n 1 j m G e n e s a(i,j) Dept. of Computing Science, University of Aberdeen
Hierarchical Clustering • Researchers use clustering to discover interesting patterns in gene expression data • Clustering is the process of grouping data with similar properties • There are many algorithms for clustering with different behaviours • It is hard to know whose results agree well with natural clusters in the input data • Hierarchical clustering produces a hierarchical structure of clusters rather than a set of clusters Dept. of Computing Science, University of Aberdeen
Hierarchical Agglomerative Clustering (HAC) • Is a bottom-up clustering algorithm very similar to the bottom-up segmentation you studied • 1. Initially, each data item is a cluster by itself • 2. Cluster pairs of items with maximum similarity (based on a pre-selected similarity metric) • 3. Compute the similarity values between the new cluster and the others • 4. Repeat 2 and 3 until all the items are grouped into one cluster. Dept. of Computing Science, University of Aberdeen
Dendrogram Display • Results of HAC are shown visually using a dendrogram • A dendrogram is a tree • with data items at the terminal (leaf) nodes • Distance from the root node represents similarity among leaf nodes • Two visual controls • minimum similarity bar allows users to adjust the number of clusters • Detail cut-off bar allows users to reduce clutter D C A B Dept. of Computing Science, University of Aberdeen
Colour Mosaic • Input data is shown using this view • Is a colour coded visual display of tabular data • Each cell in the table is painted in a colour that reflects the cell’s value • Two variations • The layout of the mosaic is similar to the original table • A transpose of the original layout • HCE uses the transposed layout because data sets usually have more rows than columns • A colour mapping control Table Original layout Transposed Layout Dept. of Computing Science, University of Aberdeen
1D Histogram Ordering • This data view is part of the rank-by-feature framework • Data belonging to one column (variable) is displayed as a histogram + box plot • Histogram shows the scale and skewness • Box plot shows the data distribution, center and spread • For the entire data set many such views are possible • By studying individual variables in detail users can select the variables for other visualizations Dept. of Computing Science, University of Aberdeen
2D Scatter Plot Ordering • This data view is again part of the rank-by-feature framework • Three categories of 2D presentations are possible • Axes of the plot obtained from Principal Component Analysis • Linear or non-linear combinations of original variables • Axes of the plot obtained directly from the original variables • Parallel coordinates • HCE uses the second option of plotting pairs of variables from the original variables • Both 1D and 2D plots can be sorted according to some user selected criteria such as number of outliers Dept. of Computing Science, University of Aberdeen
Conclusion • HCE is a very good example of data interpretation and communication technology • Performs data analysis using statistical methods and clustering • Communicates the results of data analysis visually • HCE has many other features that have not been described here • GRID and rank-by-feature framework are useful ideas and can be used while using other data analysis tools such as SPSS Dept. of Computing Science, University of Aberdeen