1 / 1

E-mail: * svinga@itqb.unl.pt ‡ almeidaj@musc

Forward U S M. Backward U S M. [0.02 0.01 0.63 0.53 0.41]. [0.07 0.30 0.52 0.67 0.37] . USM Algorithm Mapping of a sequence X into a continuous space

olwen
Download Presentation

E-mail: * svinga@itqb.unl.pt ‡ almeidaj@musc

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Forward USM Backward USM [0.02 0.01 0.63 0.53 0.41] [0.07 0.30 0.52 0.67 0.37] • USM Algorithm • Mapping of a sequence X into a continuous space • Extension of Chaos Game Representation (CGR) procedure for higher order alphabets using a n-dimensional hypercube  all CGR properties are maintained. • Dimension needed for a m-symbol alphabet: log2(m) . Sequence X, length k, from m-symbol alphabet A  Each symbol s is represented by a unique binary number u corresponding to a corner of the n-hypercube.  USM construction is based on a Iterative Function, where USMi is the point in the USM map after ith iteration and ui are the coordinates of the vertex of the ith symbol in the sequence. Distance estimation Estimates the maximum length of the shared segment Figure: cross-tabulation of similarity between positions in the two stanzas, described below, estimations dUSM=dfUSM+dbUSM Representation of a sequence, FASTA format, USM coordinates >> The Uncertainty of the Poet, by Wendy Cope >1st stanzaI am a poet.I am very fond of bananas.>2nd stanzaI am bananas.I am very fond of a poet. • Sequence Comparison by Universal Sequence Mapping (USM) alignment-free and scale-independent analysis of texts • Vinga, S. *1 ; Almeida, J.S. ‡1,2 • 1. ITQB/UNL - R. Quinta Grande 6, 2780-156 Oeiras, Portugal. • 2. MUSC - Dept Biometry & Epidemiology, Medical Univ. South Carolina 135 Cannon Street, Suite 303, P.O. Box 250835, Charleston, SC 29425, USA E-mail: *svinga@itqb.unl.pt‡almeidaj@musc.edu Abstract Universal Sequence Mapping (USM) is a technique for representing discrete sequences in a continuous iterative map. It is a generalization of Chaos Game Representation (CGR) originally developed for genomic sequences, in order to accommodate higher dimension alphabets, thus enabling the study of proteins and general texts. The present report documents the use of USM which conserves all CGR properties (reference below) such as: 1)bijection of the transformation sequence to continuous map; 2) generalization of Markov Chain probability tables; 3) distance between coordinates in the map estimates similarity between the sequences being compared. The overestimation distribution was also obtained by bootstrap methods (applying the method to random sequences). This new USM-based metric is independent of a specific sequence alignment, thus allowing a truly alignment and resolution-free sequence comparison method, not dependent in heuristic reasoning. USM is a statistical mechanics approach to discrete sequence analysis, particularly useful for the comparison of weakly correlated sequences where alignment based methods are ineffective. Availability Algorithms are freely available in a web-based tool at: http://bioinformatics.musc.edu/~jonas/usm/ Example: Map Figure: USM representation of a random sequence with length 100 from an alphabet of 8 different symbols. Each corner of the cube (0,1)3 corresponds to a symbol. USM Algorithm: start in random position, go always half the distance toward the corner representing the next symbol in the sequence Forward resolution f Example: Coordinates Backward resolution b • USM Properties • Bijection of the transformation sequence to continuous map • Generalization of Markov Chain probability tables • Coordinates distance in the map estimates real analogy or similarity between the sequences being compared. The overestimation or bias distribution was also obtained by bootstrap methods (applying the method to random sequences). Figure: equivalence of USM representation and Markov Chain Models. Sequences ending in a specific suffix are in the square labelled with that suffix  calculation of transition probability matrices. Published report Almeida JS, Vinga S: Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics 2002, 3:6 (5 February 2002) • Conclusions and Generalizations • USM is a scale-independent representation of sequences, freeing sequence analysis from the need to assume a memory length in the investigation of syntactic rules. • Algorithmic optimization and parallelization leads to high efficient performace. Web based tool On-line submission of sequences Acknowledgments The authors thankfully acknowledges financial support by grant SFRH/BD/3134/2000 awarded to S.Vinga, and project SAPIENS POCTI/1999/BSE/34794of Fundação para a Ciência e Tecnologia of the Portuguese Ministry of Science and Technology (FCT/MCT).

More Related