210 likes | 218 Views
Explore efficient algorithms for approximate database searching of polypeptide structures, enabling fast and accurate retrieval of similar protein structures. Learn about suffix trees, polypeptide angles, and fault-tolerant searching.
E N D
Efficient Algorithms GroupProf. Ernst W. MayrTechnical University of Munich Fast Approximate Database Searching of Polypeptide Structures Hanjo Taeubig Arno Buchner Jan Griebsch German Conference on Bioinformatics October 4th, 2004
Structure • motivation & problem definition • suffix trees • polypeptide angles suffix trees • application & future work
I. Motivation • the function of a protein is largely determined by it’s structure and geometric shape • How to find similar structures in a database ? • related work • DALI, VAST, CE • TopScan, ProtDex2 • existing methods are mostly based on the principlefilter heuristics + exhaustive search/pairwise comparison and scale at least linearly
I. Motivation • PDB – Protein Data Bank • ca. 3.5GB compressed, 14GB decompressed • > 23.000 entries • 90% Proteins, 5% Nucleotidesequences, 4% Nucleotide-Protein complexes • 85% x-ray cristalography, 15% NMR • protein structure databases grow almost exponentially • search methods with time complexity at most O(n) required
I. Problem Definition • search a given polypeptide structure in a protein database • search the longest common substructure in the database • identify frequent substructures (motifs) in the database
II. Suffix Trees Tries • tree with a root node • every edge is labeled with a letter • labels of all edges to the child nodes of one node are pairwise distinct
II. Suffix Trees Suffixtries • stores all suffixes of a string • the sentinel $ ensures that every suffix is represented by a leaf Suffixtree for the word aaabbb$
II. Suffix Trees Compressed Suffixtries • collapse linear paths in the tree • store only start- and end-index • linear number of inner nodes
II. Suffix Trees Further Extensions • generalized suffix trees • stores suffixes of multiple strings in one tree • online linear time construction Time Complexity • Finding an occurrence of the search pattern does not depend on the size of the searched database, but linearly on the length m of the pattern • Finding allk occurrences of a pattern takes time proprtional to m+k
III. Polypeptide Angles Suffix Tree Idea • encode the geometry of the database proteins in a translation and rotation invariant linear description (“structural text”) • torsion angle encoding of the protein backbone • adapt efficient text mining methods to the error tolerant substructure searching problem • generalized suffix trees with fault tolerant search strategies
1a1f III. Polypeptide Angles Suffix Tree … (22,93), (112, 4) … Discretization … a b b a …
1a1f … (22,93), (112, 4) … Discretization … a b b a … III. Polypeptide Angles Suffix Tree
III. Polypeptide Angles Suffix Tree Fault Tolerant Searching • accept a “neighborhood range” of intervals left and right • worst case time complexity: exponential (!) • average: O( ) figure: branching with =1
IV. Application Example • search occurrences the C2H2 zinc finger in the complete PDB • discretization: 24 intervals of 15° • compare with SCOP classification, sequence-based search, SPASM
IV. Application Score E Sequences producing significant alignments: (bits) Value gi|37926551|pdb|1LLM|C Chain C, Crystal Structure Of A Zif2... 47 6e-07 gi|15988358|pdb|1F2I|G Chain G, Cocrystal Structure Of Sele... 42 2e-05 gi|3319019|pdb|1A1H|A Chain A, Qgsr (Zif268 Variant) Zinc F... 42 3e-05 gi|3319013|pdb|1A1F|A Chain A, Dsnr (Zif268 Variant) Zinc F... 41 3e-05 gi|3319022|pdb|1A1I|A Chain A, Radr (Zif268 Variant) Zinc F... 41 3e-05 gi|16975178|pdb|1JK1|A Chain A, Zif268 D20a Mutant Bound To... 41 3e-05 gi|2098365|pdb|1AAY|A Chain A, Zif268 Zinc Finger-Dna Compl... 41 4e-05 gi|33357855|pdb|1P47|A Chain A, Crystal Structure Of Tandem... 41 5e-05 gi|443340|pdb|1ZAA|C Chain C, Zif268 Immediate Early Gene (... 40 8e-05 gi|15988466|pdb|1G2F|C Chain C, Structure Of A Cys2his2 Zin... 33 0.015 gi|15988460|pdb|1G2D|C Chain C, Structure Of A Cys2his2 Zin... 32 0.025 gi|1941952|pdb|1MEY|C Chain C, Crystal Structure Of A Desig... 28 0.44 gi|40889293|pdb|1P7A|A Chain A, Solution Stucture Of The Th... 27 0.64 gi|3318788|pdb|2ADR| Adr1 Dna-Binding Domain From Saccharo... 27 0.78 gi|2094895|pdb|1SP1| Nmr Structure Of A Zinc Finger Domain... 26 1.4 gi|1420993|pdb|1ARD| Yeast Transcription Factor Adr1 (Resi... 23 9.7 . . .
Minimum RMSD superposition: 1a1f vs. 1f2i “False” positives: 1a1f vs. 1vl2 IV. Application 1a1f vs. 6 other true positives
IV. Application Run Time • decompression of the packed PDB files • parsing of the PDB files and calculating the torsion angles • discretization and building the PAST • searching a structure 25min 55min 2min seconds Pre-processing Searching
Summary • suffixtree-based protein (sub-)structure database search method • preprocessing required • fast search • does not rely on heuristics, SSE recognition • adaptable sensitivity and error models • until gapped matching is modeled: applicable for shorter peptide chains, motifs • surprisingly simple
Future Work • model matching with insertions & deletions • consensus search pattern • implementation and practical testing of further error models • and angle encoding • identification of new motifs • testing, testing, testing: evaluating the method further with real life problems from pharmaceutical researchers, biologists, patent offices, …
Acknowledgements • Hanjo Taeubig, Arno Buchner • Volker Heun, Moritz Maass • BFAM/BMBF • ALTANA