180 likes | 287 Views
Basic Overview of Bioinformatics Tools and Biocomputing Applications II. Dr Tan Tin Wee Director Bioinformatics Centre. Common Computational Analyses. Sequence Assembly Simple sequence analysis Translation and reverse Complement, ORF
E N D
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre
Common Computational Analyses • Sequence Assembly • Simple sequence analysis • Translation and reverse Complement, ORF • Composition statistics (protein & DNA) • Molecular mass • Total charge and pI; local hydropathy • Simple determination of secondary structures • Restriction site analysis • Internal repeat analysis • Detection of active sites, functional residues, characteristic structures, substrates, and processing signals
Common Computational Analyses • Database sequence search • Multiple alignment • 2° and 3° Structure prediction; transmembrane helix detection • Structure modeling • Docking prediction and design • Hidden Markov model searches
Database Searching • Text-based Database Searching -using a text string to match an annotation in a sequence database record, ie. Keyword search • Sequence-based Database Searching-using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records
Text-Based Database Searching • Examples: Entrez, SRS, DBGET, AceDB- common integrated database systems • Search Concepts • Boolean Search - AND, OR, NOT • Broadening Search • Narrowing the Search • Proximity searching, soundex • Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic • Use standard string search algorithms and boolean operations, vocabulary matches
Text-based Database Searching • Example: To find the human homolog of the Drosophila per gene • Procedure • Web to Entrez • All Fields : enter "human" "per" • Hits returned, irrelevant - broaden search • "human" "period" - more hits • check every one, find the human RIGUI gene • Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)?Use Boolean searches?
Sequence-based Database Searching • Homology Search • Global or Local Sequence Alignment • Needleman-Wunch Algorithm • Smith-Waterman Algorithm • Lipman - Pearson FASTA • Altschul's BLAST • Take a sequence, pairwise comparison with each sequence in the database
Sequence-based Database Searching • Basic Assumptions: • Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little • Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin • Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.
Sequence-based Database Searching • Global Alignmentforces complete alignment of the pairwise comparison of the two input sequences • Local Alignmentlooks for local stretches of similarity and tries to align the most similar segments • Algorithms used may be similar, but output different, statistics needed to assess results
Sequence-based Database Searching • Alignment Scoring • Substitution score and substitution matrixPAM, BLOSUM • affine gap costs/gap penalty and gap scores • Optimal alignments, dynamic programmingNeedleman-Wunsch algorithm,Smith-Waterman algorithm (SSEARCH) • Additional heuristics to speed up the search - FASTA, BLAST
Some definitions • Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional per-residue penalty proportional to size of gap • Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment. • Algorithm - fixed procedure embodied in a computer program • Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules. • Gapped Alignment - alignment of sequences where gaps are permitted
Computational Genefinding • Major challenge in genome project • Given a DNA sequence, where does a gene begin and stop? - ORF • Where are the exons and introns? • Where are the transcription elements? • Gene structure and other regulatory elements?
Genomic Elements • Intron-exon splice sites • Start-Stop codons • Branch Points • Promoters and terminators of transcription • Polyadenylation sites • ribosomal binding sites • Topoisomerase II binding sites • Topoisomerase I cleavage sites • Transcription factor binding sites
Detecting Genomic Elements • Local sites and motifs/patterns for such element - signals and signal sensors • Extended variable-length regions eg exons and introns- contents and content sensors • Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program
Signal sensors • Simple consensus sequenceUse of Pattern matching algorithms • Weight matricesallow for weighted score for each weight matrix sensors to be summed • Use of Artificial Neural Networks (ANN)
Content Sensors • Long ORF for bacteria • Statistical models eg. Markov models - GeneMarkstatistical models of nucleotide frequencies and dependencies in codon structure • Neural Nets eg Grailexon detection by neural network combined with signal sensors for exon-intron splice sites
Some Definitions • Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression • Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it. • Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure
Other Genefinding methods • Use of dynamic programmingLinguistic rules for functional featuresParameters of a Markov Process on hidden variables - hidden Markov Models (HMM) • HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan