300 likes | 788 Views
Position weight matrix (PWM), Perceptron and their applications. Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca. Sequence Annotation. Objective of sequence annotation: Given a nucleotide or amino acid sequence, find its biological function by bioinformatics tools. Approaches:
E N D
Position weight matrix (PWM), Perceptron and their applications Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca
Sequence Annotation • Objective of sequence annotation: Given a nucleotide or amino acid sequence, find its biological function by bioinformatics tools. • Approaches: • Homology search: Annotation based on known, well annotated genes in databases, using BLAST and FASTA • Gene prediction: annotation based on known gene structure using position weight matrix, perceptron, HMM.
Position-weight matrix (PWM) • Also called position-specific scoring matrix (PSSM) • Used in • Characterizing sequence motifs • Eukaryotic translation initiation consensus • Splicing sites • Branchpoint sites • Shine-Dalgarno sequences • Database searches (PHI-BLAST, PSI-BLAST and RPS-BLAST)
Position weight matrix Sequences flanking the initiation codon of 508 CDSs: 1234567890123 A4GALT ATACCATGTCCAA ACO2 ACAAAATGGCGCC ACR GGAGTATGGTTGA ADM2 CCGCCATGGCCCG .... ..... N: Number of sequences, i.e. 508 L: Sequence length, i.e., 13 i: A, C, G or T j: Site index, i.e., 1, 2, ..., 13 Site-specific frequencies: Pij Non-site-specific (global) frequencies: Pi
Two hypotheses • No: All sites have the same nuc/aa distributions • Yes: Different sites have different nuc/aa distributions • Related terms: • Observation: S • Likelihood: probability of having S given a model (a hypothesis) • Odds ratio: LYes/LNo • Log-odds: log(LYes/LNo) 1234567890123 A4GALT ATACCATGTCCAA ACO2 ACAAAATGGCGCC ACR GGAGTATGGTTGA ADM2 CCGCCATGGCCCG .... ..... S = ACGGTACCACGTT
Position weight matrix (PWM) • Two major purposes of PWM • To characterize the sequence pattern (the motif) • to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for ATACCATGTCCAA RCCAUGG
PWMS over sites 12345678901234567890123456789012345678901234567890123456789012345678901234567890 GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG ------------- ------------- ------------- Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with = 0.01.
Slide 8 BLAST Programs
Yeast 5’ ss PWM Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded. Ma and Xia 2011
Yeast 3’ss PWM Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1. Ma and Xia 2011
PWMS as a proxy of splicing strength Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.
Highly expressed genes should have high splicing efficiency. Predictions:(1) Highly transcribed genes should, on average, have introns with greater splicing efficiency(2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes. Lowly expressed genes could have their splicing sites drifting to low efficiency
PWMS and Splicing Mechanisms • Expected PWMS is 0 when there is no site-specific difference in nucleotide frequency distribution • What does a strongly negative PWMS mean? • 5’ ss: • HAC1: -8.8291 • HFM1: -7.3825 • HOP2: -7.8898 • 3’ ss: • HAC1: -4.4039 • REC102: -3.4464
Perceptron • The perceptron is one of the simplest artificial neural networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958). • Perceptron has been used in bioinformatics research since 1980s: • The identification of translational initiation sites in E. coli (Stormo et al., 1982a). • Characterizing the ATP/GTP-binding motif (Hirst and Sternberg, 1991). • More recent publications use multi-layer perceptrons which is more complicated than what we cover here.
What perceptron does • Positive sequencesPOS1 ACGTPOS2 GCGC • Negative sequencesNEG1 AGCTNEG2 GGCC • Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences
Definitions POS1 ACGT POS2 GCGC NEG1 AGCT NEG2 GGCC Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4. For amino acid sequences, the matrix would be 20 by 4.
Post-processing What is the score for: TAAA? POS1 ACGT POS2 GCGC NEG1 AGCT NEG2 GGCC A WSi,j = 0 means either there is no data on that cell or the cell has no discriminant power
Doublet perceptron 1234567890 P1 ACGUAUACGU P2 ACGUCUACGU P3 ACGUGUACGU P4 ACGUUAACGU P5 ACGUUCACGU P6 ACGUUGACGU N1 ACGUAAACGU N1 ACGUACACGU N1 ACGUAGACGU N1 ACGUCAACGU N1 ACGUCCACGU N1 ACGUCGACGU N1 ACGUGAACGU N1 ACGUGCACGU N1 ACGUGGACGU N1 ACGUUUACGU
Doublet Perceptron Large amount of data are needed to avoid the problem of overfitting
Gene/Motif Prediction • Objective: given molecular sequence, find its biological function (preferably in terms of gene ontology). • Cellular localization • Biological processes the gene (its product) participates in • The biological reaction • Related terms: • Motif: e.g., RccAUGG • Fingerprint: a set of aligned sequences from which a position weight matrix or the like can be constructed to predict the motif effectively • Gene/Motif prediction methods • Position weight matrix • Perceptrons • Supervised learning • Hidden Markov Models (HMMs) • Neural networks (e.g., self-organizing map or SOM)
Bayesian inference on breast cancer Population: women aged 40+ A woman has a chance of 0.01 of getting breast cancer. 80% of those with breast cancer will get positive mammographies. 10% of those without breast cancer will also get a positive mammography. What is the probability that a woman with a positive mammography actually has breast cancer? posterior priors 0.8 0.075=0.008/(0.008+0.099) 0.008 0.008 0.01 0.2 0.002 0.099 0.099 0.99 0.925 0.891 0.1 0.9
Application of Bayes Theorem Many more diagnostic tools are needed and their predictions are combined to reach a better joint prediction.
Bayesian prediction of genes Population: All 300mers Probability of the 300mer is a gene: 0.02 95% of those 300mers from a gene will get positive scores. 15% of those 300mers from non-genes or pseudogenes will also get a positive score. What is the probability that a 300mer with a positive score is from a real gene? posterior priors 0.95 0.114=0.019/(0.019+0.147) 0.019 0.019 0.02 0.05 0.001 0.147 0.147 0.98 0.925 0.833 0.15 0.85