Based on: MicroRNA identification based on sequence and structure alignment

A Modified miRAlign Approach to Finding MicroRNAs in the Chicken Genome Based on:MicroRNA identification based on sequence andstructure alignment Presented by - Neeta Jain, Nehar Arora, and Jeff Bonis Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong Zhang and Yanda Li

Outline • Introduction • Motivation • Methods • Results • Conclusion

Introduction • What are miRNAs and why are they important? • miRNAs are ~22 nt long non-coding RNAs • They are derived from their ~70 nt precursors, which typically have a hairpin structure Importance of miRNAs: • They are found to regulate the expression of target genes via complementary base pair interactions.

Motivation • miRNAs are short (~22 nt) and more conserved in their secondary structure than in primary • Hence, conventional sequence alignment methods such as BLAST can only find relatively close homologues • There are replaceable steps of the miRAlign, and the increase/decrease in performance should be evaluated • Prof. Joan at the Delaware BioTechnology Institute is working on identifying miRNAs in the chicken genome, but the secondary structure information has not yet been exploited

Methods • Data • Reference sets • mirRBase Registry Version 8.0 (http://microrna.sanger.ac.uk/sequences) • MicroRNA Registry Version 5.0 was previously used • 1300 animal miRNAs from six species and their precursors(1104) composed our raw training set Train_All. • Train_Sub_1 : All six animal miRNAs except those from G. gallus • Train_Sub_2: All six animal miRNAs except those from G. gallus and C.elegans • Genomic sequences • Only the chicken genome (G. gallus) was used.

Methods (contd)

Methods (cont.) • Preprocessing • Known precursors from training set are used to BLAT (instead of BLAST) against the chicken genome • The resulting candidate pre-miRNAs are used as the potential precursor miRNAs • Experienced difficulty extracting flanking sequences

Experiment (contd) • “Modified” miRAlign (1.) Secondary Structure Prediction • Both the candidate sequence and it’s reverse complement are analyzed by RNAfold to predict hairpins. • Alternatively, sequences were also analyzed in parallel by mFold to predict their secondary structures. • Only hairpins with MFE lower than -20 kcal/mol are retained. (2.) Pairwise sequence alignment • Sequences from previous step are aligned pairwise to all the ~22 nt known miRNA sequences from the training set • Sequence similarity score between the candidate and known mature miRNAs is calculated by CLUSTALW. • If the score exceeds a user-defined threshold (default=70), then the candidate to known miRNA pairs are kept for further analysis

Methods (contd) (3.) Checking miRNA’s position on stemloop • Should not locate on terminal loop of hairpin • Omitted due to unavailability of the offset of the known mature miRNAs in the pre-miRNAs: • Should locate on the same arm of hairpin • Position of potential miRNA on hairpin should not differ too much from it’s known homologues (chosen delta_len :- 15)

Methods (contd) (4.) RNA secondary structure alignment • RNAforester computes pairwise structure alignment and gives similarity score • Score is a summation of all base (base pair) match (insertion, deletion). • Normalized similarity score of structure C and m is given as: • An alternative structure alignment program, SimTree, transforms the structures into labeled trees then computes the distance between them and assigns a normalized score.

Methods (contd) (5.) Total similarity score After aligning all potential homologue pairs, a total similarity score (tss) is assigned to each candidate sequence. Where, C- candidate sequence ; R – set composed of all C’s

Results • Search for miRNAs in the chicken genome proved somewhat difficult. BLAT was used instead of BLAST because of time restraints • For secondary structure prediction, mFold predicted a lower MFE than RNAfold, on average • T-Coffee could be used for pairwise sequence alignment instead of CLUSTALW, but is about N-times slower

Results (cont.) • Requirements for the position of mature miRNA on the stem loop were reduced • Only the non-loop locating condition was satisfied • Needed orientation (5’ vs. 3’) of known pre-miRNAs to check arm location and hairpin length • Previously found that over 97.5% of known animal miRNAs met the non-stringent cutoff hairpin length difference of 15 • For secondary structure alignments, SimTree was used along with the original RNAforester. • SimTree uses similar tree alignment methods to RNAforester

Conclusion • Final results are still under analysis • Future work: • Perform primary sequence steps first, then secondary structure filter steps • Primary sequence filters provide a greater reduction in the candidate set then the secondary structure-based filters • Additional seondary structure prediction, primary sequence alignment, and secondary alignment tools could be evaluated • Different combinations of these tools could also lead to better performance • Tertiary structure tools could supplement/replace some of the filtering steps

THANK YOU Questions ??

Based on: MicroRNA identification based on sequence and structure alignment