Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees Lior Pachter Fumei Lam Marina Alexandersson

X M Y Alignment ATCG--G A-CGTCA biologically meaningful Steiner Networks Pair Hidden Markov Models fast alignments based on HMM structure

Some basic definitions: Let G be a graph and S  V(G). A k-spanner for S is a subgraph G’  G such that for any u,v  S the length of the shortest path between u,v in G’ is at most k times the distance between u and v in G. Let V(G)=R2 and E(G)=horizontal and vertical line segments. A Manhattan network is a 1-spanner for a set S of points in R2. Vertices in the Manhattan network that are not in S are called Steiner points

Steiner point Manhattan network Example: S: red points

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points 4-approximation in O(n3) and 8-approximation in O(nlogn)

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations) slide A(v) = {u:v is the topmost node below and to the left of u} v

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide The minimum slide arborescense problem: Lingas-Pinter-Rivest-Shamir 1982 O(n3) optimal solution using dynamic programming

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness b v a u

What is an alignment? ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC

X M Y Pair HMMs Simple sequence-alignment PHMM M = (mis)match X = insert seq1 Y = insert seq2

G - - C G A T C G A C - T A Hidden alignment: Observed sequence: ATCG--G AC-GTCA ATCGG ACGTCA Pair HMMs transition probabilities Hidden sequence: M M M X M Y Y output probabilities

MMXMYYM ATCG--G AC-GTCA Using the Pair HMM In practice, we have observed sequence ATCGG ACGTCA for which we wish to infer the underlying hidden states One solution: among all possible sequences of hidden states, determine the most likely (Viterbi algorithm).

X 1 1 1 1 1 1 - - - - - - 3 3 3 3 3 3 M Y Needleman Wunsch Viterbi in PHMM Match prob: pm Mismatch prob: pr Gap prob: pg Match score: log(pm) Mismatch score: log(pr) Gap score: log(pg)

Want to take into account that the sequences are genomic sequences: Example: a pair of syntenic genomic regions

X M Y PHMM X Y

PHMM X Y • A property of “single sequence” states is • that all paths in the Viterbi graph between • two vertices have the same weight

C C G T A T T T A Strategy for Alignment G A T G GATTACATTGATCAGACAGGTGAAGA

The CD4 region 50000 mouse 0 human 0 50000

Exon 1 Exon 2 Exon 3 Exon 4 Intron 1 Intron 2 Intron 3 3’ 5’ Splice site GGTGAG Splice site CAG Stop codon TAG/TGA/TAA Branchpoint CTGAC Translation Initiation ATG

Suggests a new Steiner problem Find the shortest 1-spanner connecting reds to blues

Generalizes the Manhattan network problem (all points red and blue) Generalizes the Rectilinear Steiner Arborescence problem

History of the Rectilinear Steiner Arborescence Problem 1985, Trubin - polynomial time algorithm 1992, Rao-Sadayappan-Hwang-Shor - error in Trubin 2000, Shi and Su - NP complete!

Results for unlabeled problem • An O(n3) 2-approximation algorithm (implemented) • An O(nlogn) 4-approximation algorithm • Testing on CD4 region in human/mouse • Implementation ( SLIM ) • http://bio.math.berkeley.edu/slim/ • SLIM for SLAM (in progress) • http://bio.math.berkeley.edu/slam/

G G A C T T G A T C A T G G A CNS D X Y M I T C T G G T T G G C C T C A G G T G T C G T T T A A A G A T T A G A A T T A G G G G T G T T G C A A T T A A C G T G G T T A C G C C C A A T T G A C G T T C G G A C A A T G T C

The Viterbi graph for a more complicated alignment PHMM

Comparison and Analysis of Performance • Our method has two main steps: (L=length of seqs, n=#HSP) • Building the network O(n3) or O(nlogn) • Running the Viterbi algorithm O(nL) worst case • for the HMM on the network • Banding algorithms are O(L2) worst case for step 2. • Chaining algorithms are O(n2) in the case where gap • penalties can depend on the sequences. • These strategies do not generalize well for more • sophisticated HMMs.

ATCG--G A-CGTCA X M Y Summary Software: SLIM (network build): http://bio.math.berkeley.edu/slim/ SLAM (alignment): http://bio.math.berkeley.ed/slam/ Thanks: Nick Bray and Simon Cawley

Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees

Presentation Transcript

Efficient Steiner Tree Construction Based on Spanning Graphs

1-Steiner Routing by Kahng/Robins

Aligning Alignments

Steiner Tree

Alignments

Rudolph Steiner

Steiner Ratio

From Steiner Triple Systems to 3-sun systems

Efficient algorithms for Steiner Tree Problem

Sequence Alignments

A survey on the Group Steiner problem Guy Kortsarz, Rutgers Camden

Steiner Conference

Alignments

Progressive multiple sequence alignments from triplets

Introduction

Harvard/Virginia Case Study- “James R. Steiner”

Progressive multiple sequence alignments from triplets

1-Steiner Routing by Kahng/Robins

Efficient Steiner Tree Construction Based on Spanning Graphs