Gaurav Chadha Deepak Desore

Inferring Functional Information from Domain co-evolutionYohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama andShankar Subramaniam Gaurav Chadha Deepak Desore

Layout • Motivation • Computational Methods and Algorithms • Results • Conclusion • Questions

Motivation (1 of 2..) • Prior Work • Focused on understanding Protein function at the level of entire protein sequences • Assumption: Complete Sequence follows single evolutionary trajectory • It is well known that a domain can exist in various contexts, which invalidates the above assumption for multi-domain protein sequences

Motivation (2 of 2 ..) • Our approach • Improvement of Multiple Profile method • Constructs Co-evolutionary Matrix to assign phylogenetic similarity scores to each protein pair • Identifies Co-evolving regions using residue-level conservation

Computational Methods & Algorithms • Constructing phylogenetic profiles • Protein(single) phylogenetic profiles • Segment(Multiple) phylogenetic profiles • Residue phylogenetic profiles • Computing Co-evolutionary matrices • Deriving phylogenetic similarity scores

Protein phylogenetic profiles • Phylogenetic profile is a vector which tells about the existence of a protein in a genome. • Let P = {P1,P2,…,Pn} be the set of proteins and, G = {G1,G2,…,Gm} be the set of Genomes • Every row represents binary phylogenetic profile of a protein.

Protein phylogenetic profiles(contd.) • Single phylogenetic profile ψi for protein Pi is, ψi(j) = - 1 , 1 <= j <= m log(Eij) where Eij is minimum BLAST E-value of local alignment between Pi and Gj • Advantage: gives degree of sequence divergence

Protein phylogenetic profiles(contd.) • Mutual Information I(X,Y) defined as, I(X,Y) = H(X) + H(Y) – H(X,Y), where H(X), Shannon Entropy of X is defined as, H(X) = ∑ px * log(px), x Є X and px = P[X = x] • Phylogenetic similarity between ψi(j) and ψi(j) is, μs(Pi,Pj) = I(ψi, ψi)

Segment phylogenetic profiles • Single profile based methods could miss significant interactions. • Domain D12 of P2 follows evolutionary trajectory similar to P1 and P3 which single profile method didn’t capture.

Segment phylogen. profiles(contd.) • Dividing each protein Pi into fixed size segments S1i,S2i,…,Ski • Phylogenetic similarity between two proteins, μM(Pi,Pj) = max I(ψsi, ψtj), s,t where ψsi is phylogenetic profile of segment Ski of protein Pi

Residue phylogenetic profiles • Problem with multiple phylogenetic profiles: • Both domains covered together by the segment S22, overriding their individual phylogenetic profiles. • Significant local alignment between two proteins corresponds to the residues covered in the alignment rather than the whole sequences.

Residue phylog. profiles(contd.) • A(Pi,Gj) – set of significant local alignments between Protein Pi and Genome Gj • T(A) = [rb,re] – interval of residues on Pi corresponding to each alignment A Є A(Pi,Gj) • For each residue r on Pi phylogenetic profile is ψri(j) = min - 1 , 1 <= j <= m A Є A r log(E(A)) Ar = {A Є A(Pi,Gj): r Є T(A)} is the set of local alignments that contain r

Computing co-evolutionary matrices • For each protein pair Pi and Pj with lengths li and lj, co-evolutionary matrix entry Mij(r,s) is, Mij(r,s) = I (ψri, ψsj), where 1 <= r <= li and 1 <= s <= lj • The Co-evolutionary Matrix contains • Information about which regions of the two proteins co- evolved • The co-evolved domain(s) appear as a block of high mutual information scores in the matrix

Deriving phylogenetic similarity scores • Phylogenetic similarity scores between two proteins Pi and Pj is, μC(Pi,Pj) = max min Mij(a,b) 1<= r <= li r <= a <= r + W 1<= s <= lj s <= a <= s + W where W is the window parameter that quantifies the minimum size of the region on a protein to be considered as a conserved domain.

Results • Implemented and tested on 4311 E.coli proteins • 152 Genomes(131 Bacteria,17 Archaea,4 Eukaryota) • Value of f (down-sampling factor) = 30, W = 2 • These values translate in overlapping segments of 60 residue long • Excluded homologous proteins from analysis • Define p-value as fraction of non-homologous protein pairs (N)

Results (contd.) • MIS – Mutual Information Score • PP – No. of predicted protein pairs • PPV = TP / (TP + FP) • For all μ*, coverage = TP + FP • TN and FN are the no. of protein pairs that do not meet the threshold

Results (contd.) • Co-evolutionary matrix has 1.5 times greater coverage at PPV = 0.7 than the single profile method • At same no. of PP, Co-evolutionary matrix has better PPV and sensitivity values than single profile method

Results (contd.) Mutual Information score distribution for interacting and non-interacting protein pairs • At 0 MIS, SP shows a peak while CM doesn’t. In other ways, at low MIS scores, SP scores over CM

Results (contd.) • Shows p-values of Single Profile method v/s Co-evolutionary Matrix method • Scattered circles show that the two methods can predict very differently

Results (contd.) – Phosphotransferase system • Domain IIA(residues 1-170) and domain IIB(residue 170-320) • Darker region shows that the domains have co-evolved. So we can conclude that IIB evolved with IIC rather than IIA • Top-20 predicted interacting partners of protein IIAB for both methods

Results (contd.) - Chemotaxis • N-terminus of CheA(residues 1-200) and C-terminus of CheA(residues 540-670) co-evolved with C-terminus region of CheB (residues 170-340) • Top-20 predicted interacting partners of protein CheA using both methods

Results (contd.) – Kdp System • N-terminal domain of KdpD (residues 1-395) co-evolved with KdpC • Top-10 predicted interacting partners of protein KdpD using both methods

Conclusion • Results in this paper strongly suggest that co-evolution of proteins should be captured at the domain level • Because domains with conflicting evolutionary histories can co-exist in a single protein sequence • Regions that are important for supporting both functional and physical interactions between proteins can be detected

Questions Thank You !!

Gaurav Chadha Deepak Desore