Soft Computing Tools for Gene Similarity Measures in Bioinformatics

Soft Computing Tools for Gene Similarity Measures in Bioinformatics James M. Keller, Mihail Popescu, and Joyce Mitchell Electrical and Computer Engineering Department Health Management and Informatics Department University of Missouri-Columbia

Introduction • Principal features of gene products are • the sequence and expression values following a microarray experiment • Many (dis)similarity measures have been proposed to measure closeness of sequences • For many gene products, additional functional information comes from • the set of Gene Ontology (GO) annotations and • the set of journal abstracts related to the gene(MeSH annotations) • For these genes, it is reasonable to include similarity measures based on these terms

Features for Gene Product Similarity ATM: human ataxia telangiectasia mutated STK11: serine/threonine kinase 11

Given two gene products, G1 and G2, we can consider them as being represented by collections of terms The goal is to define a natural similarity: Term-based Similarity • There are two main approaches • similarities between pair-wise elements of the two sets are defined and aggregated using a given fusion operator • the similarity degree can be defined globally for the two entire sets. • In a sense, here the “aggregation” is performed before the similarity is computed.

Similarity is computed pair-wise: • Aggregation is performed, here with average: First Approach • Problem with average: Even when the two sets are very similar, sa(G1, G2) may not be 1. • Can use Max, Min, Order Statistic • Trouble with maximum: • If G1 and G2 have only one element in common, the similarity is 1, ignoring the others.

Jaccard similarity: Set Cosine similarity: Dice similarity: are augmented vectors in an augmented space Second Approach Set-based Measures Vector Space-based Measures

Our First Technique • Based on the concept of Fuzzy Measures • Idea: • Terms describing gene products can be given natural “weights” if they come from taxonomies, like the GO • Weights may be based on “information theory” or “depth in tree” • Weights might be assigned by experts • Fuzzy measures allow the measure of the “whole” to be more (or less) than the “sum of its parts”

Fuzzy Measures • Sources of information in a set G (sensors, features, algorithms, etc.) • Here, G = {T1, …, Tn}, the set of terms describing G • Worth of sources comes from a Fuzzy Measure: g: 2G [0,1] • g(f) = 0 and g(G) = 1 • g(A)  g(B) if A  B • If {Ai} is an increasing sequence of subsets of G, then

Fuzzy Measures • For a fuzzy measure g, let gi = g({Ti}) • The mapping Ti→ gi is called a fuzzy density function • The fuzzy density value, gi , is interpreted as the (possibly subjective) importance of the single information source Ti in determining the similarity of two genes • General fuzzy measures are broad, but often the densities can be extracted from the problem domain or supplied by experts • Need fuzzy measures that can be “built” from densities

Fuzzy Measures • A fuzzy measure g is called a lamda measure (gl-fuzzy measure) if additionally: • Forany lamda fuzzy measure l can be uniquely determined for a finite set X by solving • where G = {T1, …, Tn} and Ti = g({Ti}) interpreted as the (possibly subjective) importance of the single information source Ti in determining the evaluation of a hypothesis

Fuzzy Measure Similarity New fuzzy measure similarity between two sets G1 and G2 of terms is defined as: where g1 is a fuzzy measure defined on G1 and g2 is a fuzzy measure defined on G2

Example With densities (supplied by expert): Other measures: Both too low (intuition) for this example

FMS Calculation To calculate FMS, we need to generate the two measures For G1, = 3.1 and The l-measure for G2 has =0.37, resulting in a measure for the common set of Hence, the FMS similarity is 0.61 FMS is sensitive to the common elements: if the common elements have a high confidence, then the similarity is stronger Agrees with our intuition about similarity In the vector cosine similarity, the non-common elements have no contribution (multiplied by zero): in FMS they do implicitly since the fuzzy measures are defined apriori for each term set.

Suppose that G1 and G2 are as before (terms from a taxonomy): and calculate FMS on it Then Augmented Sets What happens if Augment each set as: is the set of nearest common ancestors (NCA) of every pair

Example Piece of Ontology with densities

Example (cont.) For G1, =0.57, resulting in For G2, =-0.15, resulting in • Lower than when the intersection was non-empty, but not zero since both sets share “near” common ancestors. • Note that both the Jaccard and the vector cosine similarities are 0

s(ATM, STK11)=? (GO dimension) • Retrieve LocusLink GO annotations: • ATM={4674: “ protein serine/threonine kinase activity”, • 3677: ” DNA binding”, • 4428 ” inositol/phosphatidylinositol kinase activity”, • 7131 : ” meiotic recombination”, • 6281 : ” DNA repair”, • 7165: ” signal transduction”, • 5634: ” nucleus”, • 16740: ” transferase activity”, • 45786: ” negative regulation of cell cycle”} • STK11={5524: “ ATP binding”, • 4674: ” protein serine/threonine kinase activity”, • 6468: ” protein amino acid phosphorylation”, • 16740: ” transferase activity”}

Densities from the GO • Compute GO term densities using the Resnik formula, the normalized version [.] or the depth in the hierarchy (.)

Compute the Similarity Recall: The “expert” deemed these to be “somewhat similar” Similarity around 0.5?

Our Second Approach • What if pairs of terms have both similarities and “importance” towards determining total gene similarity? • For example, same or similar keywords to generate pair similarity and use depth in tree to create importance (fuzzy measure) • Useful (we conjecture) for comparing based on abstracts • Keywords build pairwise similarities • Impact factors (or source of terms) give importance

Concept Use Choquet Fuzzy Integral to fuse

Abstract Term Example

Suppose that G1 and G2 are as before (terms from a taxonomy): Let and Choquet Fuzzy Integral To simplify the notation, we reorder the abstract pairs and label them by a single subscript so that Tk = (T1i,T2j) for some pair (i,j)

Choquet Fuzzy Integral Let g be a fuzzy measure on (finite set) X Then the Choquet fuzzy integral of s with respect to g is given by where the function values are reordered so that and

Choquet Fuzzy Integral Define Then the Choquet fuzzy integral can be rewritten as Looks linear, but isn’t - Depends on the sort

Define as above, normalized so that OWA Operator Special case of the Choquet integral Pick the fuzzy measure so that Then C(s) is called the Ordered Weighted Average (OWA) (weighted sum of order statistics) Only need to specify positional weights instead of entire measure Similarity:

Simple Example Pairwise GO similarity (information Theoretic) “At least two terms should be similar to get high value” Then sOWA = 0.72 while sa = 0.43

Choquet Integral Fusion Suppose: For the GO annotations, the confidence might be assigned based on the annotation procedure: “traceable author statement” –high, “computer assisted analysis” – medium, or “not recorded” - low. If a Sugeno measure is used with these densities, its parameter is =-0.77 resulting in a similarity:

Gene Ontology Example Three pairs of genes: COL1A1 (human collagen, type 1, alpha 1) and Col1a1 (mouse collagen, type 1, alpha 1) - very similar Second pair of genes: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11) - quasi-similar Third pair of genes: COL1A1 and STK11 - not similar at all The GO term annotations for the four genes is found in LocusLink

Densities Info Theory: 1-p; depth-based:

Target Sim. Max. (1-p) Jacc. Ave. (norm) Dice Max. (norm) FMS (norm) OWA (norm) FMS (depth) OWA (1-p) Ave. (1-p) Pair1 Pair1 1 1 0.2 0.14 0.65 0.33 0.62 0.91 1 0.33 0.31 Pair2 Pair2 0.5 1 0.18 0.09 0.31 0.44 0.37 0.64 0.36 0.99 0.37 Pair3 Pair3 0.67 0 0 0.03 0.1 0 0.0 0.1 0.67 0 0.22 GO Similarity Measures

ATM 12917635- Oncogene (6.737) 12970738-Oncogene (6.737) 14500819-Nucleic Acids Res. (6.373) 14499692-Science (23.329) STK11 12183403 – Cancer Res (8.30) 12234250 – Biochem J (4.326) 12805220 - EMBO J. (12.459) 11853558- Biochem J (4.326) Matching by Abstract • s(ATM, STK11)=? (Abstract dimension) • Algorithm: • Retrieve PubMed abstracts for ATM, STK11 • Calculate all the pair-wise distances based on the MeSH indexing • Keep the 4 best-matching pairs • Find the impact factor for each journal: g(Ai), i=1…8

Abstract Similarity Example • Calculate the confidence of the pair • g(A1, A2) =g(A1)*g(A2) and normalize to max value

Abstract Similarity Example Pairwise Similarity by FMS Weighted Average: sa(ATM, STK11)=0.37 Choquet Integral sChoquet(ATM, STK11)=0.53

GO Similarity by Average GO Similarity by FSM Hot Off the Press Three families of proteins: microtubularin (1-21), receptor precursor (22-108), collagen alpha chain (109-194)

Conclusions • Introduced Soft computing methods to determine gene product similarity from taxonomy terms • Use fuzzy measures on (augmented) term intersection set • Use fuzzy integrals to fuse confidence and “worth” (very general) • Results can (should) be combined with sequence information and expression values for robust similarity • Next step is to cluster on microarray experiment data • Knowledge Discovery • Do the clusters found exhibit linguistic similarity? • Unknown gene product maps into cluster by sequence: share the linguistic properties?

You should always thank your friends: • National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11 supporting M. Popescu • And all of you!

Soft Computing Tools for Gene Similarity Measures in Bioinformatics