PathoLogic Pathway Predictor

PathoLogic Pathway Predictor

Inference of Metabolic Pathways Gene Products Genes/ORFs DNA Sequences Pathways Reactions Compounds Annotated Genomic Sequence Pathway/Genome Database Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Multi-organism Pathway Database (MetaCyc) Gene Products Genes Genomic Map

PathoLogic Functionality • Initialize schema for new PGDB • Transform existing genome to PGDB form • Infer metabolic pathways and store in PGDB • Infer operons and store in PGDB • Assemble Overview diagram • Assist user with manual tasks • Assign enzymes to reactions they catalyze • Identify false-positive pathway predictions • Build protein complexes from monomers • Infer transport reactions

PathoLogic Input/Output • Inputs: • File listing genetic elements • http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat • Files containing DNA sequence for each genetic element • Files containing annotation for each genetic element • MetaCyc database • Output: • Pathway/genome database for the subject organism • Reports that summarize: • Evidence contained in the input genome for the presence of reference pathways • Reactions missing from inferred pathways

PathoLogic Analysis Phases A C G • Trial parsing of input data files [few days] • Initialize schema of new PGDB [3 min] • Create DB objects for replicons, genes, proteins [5 min] • Assign enzymes to reactions they catalyze • ferrochelatase [10 min / 1 week] • glutamate 1-semialdehyde 2,1-aminomutase • porphobilinogen deaminase E1 E2 B D E F

PathoLogic Analysis Phases • From assigned reactions, infer what pathways are present [5 min / few days] • Define metabolic overview diagram [30 min] • Define protein complexes [few days]

genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //

File Naming Conventions • One pair of sequence and annotation files for each genetic element • Sequence files: FASTA format • suffix fsa or fna • Annotation file: • Genbank format: suffix .gbk • PathoLogic format: suffix .pf

Typical Problems Using Genbank Files With PathoLogic • Wrong qualifier names used: read PathoLogic documentation! • Extraneous information in a given qualifier • Check results of trial parse carefully

GenBank File Format • Accepted feature types: • CDS, tRNA, rRNA, misc_RNA • Accepted qualifiers: • /locus_tag Unique ID [recm] • /gene Gene name [req] • /product [req] • /EC_number [recm] • /product_comment [opt] • /gene_comment [opt] • /alt_name Synonyms [opt] • /pseudo Gene is a pseudogene [opt] • For multifunctional proteins, put each function in a separate /product line

PathoLogic File Format • Each record starts with line containing an ID attribute • Tab delimited • Each record ends with a line containing // • One attribute-value pair is allowed per line • Use multiple FUNCTION lines for multifunctional proteins • Lines starting with ‘;’ are comment lines • Valid attributes are: • ID, NAME, SYNONYM • STARTBASE, ENDBASE, GENE-COMMENT • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT • DBLINK • INTRON

PathoLogic File Format ID TP0734 NAME deoD STARTBASE 799084 ENDBASE 799785 FUNCTION purine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENT similar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // ID TP0735 NAME gltA STARTBASE 799867 ENDBASE 801423 FUNCTION glutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P

Before you start: What to do when an error occurs • Most Navigator errors are automatically trapped – debugging information is saved to error.tmp file. • All other errors (including most PathoLogic errors) will cause software to drop into the Lisp debugger • Unix: error message will show up in the original terminal window from which you started Pathway Tools. • Windows: Error message will show up in the Lisp console. The Lisp console usually starts out iconified – its icon is a blue bust of Franz Liszt • 2 goals when an error occurs: • Try to continue working • Obtain enough information for a bug report to send to pathway-tools support team.

The Lisp Debugger • Sample error (details and number of restart actions differ for each case) Error: Received signal number 2 (Keyboard interrupt) Restart actions (select using :continue): 0: continue computation 1: Return to command level 2: Pathway Tools version 10.0 top level 3: Exit Pathway Tools version 10.0 [1c] EC(2): • To generate debugging information (stack backtrace): :zoom :count :all • To continue from error, find a restart that takes you to the top level – in this case, number 2 :cont 2 • To exit Pathway Tools: :exit

How to report an error • Determine if problem is reproducible, and how to reproduce it (make sure you have all the latest patches installed) • Send email to ptools-support@ai.sri.com containing: • Pathway Tools version number and platform • Description of exactly what you were doing (which command you invoked, what you typed, etc.) or instructions for how to reproduce the problem • error.tmp file, if one was generated • If software breaks into the lisp debugger, the complete error message and stack backtrace (obtained using the command :zoom :count :all, as described on previous slide)

Using the PPP GUI to Create a Pathway/Genome Database • Input Project Information • Organism -> Create New

Input Project Information

Next Steps • Trial Parse • Build -> Trial Parse • Fix any errors in input files • Build pathway/genome database • Build -> Automated Build

PathoLogic Parser Output

Assign Enzymes to Reactions 5.1.3.2 Gene product MetaCyc UDP-glucose-4-epimerase Match yes no Probable enzyme -ase Assign UDP-D-glucose  UDP-galactose no yes Manually search Not a metabolic enzyme no yes Assign Can’t Assign

Enzyme Name Matcher • Matches on full enzyme name • Match is case-insensitive and removes the punctuation characters “ -_(){}',:” • Also matches after removal of prefixes and suffixes such as: • “Putative”, “Hypothetical”, etc • alpha|beta|…|catalytic|inducible chain|subunit|component • Parenthetical gene name

Enzyme Name Matcher • For names that do not match, software identifies probable metabolic enzymes as those • Containing “ase” • Not containing keywords such as • “sensor kinase” • “topoisomerase” • “protein kinase” • “peptidase” • Etc • Research unknown enzymes • MetaCyc, Swiss-Prot, PubMed

Enzyme Name to Reaction Mapping See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt

Manual Polishing • Refine -> Assign Probable Enzymes  Do this first • Refine -> Rescore Pathways  Redo after assigning enzymes • Refine -> Create Protein Complexes  Can be done at any time • Refine -> Assign Modified Proteins  Can be done at any time • Refine -> Transport Identification Parser  Can be done at any time • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Update Overview  Do this last, and repeat after any material changes to PGDB

Assign Probable Enzymes

How to find reactions for probable enzymes • First, verify that enzyme name describes a specific, metabolic function • Search for fragment of name in MetaCyc – you may be able to find a match that PathoLogic missed • Look up protein in SwissProt or other DBs • Search for gene name in PGDB for related organism (bear in mind that gene names are not reliable indicators of function, so check carefully) • Search for function name in PubMed • Other…

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Automated Pathway Inference • All pathways in MetaCyc for which there is at least one enzyme identified in the target organism are considered for possible inclusion. • Algorithm errs on side of inclusivity – easier to manually delete a pathway from an organism than to find a pathway that should have been predicted but wasn’t.

Considerations taken into account when deciding whether or not a pathway should be inferred: • Is there a unique enzyme – an enzyme not involved in any other pathway? • Does the organism fall in the expected taxonomic domain of the pathway? • Is this pathway part of a variant set, and, if so, is there more evidence for some other variant? • If there is no unique enzyme: • Is there evidence for more than one enzyme? • If a biosynthetic pathway, is there evidence for final reaction(s)? • If a degradation pathway, is there evidence for initial reaction(s)? • If an energy metabolism pathway, is there evidence for more than half the reactions?

Assigning Evidence Scores to Predicted Pathways • X|Y|Z denotes score for P in O • where: • X = total number of reactions in P • Y = enzymes catalyzing number of reactions for which there is evidence in O • Z = number of Y reactions that are used in other pathways in O

Manual Pruning of Pathways • Use pathway evidence report • Coloring scheme aids in assessing pathway evidence • Phase I: Prune extra variant pathways • Rescore pathways, re-generate pathway evidence report • Phase II: Prune pathways unlikely to be present • No/few unique enzymes • Most pathway steps present because they are used in another pathway • Pathway very unlikely to be present in this organism • Nonspecific enzyme name assigned to a pathway step

Caveats • Cannot predict pathways not present in MetaCyc • Evidence for short pathways is hard to interpret • Since many reactions occur in multiple pathways, some false positives

Output from PPP • Pathway/genome database • Summary pages • Pathway evidence page • Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” • Missing enzymes report • Directory tree containing sequence files, reports, etc.

Resulting Directory Structure • ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/ • input • organism.dat • organism-init.dat • genetic-elements.dat • annotation files • sequence files • reports • name-matching-report.txt • trial-parse-report.txt • kb • ORGIDbase.ocelot • data • overview.graph • released -> VERSION

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Creating Protein Complexes

Complex Subunits Stoichiometries

Manual Polishing • Refine -> Assign Probable Enzymes • Refine -> Re-run Name Matcher • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Proteins as Reaction Substrates

Manual polishing • Refine -> Assign Probable Enzymes • Refine -> Rescore Pathways • Refine -> Create Protein Complexes • Refine -> Assign Modified Proteins • Refine -> Transport Identification Parser • Refine -> Pathway Hole Filler • Refine -> Predict Transcription Units • Refine -> Run Consistency Checker • Refine -> Update Overview

Nomenclature • WO pair = pair of genes within an operon • TUB pair = pair of genes at a transcription unit boundary (delineate operons)

Operation of the operon predictor • For each contiguous gene pair, predict whether gene pairs are within the same operon or at a transcription unit boundary • Use pairwise predictions to identify potential operons AB = TUB pair BC = WO pair operon = BCD CD = WO pair DE = TUB pair A B C D E

Operon predictor • Predicts operon gene pairs based on: • intergenic distance between genes • genes in the same functional class • Typically used for operon prediction • We use method from Salgado et al, PNAS (2000) as a starting point. • Uses E. coli experimentally verified data as a training set. • Compute log likelihood of two genes being WO or TUB pair based on intergenic distance.

Operon predictor Additional features easily computed from a PGDB • both genes products enzymes in the same metabolic pathway • both gene products monomers in the same protein complex • one gene product transports a substrate for a metabolic pathway in which the other gene product is involved as an enzyme • a gene upstream or downstream from the gene pair (and within the same directon) is related to either one of the genes in the pair as per features 1, 2 and 3 above.

PathoLogic Pathway Predictor