1 / 27

Secuenciación de novo

Secuenciación de novo. Prácticas. Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed smartinez@proteored.org. Why de novo sequencing is difficult. Leucine and isoleucine have the same mass Glutamine and lysine differ in mass by 0.036Da

gates
Download Presentation

Secuenciación de novo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Secuenciación de novo Prácticas Salvador Martínez de Bartolomé Bioinformatics Support ProteoRed smartinez@proteored.org

  2. Why de novo sequencing is difficult • Leucine and isoleucine have the same mass • Glutamine and lysine differ in mass by 0.036Da • Phenylalanine and oxidized methionine differ in mass by 0.033Da • Cleavages do not occur at every peptide bond (or cannot be observed on the MS-MS) • Poor quality spectrum (some fragment ions are below noise level) • The C-terminal side of proline is often resistant to cleavage • Absence of mobile protons • Peptides with free N-termini often lack fragmentation between the first and second amino acids

  3. Why de novo sequencing is difficult (II) • Certain amino acids have the same mass as pairs of other amino acids • Gly +Gly (114.0429) Asn (114.0429) • Ala +Gly (128.0586) Gln (128.0586) • Ala +Gly (128.0586) Lys (128.0950) • Gly + Val (156.0742) Arg (156.1011) • Ala + Asp (186.0641) Trp (186.0793) • Ser + Val (186.1005) Trp (186.0793) • Directionality of an ion series is not always known (are they b- or y-ions?)

  4. Secuenciación de novo • Dos aproximaciones bioinformáticas • Enfoque global: Se calculan todos los péptidos posibles Pi con masa Mi. Posteriormente se genera para cada una de las secuencias de aminoácidos calculadas su espectro de fragmentación teórico, T(Pi), y por último realizar la comparación de cada T(Pi) con el espectro real S. La solución consiste en determinar la secuencia del péptido Pi que genera el espectro teórico T(Pi) con mayor identidad con S. • Enfoque local: En este caso se utiliza la información de los picos del espectro para reducir el número de candidatos a generar. La secuenciación se realiza partiendo del extremo N terminal, y comprobando la existencia de algún pico cuya diferencia de masa respecto al extremo corresponda con la masa de algún aminoácido. El proceso continua chequeando la existencia de picos en el espectro de la misma manera.

  5. Secuenciación de novo • Aproximación mejorada: local o global? Los problemas asociados a los algoritmos globales eran principalmente dos: • El primero, relativo al crecimiento exponencial del número de candidatos, y por otro lado, el tiempo computacional dedicado para la comparación de los espectros teóricos generados a partir de los candidatos y el espectro real. Estos problemas son dependientes de la masa a determinar, aunque, en el caso de existir, la solución siempre se proporciona. • Solución: la región del espectro a determinar por el enfoque global se reduce al mínimo para evitar que los inconvenientes presentados supongan un problema. • Por su parte, los algoritmos locales, más rápidos y dirigidos, plantean el problema que en zonas mal expresadas del espectro (poca abundancia de picos), no generan buenas soluciones. • Solución: Por tanto, en este sistema, sólo se realizará una búsqueda de este tipo en las zonas con abundancia de picos considerable.

  6. Summary of de novo sequencing tools Software Source website PEAKS* www.bioinformaticssolutions.com SeqMS (download) www.protein.osaka-u.ac.jp/rcsfp/profiling/SeqMS.html Sherenga (included in SpectrumMill)* N/A Lutefisk (download) www.hairyfatguy.com/Lutefisk DeNovoX* www.thermo.com PepNovo peptide.ucsd.edu/pepnovo.py SpectrumMill* www.home.agilent.com *Commercialized

  7. Prácticas DeNovo

  8. Lutefisk

  9. Lutefisk is software for the de novo interpretation of peptide CID spectra • http://www.hairyfatguy.com/lutefisk/ • To run Lutefisk, you need to have four files within the same directory or folder: • CID data file (data files can be specified with a full or partial pathname) • Lutefisk.details • Lutefisk.params • Lutefisk.residues • One additional file is optional: • Database.sequence

  10. ‘Lutefisk.details’ file • The Lutefisk.details file contains the so-called "ion probabilities" for each type of ion. • Each column in the file contains the "ion probabilities" for different fragmentation patterns (see the description of "fragmentation patterns" below). • Currently there are only two types of fragmentation pattern that have been coded, which is for low energy CID of tryptic peptides on triple quadrupole (or Qtof) instruments or ion traps, and these ion probabilities are listed in the second and third columns. The first column is not used (oddly enough).

  11. ‘Lutefisk.residues’ file • The Lutefisk.residues file contains the single letter code, monoisotopic masses, average masses, and nominal masses for each amino acid. • To add an additional residue to the list, replace the 0's in one of the rows w/ the corresponding monoisotopic, average, and nominal masses. • Up to five additional non-traditional residues can be entered here, and will be given the single letter code of J, O, U, X, or Z

  12. ‘Database.sequence’ file • The Database.sequence file is a text file containing a sequence or a list of sequences that might have been derived from a sequence database search. • In the final steps, where it determines scores for the candidate sequences, Lutefisk tosses in these database-derived sequences along with the de novo sequence candidates to determine if the database sequences are as good as or better than the de novo sequences. If so, then this constitutes evidence that the database derived sequences might actually be correct.

  13. ‘Lutefisk.params’ file 241103plata_bernabe.369.369.2.dta

  14. ‘Lutefisk.params’ file

  15. Lutefisk help >lutefisk.exe -h

  16. Run Lutefisk.exe • Once all files are configured correctly, • on command prompt, type: (in “C:\Documents and Settings\Bioworks32\Desktop\denovo\LUTEFISK” folder) • Lutefisk.exe

  17. Output from Lutefisk – lut file • The candidate sequences are ranked according to Pr(C) which is the estimated probability of being correct. • Also gives four scores: • Pevzscr is an adaptation of the ideas presented by Dancik et al (J. of Comput. Biol (1999) Vol 6, 327), which is a score that penalizes for the absence of expected ions and accounts for the possibility of random matches.  • Quality is the percentage of the peptide mass that can be accounted for by a contiguous ion series.  • Intscr is the percentage of the fragment ion intensity that can be accounted for as b, y, internal fragment, etc, ions.  • X-corr is the cross-correlation score that has been normalized by its auto-correlation score.

  18. pepNovo

  19. PepNovo • scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation • specific for Ion Trap data

  20. pepNovo • Pepnovo was developed at the University of California, San Diego • Pepnovo uses a probabilistics network to model the peptide fragmentation events in a mass spectrometer. • It’s available online at: http://bix.ucsd.edu/MassSpec/ and also in an inhouse instalation.

  21. pepNovo • PepNovo runs via command line arguments: • -file <full path to input file> to specify a single input file (mgf,dta,mzxml) or • -list <full path to txt file> to give a list of input files (this is the preferred method for large amounts of files since the models are not reread for each input file). • -model <model name> (currently only CID_IT_TRYP is available)

  22. pepNovo • Optional PepNovo arguments: • -prm - only print spectrum graph nodes with scores • -prm_norm - prints spectrum graph scores after normalization and removal of negative scores. • -correct_pm - finds optimal precursor mass and charge values. • -use_spectrum_charge - does not correct charge. • -use_spectrum_mz - does not correct the precursor m/z value that appears in the file. • -no_quality_filter - does not remove low quality spectra. • -fragment_tolerance < 0-0.75 > - the fragment tolerance (each model has a default setting) • -pm_tolerance < 0-5.0 > - the precursor masss tolerance (each model has a default setting) • -PTMs <PTM string> - separated by a colons (no spaces) e.g., M+16:S+80:N+1 • -digest <NON_SPECIFIC,TRYPSIN> - default TRYPSIN • -num_solutions < number > - default 20 • -tag_length < 3-6> - returns peptide sequence of the specified length (only lengths 3-6 are allowed). • -model_dir < path > - directory where model files are kept (default ./Models)

  23. Pepnovo • Usingpepnovo: >PepNovo.exe –list paths_of_lots_of_spectra.txt –model CID_IT_TRYP –PTMs C+57:M+16 –digest TRYPSIN This command runs Pepnovo on all the spectra files in “paths_of_lots_of_spectra.txt” assumes that peptides were digested with trypsin and that the cystine are carbomethylated and that the methionine can be oxidized. The output is the defaults output of 20 sequences. >PepNovo.exe –file my_great_spectra.mgf –model CID_IT_TRYP C+57:M+16 –digest NON_SPECIFIC –tag_length 3 –num_solutions 50 Runs pepnovo on a single mgf file and generates 50 tags of length 3 for each spectrum (assumes that the digest was not with trypsin).

  24. Pepnovo • Usingpepnovo: (in “C:\Documents and Settings\Bioworks32\Desktop\denovo\pepNovo” folder) PepNovo.exe –file 241103plata_bernabe.369.369.2.dta –model CID_IT_TRYP –digest TRYPSIN PepNovo.exe –file FQSEEQQQTEDELQDK.dta –model CID_IT_TRYP –digest TRYPSIN

  25. Pepnovo • PepNovo output: • The output gives the following tab delimited fields for each MS/MS spectrum: • Idx – the sequence/tag rank (starts at 0) • RnkScr - the ranking score (the major score that is used) • PnvScr – the PepNovo score of the sequence (see Anal Chem 2005, and JPR 2006 for more details on the score). • N-Gap - the mass gap from the N-terminal to the start of the de novo sequence. • C-Gap - the mass gap from the C-terminal to the end of the de novo sequence. • Sequence – the predicted amino acid sequence.

  26. Pepnovo http://bix.ucsd.edu/MassSpec/

More Related