1 / 13

Short introduction to perl & gff

Short introduction to perl & gff. Marcus Ronninger The Linnaeus Centre for Bioinformatics. Motivation. Bioinformatics yields lots of information The information have to be mined Build or modify text files Small changes can take long time with lots of data

sheba
Download Presentation

Short introduction to perl & gff

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Short introduction to perl & gff Marcus Ronninger The Linnaeus Centre for Bioinformatics

  2. Motivation • Bioinformatics yields lots of information • The information have to be mined • Build or modify text files • Small changes can take long time with lots of data • Example: Change every letter to lower case • With script programming this could be done in less than a second

  3. perl • Practical extraction and report language • Scripts • Object oriented programming • Graphical web interface, CGI • Possibilities • BioPerl

  4. Example Example of a very simple perl script, to_lower_case.pl #!/usr/bin/perl -w use strict; my $seqfile = $ARGV[0]; my $outfile = $ARGV[1]; open (SEQ, $seqfile) || die "Can't open file: $seqfile"; open (OUTFILE, "> $outfile"); while(<SEQ>){ if ($_ =~ /^\>.*\n/){ print OUTFILE $_; } else{ print OUTFILE lc ($_); } }

  5. Useful tools for parsing files • Scalar $ • Array @ • Regular expression /.fasta/ • Split, @chars = split //, $word • Substitute s/old-regex/new-string/ • Upper and lower case: uc, lc • Escape characters: \n \t \s etc • sub

  6. General feature format, gff • AKA “gene finding format” • A format for handling output from different feature finding programs • Processes can be decoupled but the result can still be put together • Makes it easy to include external algorithms

  7. General feature format The construction of the format is very simple. The values are tab-delimited. SEQ1 EMBL atg 103 105 . + 0 SEQ1 EMBL exon 103 172 . + 0 1. 2. 3. 4. 5. 6. 7. 8. 1. Sequence name 2. Source of the feature 3. Feature type 4. Start 5. End 6. Score - most feature finding programs have some kind of score for the found motif 7. Strand - can either be + or - 8. Frame - 0, 1, 2, .

  8. Small example A small script that transforms known transcription factor binding sites into a .gff file #Gfap #Known TFBS (Besnard et al 1991) #count backwards form the TSS #start -14 AP-2: ccccaccccc -101 NF-1: tgggctgcggccca -116 Hgcs: ctgggctgcggc -117

  9. Example Basically the same procedure as the perl example above $seqlength = 5000; $gff = “”; while (<LIT>){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); } }

  10. Example while (<LIT>){ if ($_ =~ /^#start/){ $rel_start = $'; } elsif (!($_ =~ /^#/) && ($_ =~ /\w+/)){ make_gff($_, $rel_start, "Literature"); } } sub make_gff{ my $start; my $stop; (my $seq, my $rs, my $type) = @_; my @feature = split(/\s+/, $seq); # now the array has the feature information if($type eq "Literature"){ $start = $seqlength + $rs + $feature[2]; $stop = $start + length($feature[1]) -1; $sign = '.'; $gff .= "$feature[0]\t$type\t$feature[0]\t$start\t$stop\tundef\t$sign\t$sign\n"; } etc.

  11. Example Output: a file named lit.gff with the following contents AP-2: Literature AP-2: 4886 4895 undef . . NF-1: Literature NF-1: 4871 4884 undef . . Hgcs: Literature Hgcs: 4870 4881 undef . . This can now be loaded into programs thatsupport the gff format, e.g. Apollo

  12. Apollo • Gff files is boring to view as they are • Use graphical software • Apollo, a sequence annotation editor • Great for viewing gff files together with the sequence

  13. References • Tisdall J.D, “Beginning Perl for Bioinformatics” 2001, O’Reilly • http://www.sanger.ac.uk/Software/formats/GFF/ • http://www.fruitfly.org/annot/apollo/.

More Related