190 likes | 362 Views
Lab7. Twinscan, HMMER, PFAM. TWINSCAN. TwinScan. TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the gene structure in the target and the evolutionary conservation derived from " informant " genomic sequences.
E N D
Lab7 Twinscan, HMMER, PFAM
TwinScan • TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the gene structure in the target and the evolutionary conservation derived from "informant" genomic sequences. • The target sequence (i.e. the sequence to be annotated) should generally be of draft or finished quality. • The informant can range from a single sequence to a whole genome in any condition from raw shotgun reads to finished assembly. • References • http://mblab.wustl.edu/media/publications/Hu-Brent-2003-Proof.pdf • http://mrw.interscience.wiley.com/emrw/9780471250951/cp/cpbi/article/bi0408/current/abstract • http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S140
Requirements • TwinScan/ Nscan 4.0 executable • DNA sequence in FASTA format • Twinscan parameter file (/parameters/twinscan_parameters) • Each filename contains the name of the target organism: eg maize_twinscan.zhmm. • Conservation sequence (see examples/example.conseq) • A symbolic representation of the the best alignments between the target and informant sequences. • To create this conservation sequence, WU-BLAST (http://blast.wustl.edu) is used. • For NCBI BLAST, the input parameters need to be changed. (see parameters/examples/example_blast_parameters.txt) • xdformat (WU-BLAST package) formats the informant sequences to create a blast database. • After running BLAST, the output must be formatted with conseq, which is included in this package. • Example (using WU-BLAST) • xdformat -n informant.fa • Blast M=1 N=-1 Q=5 R=1 B=10000 V=100 -cpus=1 -warnings -lcfilter filter=seg filter=dust topcomboN=1 informant.fa target.fa > blast.out • conseq target.fa blast.out > conseq.fa • Note: The runTwinscan2 script will run these steps without user intervention (see below). • EST sequence (optional) • EST sequence is a symbolic representation of evidence from ESTs that align to the target sequence. • The estseq script included in the distribution creates EST sequence when given a DNA sequence and a (set of) BLAT reports of the the ESTs aligned to the target. • For downloading BLAT, go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3
Web service and Installation • Web service • http://mblab.wustl.edu/nscan/submit/ • Local installation • http://mblab.wustl.edu/software/twinscan/ • Installation step $ tar xvzf iscan-4.1.2.tar.gz $ cd N-SCAN $ make linux $ ./test-executable
How to run • Usage twinscan <parameter file> <masked sequence file> -c=<conseq file> [-e=estseq_file] > <outputfile> • Example: ./bin/iscan ./parameters/twinscan_parameters/human_twinscan.zhmm ./examples/example.fa.masked -c=./examples/example.conseq > mySequence.gtf
Running Twinscan using the runTwinscan2 script • In summary, there are 5 steps required to run Twinscan: • Step 1: Mask target sequence with RepeatMasker • Step 2: Create informant BLAST database • Step 3: Run BLAST • Step 4: Create conservation sequence • (Step 4b: Create EST sequence) • Step 5: Run Twinscan • These five steps are all contained in the example script runTwinscan2 (see ./bin) • The default BLAST parameters used by runTwinscan2 are those for C.elegans (see parameters/blast_params/Celegans.blast.param). • This can and should be changed for any other species with the -B option to the runTwinscan2 script. • The file example.output in the /examples directory contains the output from runTwincan2 using the BLAST parameters found in the script.
Local environment setting for runTwinscan2 • Example • ../bin/runTwinscan2 -r ../parameters/twinscan_parameters/human_twinscan.zhmm -d output -B ../parameters/blast_params/Hsapiens.blast.param example.fa.masked informant.fa • After running you can find output files in the newly created /output directory. • Several programs must be installed on your system to run runTwinscan. • You may need to change runTwinscan to point it to these programs. • my $REPEATMASKER = "RepeatMasker"; # Format for local environment • my $BLASTN = "blastn"; # Format for local environment • my $BLAT = "blat"; # Format for local environment • my $XDFORMAT = "xdformat"; # Format for local environment • my $PRESSDB = "pressdb"; # Format for local environment
Sean Eddy’s Lab • http://selab.janelia.org/software.html
Introduction • HMMER2 is an implementation in UNIX (Linux, MacOS) platform of profile hidden Markov model, whose source code, executables, and user guide can be downloaded from http://hmmer.janelia.org/ • The experiment of HMMER is to look for known domains in a query sequence by searching a single sequence again a library of HMMs. • One such library is PFAM, and you can also create your own library using HMMER
HMMER executables • hmmalign ‐ Align sequences to an existing model. • hmmbuild‐ Build a model from a multiple sequence alignment. • hmmcalibrate‐ Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E‐values). • hmmconvert ‐ Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. • hmmemit ‐ Emit sequences probabilistically from a profile HMM. • hmmfetch ‐ Get a single model from an HMM database. • hmmindex ‐ Index an HMM database. • hmmpfam ‐ Search an HMM database for matches to a query sequence. • hmmsearch‐ Search a sequence database for matches to an HMM.
Installation • Simple installation • Download the current version of HMMER “hmmer‐2.3.2.bin.intel‐linux.tar.gz” from http://hmmer.janelia.org/#download; • Unpack the software by typing “tar –xvf hmmer‐2.3.2.bin.intel‐linux.tar.gz” in the command line. You will see a new directory “hmmer‐2.3.2.bin.intel‐linux”. • Enter the directory of hmmer‐2.3.2.bin.intel‐linux. You will see NINE executables ready in the subdirectory “/binaries”, and also nine files in the subdirectory “/tutorial”; • Installation from source code • Download the current HMMER source code version “hmmer‐2.3.2.tar.gz” from http://hmmer.janelia.org/#download; • Create a new directory in your Linux account and upload or move the software package to the directory; • Unpack the software by typing “tar –xvf hmmer‐2.3.2.tar.gz” in the command line; • Type “cd hmmer‐2.3.2” to enter the software directory; • Type “./configure” to configure for your system and build the programs; • Type “make” to generate the executables; • Type “make check” to run the automated test suite; (This is optional but recommended, and all these tests should pass); • Please note that by default programs are in “/usr/local/bin/” and man pages are in “/usr/local/man/man1”; • Type “make install” to install all executables;
Hmmbuild : build a profile HMM from an aignment • hmmbuild[options]hmmfilealignfile • hmmbuild test.hmm test.aln • hmmbuild -h • hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves the HMM in hmmfile. • alignfile may be in ClustalW, GCG MSF, or SELEX alignment format. • By default, the model is configured to find one or more non-overlapping alignments to the complete model. • To configure the model for a single global alignment, use the -g option; • To configure the model for multiple local alignments, use the -f option; • To configure the model for a single local alignment (standard Smith/Waterman), use the -s option.
Hmmcalibrate: calibrate HMM search statistics • hmmcalibrate[options] hmmfile • hmmcalibrate test.hmm • Hmmcalibrate -h • hmmcalibrate reads an HMM file from hmmfile, scores a large number of synthesized random sequences with it, fits an extreme value distribution (EVD) to the histogram of those scores, and re-saves hmmfile now including the EVD parameters. • This step is optional, but it will increase the sensitivity of your database search • hmmcalibrate may take several minutes (or longer) to run. While it is running, a temporary file called hmmfile.xxx is generated in your working directory. • If you abort hmmcalibrate prematurely (ctrl-C, for instance), your original hmmfile will be untouched, and you should delete the hmmfile.xxx temporary file.
Hmmsearch: - search a sequence database with a profile HMM • hmmsearch[options]hmmfileseqfile • hmmsearch test.hmm query.faa > query.faa.domain • hmmsearch -h • hmmsearch reads an HMM from hmmfile and searches seqfile for significantly similar sequence matches. • hmmsearch may take minutes or even hours to run, depending on the size of the sequence database. It is a good idea to redirect the output to a file. • The output consists of four sections: • a ranked list of the best scoring sequences, • a ranked list of the best scoring domains, • alignments for all the best scoring domains, and • a histogram of the scores. • A sequence score may be higher than a domain score for the same sequence if there is more than one domain in the sequence; the sequence score takes into account all the domains. All sequences scoring above the -E and -T cutoffs are shown in the first list, then every domain found in this list is shown in the second list of domain hits. If desired, E-value and bit score thresholds may also be applied to the domain list using the -domE and -domT options.
Pfam 23.0 (July 2008, 10340 families) • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. • The identification of domains that occur within proteins can therefore provide insights into their function. • There are two components to Pfam: Pfam-A and Pfam-B. • Pfam-A entries are high quality, manually curated families. • Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. • Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. • Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. (see Pfam-C)
Sequence analysis with HMM • ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam23.0/ to download files “Pfam_fs.gz” and “Pfam_ls.gz” • Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. • Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with the hmmpfam program. • Data location • http://darwin.informatics.indiana.edu/col/courses/I529-12/Lab/Lab7/Data/PFAM_data/ • Copy to your working director • To search for domains in “test.faa” in the global sequence database, type • hmmpfam Pfam_fs test.faa > test.faa.pfam” • The results is logged into an output file “test.faa.pfam”;