130 likes | 254 Views
Tuning Tophat2. Belinda Giardine. Tophat2. Aligns reads from RNA to the genome Ribonucleic acid (RNA) is a ubiquitous family of large biological molecules that perform multiple vital roles in the coding, decoding, regulation, and expression of genes.
E N D
Tuning Tophat2 Belinda Giardine
Tophat2 • Aligns reads from RNA to the genome • Ribonucleic acid (RNA) is a ubiquitous family of large biological molecules that perform multiple vital roles in the coding, decoding, regulation, and expression of genes. • Adds on dealing with gaps in the alignments by breaking the reads into small pieces ~20 bases and reassembling the reads after mapping. • Though the new version is more parallel still slow (more than 4 days for recent runs) • It uses Bowtie to do the actual mapping
RNA-seq image from wikipedia fastq file, a single read: @DGM97JN1:330:C3EW0ACXX:1:1101:2723:1993 1:N:0:NAAGGC GAATGCCCCCGGCCGTCCCTCTTAATCATGGCCTCAGTTCCGAAAACCANCAAAATAGAACCGCGGTCCTATTNN + CCCFFFFFHHHHGIIFGIIIIJJIIJIFGIJEHIIJIGHIJHAGHHFEE#,;;BACEEDDDDDD@B>BBDCDC##
Tophat2 • Pipeline written in C++ (34,351 lines of code in 63 files) • Wrapper written in Python • 3 of the programs use Boost pthreads • long_spanning_reads.cpp • segment_juncs.cpp • tophat_reports.cpp • Programs are compiled as one unit under autoconfig and automake, communication between programs with temporary files. • Many prerequisites: zlib, Boost, samtools, Bowtie, this and the amount of file IO makes running on MIC only not feasible.
Data files • Reads in fastq format, 20–200 million reads (2 x 20gb for my test) • Reference sequence and indexes used for mapping 6gb for mouse • Final output 14gb for my test
Work from last time • Compiling • start with gcc • then icc • then add –mmic (this failed in trying to get all the prerequisites) • Test run on host, using Tophat’s log of run for time. • Run on biostar(Xeon) using 8 threads took 26 hours • Run on stampede (host) using 16 threads took 19 hours, 40 mins • Run on stampede (host) using 32 threads took 24 hours
New work • Python wrapper and long run times makes gprof and vtune difficult to profile code with. • Going from my experience in Biostar, I am starting with segment_juncs executable. • Keeping the temporary files that are used for passing data between programs, I ran just segment_juncs. • Time for segment_junctions run alone: • 8 threads 2 hours 13 minutes • 16 threads 1 hour 15 minutes (2 ½ out of 19 ½ hours total) • of this 76% is spent in the parallel section • 32 threads 2 hours 12 minutes
Failed attempts • Run vtune onsegment_juncs • times out of full data • license errors • Check loops inpar_reportthat are assumed dependencies. • lines of code indicated not loops or in loops? • contradictory lines • Offloading threaded section of code in segment_juncs.cpp. Will it actually improve speed or too much file IO? • Lots of variables to copy • File IO
vec_report3 segment_juncs.cpp(135): (col. 32) remark: loop was not vectorized: existence of vector dependence. segment_juncs.cpp(135): (col. 32) remark: vector dependence: assumed ANTI dependence between r.92068 line 135 and r.92068 line 135. segment_juncs.cpp(135): (col. 32) remark: vector dependence: assumed FLOW dependence between r.92068 line 135 and r.92068 line 135. Line 135: left_seg.left = max(0, T.right() - 2);
opt_report • REMOVED VAR left_mismatches.201433.0_V$78b • REMOVED PACK left_mismatches.201433.0 • REMOVED VAR right_mismatches.201433.0_V$78d • REMOVED PACK right_mismatches.201433.0
gprof output for segment_juncs • Each sample counts as 0.01 seconds. • % cumulative self self total • time seconds seconds calls Ts/call Ts/call name • 100.01 0.01 0.01 extend_from_seeds(std::vector<SeedExtension, std::allocator<SeedExtension> >&, PackedSplice const&, std::vector<std::vector<ReadHit, std::allocator<ReadHit> >, std::allocator<std::vector<ReadHit, std::allocator<ReadHit> > > > const&, std::string const&, std::string const&, unsigned long, unsigned long, int) • 0.00 0.01 0.00 89528 0.00 0.00 pack_splice(std::string const&, int, int, unsigned int) • 0.00 0.01 0.00 3 0.00 0.00 __do_global_dtors_aux • 0.00 0.01 0.00 2 0.00 0.00 pack_right_splice_half(std::string const&, unsigned int, unsigned int)
Parallel section of code: vector<boost::thread*> threads; for (inti = 0; i < num_threads; ++i) { SegmentSearchWorker worker; worker.rt = &rt; worker.reads_fname = left_reads_fname; worker.segmap_fnames = &left_segmap_fnames; worker.partner_reads_map_fname = right_reads_map_fname; worker.seg_partner_reads_map_fname = right_seg_fname_for_segment_search; worker.juncs = &vseg_juncs[i]; worker.deletions = &vdeletions[i]; worker.insertions = &vinsertions[i]; worker.fusions = &vfusions[i]; worker.read = READ_LEFT; worker.partner_hit_offset = 0; worker.seg_partner_hit_offset = 0; if (i == 0) { worker.begin_id = 0; worker.seg_offsets = vector<int64_t>(left_segmap_fnames.size(), 0); worker.read_offset = 0; } else { worker.begin_id = read_ids[i-1]; worker.seg_offsets.insert(worker.seg_offsets.end(), offsets[i-1].begin()+1, offsets[i-1].end()); worker.read_offset = offsets[i-1][0]; if (partner_offsets.size() > 0) worker.partner_hit_offset = partner_offsets[i-1]; if (seg_partner_offsets.size() > 0) worker.seg_partner_hit_offset = seg_partner_offsets[i-1]; } worker.end_id = (i+1 < num_threads) ? read_ids[i] : std::numeric_limits<uint64_t>::max(); //Geo debug: //fprintf(stderr, "Worker %d: begin_id=%lu, end_id=%lu\n", i, worker.begin_id, worker.end_id); if (num_threads > 1 && i + 1 < num_threads) threads.push_back(newboost::thread(worker)); else worker(); }