Scrubbing Reads for Better Assembly


In an earlier post, I introduced the concept of  intrinsic quality values — a statistic indicating the quality of every say TP=100bp interval of a read — that could be computed from the pile of local alignments for each read computed by daligner with the -t option set to TP.  This idea is important because of the complex and variable error profile of PacBio long reads which I illustrated with the following hypothetical example:

Error profile of a typical long read. The average error rate is say 12% but it varies and occasionally is pure junk.

The quality values together with the pattern of alignments in a read’s pile allow one to detect these, sometimes long, low-quality segments of a read, as well as detect when the read is a chimer or contains undetected adaptamer sequence that should have been excised.  These artifacts are quite frequent in PacBio data sets and apart from repeats are the major cause of poor assembly.  We call the process of repairing the low-quality sequence and removing the chimers and missed adaptamers, scrubbing, and formally the goal of a scrubber is to edit a set of reads with as little loss of data as possible so that:

  1. every edited read is a contiguous segment of the underlying genome being sequenced, and
  2. every portion of a read is of reasonable quality, i.e. <20% error or Q8.

The conditions above are exactly those assumed by a string graph construction algorithm and any data that doesn’t meet these conditions introduces breaks and false branches in the string graph.  For this reason I choose to scrub hard, i.e. my scrubber will over call chimers and adaptamers rather than miss them, so that condition 1 above is meet by all the data after scrubbing.

The scrubber module consists of a pipeline of 4-5 modules as opposed to a single program that performs the end-to-end task.  We structured it this way so that one can run each pass in a block-parallel manner on an HPC cluster, and secondarily, so that a creative user can potentially customize or improve aspects of the overall process.  The figure below, shows the pipeline:

Data flow, including tracks, of the scrubbing pipeline

Each program in the pipeline can be run on the overlap piles for a block of the underlying DB in parallel, allowing a simple HPC parallel scheme.  By doing so each program per force outputs only the portion of a track for the intput block (i.e.. .qual, .trim, .patch), and these block tracks must be concatenated into a single track for the entire database with Catrack before the next step in the pipeline can be performed.  Fortunately, this serial step is very fast, basically amounting to concatenating a number of files of total size O(R) where R is the number of reads in the database.

The first step in scrubbing is to compute the quality values for every read segment and record them in a qual track with DASqv as described in an earlier post.  Recall that with the -v option DASqv displays a histogram of the QV’s over the dataset.  Using this histogram one should select a threshold GOOD for which 80% of the QV’s are not greater than this value, and another BAD for which 5-7% of the QV’s are not less than this value.  In the next two steps of the pipeline, QV’s ≤ GOOD will be considered “(definitley) good”, and QV’s ≥ BAD will be considered “(definitely) bad”.  Any QV’s between the two thresholds will be considered of unknown quality.   Decreasing these values, increases the amount of data that will be scrubbed away as not of sufficient quality, and increasing these thresholds retains more data but at a sacrifice in quality.  The percentile guidelines above represent good settings for typical PacBio data (that is about 80% of the data actually is good and 5% of it is actually crap).

The GOOD/BAD thresholds must be supplied as the -g and -b parameters to the next phase of the pipeline realized by DAStrimDAStrim first uses the quality values and the supplied -g and -b thresholds to determine a sequence of high-quality (HQ) segments for the A-read of each pile.  Formally, an HQ segment begins and ends with a good  trace point interval interval, contains no bad intervals, and is at least 400bp long.  Any intervals before the first HQ segment and after the last HQ segment, are clipped/trimmed from the beginning and end of the read, respectively.  The main task of DAStrim, is to then determine what to do with the gaps between the HQ segments of a read.  There are 4 cases:

  1. The gap is completely covered/spanned by lots of local alignments with other reads – then the gap is simply low quality sequence and deemed LOW-Q. This is illustrated in the example at right by the leftmost low-quality gap.
  2. The gap is not spanned by local alignments with other reads, but most of the local alignments to the left (X) and right (Y) of the gap are for the same B-read and the distance between the two B-read segments in the relevant local alignments are consistent with the size of the gap – then the gap is a (very) low quality (and often long) drop out in the A-read and the gap is deemed LOW-Q.
  3. the gap is not spanned but the left (X) and right (Y) alignment sets have the relevant B-reads in the complement orientation with respect to each other – the gap is due to an ADAPTAMER. Typically, the gap is also very sharp with the gap between paired local alignments being very narrow as in the example below left.  In the situation where one or more adatamer gaps is found for a read, all but the longest subread between two of the adaptamers is thrown away.


  1. the gap is not spanned and there is no correspondence between the reads in the local alignments to the left and right – the gap is due to a chimeric join in the A-read and the gap is designated CHIMER as illustrated in the example above right.  Every chimeric gap splits the A-read into two new reads.

DAStrim outputs a track (or track block) called trim that for each read encodes the begin/end coordinates for each HQ segment and the classification of each intervening gap (either LOW-Q or CHIMER).  With the -v option it also outputs a report describing how much data was trimmed, designated LOW-Q, etc.  In the example below, one sees that the left column is the number of reads or items, and the right column is the total number of bases in those items.  3,117 reads totalling 11.6% of the bases in the data set were found to be either garbage or were totally masked by the DAMASKER module and thrown out.  Of the remaining reads, 5.7% of the bases were trimmed from the beginning of reads, and 2.5% from the end of reads.  37 missed adaptamer sites were found and 117.8Kbp was trimmed away leaving only the longest sub-read between adaptamers.

Input:     22,403 (100.0%) reads      250,012,393 (100.0%) bases

Discard:    3,117 ( 13.9%) reads       29,036,136 ( 11.6%) bases
5' trim:   13,749 ( 61.4%) reads       14,137,800 (  5.7%) bases
3' trim:    7,919 ( 35.3%) reads        6,316,157 (  2.5%) bases
Adapter:       37 (  0.2%) reads          117,816 (  0.0%) bases

Gaps:      35,253 (157.4%) gaps        15,205,800 (  6.1%) bases
Low QV:    30,903 (137.9%) gaps         8,834,200 (  3.5%) bases
Span'd:     3,346 ( 14.9%) gaps         4,441,700 (  1.8%) bases
Chimer:     1,004 (  4.5%) gaps         1,929,900 (  0.8%) bases

Clipped:   25,826 clips                49,607,909 ( 20.6%) bases
Patched:   34,249 patches              13,275,900 (  5.3%) bases

In the adaptamer-free reads that were retained, there were 35,253 gaps between HQ segments.  30,903 were deemed LOW-Q by virtue of being spanned by alignments (Low QV) and 3,346 were deemed LOW-Q by virture of consistent alignment pairs (Span’d).  The remaining 1,004 where deemed to be CHIMERs and will be split creating 1,004 new reads.  The last two lines tell one that a total of 20.6% of the bases were lost altogether because of trimming, adapatmers, and chimers, and that 5.3% of the bases were in low-quality gaps that will be patched with better sequence in the ensuing scrubbing passes.

The 3rd phase of scrubbing is realized by DASpatch which selects a good B-read segment to replace each LOW-Q gap with.  We call this high-quality sequence a patch.  The program also reviews all the gap calls in light of the fact that the trimming of reads affects their B-read alignment intervals in the piles of other reads, occasionally implying that a gap should have been called a CHIMER and not LOW-QDASpatch further changes a LOW-Q gap to a CHIMER call if it cannot find what it thinks is a good B-read segment to patch the gap.  The number of these calls is generally very small, often 0, and their number is reported when the -v option is set.  DASpatch outputs a track (or track block) called patch that encodes the sequence of B-read segments to use for patching each low quality gap in a read.

The final phase of scrubbing is to perform the patching/editing of reads that is encoded in the trim and patch tracks produced in the last two phases, and produce a new data base of all the resulting scrubbed reads.  This is accomplished by DASedit that takes the name of the original database and the name you would like for the new database as input.  This phase runs sequentially and can can be quite slow due to the random access pattern of read sequences required to patch, so you might want to go do something else while it runs.  But when its finished you will have a database of reads that are typically 99.99% chimer and adatamer free, and that contain almost no stretches of very low-quality base calls.

The new database does not have a .qvs or .arr component, that is it is a a sequence or S-database (see the original Dazzler DB post).  Very importantly, the new database has exactly the same block divisions as the original.  That is, all patched subreads in a block of the new database have been derived from reads of the same block in the original database, and only from those reads.  DASedit also produces a map track that for each read encodes the original read in the source DB that was patched and the segments of that read that were high-quality (i.e. not patched).  A program DASmap output this information in either an easy-to-read or an easy-to-parse format.  A small example of a snippet output by DASmap is as follows:

55 -> 57(2946) [400,2946]
56 -> 58(11256) [700,1900]
57 -> 58(11256) [6600,9900] 83 [10000,11256]
58 -> 59(12282) [400,4100] 88 [4200,9400] 97 [9500,12282]

The first line indicates that read 55 in the patched database was derived from read 57 in the original database and is the segment from [400,2946] of that read.  Reads 56 and 57 were both derived from read 58 in the original DB, and read 57 consists of segments [6600,9900] and [10000,11256] of read 58 with a patch between them of 83bp (but the source of the patch data is not given).  The read length of each original read is given for convenience.  The purpose of this map is to allow you to map back to the original reads in the final consensus phase of an assembly where one will want to use the original data along with its Quiver or Arrow data encoded in the .qvs/.arr component of the DB.

A database of scrubbed reads can be used in three ways as illustrated in the figure at the end of this paragraph.  In the first use-case, a user can simply take the database of scrubbed reads and use them as input to their favorite long-read assembler (e.g. Falcon or Canu).  In our experience doing so improves the assemblies produced by these third party systems.  For example, in the plot at right, Falcon’s error corrector (EC) tends to break reads at long low quality gaps producing the red line read-length profile given a data set with the blue read-length distribution.  When Falcon starts on our scrubbed reads with the grey read-length profile, (which is significantly better than the EC profile), its own error correct leaves these longer scrubbed reads intact. Longer reads into the string graph phase of Falcon’s assembler implies a more coherent assembly, i.e. longer contigs.The second use-case, is to rerun the daligner on this set of reads to produce .las files anew over the scrubbed reads.  The down side is that this takes considerable compute time.  However, because the reads are scrubbed, one should expect that reads properly overlap and a purely local alignments can now be ignored as repeat-induced.  Giving the result set of overlaps to a string graph algorithm for assembly should result in better assembly as many repeat-induced correspondences have been removed.

The final third use-case, is rather than re-run daligner, to take advantage of the fact that the piles computed the first time already contain all the pairs of reads that align, and so in fact all that needs to be done is to “re-align” an local alignment found between original reads in terms of the derived reads.  This includes extending the local alignment when the introduction of a patch allows it to now extend through a formerly low-quality region.  To this end I produced a program DASrealign that takes a pair of blocks of the scrubbed DB and the original .las file for the block pair (in the original DB), and produces the realigned overlap piles in a .las file for the scrubbed DB blocks.  Not carefully that this requires (1) that you have kept the block pair .las files produced by the first phase of the HPC.daligner process and not removed them, and (2) that after realigning every block pair, you must then merge these to produce the overlap blocks for the scrubbed DB.  This process is roughly 40x faster than running daligner from scratch.  The trade off is that some alignments might be missed that would otherwise be found if daligner were rerun on the scrubbed reads, and the trace point encoding of a patched alignment is no longer uniform in the A-read and so one could not scrub the patched reads a second time with the piles so produced.  On the otherhand, one can take just the overlaps that are produced this way and perform a string graph assembly.

The pipeline above careful identifies and removes read artifacts such as chimers and adapatamers, and repairs high error-rate read segments, providing a set of reads that will assemble well (without correction).  However, ultimately we believe good assembly will further require correction in order to carefully separate haplotypes and low copy number repeats prior to the string graph phase.  Therefore, our ultimate ongoing development plan is to replace DASedit with a program that produces error corrected A-reads based on the analysis of DASqv-DAStrim-DASpatch and to then realign (at much higher stringency / lower error rate) the resulting reads.  We expect this process to take more time than simply applying DASrealign, but with the advantage that the reads will also be error corrected.  Never the less, just scrubbing reads and feeding them into another assembler is proving to have sufficient value that I decided to release the current pipeline.  Best, Gene

Mapping Your Reads: damapper


OK, so there are over 100 different read mappers out there.  Why another one?  First, its faster than all the rest on long noisy reads while being equally sensitive to the best, second, because it uses a chain model of alignments that accounts for and reveals the frequent drop outs in quality within a Pacbio read, and third, because it actually estimates and reports those regions of reads that involve repeat elements and their approximate copy number in the genome to which they are being mapped.  Finally, I was just curious as to whether the daligner‘s sort and merge k-mer strategy would be competitive with the prevailing paradigm of using FM-indices or Suffix tree/arrays.  The answer is that it is, albeit it depends on M, the amount of memory available, and G, the size of the reference genome.  The larger G, the slower damapper becomes relative to index-based mappers.  Morever, while damapper will run in any reasonable (e.g. 16Gb) amount of memory, it runs faster the more memory there is available.  For example, with 48Gb, damapper is 1.8 times faster than BWA on the human genome, 15 times faster on the fly genome, and 36 times faster on E.Coli.  Performance will be reported on more extensively in a subsequent post.

So “damapper REF DB.1 DB.2” will map the reads in blocks 1 and 2 of DB to a reference genome  REF.  It really doesn’t care if REF is a .dam or a .db, and it will handle whatever size it happens to be in the amount of memory available or specified with the -M option.  On the other hand the DB blocks must contain sequences that are not too long, i.e reads only, and is expected to be of an appropriate block size (e.g. 250Mbp).  The difference between damapper and daligner is that damapper is looking for the best or near best map of the read to a location in REF, whereas daligner will find every possible local alignment between the reads in DB.1 and DB.2 including all repeat induced LA’s.  Thus damapper has much less work to do and so can be and is much faster.

Most of the parameters to damapper are the same as for the daligner, i.e. -v, -b, -k, -t, -M, -e, -s, and -m, but with a much larger default of -k (20) and a higher correlation for -e (.85), and the -w, -h, and -l parameters are not needed.  Like daligner, damapper finds local alignments consistent with the error rate -e, and does not extend these alignments into regions where the sequences are no longer reasonably correlating at the specified rate.  This means that if a Pacbio read has, say, two very low quality regions internal to the read, a frequent occurrence in such data, then damapper will find and report a chain of 3 local alignments and not a single alignment that is basically nonsense in the two drop out regions.  This is distinctly different from other mappers that force an alignment through these regions.  This seems to my mind a more principled approach and identifies the fact that parts of a read are very low quality.  I’m not sure why seeing 2,000 garbage bases in a forced alignment to the reference sequence is valuable :-).

So a damapper mapping of a read is a chain of one or more local alignments to a corresponding contig of the reference.  Occasionally reads are chimers, in which case one wants to see a mapping of each of the two distinct parts of the read.  damapper accomplishes this by finding the best chain mapping for the read, then finding the best mapping to any portion of the read not spanned by the first chain, and so on, until all significant portions of the read are mapped (if possible).  Moreover, damapper will report other good matches to a segment if requested with the -n option.  For example, with -n.95, it reports all matches whose score is within 95% of the best score covering a given portion of a read.

Apart from the -n option, the other options unique to damapper are the -C, -N, and -p flags.  By default, damapper returns .las files where the A-reads are the mapped reads, and the B-reads are the contigs of the reference.  That is, “damapper REF DB.1 DB.2” will output DB.1.REF.M1.las, DB.1.REF.M2.las, … DB.2.REF.Mn.las, where T is the number of threads it is run with (specified by the -T option, default 4).  The T files for each block can be combined into a single .las file with for example “LAcat DB.1.REF.M# >DB.REF.1.las“.  If you would like to see the mapping from the point of view of the reference contigs, specifying the -C option, will further produce files of the form REF.DB.1.C1.las, … REF.DB.2.Cn.las, where the A-reads are the contigs of the reference, and the B-reads are the mapped reads.  Specifying -N suppresses the production of the .M#.las files in case you only want to see the coverage of the contigs by reads.

As is customary, there is an HPC script generator, HPC.damapper, that will generate all the calls necessary to produce the mapping of reads to contigs for blocks of a data set (and vice versa if -C is set).  For example, “HPC.damapper -C -n.95 REF DB” will produce DB.#.REF.las and REF.DB.#.las where DB.# is a block of DB.  For example, if DB has 4 blocks then HPC.damapper generates a script equivalent to the following:

# Damapper jobs (1)
damapper -v -C -n0.95 REF DB.1 DB.2 DB.3 DB.4
# Catenate jobs (8)
LAsort -a DB.1.REF.M*.las && LAcat DB.1.REF.M#.S >DB.1.REF.las
LAsort -a DB.2.REF.M*.las && LAcat DB.2.REF.M#.S >DB.2.REF.las
LAsort -a DB.3.REF.M*.las && LAcat DB.3.REF.M#.S >DB.3.REF.las
LAsort -a DB.4.REF.M*.las && LAcat DB.4.REF.M#.S >DB.1.REF.las
LAsort -a REF.DB.1.C*.las && LAmerge -a REF.DB.1 REF.DB.1.C*.S.las
LAsort -a REF.DB.2.C*.las && LAmerge -a REF.DB.2 REF.DB.2.C*.S.las
LAsort -a REF.DB.3.C*.las && LAmerge -a REF.DB.3 REF.DB.3.C*.S.las
LAsort -a REF.DB.4.C*.las && LAmerge -a REF.DB.4 REF.DB.4.C*.S.las
# Check all .las files (optional but recommended)
# Cleanup all intermediate .las files
rm DB.*.REF.*.las REF.DB.*.*.las

Its important to note that all the commands that deal with .las files, i.e. LAsort, LAmerge, LAcheck, LAshow, and LAdump have been upgraded to properly handle damapper-produced .las files that encode chains of LAs.  Internally, Dazzler software can distinguish between .las files that encode chains and those that don’t (e.g. as produced by daligner or datander).  For sorting, chains are treated as a unit, and sorted on the basis of the first LA.  Display commands will indicate chains when present, and LAcheck checks chains and their internal encoding for validity.  Another thing to note very carefully, is that chains are best sorted with the -a option and not the default.  This is particularly important for files produced in response to the -C option.  For example, to produce a mapping of all reads to REF in say REF.DB.las, one should merge the relevant files above with “LAmerge -va REF DB REF.DB.*.las”  One can then see how the reference sequence is covered by the database by then opening REF.DB.las in DaViewer.

In developing the chaining algorithm for damapper, it became obvious that at little additional overhead one could estimate the repetitiveness of the sequence in a region of a read simply by counting how many (sub-optimal) chains covered the region.  The -p option requests that damapper produce a repeat profile track for each read based on these counts which roughly estimate the repetitiveness of a read segment within the underlying genome.  That is, for each trace-point sized interval (established by the -s parameter) the number of times, c, this interval is involved in a distinct alignment to the reference is estimated.  0 is recorded for segments that don’t match anything in the reference, 1 for segments that match uniquely, and otherwise floor(log10c/10) up to a cap of 40 corresponding to 10,000 copies.  Observe carefully that the track has the same form as intrinsic quality values: a vector of small byte-sized integers for each trace point interval.  These vectors can be output by DBdump with the -p option and DaViewer is able to graphically display them.  These profiles should make it obvious when a read does not have a unique location in a reference sequence due to its being entirely or almost entirely repetitive.  I think having this information is critical in any subsequent analysis.

Seeing Your Reads: DaViewer

The old adage that “A picture is worth a 1000 words” might well be stated today to as “A visualization of your data is worth a 1000 print outs”.  While there are many programs in the Dazzler system to produce ASCII text displays of your reads and their overlaps with each other, I have found that it is hugely more powerful to look at pile-o-grams and visualize the quality of the matches, the intrinsic QVs of a read, and any mask intervals produced by programs such as the repeat maskers or the forthcoming scrubber.  To this end I’m releasing today DaViewer that is a Qt-implemented user interface for seeing Dazzler piles and associated information.  Unlike other modules, this one does have a dependency on the Qt library, which you can download and install for free here.  I developed it under Qt5.4 and have recently tested it with Qt5.6 (latest version) under Mac OS X.  The library and my C++ code should compile and run under any OS, but this has not been tested, so I’d appreciate reports of porting problems, and of course, any patches.  In this post I give a brief overview of what the DaViewer can do, and refer you to the manual page for detailed documentation.

In brief, DaViewer allows you to view any subset of the information in a given .las file of local alignments computed by the daligner or the forth coming damapper, and any track information associated with the read database(s) that were compared to produce the local alignments (LAs).  As an example here is a screen showing several overlap piles (click on the image to see a bigger view of it):

Screen Shot 2016-05-29 at 1.42.44 PM The viewer has a concept of a palette that controls the color of all the elements on a screen and you can set this up according to how you like to see the data.  I prefer a black background with primary colors for the foreground, but in the example below I set the palette so that the background is white and adjusted the foreground colors accordingly:Screen Shot 2016-05-29 at 1.44.13 PMWith the query panel you can zoom to any given read or range of reads, and when you zoom in far enough for the display to be meaningful you see (a) a heat map displaying the quality of all trace point interval matches, (b) a “tri-state” color map of the quality values of read 19 for each trace point (as computed by DASqv), and (c) a grid-spacer (if the appropriate option is turned on).  Moreover, you can ask the viewer to chain together local alignments that are consistently spaced with a dashed line connecting them and a color-coded ramp indicating the compression or expansion of the spacing.  For example, below I queried read “19”: Screen Shot 2016-05-29 at 2.27.25 PMYou can see the grid-lines spaced every 1000bp, compressed dash lines across what must be a low quality gap in the read between bases 2000 and 3000, and numerous expanded dashes across what are most likely low quality gaps in matching B-reads.  The settings in the palette dialog producing this view were as follows:

Screen Shot 2016-05-29 at 2.28.06 PMScreen Shot 2016-05-29 at 2.28.57 PM

In the panel at right the “Tri-State” option is chosen for “Show qual qv’s” and the colors are set so that trace point intervals with quality values not greater than 23 (i.e. “Good”) are colored green,  intervals with values not less than 30 (i.e. “Bad”) are colored red, and otherwise the interval is colored yellow.  Further note that you can also ask to see read’s QV’s projected on to the B-reads of the pile, producing a view like this:

Screen Shot 2016-06-01 at 7.39.53 AMA variety of programs such as DBdust, REPmask, TANmask, and the forthcoming DAStrim, all produce interval tracks associated with the underlying DB(s).  When a new .las file is opened with daviewer the program automatically searches the specified DBs for any interval tracks associated with them and loads them for (possible) display.  The found tracks are listed in the “Masks” tab of the palette dialog as illustrated immediately below.

Screen Shot 2016-06-01 at 7.49.11 AMScreen Shot 2016-06-01 at 7.48.17 AM

In the palette tab at left each available interval track or mask is listed.  One can choose whether a track is displayed, on which register line, its color, and whether or not you want to see the mask on the B-reads as well.  The tracks are displayed in the order given and the order can be adjusted with drag-and-drop initiated by depressing the up/down arrow at the far left.

Finally, one can record all the settings of a particular palette arrangement in what is called a view.  At the bottom of the palette dialog is a combo-box that allows one to select any saved view, and the creation of views, their updating, and removal are controlled by the Add, Update, and Delete buttons, respectively.  In the palette example above, a view call “trim” has been created and that as seen in “Quality”-tab has a different heat map for trace interval segments.  The screen capture below shows an example of this view and further illustrates the display of tracks.Screen Shot 2016-05-30 at 5.07.03 PMOne should also know that clicking on items in the display area produce popups with information about the object, clicking below the coordinate axis at bottom zooms in or out (shift-click) by sqrt(2), and grabbing a background area allows you to scroll by hand (as will as with the scroll widgets).

Finally, one can have multiple, independently scrollable and zoomable views of the same data set by clicking the “Duplicate” button (+-sign in the tool bar at upper-left), and one can further arrange them into a tiling of your screen by clicking the “Tile” button (again in the tool bar).  As an example, the (entire) screen capture below was created by hitting “Duplicate” 3 times, then “Tile”, and then zooming and scrolling the four individual views.

Screen Shot 2016-06-01 at 10.56.21 AMIn closing, DaViewer is a full featured and flexible visualization tool for the specific task of seeing your Pacbio reads, their quality, and their LAs with other reads.  It has so many features that a man page will take a while for me to produce.  In the meantime, I will hope that you can just open up a data set and figure it out by trial and error button pushing and clicking.  Unlike other Dazzler modules it is after all a GUI, have you ever read the Microsoft Word manual?

Detecting and Masking Repeats

The daligner was specifically designed to find local alignments (LAs) between reads as opposed to overlaps where in each end of the alignment reaches an end of one or the other read.  This is a necessity because raw PacBio reads can have very low quality or even random segments at any position, and one cannot therefore expect an alignment through these regions.  Finding LAs instead of overlaps, allows the Dazzler suite to then analyze pile-ograms in order to detect and address these regions (recall that a pile for a read x is the set of all LAs where x is the A-read and a pile-ogram is a depiction of all the A-intervals covering x).  Computing LAs also implies that if a read x is from a region of DNA that has a relatively conserved repetitive element in it, then every copy of the element in any read of the shotgun data set will align with the relevant interval of x.  For example, given a 50X (=c) data set and a repeat that occurs 20 (=r) times in the genome, one should expect every occurrence of the repeat in a read to be covered on average by 1,000 (=rc) LA’s in its pile.  The pile-ogram below depicts such a repeat stack:

A small repeat stackWhat’s good about this is that the excessive depth of “stacked” LA’s clearly signals the repetitive element, its approximate copy number, and its approximate boundaries.  What’s bad about this is that the daligner can spend 95% or more of its time finding all the local alignments in these repeat stacks.  This observation raises two questions: (a) Is it worth finding them all, and (b) If not, how can we avoid doing so?

The answer to the first question is “no” (i.e., it is not worth finding them all) and our argument is as follows.  First recall, that daligner can accept any number of interval tracks that it uses to soft mask the reads.  That is, it will only seed/seek alignments in regions that are not masked, but will extend an alignment through a masked region if the the two sequences involved continue to be similar.  Next observe, that if a repetitive element is flanked by unique sequence and it is short enough that with high probability there is one or more spanning reads that cover the repeat and at least say 1Kbp of the unique flank on either side, then there is no consequence to soft masking this repeat as the spanning reads will provide a coherent assembly across it as shown in the figure below as the left (small) scenario.  For Pacbio reads this is generally true of any interspersed repeat of 10Kb or less.  If a repeat is compound or so large that there are no spanning reads for it (as shown in the right (large) scenario), then all current assemblers frankly fail to assemble the region involving it, and so in some sense Effect of soft-masking repeatsno harm is done by soft-masking the long repeats too. We therefore think that soft-masking the repetitive parts of reads is a good initial strategy for the down stream assembly problem: most of the genome will come together with long reads, the “overlap”/daligner stage of assembly will be 10 to 20 times faster, and the big repeats can still be solved in a subsequent post-process by analyzing the much smaller number of reads that were completely masked as repetitive.

So on to the second question, which is how does one efficiently build the needed repeat masks?  Consider taking the hypothetical 50X data set above and comparing every 1X block of this data set against itself.  First observe that doing so is 2% of all the comparisons that would normally be performed by daligner.  Second, observe that unique parts of a read will have on average about 1 LA covering them, whereas a segment from a repeat that occurs, say, 10 times in the genome will already have have 10 LA’s covering it and will appear as a repeat stack in a read’s pile-ogram.  The probability of having 10 or more LA’s covering a unique segment of a read is exceedingly low, so simply specifying a threshold, such as 10, above which a region is labelled repetitive is an effective strategy for identifying high-copy number repetitive elements.  This mask can then be used by the daligner when performing the remaining 98% of its comparisons.  The resulting suppression of purely repeat-induced LAs will speed things up 10-20 times at almost no degradation in the performance of a downstream assembler like Falcon or the Celera Assembler as argued earlier.

For a very lScreen Shot 2016-04-04 at 9.44.23 AMarge data set one may want to repeat this process at different coverage levels and with different thresholds in order to achieve greater efficiency.  For example, we have a 30X data set of a 30Gbp genome, that requires dividing the Dazzler DB into 3,600 250Mbp blocks.  We first run each block against itself (.008X vs .008X) and create a mask .rep1 with a threshold of 20, that detects only super-repetitive elements.  Next we compare every group of 10 consecutive blocks (.08X) against themselves soft-maskiScreen Shot 2016-04-04 at 9.44.35 AMng with .rep1, and create a mask .rep10 with a threshold of 15 to detect fairly abundant repetitive elements.  Then we compare every consecutive group of 100 blocks (.8X) against themselves soft-masking with .rep1 and .rep10, and create a mask .rep100 with threshold 10 used to further soft-mask the final all-against-all of the 3,600 blocks in the entire data set.  The figures interspersed within this paragraph illustrate a 1×1 and 5×5 schema for a hypothetical set of blocks.

To implement this strategy we have created the programs HPC.REPmask and REPmask as part of a new DAMASKER module.  “HPC.REPmask -g#1 -c#2” produces a job-based UNIX script suitable for an HPC cluster that:

  1. compares each consecutive #1 blocks against themselves,
  2. sorts and merges these into .las files for each block, and then
  3. calls “REPmask -c#2 -mrep#1” on each block and associated .las file to produce a block interval track or mask, with the name .rep#1, of any read region covered #2 or more times.

As a running example consider a hypothetical Dazzler data base D that is divided into 150 blocks and contains an estimated 50X coverage of the target genome.  Then 3 blocks of data contains 1X of the genome, and to realize a 1X versus 1X repeat mask, we need to compare every consecutive group of 3 blocks against each other and then produce a mask of all reads covered more than say 10 deep.  Calling “HPC.REPmask -g3 -c10 D” will produce a shell script to do this, where the script has the following form:

# Daligner jobs (150)
daligner D.1 D.1
daligner D.2 D.1 D.2
daligner D.3 D.1 D.2 D.3
daligner D.4 D.4
daligner D.5 D.4 D.5
daligner D.6 D.4 D.5 D.6

daligner D.148 D.148
daligner D.149 D.148 D.149
daligner D.150 D.148 D.149 D.150
# Initial sort & merge jobs (450)
LAsort D.1.D.1.*.las && LAmerge L1.1.1 D.1.D.1.*.S.las
LAsort D.1.D.2.*.las && LAmerge L1.1.2 D.1.D.2.*.S.las
LAsort D.1.D.3.*.las && LAmerge L1.1.3 D.1.D.3.*.S.las
LAsort D.2.D.1.*.las && LAmerge L1.2.1 D.2.D.1.*.S.las
LAsort D.2.D.2.*.las && LAmerge L1.2.2 D.2.D.2.*.S.las
LAsort D.2.D.3.*.las && LAmerge L1.2.3 D.2.D.3.*.S.las

LAsort D.150.D.150.*.las && LAmerge L1.150.150 D.150.D.150.*.S.las
# Level 1 merge jobs (150)
LAmerge D.R3.1 L1.1.1 L1.1.2 L1.1.3
LAmerge D.R3.2 L1.2.1 L1.2.2 L1.2.3
LAmerge D.R3.3 L1.3.1 L1.3.2 L1.3.3
LAmerge D.R3.4 L1.4.1 L1.4.2 L1.4.3
LAmerge D.R3.5 L1.5.1 L1.5.2 L1.5.3
LAmerge D.R3.6 L1.6.1 L1.6.2 L1.6.3

# REPmask jobs (50)
REPmask -c10 -mrep3 D D.R3.1 D.R3.2 D.R3.3
REPmask -c10 -mrep3 D D.R3.4 D.R3.5 D.R3.6
REPmask -c10 -mrep3 D D.R3.7 D.R3.8 D.R3.9

REPmask -c10 -mrep3 D D.R3.148 D.R3.149 D.R3.150
# Produce final mask for entirety of D
Catrack D rep3

So the above would pretty much be the end of this post if it were not for one other kind of repetitive element that haunts single-molecule data sets that sample the entire genome including the centromeres and telomeres, namely tandem satellites.  Back in the days of Sanger sequencing, such repeats were rarely seen as they were unclonable, but with single molecule reads we see and sample all the genome including these regions.  Tandem repeats are even nastier than interspersed repeats at creating deep stacks in a read’s pile-ogram because they align with themselves at every offset of the tandem element as well as with each other!  So our masking strategy actually starts by detecting and building a mask for tandem repeats in the data set.

A tandem repeat can be detected in a read by comparing the read against itself, with the slight augmentation of not allowing the underlying alignment algorithm to use the main diagonal of the edit matrix/graph.  When a tandem is present, it induces off-diagonal alignments at intervalTandem dot plots equal to the tandem element length.  That is, if the tandem is x10 then the prefix of the first n copies of x aligns with the suffix of the last n copies of x creating a ladder of n-1 alignments as seen in the dot plot accompanying this paragraph.  Comparing a read against itself is already possible in our framework with the -I option to daligner.  To detect tandems, we built a special version of daligner that we call datander that given a DB block, compares each read in the block only against itself and outputs these alignments.  We further developed the program TANmask to use these self-alignments to detect tandem regions to mask: namely, when the two aligned segments of a self-LA overlap, the union of the two segments marks a tandem region.  Lastly, the program HPC.TANmask generates a script generator to call datander, sort and merge the resulting self-LAs, and finally call TANmask on the results.  For example, “HPC.TANmask D” for the 50X database D introduced earlier produces a script of the form:

# Datander jobs (38)
datander D.1 D.2 D.3 D.4
datander D.5 D.6 D.7 D.8

datander D.149 D.150
# Sort & merge jobs (150)
LAsort D.1.T*.las && LAmerge D.T.1 D.1.T*.S.las
LAsort D.2.T*.las && LAmerge D.T.2 D.2.T*.S.las

LAsort D.150.T*.las && LAmerge D.T.150 D.150.T*.S.las
# TANmask jobs (38)
TANmask D D.T.1 D.T.2 D.T.3 D.T.4
TANmask D D.T.5 D.T.6 D.T.7 D.T.8

TANmask D D.T.149 D.T.150
# Produce final mask for entirety of D
Catrack D tan

One should note carefully, that tandem repeat masking should be performed before the interspersed repeat masking discussed at the beginning.  To conclude, we close with the example of the super large 30X database of a 30Gbp genome consisting of 3,600 blocks.  Assuming the DB is called, say Big, all the scripts for performing the repeat-masked “overlap” phase of assembly would be produced by invoking the commands:

HPC.TANmask Big
HPC.REPmask -g1 -c20 -mtan Big
HPC.REPmask -g10 -c15 -mtan -mrep1 Big
HPC.REPmask -g100 -c10 -mtan -mrep1 -mrep10 Big
HPC.daligner -mtan -mrep1 -mrep10 -mrep100 Big

The scripts produced are quite huge for a task of this scale and include a variety of integrity checks and cleanup commands not mentioned here.  A newcomer would best be served by gaining experience on a small data set with small blocks, and only proceeding to large production runs with these “HPC.<x>” scripts when one has confidence that they can manage such a pipeline 😄  In a subsequent post on performance/results I will present some timing studies of the impact of repeat masking on the overlap phase of assembly.

The Dazzler DB: Organizing an Assembly Pipeline

Assembling DNA sequence data is not a simple problem. Conceptually it generally can be thought of as proceeding in a number of sequential steps, performed by what we call phases arranged in a pipeline, that perform certain aspects of the overall process. For example in the “OLC” assembly paradigm, the first phase (O) is to find all the overlaps between reads, the second phase (L) is to then use the overlaps to construct a layout of the reads, and the final phase (C) is to compute a consensus sequence over the arrangement of reads. One can easily imagine additional phases, such as a “scrubber” that detects chimeric reads, a “cleaner” that corrects read sequence, and so on. Indeed, the header banner for this blog shows a 7 phase pipeline that roughly realizes our current dazzler prototype. To facilitate the multiple and evolving phases of the dazzler assembler, we have chosen to organize all the read data into what is effectively a “database” of the reads and their meta-information. The design goals for this data base are as follows:

  1. The database stores the source Pacbio read information in such a way that it can recreate the original input data, thus permitting a user to remove the (effectively redundant) source files. This avoids duplicating the same data, once in the source file and once in the database.
  1. The data base can be built up incrementally, that is new sequence data can be added to the data base over time.
  1. The data base flexibly allows one to store any meta-data desired for reads. This is accomplished with the concept of tracks that implementers can add as they need them.
  1. The data is held in a compressed form equivalent to the .dexta and .dexqv files of the data extraction module. Both the .fasta and .quiva information for each read is held in the data base and can be recreated from it. The .quiva information can be added separately and later on if desired.
  1. To facilitate job parallel, cluster operation of the phases of our assembler, the data base has a concept of a current partitioning in which all the reads that are over a given length and optionally unique to a well, are divided up into blocks containing roughly a given number of bases, except possibly the last block which may have a short count. Often programs can be run on blocks or pairs of blocks and each such job is reasonably well balanced as the blocks are all the same size. One must be careful about changing the partition during an assembly as doing so can void the structural validity of any interim block-based results.

The Dazzler DB module source code is available on Github here. Grabbing the code and uttering “make” should produce 13 programs: fast2DB, quiva2DB, DB2fasta, DB2quiva, fasta2DAM, DAM2fasta, DBsplit, DBdust, Catrack, DBshow, DBstats, DBrm, and simulator.

One can add the information from a collection of .fasta files to a given DB with fasta2DB, and one can subsequently add the quality streams in the associated .quiva files with quiva2DB. After adding these files to a data base one can then remove them as property 1 above guarantees that these files can be recreated from the DB, and the routines DB2fasta and DB2quiva do exactly that. Indeed, here in Dresden we typically extract .fasta and .quiva files with dextract from the Pacbio .bax.h5 source files and then immediately put the .fasta and .quiva files into the relevant DB, remove the .fasta and .quiva files, and send the .bax.h5 files to our tape backup library. We can do this because we know that we can recreate the .fasta and .quiva files exactly as they were when imported to the database.

A Dazzler DB consists of one named, visible file, e.g. FOO.db, and several invisible secondary files encoding various elements of the DB. The secondary files are “invisible” to the UNIX OS in the sense that they begin with a “.” and hence are not listed by “ls” unless one specifies the -a flag. We chose to do this so that when a user lists the contents of a directory they just see a single name, e.g. FOO.db, that is the one used to refer to the DB in commands. The files associated with a database named, say FOO, are as follows:

(a) FOO.db
a text file containing (i) the list of input files added to the database so far, and (ii) how to partition the database into blocks (if the partition parameters have been set).

(b) .FOO.idx
a binary “index” of all the meta-data about each read allowing, for example, one to randomly access a read’s sequence (in the store .FOO.bps). It is 28N + 88 bytes in size where N is the number of reads in the database.

(c) .FOO.bps
a binary compressed “store” of all the DNA sequences. It is M/4 bytes in size where M is the total number of base pairs in the database.

(d) .FOO.qvs
a binary compressed “store” of the 5 Pacbio quality value streams for the reads. Its size is roughly 5/3M bytes depending on the compression achieved. This file only exists if .quiva files have been added to the database.

(e) .FOO.<track>.anno, .FOO.<track>.data
a track called <track> containing customized meta-data for each read. For example, the DBdust command annotates low complexity intervals of reads and records the intervals for each read in the two files .FOO.dust.anno and that are together called the “dust” track. Any kind of information about a read can be recorded, such as micro-sat intervals, repeat intervals, corrected sequence, etc. Specific tracks will be described as phases that produce them are created and released.

If one does not like the convention of the secondary files being invisible, then un-defining the constant HIDE_FILES in DB.h before compiling the library, creates commands that do not place a prefixing “.” before secondary file names, e.g. FOO.idx instead of .FOO.idx. One then sees all the files realizing a DB when listing the contents of a directory with ls.

While a Dazzler DB holds a collection of Pacbio reads, a Dazzler map DB or DAM holds a collection of contigs from a reference genome assembly. This special type of DB has been introduced in order to facilitate the mapping of reads to an assembly and has been given the suffix .dam to distinguish it from an ordinary DB. It is structurally identical to a .db except:

(a) there is no concept of quality values, and hence no .FOO.qvs file.

(b) every .fasta scaffold (a sequence with runs of N’s between contigs estimating the length of the gap) is broken into a separate contig sequence in the DB and the header for each scaffold is retained in a new .FOO.hdr file.

(c) the original and first and last pulse fields in the meta-data records held in .FOO.idx, hold instead the contig number and the interval of the contig within its original scaffold sequence.

A map DB can equally well be the argument of any of the commands below that operate on normal DBs. In general, a .dam can be an argument anywhere a .db can, with the exception of routines or optioned calls to routines that involve quality values, or the special routines fasta2DAM and DAM2fasta that create a DAM and reverse said, just like the pair fasta2DB and DB2fasta. So in general when we refer to a database we are referring to either a DB or a DAM.

The command DBsplit sets or resets the current partition for a database which is determined by 3 parameters: (i) the total number of basepairs to place in each block, (ii) the minimum read length of reads to include within a block, and (iii) whether or not to only include the longest read from a given well or all reads from a well (NB: several reads of the same insert in a given well can be produced by the Pacbio instrument). Note that the length and uniqueness parameters effectively select a subset of the reads that contribute to the size of a block. We call this subset the trimmed data base. Some commands operate on the entire database, others on the trimmed database, and yet others have an option flag that permits them to operate on either at the users discretion. Therefore, one should note carefully to which version of the database a command refers to. This is especially important for any command that identifies reads by their index (ordinal position) in the database.

Once the database has been split into blocks, the commands DBshow, DBstats, and DBdust below and commands yet to come, such as the local alignment finder dalign, can take a block or blocks as arguments. On the command line this is indicated by supplying the name of the DB followed by a period and then a block number, e.g. FOO.3.db or simply FOO.3, refers to the 3’rd block of DB FOO (assuming of course it has a current partition and said partition has a 3rd block). One should note carefully that a block is a contiguous range of reads such that once it is trimmed has a given size in base pairs (as set by DBsplit). Thus like an entire database, a block can be either untrimmed or trimmed and one needs to again be careful when giving a read index to a command such as DBshow.

The command DBdust runs the symmetric DUST low complexity filter on every read in a given untrimmed DB or DB block and it is the first example of a phase that produces a track. It is incremental in that you can call it repeatedly on a given DB and it will extend the track information to include any new reads added since the last time it was run on the DB. Reads are analyzed for low-complexity intervals and these intervals constitute the information stored in the .dust-track. These intervals are subsequently “soft”-masked by the local alignment command dalign, i.e. the low-complexity intervals are ignored for the purposes of finding read pairs that might have a significant local alignment between them, but any local alignment found will include the low-complexity intervals.

When DBdust is run on a block, say FOO.3, it is run on the untrimmed version of the block and it produces a block track in the files .FOO.3.dust.anno and, whereas running DBdust on the entire DB FOO produces .FOO.dust.anno and Thus one can run an independent job on each block of a DB, and then after the fact stitch all the block tracks together with a call to Catrack, e.g. “Catrack FOO dust“. Catrack is a general command that merges any sequence of block tracks together into a track for the specified DB. For example, a 50X human dataset would partition into roughly 375, 400Mb blocks, for which DBdust takes about 20 seconds on a typical processor. One can either call DBdust on the entire DB and wait 375*20 ~ 2.07 hours, or run 375 jobs on each block on your local compute cluster, waiting 1 or 2 minutes, depending on how loaded your cluster is, and then call Catrack on the block tracks, which takes another half a minute at most.

Tracks that encode intervals of the underlying reads in the data base, of which the “dust” track is an example, are specifically designated interval tracks, and commands like DBshow, DBstats, and daligner can accept and utlize one or more such tracks, typically specified through a -m option on the command line.

DBshow and DBstats are routines that allow one to examine the current contents of a DB or DB block. DBstats gives one a summary of the contents of an untrimmed or trimmed DB of DB block including a histogram of read lengths and histograms of desired interval tracks if desired. DBshow allows your to display the sequence and/or quality streams of any specific set of reads or read ranges in the DB or DB block where the output can be in Pacbio fasta or quiva format depending on which option flags are set. There is a flag that controls whether the command applies to the entire DB or the trimmed DB and one must be careful, as the, say 5th read of the DB may not be the 5th read of the trimmed DB. A nice trick here is that since the output is in the Pacbio formats, one can then create a “mini-DB” of the set of reads output by DBshow, by calling fasta2DB and/or quiva2DB on its output. This conveniently allows a developer to test/debug a phase on a small, troublesome subset of reads in the DB.

DBrm removes all the files, including all tracks, realizing a given DB. It is safer to use this command than to attempt to remove a DB by calling rm on each hidden file. Lastly, this release includes a simple simulator, simulator, that creates a synthetic Pacbio data set useful for testing purposes.

A precise, detailed, and up-to-date command reference can be found here.