Scrubbing Reads for Better Assembly


In an earlier post, I introduced the concept of  intrinsic quality values — a statistic indicating the quality of every say TP=100bp interval of a read — that could be computed from the pile of local alignments for each read computed by daligner with the -t option set to TP.  This idea is important because of the complex and variable error profile of PacBio long reads which I illustrated with the following hypothetical example:

Error profile of a typical long read. The average error rate is say 12% but it varies and occasionally is pure junk.

The quality values together with the pattern of alignments in a read’s pile allow one to detect these, sometimes long, low-quality segments of a read, as well as detect when the read is a chimer or contains undetected adaptamer sequence that should have been excised.  These artifacts are quite frequent in PacBio data sets and apart from repeats are the major cause of poor assembly.  We call the process of repairing the low-quality sequence and removing the chimers and missed adaptamers, scrubbing, and formally the goal of a scrubber is to edit a set of reads with as little loss of data as possible so that:

  1. every edited read is a contiguous segment of the underlying genome being sequenced, and
  2. every portion of a read is of reasonable quality, i.e. <20% error or Q8.

The conditions above are exactly those assumed by a string graph construction algorithm and any data that doesn’t meet these conditions introduces breaks and false branches in the string graph.  For this reason I choose to scrub hard, i.e. my scrubber will over call chimers and adaptamers rather than miss them, so that condition 1 above is meet by all the data after scrubbing.

The scrubber module consists of a pipeline of 4-5 modules as opposed to a single program that performs the end-to-end task.  We structured it this way so that one can run each pass in a block-parallel manner on an HPC cluster, and secondarily, so that a creative user can potentially customize or improve aspects of the overall process.  The figure below, shows the pipeline:

Data flow, including tracks, of the scrubbing pipeline

Each program in the pipeline can be run on the overlap piles for a block of the underlying DB in parallel, allowing a simple HPC parallel scheme.  By doing so each program per force outputs only the portion of a track for the intput block (i.e.. .qual, .trim, .patch), and these block tracks must be concatenated into a single track for the entire database with Catrack before the next step in the pipeline can be performed.  Fortunately, this serial step is very fast, basically amounting to concatenating a number of files of total size O(R) where R is the number of reads in the database.

The first step in scrubbing is to compute the quality values for every read segment and record them in a qual track with DASqv as described in an earlier post.  Recall that with the -v option DASqv displays a histogram of the QV’s over the dataset.  Using this histogram one should select a threshold GOOD for which 80% of the QV’s are not greater than this value, and another BAD for which 5-7% of the QV’s are not less than this value.  In the next two steps of the pipeline, QV’s ≤ GOOD will be considered “(definitley) good”, and QV’s ≥ BAD will be considered “(definitely) bad”.  Any QV’s between the two thresholds will be considered of unknown quality.   Decreasing these values, increases the amount of data that will be scrubbed away as not of sufficient quality, and increasing these thresholds retains more data but at a sacrifice in quality.  The percentile guidelines above represent good settings for typical PacBio data (that is about 80% of the data actually is good and 5% of it is actually crap).

The GOOD/BAD thresholds must be supplied as the -g and -b parameters to the next phase of the pipeline realized by DAStrimDAStrim first uses the quality values and the supplied -g and -b thresholds to determine a sequence of high-quality (HQ) segments for the A-read of each pile.  Formally, an HQ segment begins and ends with a good  trace point interval interval, contains no bad intervals, and is at least 400bp long.  Any intervals before the first HQ segment and after the last HQ segment, are clipped/trimmed from the beginning and end of the read, respectively.  The main task of DAStrim, is to then determine what to do with the gaps between the HQ segments of a read.  There are 4 cases:

  1. The gap is completely covered/spanned by lots of local alignments with other reads – then the gap is simply low quality sequence and deemed LOW-Q. This is illustrated in the example at right by the leftmost low-quality gap.
  2. The gap is not spanned by local alignments with other reads, but most of the local alignments to the left (X) and right (Y) of the gap are for the same B-read and the distance between the two B-read segments in the relevant local alignments are consistent with the size of the gap – then the gap is a (very) low quality (and often long) drop out in the A-read and the gap is deemed LOW-Q.
  3. the gap is not spanned but the left (X) and right (Y) alignment sets have the relevant B-reads in the complement orientation with respect to each other – the gap is due to an ADAPTAMER. Typically, the gap is also very sharp with the gap between paired local alignments being very narrow as in the example below left.  In the situation where one or more adatamer gaps is found for a read, all but the longest subread between two of the adaptamers is thrown away.


  1. the gap is not spanned and there is no correspondence between the reads in the local alignments to the left and right – the gap is due to a chimeric join in the A-read and the gap is designated CHIMER as illustrated in the example above right.  Every chimeric gap splits the A-read into two new reads.

DAStrim outputs a track (or track block) called trim that for each read encodes the begin/end coordinates for each HQ segment and the classification of each intervening gap (either LOW-Q or CHIMER).  With the -v option it also outputs a report describing how much data was trimmed, designated LOW-Q, etc.  In the example below, one sees that the left column is the number of reads or items, and the right column is the total number of bases in those items.  3,117 reads totalling 11.6% of the bases in the data set were found to be either garbage or were totally masked by the DAMASKER module and thrown out.  Of the remaining reads, 5.7% of the bases were trimmed from the beginning of reads, and 2.5% from the end of reads.  37 missed adaptamer sites were found and 117.8Kbp was trimmed away leaving only the longest sub-read between adaptamers.

Input:     22,403 (100.0%) reads      250,012,393 (100.0%) bases

Discard:    3,117 ( 13.9%) reads       29,036,136 ( 11.6%) bases
5' trim:   13,749 ( 61.4%) reads       14,137,800 (  5.7%) bases
3' trim:    7,919 ( 35.3%) reads        6,316,157 (  2.5%) bases
Adapter:       37 (  0.2%) reads          117,816 (  0.0%) bases

Gaps:      35,253 (157.4%) gaps        15,205,800 (  6.1%) bases
Low QV:    30,903 (137.9%) gaps         8,834,200 (  3.5%) bases
Span'd:     3,346 ( 14.9%) gaps         4,441,700 (  1.8%) bases
Chimer:     1,004 (  4.5%) gaps         1,929,900 (  0.8%) bases

Clipped:   25,826 clips                49,607,909 ( 20.6%) bases
Patched:   34,249 patches              13,275,900 (  5.3%) bases

In the adaptamer-free reads that were retained, there were 35,253 gaps between HQ segments.  30,903 were deemed LOW-Q by virtue of being spanned by alignments (Low QV) and 3,346 were deemed LOW-Q by virture of consistent alignment pairs (Span’d).  The remaining 1,004 where deemed to be CHIMERs and will be split creating 1,004 new reads.  The last two lines tell one that a total of 20.6% of the bases were lost altogether because of trimming, adapatmers, and chimers, and that 5.3% of the bases were in low-quality gaps that will be patched with better sequence in the ensuing scrubbing passes.

The 3rd phase of scrubbing is realized by DASpatch which selects a good B-read segment to replace each LOW-Q gap with.  We call this high-quality sequence a patch.  The program also reviews all the gap calls in light of the fact that the trimming of reads affects their B-read alignment intervals in the piles of other reads, occasionally implying that a gap should have been called a CHIMER and not LOW-QDASpatch further changes a LOW-Q gap to a CHIMER call if it cannot find what it thinks is a good B-read segment to patch the gap.  The number of these calls is generally very small, often 0, and their number is reported when the -v option is set.  DASpatch outputs a track (or track block) called patch that encodes the sequence of B-read segments to use for patching each low quality gap in a read.

The final phase of scrubbing is to perform the patching/editing of reads that is encoded in the trim and patch tracks produced in the last two phases, and produce a new data base of all the resulting scrubbed reads.  This is accomplished by DASedit that takes the name of the original database and the name you would like for the new database as input.  This phase runs sequentially and can can be quite slow due to the random access pattern of read sequences required to patch, so you might want to go do something else while it runs.  But when its finished you will have a database of reads that are typically 99.99% chimer and adatamer free, and that contain almost no stretches of very low-quality base calls.

The new database does not have a .qvs or .arr component, that is it is a a sequence or S-database (see the original Dazzler DB post).  Very importantly, the new database has exactly the same block divisions as the original.  That is, all patched subreads in a block of the new database have been derived from reads of the same block in the original database, and only from those reads.  DASedit also produces a map track that for each read encodes the original read in the source DB that was patched and the segments of that read that were high-quality (i.e. not patched).  A program DASmap output this information in either an easy-to-read or an easy-to-parse format.  A small example of a snippet output by DASmap is as follows:

55 -> 57(2946) [400,2946]
56 -> 58(11256) [700,1900]
57 -> 58(11256) [6600,9900] 83 [10000,11256]
58 -> 59(12282) [400,4100] 88 [4200,9400] 97 [9500,12282]

The first line indicates that read 55 in the patched database was derived from read 57 in the original database and is the segment from [400,2946] of that read.  Reads 56 and 57 were both derived from read 58 in the original DB, and read 57 consists of segments [6600,9900] and [10000,11256] of read 58 with a patch between them of 83bp (but the source of the patch data is not given).  The read length of each original read is given for convenience.  The purpose of this map is to allow you to map back to the original reads in the final consensus phase of an assembly where one will want to use the original data along with its Quiver or Arrow data encoded in the .qvs/.arr component of the DB.

A database of scrubbed reads can be used in three ways as illustrated in the figure at the end of this paragraph.  In the first use-case, a user can simply take the database of scrubbed reads and use them as input to their favorite long-read assembler (e.g. Falcon or Canu).  In our experience doing so improves the assemblies produced by these third party systems.  For example, in the plot at right, Falcon’s error corrector (EC) tends to break reads at long low quality gaps producing the red line read-length profile given a data set with the blue read-length distribution.  When Falcon starts on our scrubbed reads with the grey read-length profile, (which is significantly better than the EC profile), its own error correct leaves these longer scrubbed reads intact. Longer reads into the string graph phase of Falcon’s assembler implies a more coherent assembly, i.e. longer contigs.The second use-case, is to rerun the daligner on this set of reads to produce .las files anew over the scrubbed reads.  The down side is that this takes considerable compute time.  However, because the reads are scrubbed, one should expect that reads properly overlap and a purely local alignments can now be ignored as repeat-induced.  Giving the result set of overlaps to a string graph algorithm for assembly should result in better assembly as many repeat-induced correspondences have been removed.

The final third use-case, is rather than re-run daligner, to take advantage of the fact that the piles computed the first time already contain all the pairs of reads that align, and so in fact all that needs to be done is to “re-align” an local alignment found between original reads in terms of the derived reads.  This includes extending the local alignment when the introduction of a patch allows it to now extend through a formerly low-quality region.  To this end I produced a program DASrealign that takes a pair of blocks of the scrubbed DB and the original .las file for the block pair (in the original DB), and produces the realigned overlap piles in a .las file for the scrubbed DB blocks.  Not carefully that this requires (1) that you have kept the block pair .las files produced by the first phase of the HPC.daligner process and not removed them, and (2) that after realigning every block pair, you must then merge these to produce the overlap blocks for the scrubbed DB.  This process is roughly 40x faster than running daligner from scratch.  The trade off is that some alignments might be missed that would otherwise be found if daligner were rerun on the scrubbed reads, and the trace point encoding of a patched alignment is no longer uniform in the A-read and so one could not scrub the patched reads a second time with the piles so produced.  On the otherhand, one can take just the overlaps that are produced this way and perform a string graph assembly.

The pipeline above careful identifies and removes read artifacts such as chimers and adapatamers, and repairs high error-rate read segments, providing a set of reads that will assemble well (without correction).  However, ultimately we believe good assembly will further require correction in order to carefully separate haplotypes and low copy number repeats prior to the string graph phase.  Therefore, our ultimate ongoing development plan is to replace DASedit with a program that produces error corrected A-reads based on the analysis of DASqv-DAStrim-DASpatch and to then realign (at much higher stringency / lower error rate) the resulting reads.  We expect this process to take more time than simply applying DASrealign, but with the advantage that the reads will also be error corrected.  Never the less, just scrubbing reads and feeding them into another assembler is proving to have sufficient value that I decided to release the current pipeline.  Best, Gene

DFA – The Dagstuhl Format for Assembly


A proposal for a comprehensive assembly format

The value of a standard for encoding DNA sequence assemblies is based on the observation that in general different development teams build assemblers, assembly editors, and assembly visualizers, respectively, because of the complexity and distinct nature of each of the three applications.  While these apps should certainly use tailored encodings internally for efficiency, the nexus between the three types of tools would benefit from a standard encoding format that would allow them to all inter-operate.  In other words, if I build an assembler it would be great if I could use someone’s string graph drawing package or multi-alignment editor, or better yet, be able to choose among several depending on which serves a particular purpose best.  As a drawing:screen-shot-2016-09-11-at-12-59-20-pmSo Jason Chin, Richard Durbin, and myself found ourselves together at a workshop meeting in Dagstuhl Germany at the end of August and hammered out an initial proposal for an assembly format that we think is quite comprehensive and general purpose. We thought GFA was a good start point and built a more comprehensive design around it.  We are calling this preliminary version the “Dagstuhl Format for Assembly” or DFA and have placed it on Github here where one can fork and version control the document.  We also include it below.  You are welcome to comment or make suggestions either here or at Github (via the “Issues” mechanism).  Our hope is that eventually some version of DFA that includes at least all the features in the proposal might find adoption, possibly as “GFA 2“.


DFA is a generalization of GFA that allows one to specify an assembly graph in either less detail, e.g. just the topology of the graph, or more detail, e.g. the multi-alignment of reads giving rise to each sequence.  We further designed it to be a suitable representation for a string graph at any stage of assembly, from the graph of all overlaps, to a final resolved assembly of contig paths with multi-alignments.  Apart from meeting our needs, the extensions also supports other assembly and variation graph types.

We further want to emphasize that we are proposing a core standard.  As will be seen later in the technical specification, the format is extensible in that additional description lines can be added and additional SAM tags can be appended to core description lines.

In overview, an assembly is a graph of vertices called segments representing sequences that are connected by edges that denote local alignments between the vertex sequences.  At a minimum one must specify the length of the sequence represented, further specifying the actual sequence is optional.  In the direction of more detail, one can optionally specify a collection of external sequences, called fragments, from which the sequence was derived (if applicable) and how they multi-align to produce the sequence.  Similarly, the specification of an edge need only describe the range of base pairs aligned in each string, but may optionally contain a Dazzler-trace or a CIGAR string that describes the alignment of the edge.  Traces are a space-efficient Dazzler assembler concept that allow one to efficiently compute an alignment in linear time, and CIGAR strings are a SAM concept explicitly detailing the details of a particular alignment.  Many new technologies such a Hi-C and BioNano maps organize segments into scaffolds along with traditional data sets involving paired reads, and so a gap concept is also introduced so that order and orientation between disjoint contigs of an assembly can be described.  Finally, one -can create named sub-graphs of the graph, that we call a group, that includes the special case of a path.


<spec>     <- ( <header> | <segment> | <fragment>
| <edge> | <gap> | <group> )+

<header>   <- H VN:Z:1.0 {TS:i:<trace spacing>}

<segment>  <- S <sid:id> <slen:int> <sequence>

<fragment> <- F <sid:id> [+-] <external:id>
<sbeg:pos> <send:pos> <fbeg:pos> <fend:pos> <alignment>

<edge>     <- E <eid:id> <sid1:id> [+-] <sid2:id>
<beg1:pos> <end1:pos> <beg2:pos> <end2:pos> <alignment>

<gap>      <- G <eid:id> <sid1:id> [+-] <sid2:id> <dist:int> <var:int>

<group>    <- P[UO] <pid:id> <item>([ ]<item>)*

<id>        <- [!-~]+
<item>      <- <sid:id> | <eid:id> | <pid:id>
<pos>       <- [$]<int>
<sequence>  <- * | [A-Za-z]+
<alignment> <- * | <trace array> | <CIGAR string>

<CIGAR string> <- ([0-9]+[MX=DIP])+
<trace array>  <- <int>(,<int>)*

In the grammar above all symbols are literals other than tokens between <>, the derivation operator <-, and the following marks:

  • {} enclose an optional item
  • | denotes an alternative
  • * zero-or-more
  • + one-or-more
  • [] a set of one character alternatives.

Like GFA, DFA is tab-delimited in that every lexical token is separated from the next by a single tab.

Each descriptor line must begin with a letter and lies on a single line with no white space before the first symbol.   The tokens that generate descriptor lines are <header>, <segment>, <fragment>, <edge>, <gap>, and <group>.  Any line that does not begin with a recognized code (i.e. H, S, F, E, G, or P) can be ignored.  This will allow users to have additional descriptor lines specific to their special processes.  Moreover, the suffix of any DFA descriptor line may contain any number of user-specific SAM tags which are ignored by software designed to support the core standard.


The header contains an optional ‘VN’ SAM-tag version number, 1.0, and an optional ‘TS’ SAM-tag specifying the trace point spacing for any Dazzler traces specified to accelerate alignment computation. Any number of header lines containing SAM-tags may occur.

A segment is specified by an S-line giving a user-specified ID for the sequence, its length in bases, and the string of bases denoted by the segment or * if absent.  The length does not need to be the actual length of the sequence, if given, but is rather an indication to a drawing program of how long to draw the representation of the segment.  The segment sequences and any CIGAR strings referring to them if present follow the unpadded SAM convention.

Fragments, if present, are encoded in F-lines that give (a) the segment they belong to, (b) the orientation of the fragment to the segment, (c) an external ID that references a sequence in an external collection (e.g. a database of reads or segments in another DAF or SAM file), (d) the interval of the vertex segment that the external string contributes to, and (e) the interval of the fragment that contributes to to segment. One concludes with either a trace or CIGAR string detailing the alignment, or a * if absent.

Edges are encoded in E-lines that in general represent a local alignment between arbitrary intervals of the sequences of the two vertices in question. One gives first an edge ID and then the segment ID’s of the two vertices and a + or – sign between them to indicate whether the second segment should be complemented or not.  An edge ID does not need to be unique or distinct from a segment ID, unless you plan to refer to it in a P-line (see below).  This allows you to use something short like a single *-sign as the id when it is not needed.

One then gives the intervals of each segment that align, each as a pair of positions. A position is either an integer or the special symbol $ followed immediately by an integer. If an integer, the position is the distance from the left end of the segment, and if $ followed by an integer, the distance from the right end of the sequence. This ability to define a 0-based position from either end of a segment allows one to conveniently address end-relative positions without knowing the length of the segments. Note carefully, that the segment and fragment intervals in an F-line are also positions.

Position intervals are always intervals in the segment in its normal orientation. If a minus sign is specified, then the interval of the second segment is reverse complemented in order to align with the interval of the first segment. That is, S s1 - s2 b1 e1 b2 e2 aligns s1[b1,e1] to the reverse complement of s2[b2,e2].

A field for a CIGAR string or Dazzler-trace describing the alignment is last, but may be absent by giving a *. One gives a CIGAR string to describe an exact alignment relationship between the two segments. A trace string by contrast is given when one simply wants an accelerated method for computing an alignment between the two intervals. If a * is given as the alignment note that it is still possible to compute the implied alignment by brute force.

The DFA concept of edge generalizes the link and containment lines of GFA.  For example a GFA edge which encodes what is called a dovetail overlap (because two ends overlap) is simply a DFA edge where end1 = $0 and beg2 = 0 or beg1 = 0 and end2 = $0. A GFA containment is modeled by the case where beg2 = 0 and end2 = $0 or beg1 = 0 and end1 = $0. The figure below illustrates:


Special codes could be adopted for dovetail and containment relationships but the thought is there is no particular reason to do so, the use of the special characters for terminal positions makes their identification simple both algorithmically and visually, and the more general scenario allows interesting possibilities.  For example, one might have two haplotype bubbles shown in the “Before” picture below, and then in a next phase choose a path through the bubbles as the primary “contig”, and then capture the two bubble alternatives as a vertex linked with generalized edges shown in the “After” picture.  Note carefully that you need a generalized edge to capture the attachment of the two haplotype bubbles in the “After” picture.


While one has graphs in which vertex sequences actually overlap as above, one also frequently encounters models in which there is no overlap (basically edge-labelled models captured in a vertex-labelled form).  This is captured by edges for which beg1 = end1 and beg2 = end2 (i.e. 0-length overlap)!

While not a concept for pure DeBrujin or long-read assemblers, it is the case that paired end data and external maps often order and orient contigs/vertices into scaffolds with intervening gaps.  To this end we introduce a gap edge described in G-lines that give the estimated gap distance between the two vertex sequences and the variance of that estimate or 0 if no estimate is available.  Relationships in E-lines are fixed and known, where as in a G-line, the distance is an estimate and the line type is intended to allow one to define assembly scaffolds.

A group encoding on a P-line allows one to name and specify a subgraph of the overall graph. Such a collection could for example be hilighted by a drawing program on command, or might specify decisions about tours through the graph. The P is immediately followed by U or O indicating either an unordered or ordered collection. The remainder of the line then consists of a name for the collection followed by a non-empty list of ID’s referring to segments and/or edges that are separated by single spaces (i.e. the list is in a single column of the tab-delimited format). P-lines with the same name are considered to be concatenated together in the order in which they appear, and a group list may refer to another group recursively.

An unordered collection refers to the subgraph induced by the vertices and edges in the collection (i.e. one adds all edges between a pair of segments in the list and one adds all segments adjacent to edges in the list.) An ordered collection captures paths in the graph consisting of the listed objects and the implied adjacent objects between consecutive objects in the list (e.g. the edge between two consecutive segments, the segment between two consecutive edges, etc.)

Note carefully that there may be several edges between a given pair of segments, so in in this event it is necessary that edges have unique IDs that can be referred to as a segment pair does not suffice.  The convention is that every edge has an id, but this id need only be unique and distinct from the set of segment ids when one wishes to refer to the edge in a P-line.  So an id in a P-line refers first to a segment id, and if there is no segment with that id, then to the edge with that id.  All segment id’s must be unique, and it is an error if an id in a P-line does not refer to a unique edge id when a segment with the id does not exist.


<link>     <- L <eid:id> <sid1:id> [+-] <sid2:id> [+-]
                         <ovl1:int> <ovl2:int> {<alignment>}

Some users want a specific specification for edges that encode dovetail overlaps reminiscent of the L-line in GFA. The table below gives the direct translation from an L-line into the more general E-line, and serves effectively as the definition of an L-line’s semantics:

L s1 + s2 + o1 o2       <==>      E s1 + s2 $o1  $0   0  o2
L s1 + s2 - o1 o2       <==>      E s1 - s2 $o1  $0 $o2  $0
L s1 - s2 + o1 o2       <==>      E s1 - s2   0  o1   0  o2
L s1 - s2 - o1 o2       <==>      E s1 + s2   0  o1 $o2  $0


DFA is a super-set of GFA, that is, everything that can be encoded in GFA can be encoded in DFA, with a relatively straightforward transformation of each input line.

On the other hand, a GFA parser, even one that accepts optional SAM-tags at the end of a defined line type, and which ignores line types not defined in GFA, will not accept a DFA specification because of changes in the defined fields of the S- and P-lines. Moreover, any useful DFA reader must process E-lines which replace L- and C-lines.

The syntactic conventions, however, are identical to GFA, so upgrading a GFA parser to a DFA parser is relatively straight forward. Each description line begins with a single letter and has a fixed set of fields that are tab-delimited. The changes are as follows:

  1. There is an integer length field in S-lines.
  2. The L- and C-lines have been replaced by a consolidated E-line.
  3. The P-line has been enhanced to encode both subgraphs and paths, and can take edge id’s, obviating the need for orientation signs and alignments between segments.

  4. There is a new F-line for describing multi-alignments.
  5. Alignments can be trace length sequences as well as CIGAR strings.
  6. Positions have been extended to include a notation for 0-based position with respect to either end of a segment (the default being with respect to the beginning).