Scrubbing Reads for Better Assembly


In an earlier post, I introduced the concept of  intrinsic quality values — a statistic indicating the quality of every say TP=100bp interval of a read — that could be computed from the pile of local alignments for each read computed by daligner with the -t option set to TP.  This idea is important because of the complex and variable error profile of PacBio long reads which I illustrated with the following hypothetical example:

Error profile of a typical long read. The average error rate is say 12% but it varies and occasionally is pure junk.

The quality values together with the pattern of alignments in a read’s pile allow one to detect these, sometimes long, low-quality segments of a read, as well as detect when the read is a chimer or contains undetected adaptamer sequence that should have been excised.  These artifacts are quite frequent in PacBio data sets and apart from repeats are the major cause of poor assembly.  We call the process of repairing the low-quality sequence and removing the chimers and missed adaptamers, scrubbing, and formally the goal of a scrubber is to edit a set of reads with as little loss of data as possible so that:

  1. every edited read is a contiguous segment of the underlying genome being sequenced, and
  2. every portion of a read is of reasonable quality, i.e. <20% error or Q8.

The conditions above are exactly those assumed by a string graph construction algorithm and any data that doesn’t meet these conditions introduces breaks and false branches in the string graph.  For this reason I choose to scrub hard, i.e. my scrubber will over call chimers and adaptamers rather than miss them, so that condition 1 above is meet by all the data after scrubbing.

The scrubber module consists of a pipeline of 4-5 modules as opposed to a single program that performs the end-to-end task.  We structured it this way so that one can run each pass in a block-parallel manner on an HPC cluster, and secondarily, so that a creative user can potentially customize or improve aspects of the overall process.  The figure below, shows the pipeline:

Data flow, including tracks, of the scrubbing pipeline

Each program in the pipeline can be run on the overlap piles for a block of the underlying DB in parallel, allowing a simple HPC parallel scheme.  By doing so each program per force outputs only the portion of a track for the intput block (i.e.. .qual, .trim, .patch), and these block tracks must be concatenated into a single track for the entire database with Catrack before the next step in the pipeline can be performed.  Fortunately, this serial step is very fast, basically amounting to concatenating a number of files of total size O(R) where R is the number of reads in the database.

The first step in scrubbing is to compute the quality values for every read segment and record them in a qual track with DASqv as described in an earlier post.  Recall that with the -v option DASqv displays a histogram of the QV’s over the dataset.  Using this histogram one should select a threshold GOOD for which 80% of the QV’s are not greater than this value, and another BAD for which 5-7% of the QV’s are not less than this value.  In the next two steps of the pipeline, QV’s ≤ GOOD will be considered “(definitley) good”, and QV’s ≥ BAD will be considered “(definitely) bad”.  Any QV’s between the two thresholds will be considered of unknown quality.   Decreasing these values, increases the amount of data that will be scrubbed away as not of sufficient quality, and increasing these thresholds retains more data but at a sacrifice in quality.  The percentile guidelines above represent good settings for typical PacBio data (that is about 80% of the data actually is good and 5% of it is actually crap).

The GOOD/BAD thresholds must be supplied as the -g and -b parameters to the next phase of the pipeline realized by DAStrimDAStrim first uses the quality values and the supplied -g and -b thresholds to determine a sequence of high-quality (HQ) segments for the A-read of each pile.  Formally, an HQ segment begins and ends with a good  trace point interval interval, contains no bad intervals, and is at least 400bp long.  Any intervals before the first HQ segment and after the last HQ segment, are clipped/trimmed from the beginning and end of the read, respectively.  The main task of DAStrim, is to then determine what to do with the gaps between the HQ segments of a read.  There are 4 cases:

  1. The gap is completely covered/spanned by lots of local alignments with other reads – then the gap is simply low quality sequence and deemed LOW-Q. This is illustrated in the example at right by the leftmost low-quality gap.
  2. The gap is not spanned by local alignments with other reads, but most of the local alignments to the left (X) and right (Y) of the gap are for the same B-read and the distance between the two B-read segments in the relevant local alignments are consistent with the size of the gap – then the gap is a (very) low quality (and often long) drop out in the A-read and the gap is deemed LOW-Q.
  3. the gap is not spanned but the left (X) and right (Y) alignment sets have the relevant B-reads in the complement orientation with respect to each other – the gap is due to an ADAPTAMER. Typically, the gap is also very sharp with the gap between paired local alignments being very narrow as in the example below left.  In the situation where one or more adatamer gaps is found for a read, all but the longest subread between two of the adaptamers is thrown away.


  1. the gap is not spanned and there is no correspondence between the reads in the local alignments to the left and right – the gap is due to a chimeric join in the A-read and the gap is designated CHIMER as illustrated in the example above right.  Every chimeric gap splits the A-read into two new reads.

DAStrim outputs a track (or track block) called trim that for each read encodes the begin/end coordinates for each HQ segment and the classification of each intervening gap (either LOW-Q or CHIMER).  With the -v option it also outputs a report describing how much data was trimmed, designated LOW-Q, etc.  In the example below, one sees that the left column is the number of reads or items, and the right column is the total number of bases in those items.  3,117 reads totalling 11.6% of the bases in the data set were found to be either garbage or were totally masked by the DAMASKER module and thrown out.  Of the remaining reads, 5.7% of the bases were trimmed from the beginning of reads, and 2.5% from the end of reads.  37 missed adaptamer sites were found and 117.8Kbp was trimmed away leaving only the longest sub-read between adaptamers.

Input:     22,403 (100.0%) reads      250,012,393 (100.0%) bases

Discard:    3,117 ( 13.9%) reads       29,036,136 ( 11.6%) bases
5' trim:   13,749 ( 61.4%) reads       14,137,800 (  5.7%) bases
3' trim:    7,919 ( 35.3%) reads        6,316,157 (  2.5%) bases
Adapter:       37 (  0.2%) reads          117,816 (  0.0%) bases

Gaps:      35,253 (157.4%) gaps        15,205,800 (  6.1%) bases
Low QV:    30,903 (137.9%) gaps         8,834,200 (  3.5%) bases
Span'd:     3,346 ( 14.9%) gaps         4,441,700 (  1.8%) bases
Chimer:     1,004 (  4.5%) gaps         1,929,900 (  0.8%) bases

Clipped:   25,826 clips                49,607,909 ( 20.6%) bases
Patched:   34,249 patches              13,275,900 (  5.3%) bases

In the adaptamer-free reads that were retained, there were 35,253 gaps between HQ segments.  30,903 were deemed LOW-Q by virtue of being spanned by alignments (Low QV) and 3,346 were deemed LOW-Q by virture of consistent alignment pairs (Span’d).  The remaining 1,004 where deemed to be CHIMERs and will be split creating 1,004 new reads.  The last two lines tell one that a total of 20.6% of the bases were lost altogether because of trimming, adapatmers, and chimers, and that 5.3% of the bases were in low-quality gaps that will be patched with better sequence in the ensuing scrubbing passes.

The 3rd phase of scrubbing is realized by DASpatch which selects a good B-read segment to replace each LOW-Q gap with.  We call this high-quality sequence a patch.  The program also reviews all the gap calls in light of the fact that the trimming of reads affects their B-read alignment intervals in the piles of other reads, occasionally implying that a gap should have been called a CHIMER and not LOW-QDASpatch further changes a LOW-Q gap to a CHIMER call if it cannot find what it thinks is a good B-read segment to patch the gap.  The number of these calls is generally very small, often 0, and their number is reported when the -v option is set.  DASpatch outputs a track (or track block) called patch that encodes the sequence of B-read segments to use for patching each low quality gap in a read.

The final phase of scrubbing is to perform the patching/editing of reads that is encoded in the trim and patch tracks produced in the last two phases, and produce a new data base of all the resulting scrubbed reads.  This is accomplished by DASedit that takes the name of the original database and the name you would like for the new database as input.  This phase runs sequentially and can can be quite slow due to the random access pattern of read sequences required to patch, so you might want to go do something else while it runs.  But when its finished you will have a database of reads that are typically 99.99% chimer and adatamer free, and that contain almost no stretches of very low-quality base calls.

The new database does not have a .qvs or .arr component, that is it is a a sequence or S-database (see the original Dazzler DB post).  Very importantly, the new database has exactly the same block divisions as the original.  That is, all patched subreads in a block of the new database have been derived from reads of the same block in the original database, and only from those reads.  DASedit also produces a map track that for each read encodes the original read in the source DB that was patched and the segments of that read that were high-quality (i.e. not patched).  A program DASmap output this information in either an easy-to-read or an easy-to-parse format.  A small example of a snippet output by DASmap is as follows:

55 -> 57(2946) [400,2946]
56 -> 58(11256) [700,1900]
57 -> 58(11256) [6600,9900] 83 [10000,11256]
58 -> 59(12282) [400,4100] 88 [4200,9400] 97 [9500,12282]

The first line indicates that read 55 in the patched database was derived from read 57 in the original database and is the segment from [400,2946] of that read.  Reads 56 and 57 were both derived from read 58 in the original DB, and read 57 consists of segments [6600,9900] and [10000,11256] of read 58 with a patch between them of 83bp (but the source of the patch data is not given).  The read length of each original read is given for convenience.  The purpose of this map is to allow you to map back to the original reads in the final consensus phase of an assembly where one will want to use the original data along with its Quiver or Arrow data encoded in the .qvs/.arr component of the DB.

A database of scrubbed reads can be used in three ways as illustrated in the figure at the end of this paragraph.  In the first use-case, a user can simply take the database of scrubbed reads and use them as input to their favorite long-read assembler (e.g. Falcon or Canu).  In our experience doing so improves the assemblies produced by these third party systems.  For example, in the plot at right, Falcon’s error corrector (EC) tends to break reads at long low quality gaps producing the red line read-length profile given a data set with the blue read-length distribution.  When Falcon starts on our scrubbed reads with the grey read-length profile, (which is significantly better than the EC profile), its own error correct leaves these longer scrubbed reads intact. Longer reads into the string graph phase of Falcon’s assembler implies a more coherent assembly, i.e. longer contigs.The second use-case, is to rerun the daligner on this set of reads to produce .las files anew over the scrubbed reads.  The down side is that this takes considerable compute time.  However, because the reads are scrubbed, one should expect that reads properly overlap and a purely local alignments can now be ignored as repeat-induced.  Giving the result set of overlaps to a string graph algorithm for assembly should result in better assembly as many repeat-induced correspondences have been removed.

The final third use-case, is rather than re-run daligner, to take advantage of the fact that the piles computed the first time already contain all the pairs of reads that align, and so in fact all that needs to be done is to “re-align” an local alignment found between original reads in terms of the derived reads.  This includes extending the local alignment when the introduction of a patch allows it to now extend through a formerly low-quality region.  To this end I produced a program DASrealign that takes a pair of blocks of the scrubbed DB and the original .las file for the block pair (in the original DB), and produces the realigned overlap piles in a .las file for the scrubbed DB blocks.  Not carefully that this requires (1) that you have kept the block pair .las files produced by the first phase of the HPC.daligner process and not removed them, and (2) that after realigning every block pair, you must then merge these to produce the overlap blocks for the scrubbed DB.  This process is roughly 40x faster than running daligner from scratch.  The trade off is that some alignments might be missed that would otherwise be found if daligner were rerun on the scrubbed reads, and the trace point encoding of a patched alignment is no longer uniform in the A-read and so one could not scrub the patched reads a second time with the piles so produced.  On the otherhand, one can take just the overlaps that are produced this way and perform a string graph assembly.

The pipeline above careful identifies and removes read artifacts such as chimers and adapatamers, and repairs high error-rate read segments, providing a set of reads that will assemble well (without correction).  However, ultimately we believe good assembly will further require correction in order to carefully separate haplotypes and low copy number repeats prior to the string graph phase.  Therefore, our ultimate ongoing development plan is to replace DASedit with a program that produces error corrected A-reads based on the analysis of DASqv-DAStrim-DASpatch and to then realign (at much higher stringency / lower error rate) the resulting reads.  We expect this process to take more time than simply applying DASrealign, but with the advantage that the reads will also be error corrected.  Never the less, just scrubbing reads and feeding them into another assembler is proving to have sufficient value that I decided to release the current pipeline.  Best, Gene

Intrinsic Quality Values

One of the most vexing problems with Pacbio data is that it is very difficult to understand the quality of a given small segment or region of a read.  The Q-scores that come out of the instrument are not particularly informative in this regard, their value seems to be mostly in indicating if a given base is better or worse than the bases neighboring it (and this is of importance to multi-aligners and consensus callers like Quiver).  I tried every way I could think of to aggregate these Q-scores over a say 100bp interval so as to arrive at an informative statistic about the overall quality of that 100bp stretch and failed.  But having this information is very important because over the course of a long 10-40Kbp read, the quality of the base calls varies significantly, where there are even long internal stretches that are essentially junk.  I illustrate a typical Pacbio read error profile to help you further understand why I think it is extremely important to understand the “regional” quality of a read.

Error profile of a typical long read.
My solution to determining read quality begins with the idea of a pile, which is the set of all alignments that involve a given read, say a, as the A-read in an alignment pair (a,b).  Recall that the HPCdaligner script arranges to sort all the alignments between reads found by daligner so that they are in order of A-read, then B-read, then B-orientation, etc.  In particular, this means that for a given read, say 103, one encounters all the alignments with 103 as the A-read in consecutive order in a .las file.  That is the 103-pile and in fact every pile can be collected and processed consecutively as one scans a .las file!  All post-overlap and pre-assembly analyses in the Dazzler suite are based on looking at read piles.

The figure below shows a pile in a graphical form I call a pile-ogram.  The long yellow bar at the top represents the A-read and every white bar below it represents the A-interval of an alignment with another read.  The brown tips indicate that the B-read continues but does not align with the A-read  (note that some white bars have brown tips, others do not).

Screen Shot 2015-11-06 at 6.07.01 PMRecall from the previous post on trace points, that for each alignment we record the number of differences in the alignment for each section between tick marks spaced Δ base pairs apart in A.  Daligner uses a default of Δ = 100, so the number of differences is also the percentage mismatch between the two reads over each 100bp stretch of the A-read (save possibly at the very beginning and very end of an alignment that spans only a part of a 100bp stretch).   In the pile-ogram below, we have color-coded each 100bp interval of each alignment according to the number of difference in the segment, where the heat map scale is given just above the pile-ogram.

Screen Shot 2015-11-06 at 9.40.26 PMIt should immediately be obvious that in the 100bp “columns” that are all orange, red, and purple, every B-read disagrees with the corresponding segment of the A-read and the clear inference is that the quality of the A-read is very low in that segment.  Going a little further, if a column has say 40 alignments covering it and the average error rate of these is 12%, then it is very unlikely that 50% or more of the 40 B-reads are worse than average in their portions of the alignment spanning the column and indeed we expect the average quality of this 50% to be quite stable, perhaps from 8-12%.  The variable part that mostly determines the color code must then be the quality of the A segment so a good proxy statistic for the quality of each 100bp segment of the A-read is the average match fidelity of the best say 25-50% of the alignments covering the segment.  In the figure below we now color code the segments of the A-read according to their intrinsic quality value.  We use the adjective “intrinsic” to emphasize that this statistic was derived solely from the read data set itself.

Screen Shot 2015-11-06 at 6.04.16 PMIn the case that there is a large drop out in the quality of an A-read, this tends to “break” local alignments as the alignment software cannot find a good correspondence with any B-read in this region.  The pile below illustrates this effect where it is clear all alignments are interrupted over the interval [2100,2200].

Screen Shot 2015-11-08 at 9.43.24 AMThe pile below shows an even larger low quality region that is over 4000bp long.  However, one cannot immediately conclude that the segments in question are low quality, as another possibility is that the read is a chimer, and the drop out is due to the presence of the chimer junction.  Typically such breaks are small as in the pile above.  To distinguish the two possibilities, one needs to see if the B-reads are the same on both sides of the gap and that the gap distance is consistent with the gaps implied in the B-reads.  In the pile below, alignment pairs that are so consistent, are indicated by dashed lines across the gap.  One sees that about 1/2 the pairs are consistent across the gap.  For the others, a careful analysis reveals that the B-reads are not long enough to actually span the 4Kbp gap, terminating before reaching the other side.

Screen Shot 2015-11-08 at 9.44.32 AMHopefully, the examples and exposition above make it clear that analyzing read piles is pretty interesting and informative.  We have built a complete research prototype pipeline that uses the intrinsic quality values and the piles to trim poor prefixes and suffixes of each read, to detect and break all chimeric reads, to remove all adaptamers missed by the Pacbio software, and to identify and patch low quality gaps.  I am still not completely happy with elements of the current scrubbing pipeline and so will not release all of it at this time.  However, the intrinsic quality value caller is stable and I am releasing it now, so that at least you can have an error profile for your reads at this time.  Perhaps you can build a better scrubber.

The command “DASqv -c40 Project.db Project.las” will produce an intrinsic quality track, “qual” in the files .Project.qual.anno and  In this example I was assuming that Project entailed 40X sequencing of a target genome and set the parameter -c accordingly.  While I could have tried to estimate the coverage of your project by looking at the piles, the assumption was that the user invariably knows this information and so can easily supply it.  For a given read, its quality vector has length ⌊ n/Δ ⌋ + 1 where Δ is the trace spacing (default 100), and it consists of one byte numbers between 0 and 50 where I capped the values at 50 as any segment with worse than an average 50% difference to its B-reads is obviously a hideously bad segment (two random DNA string will align with 52% differences).  As stated carefully previously, intrinsic quality values are statistics that correlate with the underlying quality, and not the actual error rate of the segment.  To help you calibrate this to the actual quality, DASqv can be asked to output a histogram of the match values and quality values with the -v option.  An example output, might be:

DASqv -c40 Project Project

Input:   28,347reads,   213,832,622 bases

Histogram of q-values (average 10 best)

         Input                 QV

50:     124591    0.2%       369417   17.2%

49:      30831    0.0%         1177    0.1%
48:      45052    0.1%         1304    0.1%
47:      55249    0.2%         1318    0.2%
46:      70125    0.3%         1576    0.3%
45:      88451    0.4%         1823    0.4%
44:     109586    0.6%         2019    0.5%
43:     138026    0.8%         2404    0.7%
42:     167315    1.0%         2891    0.8%
41:     206812    1.3%         3271    1.0%
40:     276479    1.7%         3718    1.2%
39:     273300    2.1%         4164    1.4%
38:     363039    2.7%         4507    1.7%
37:     439018    3.3%         5130    2.0%
36:     520598    4.1%         5792    2.3%
35:     629455    5.0%         6352    2.7%
34:     760599    6.1%         6980    3.1%
33:     893687    7.4%         8118    3.5%
32:    1109495    9.0%         9013    4.0%
31:    1309997   10.9%        10322    4.6%
30:    1599547   13.2%        12054    5.3%
29:    1881921   16.0%        14483    6.1%
28:    2230686   19.2%        17300    7.1%
27:    2659619   23.1%        21917    8.3%
26:    3071431   27.6%        27422    9.8%
25:    3660064   32.9%        34941   11.8%
24:    3751121   38.4%        45721   14.3%
23:    4299877   44.7%        58711   17.6%
22:    4550533   51.3%        75977   21.9%
21:    4729179   58.2%        96397   27.3%
20:    4818604   65.2%       118736   34.0%
19:    4445302   71.7%       142094   41.9%
18:    4232805   77.8%       160940   51.0%
17:    3771886   83.3%       172833   60.7%
16:    3209555   88.0%       173724   70.4%
15:    2585374   91.8%       161052   79.4%
14:    1974279   94.7%       135985   87.1%
13:    1416910   96.7%       100996   92.7%
12:     958661   98.1%        66090   96.4%
11:     599259   99.0%        37087   98.5%
10:     350258   99.5%        17349   99.5%
9:     185194   99.8%         6601   99.9%
8:      91231   99.9%         1966  100.0%
7:      39498  100.0%          518  100.0%
6:      15779  100.0%          114  100.0%
5:       5154  100.0%           23  100.0%
4:       1630  100.0%            3  100.0%

It is a matter of experience, but my general finding is that of the QV’s not capped at 50, about 80% of the data is definitely usable and 5-7% is definitely bad, leaving the rest in a “grey” zone.  From the histogram above anything under 22 is definitely usable, and anything over 28 is definitely bad.  Future scrubbing commands that trim and patch reads take these two user-selected thresholds as input (a -g and -b parameters) to control their function.  In this way, you can scrub harder or softer by adjusting these two thresholds.

To end, the routine DBdump now takes a -i option that asks it to output the quality vector for reads, provided, of course, the “qual” track exists.

A developing command guide for the scrubber module is available here.

Recording Alignments with Trace Points

The daligner often produces millions or billions of alignments for large genomes and records in .las files each of these alignments.  Thus far I’ve alluded to trace points as the way alignments are recorded in a couple of posts without real stating what they are.  An alignment between two reads a and b, must at a minimum record the interval of a, say [ab,ae], interval of b, say [bb,be], and the orientation o (= n or c) of b relative to a.  In brief:

a[ab,ae] x bo[bb,be]

However, at later stages one may want the actual alignment between these two substrings.  One choice, is to simply compute the alignment whenever it is needed.  Letting n = ae-ab ~ be-bb, and assuming an error rate of say ε = 15%, this takes O(εn2) time, which can be rather expensive when n is 10Kbp or more.  Another choice is to encode the alignment (daligner has it at one point in its work flow) as say a sequence of edit positions as in the SAM format.  This means that alignment is immediately available when needed, but unfortunately the encoding takes O(εn) space for each alignment which is a nontrivial amount.  For example, a 10Kbp alignment would involve 2500-3000 edits requiring conservatively 5-6Kb to encode, and that’s per alignment !

Trace points are a concept I came up with to realize a parameter-controlled trade off between the time to compute an alignment and the space to encode it.  Let Δ be a positive number. For instance, the daligner default is 100.  For an alignment with A-interval [ab,ae], imagine a trace point every 100bp, on every 100th base of the A-read.  For example, if [ab,ae] = [57,432], then place points at 100, 200, 300, and 400.  Generalizing, there are:

 τ = ⌈ ae/Δ ⌉ – ⌊ ab/Δ ⌋ = O(n/Δ)

intervals between the end-points of the A-interval and the trace points, e.g. [57,100], [100,200], [200,300], [300,400], [400,432].  Note that only the parameter Δ and the A-interval are needed to know where the trace points are in the A-read.  What we record is the number of bases in the B-interval [bb,be] that align with each of the trace-point intervals in A.  For example, suppose the B-interval is [1002,1391] , and in the full alignment trace that daligner has just computed:

A[ 57,100] aligns to B[1002,1050] with 10 diffs
A[100,200] aligns to B[1050,1155] with 25 diffs
A[200,300] aligns to B[1155,1250] with 18 diffs
A[300,400] aligns to B[1250,1351] with 22 diffs
A[400,432] aligns to B[1351,1391] with  7 diffs

Then the daligner records the alignment is recorded as the sequence of pairs (10,48) (25,105) (18,95) (22,101) (7,40) in the .las file along with the basic alignment information above.  In general, an alignment is encoded as a sequence (d1,b1) (d2,b2) (bd,b3) … (dτ,bτ) such that [bb+Σ(i-1),bb+Σ(i)] aligns with the Δ-spaced intervals of A, where Σ(i) = b1 + b2 + … + bi and by construction Σ(τ) = be-bb.  Note that if Δ is 100 and ε = 15% then one can be certain the bi‘s and di‘s fit in a single byte.  So the encoding is very sparse at exactly 2n/Δ bytes per alignment for the daligner running with the default trace point spacing.

So now suppose you want the alignment and you have the trace-point distances bi.  Very simply, all you need to do is compute alignments between each pair of A- and B- trace point intervals, and concatenate them together!  Each interval pair alignment takes O(εΔ2) time and there are O(n/Δ) intervals, for a total of O(εΔn) time.  For any fixed value of Δ, alignment delivery is thus linear in n and not quadratic!  One should note that the number of differences in each trace point interval are not actually needed to build an alignment.  The value of having these differences tucked away in the .las file will become apparent in a future post on “Intrinsic Quality Values”, but we introduce the exact encoding here for completeness.

With Δ=100 as the default, daligner uses only 200 bytes for every 10Kbp of aligned bases (compared to 6,000 bytes for BAM), but can still deliver alignments very rapidly with time that grows linearly in the number of aligned bases.  While the higher error rate of Pacbio data inspired this design, one should note that this scheme can and I think should be used for low error rate scenarios by simply using a much bigger Δ.  SAM was designed for short, almost perfect (ε = .5%) read alignments.  It is not general and space inefficient, whereas the trace point concept scales to whatever read size and error rate one could imagine for a technology simply by adjusting Δ.  SAM is also a really bad example of a bio-format that violates the “easy-to-read-by-a-program” philosophy I explained in the post “Seeing Dazzler Data & Results“.  The utility LAdump described in that post can be used to get the trace point sequence for any alignment from a daligner-produced .las file.

On Perfect Assembly

  During the AGBT meeting this February at Marco Island I decided to “get modern” and joined Twitter in part due to the encouragement of my younger colleagues, Ned Lederbragt and Pavel Tomancak.  I was at the AGBT meeting to present my new work towards an assembler for long-read-only data sets.  I’ve been out of the DNA sequencing “scene” for almost ten years now, mostly because the trend has been so much on cost versus quality that I just lost interest as a mathematician.  With the short read instruments, a great assembly is just not possible, especially in the face of strong sampling bias of the reads and consistently reproducible sequencing errors.  Yes, the results are still biologically useful, but I’m not a biologist.  What mathematician wants to work on a problem they know they cannot solve to their satisfaction?  Moreover, as a bioinformatician I know how much time is spent (wasted) dealing with the overwhelming number of caveats in the data sets.  Wouldn’t it be great if at least the source of much our work, the DNA sequence, was perfect?

What I perceived early in 2013 was that the relatively new Pacbio “long read” DNA sequencer was reaching sufficient maturity that it could produce data sets that might make this possible, or at least get us much, much closer to the goal of near perfect, reference quality reconstructions of novel genomes.  While the accuracy of its reads is relatively low (85-88%), causing many to erroneously think it not a good machine, the Pacbio machine has two incredibly powerful offsetting properties, namely, that (i) the set of reads produced is a nearly Poisson sampling of the underlying genome, and (ii) the location of errors within reads is truly randomly distributed.   Given my previous experience, I immediately understood the implication of these two properties, and decided in the summer of 2013 to buy a machine and get back into the genome assembly game.

So with my new twitter account, I decided to float the following 140 character theorem, that encapsulates my reasoning for coming back to the DNA sequencing scene:

Thm: Perfect assembly possible iff a) errors random b) sampling is Poisson c) reads long enough 2 solve repeats. Note: e-rate not needed

There was much follow on: some people didn’t believe it and some were (quite reasonably) confused about some of the conditions (especially c).  In addition I got some interesting links to work along these lines (here and here).  So how did I get to the conclusion of the theorem?

Condition (b), by the Poisson theory of Lander and Waterman, implies that for any minimum target coverage level k, there exists a level of sequencing coverage c that guarantees that every region of the underlying genome is covered k times.  In their 1988 paper they showed that the amount of the genome uncovered was roughly e-c where e is Napier’s constant, and an easy generalization shows that the amount of the genome not covered at least k times is ck/k! e-c.  Certain as c goes to infinity this quantity goes to 0 and since the genome is finite, at some point every part of the genome is covered k times.

Condition (a), from the early work of Churchill and Waterman, implies that the accuracy of the consensus sequence of k sequences, each with average error err, is O(errk) which goes to 0 as k increases.  This clearly implies that one can arrive at an arbitrarily accurate consensus by sampling enough sequences.  Note carefully that this is not true for a device that has reproducible errors and that every previous sequencing device makes such reproducible errors, the best ones at about a rate of 1 per 10Kbp.  This implies that with previous devices the best accuracy you could ever expect is a “Q40” reconstruction.  For the Pacbio where error is truly random any desired accuracy is possible.  For example, the Pacbio team has produced a Q60 reconstruction of E. Coli with their machines.

So conditions (a) and (b) together imply that with sufficiently large coverage c, an arbitrarily accurate reconstruction of the entire genome is possible, provided the sequencing reads can be correctly assembled.  As is well known it is the underlying repeat structure of a genome that confounds assembly.  Condition (c) is very loosely stated and basically says, if the reads are so long that every repetitive region is spanned, then the repeats aren’t a problem.  Of course, one could try to characterize more carefully how long the reads need to be (for example, they tried here), or one could simply capitulate and say all regions not involving repeats over a given size and fidelity are correctly assembled.  Practically, full length retrotransposons, the most frequent interspersed repetitive element, are 7Kbp long so most regions of any genome will not be problematic with 10Kbp reads.  Typically one finds that the euchromatic arms of most genomes assemble correctly into one or perhaps a few pieces with reads of this length.  While one can try to make theoretical statements, the truth of the matter is most assemblers are still not good at resolving repeats on the basis of the micro-variation within them, i.e. there is still significant room for improvement in the combinatorial aspects of genome assembly.  We intend to aggressively address this weakness going forward.

The theorem presupposes that there is a single haplotype under consideration.  Diploidy and polyploidy create additional complications that I believe are addressable.  But the really important issue is how to build an assembler that is efficient in coverage c.  That is, the theorem just says for c “big” enough.  But pragmatically the smaller c, the better the cost and efficiency.  Currently most of the “Pacbio-only” projects have involved 50X or more.  But in principle, a 20X assembly should be pretty darn good.  But no one currently gets good assembly at 20X because the 50X is needed to “correct” reads, i.e. understand read error, sufficiently to put the pieces together.  So there is lots of room for innovation and improvement in the arena of genome assembly algorithms: assembly is not a solved problem !