On Perfect Assembly

  During the AGBT meeting this February at Marco Island I decided to “get modern” and joined Twitter in part due to the encouragement of my younger colleagues, Ned Lederbragt and Pavel Tomancak.  I was at the AGBT meeting to present my new work towards an assembler for long-read-only data sets.  I’ve been out of the DNA sequencing “scene” for almost ten years now, mostly because the trend has been so much on cost versus quality that I just lost interest as a mathematician.  With the short read instruments, a great assembly is just not possible, especially in the face of strong sampling bias of the reads and consistently reproducible sequencing errors.  Yes, the results are still biologically useful, but I’m not a biologist.  What mathematician wants to work on a problem they know they cannot solve to their satisfaction?  Moreover, as a bioinformatician I know how much time is spent (wasted) dealing with the overwhelming number of caveats in the data sets.  Wouldn’t it be great if at least the source of much our work, the DNA sequence, was perfect?

What I perceived early in 2013 was that the relatively new Pacbio “long read” DNA sequencer was reaching sufficient maturity that it could produce data sets that might make this possible, or at least get us much, much closer to the goal of near perfect, reference quality reconstructions of novel genomes.  While the accuracy of its reads is relatively low (85-88%), causing many to erroneously think it not a good machine, the Pacbio machine has two incredibly powerful offsetting properties, namely, that (i) the set of reads produced is a nearly Poisson sampling of the underlying genome, and (ii) the location of errors within reads is truly randomly distributed.   Given my previous experience, I immediately understood the implication of these two properties, and decided in the summer of 2013 to buy a machine and get back into the genome assembly game.

So with my new twitter account, I decided to float the following 140 character theorem, that encapsulates my reasoning for coming back to the DNA sequencing scene:

Thm: Perfect assembly possible iff a) errors random b) sampling is Poisson c) reads long enough 2 solve repeats. Note: e-rate not needed

There was much follow on: some people didn’t believe it and some were (quite reasonably) confused about some of the conditions (especially c).  In addition I got some interesting links to work along these lines (here and here).  So how did I get to the conclusion of the theorem?

Condition (b), by the Poisson theory of Lander and Waterman, implies that for any minimum target coverage level k, there exists a level of sequencing coverage c that guarantees that every region of the underlying genome is covered k times.  In their 1988 paper they showed that the amount of the genome uncovered was roughly e-c where e is Napier’s constant, and an easy generalization shows that the amount of the genome not covered at least k times is ck/k! e-c.  Certain as c goes to infinity this quantity goes to 0 and since the genome is finite, at some point every part of the genome is covered k times.

Condition (a), from the early work of Churchill and Waterman, implies that the accuracy of the consensus sequence of k sequences, each with average error err, is O(errk) which goes to 0 as k increases.  This clearly implies that one can arrive at an arbitrarily accurate consensus by sampling enough sequences.  Note carefully that this is not true for a device that has reproducible errors and that every previous sequencing device makes such reproducible errors, the best ones at about a rate of 1 per 10Kbp.  This implies that with previous devices the best accuracy you could ever expect is a “Q40” reconstruction.  For the Pacbio where error is truly random any desired accuracy is possible.  For example, the Pacbio team has produced a Q60 reconstruction of E. Coli with their machines.

So conditions (a) and (b) together imply that with sufficiently large coverage c, an arbitrarily accurate reconstruction of the entire genome is possible, provided the sequencing reads can be correctly assembled.  As is well known it is the underlying repeat structure of a genome that confounds assembly.  Condition (c) is very loosely stated and basically says, if the reads are so long that every repetitive region is spanned, then the repeats aren’t a problem.  Of course, one could try to characterize more carefully how long the reads need to be (for example, they tried here), or one could simply capitulate and say all regions not involving repeats over a given size and fidelity are correctly assembled.  Practically, full length retrotransposons, the most frequent interspersed repetitive element, are 7Kbp long so most regions of any genome will not be problematic with 10Kbp reads.  Typically one finds that the euchromatic arms of most genomes assemble correctly into one or perhaps a few pieces with reads of this length.  While one can try to make theoretical statements, the truth of the matter is most assemblers are still not good at resolving repeats on the basis of the micro-variation within them, i.e. there is still significant room for improvement in the combinatorial aspects of genome assembly.  We intend to aggressively address this weakness going forward.

The theorem presupposes that there is a single haplotype under consideration.  Diploidy and polyploidy create additional complications that I believe are addressable.  But the really important issue is how to build an assembler that is efficient in coverage c.  That is, the theorem just says for c “big” enough.  But pragmatically the smaller c, the better the cost and efficiency.  Currently most of the “Pacbio-only” projects have involved 50X or more.  But in principle, a 20X assembly should be pretty darn good.  But no one currently gets good assembly at 20X because the 50X is needed to “correct” reads, i.e. understand read error, sufficiently to put the pieces together.  So there is lots of room for innovation and improvement in the arena of genome assembly algorithms: assembly is not a solved problem !


6 thoughts on “On Perfect Assembly

  1. Gene – thanks for the mention. I always liked the C&W 1992 paper but the independence assumption seemed insurmountable and blatantly wrong. I decided to pursue other things. You have made an interesting observation about error characteristics of PacBio data. Do you have reference, data to support or intuition? (I know it is just a blog.)

    I read the rest of the blog and noticed an off the cuff comment about quality scores. So I have a theorem of my own to propose. In denovo genome assembly quality scores may be useful. Maybe. But if we are aligning reads to a known target sequence, as we do in RNASeq, then

    in statistical jargon: conditional on the alignment, the quality score is ancillary.
    in English: the quality scores contain no additional information that is not already available from the alignment.
    or by analogy: it is like saying that the probability of heads is 1/2 after the coin has been tossed.


  2. Hi Gene,

    It was nice seeing you again at AGBT.
    When do you expect a beta version of Dazzler to be available for testing?
    We have several eukaryotic PacBio datasets I would like to try it on.

    Regarding your post. A truly perfect assembly should produce a separate phased consensus sequence for each haplotype. Even with perfect long-reads and infinite coverage there are theoretical cases where a perfect assembly is not achievable. In particular when the distance between unique alleles is longer than the longest read. In a sense every haplotype is a very long repeat that needs to be resolved. Regardless, there is need for better tools for de-novo haplotype assembly from long reads. This is especially important for highly polymorphic diploid and polyploid genomes, which current assemblers can’t handle very well.

    Part of the challenge is moving from a linear sequence representation to a more realistic graph based representation of genomes. This requires all downstream analysis tools like mapping and gene prediction to be able to operate on reference graphs instead of reference sequences.

  3. Dear Dr Myers,

    I have a naive question after reading this post: when you say “Note carefully that this is not true for a device that has reproducible errors and that every previous sequencing device makes such reproducible errors, the best ones at about a rate of 1 per 10Kbp.”, does this include Sanger capillary sequencers or do you mean only previous NGS instruments?

    I ask out of curiosity but I suppose the answer could be relevant to the integration of old sequencing data into new future assemblies.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s