Mapping Your Reads: damapper

Featured

OK, so there are over 100 different read mappers out there.  Why another one?  First, its faster than all the rest on long noisy reads while being equally sensitive to the best, second, because it uses a chain model of alignments that accounts for and reveals the frequent drop outs in quality within a Pacbio read, and third, because it actually estimates and reports those regions of reads that involve repeat elements and their approximate copy number in the genome to which they are being mapped.  Finally, I was just curious as to whether the daligner‘s sort and merge k-mer strategy would be competitive with the prevailing paradigm of using FM-indices or Suffix tree/arrays.  The answer is that it is, albeit it depends on M, the amount of memory available, and G, the size of the reference genome.  The larger G, the slower damapper becomes relative to index-based mappers.  Morever, while damapper will run in any reasonable (e.g. 16Gb) amount of memory, it runs faster the more memory there is available.  For example, with 48Gb, damapper is 1.8 times faster than BWA on the human genome, 15 times faster on the fly genome, and 36 times faster on E.Coli.  Performance will be reported on more extensively in a subsequent post.

So “damapper REF DB.1 DB.2” will map the reads in blocks 1 and 2 of DB to a reference genome  REF.  It really doesn’t care if REF is a .dam or a .db, and it will handle whatever size it happens to be in the amount of memory available or specified with the -M option.  On the other hand the DB blocks must contain sequences that are not too long, i.e reads only, and is expected to be of an appropriate block size (e.g. 250Mbp).  The difference between damapper and daligner is that damapper is looking for the best or near best map of the read to a location in REF, whereas daligner will find every possible local alignment between the reads in DB.1 and DB.2 including all repeat induced LA’s.  Thus damapper has much less work to do and so can be and is much faster.

Most of the parameters to damapper are the same as for the daligner, i.e. -v, -b, -k, -t, -M, -e, -s, and -m, but with a much larger default of -k (20) and a higher correlation for -e (.85), and the -w, -h, and -l parameters are not needed.  Like daligner, damapper finds local alignments consistent with the error rate -e, and does not extend these alignments into regions where the sequences are no longer reasonably correlating at the specified rate.  This means that if a Pacbio read has, say, two very low quality regions internal to the read, a frequent occurrence in such data, then damapper will find and report a chain of 3 local alignments and not a single alignment that is basically nonsense in the two drop out regions.  This is distinctly different from other mappers that force an alignment through these regions.  This seems to my mind a more principled approach and identifies the fact that parts of a read are very low quality.  I’m not sure why seeing 2,000 garbage bases in a forced alignment to the reference sequence is valuable :-).

So a damapper mapping of a read is a chain of one or more local alignments to a corresponding contig of the reference.  Occasionally reads are chimers, in which case one wants to see a mapping of each of the two distinct parts of the read.  damapper accomplishes this by finding the best chain mapping for the read, then finding the best mapping to any portion of the read not spanned by the first chain, and so on, until all significant portions of the read are mapped (if possible).  Moreover, damapper will report other good matches to a segment if requested with the -n option.  For example, with -n.95, it reports all matches whose score is within 95% of the best score covering a given portion of a read.

Apart from the -n option, the other options unique to damapper are the -C, -N, and -p flags.  By default, damapper returns .las files where the A-reads are the mapped reads, and the B-reads are the contigs of the reference.  That is, “damapper REF DB.1 DB.2” will output DB.1.REF.M1.las, DB.1.REF.M2.las, … DB.2.REF.Mn.las, where T is the number of threads it is run with (specified by the -T option, default 4).  The T files for each block can be combined into a single .las file with for example “LAcat DB.1.REF.M# >DB.REF.1.las“.  If you would like to see the mapping from the point of view of the reference contigs, specifying the -C option, will further produce files of the form REF.DB.1.C1.las, … REF.DB.2.Cn.las, where the A-reads are the contigs of the reference, and the B-reads are the mapped reads.  Specifying -N suppresses the production of the .M#.las files in case you only want to see the coverage of the contigs by reads.

As is customary, there is an HPC script generator, HPC.damapper, that will generate all the calls necessary to produce the mapping of reads to contigs for blocks of a data set (and vice versa if -C is set).  For example, “HPC.damapper -C -n.95 REF DB” will produce DB.#.REF.las and REF.DB.#.las where DB.# is a block of DB.  For example, if DB has 4 blocks then HPC.damapper generates a script equivalent to the following:

# Damapper jobs (1)
damapper -v -C -n0.95 REF DB.1 DB.2 DB.3 DB.4
# Catenate jobs (8)
LAsort -a DB.1.REF.M*.las && LAcat DB.1.REF.M#.S >DB.1.REF.las
LAsort -a DB.2.REF.M*.las && LAcat DB.2.REF.M#.S >DB.2.REF.las
LAsort -a DB.3.REF.M*.las && LAcat DB.3.REF.M#.S >DB.3.REF.las
LAsort -a DB.4.REF.M*.las && LAcat DB.4.REF.M#.S >DB.1.REF.las
LAsort -a REF.DB.1.C*.las && LAmerge -a REF.DB.1 REF.DB.1.C*.S.las
LAsort -a REF.DB.2.C*.las && LAmerge -a REF.DB.2 REF.DB.2.C*.S.las
LAsort -a REF.DB.3.C*.las && LAmerge -a REF.DB.3 REF.DB.3.C*.S.las
LAsort -a REF.DB.4.C*.las && LAmerge -a REF.DB.4 REF.DB.4.C*.S.las
# Check all .las files (optional but recommended)
LAcheck -vS DB REF DB.1.REF DB.2.REF DB.3.REF DB.4.REF
LAcheck -vS REF DB REF.DB.1 REF.DB.2 REF.DB.3 REF.DB.4
# Cleanup all intermediate .las files
rm DB.*.REF.*.las REF.DB.*.*.las

Its important to note that all the commands that deal with .las files, i.e. LAsort, LAmerge, LAcheck, LAshow, and LAdump have been upgraded to properly handle damapper-produced .las files that encode chains of LAs.  Internally, Dazzler software can distinguish between .las files that encode chains and those that don’t (e.g. as produced by daligner or datander).  For sorting, chains are treated as a unit, and sorted on the basis of the first LA.  Display commands will indicate chains when present, and LAcheck checks chains and their internal encoding for validity.  Another thing to note very carefully, is that chains are best sorted with the -a option and not the default.  This is particularly important for files produced in response to the -C option.  For example, to produce a mapping of all reads to REF in say REF.DB.las, one should merge the relevant files above with “LAmerge -va REF DB REF.DB.*.las”  One can then see how the reference sequence is covered by the database by then opening REF.DB.las in DaViewer.

In developing the chaining algorithm for damapper, it became obvious that at little additional overhead one could estimate the repetitiveness of the sequence in a region of a read simply by counting how many (sub-optimal) chains covered the region.  The -p option requests that damapper produce a repeat profile track for each read based on these counts which roughly estimate the repetitiveness of a read segment within the underlying genome.  That is, for each trace-point sized interval (established by the -s parameter) the number of times, c, this interval is involved in a distinct alignment to the reference is estimated.  0 is recorded for segments that don’t match anything in the reference, 1 for segments that match uniquely, and otherwise floor(log10c/10) up to a cap of 40 corresponding to 10,000 copies.  Observe carefully that the track has the same form as intrinsic quality values: a vector of small byte-sized integers for each trace point interval.  These vectors can be output by DBdump with the -p option and DaViewer is able to graphically display them.  These profiles should make it obvious when a read does not have a unique location in a reference sequence due to its being entirely or almost entirely repetitive.  I think having this information is critical in any subsequent analysis.

Advertisements

Dazzler DAMs and Other Upgrades

  Good news: the PacBio machines now occasionally produce reads that are longer than 65,536 bases long !  While I was excited about the length, my heart sank as I knew it meant that I had to take all my carefully compacted Dazzler data structures that used short integers and turn them into proper 4 byte integers requiring a review and retest of all the code released thus far.  To my relief it actually wasn’t that hard, and while I was at it I further modified the daligner local alignment finder so that it can accommodate k-mers of length up to 32.  The downside is that the daligner now takes twice as much memory as before, but it is only 4% slower on previous test cases.

  The upside, on the other hand, is huge because now the daligner can also be used as a general read mapper and string to string comparison tool, as a “read” can now be a DNA sequence that is as long as say the human genome or any other genomic sequence one might care to align reads to.  The only change required was to generalize a dazzler DB to something that holds all the contigs of a reference sequence and their .fasta header lines.  I decided to call these objects Dazzler Maps (DAM) and identify them with a .dam suffix versus the .db suffix of a regular database of Pacbio reads.  Internally almost everything is the same, save that a DAM, say foo, doesn’t have quality values (i.e. a .foo.qvs file) but does retain the header of every entry (in a new .foo.hdr file).  In fact every command that runs on a DB, now also runs on a DAM if it makes sense to do so.  In overview:

New Command:

fasta2DAM, DAM2fasta:

Like fasta2DB and DB2fasta, except entries with runs of N’s are broken into separate “contig” sequences (and the range of each contig recorded) by fasta2DAM and stitched back together with intervening N’s by DAM2fasta.  If you just have a plain old .fasta file with a bunch of sequences with no N’s in them, don’t worry, everything works just fine.

HPCmapper:

Like HPCdaligner, HPCmapper produces a UNIX script that using daligner, LAsort, and LAmerge to find all local alignments between the sequences in two databases that can be either a .dam or .db.  In addition, HPCdaligner also works on a .dam so for example I compared the E.Coli reference genome against itself to find all the repetitive elements which I thought was kind of cool.

Extended Commands:

DBshow, DBdust, Catrack, DBstats, DBsplit, DBrm:

All these commands run perfectly well on either a .db or a .damDBshow has also been upgraded to output the original .fasta header of displayed entries for .db‘s as well as .dam‘s.

daligner:

The arguments to the daligner can now be a .dam, a .db, or a block thereof.  The daligner takes care of all the details in handling one versus the other, and now has a -A option that controls whether an alignment between reads a and b is reported twice as (a,b) and (b,a) (the previous behavior), or as just (a,b).  It also has what I think is a  cool -I option that tells daligner to allow a read to be compared against itself, excluding the identity match.

LAcheck, LAshow

LAcheck and LAshow both took a DB as an argument in order to either check or display .las entries, respectively.  For HPCmapper applications, the A- and B-sequences come from two different DB’s.  So the extended commands can take either a single .db or .dam, in which case they assume both the A- and B-sequences come from said DB, or they can now take a pair of .db or .dam arguments, that give the sources of the A- and B-sequences, respectively.

The new release of all the modules is available at Github here.

So with these upgrades and additions is it now possible to use Dazzler modules to map PacBio reads to a reference genome assembly.  In brief, one creates a DAM, say R.dam, of the reference genome with fasta2DAM and presumably one has the PacBio reads in a DB, D.db.  The .dam can be DBdust‘ed and split into blocks (if necessary with DBsplit).  A read mapping can then be effected by calling “HPCmapper R D” that produces a script involving daligner, LAsort, and LAmerge that will result in files R.D.1.las, R.D.2.las, … R.D.n.las where R.D.k.las contains the alignments between all of R.dam and the k‘th block of the DB D.db.  For example, if R is split into two blocks, and D into 3, then the call “HPCmapper -vd R D” produces the script:

# Daligner jobs (2)
daligner -vdA -k20 -h50 -e.85 R.1 D.1 D.2 D.3
daligner -vdA -k20 -h50 -e.85 R.2 D.1 D.2 D.3
# Initial sort jobs (6)
LAsort R.1.D.1.*.las && LAmerge L1.1.1 R.1.D.1.*.S.las && rm R.1.D.1.*.las
LAsort R.1.D.2.*.las && LAmerge L1.1.2 R.1.D.2.*.S.las && rm R.1.D.2.*.las
LAsort R.1.D.3.*.las && LAmerge L1.1.3 R.1.D.3.*.S.las && rm R.1.D.3.*.las
LAsort R.2.D.1.*.las && LAmerge L1.2.1 R.2.D.1.*.S.las && rm R.2.D.1.*.las

# Level 1 merge jobs (3)
LAmerge R.D.1 L1.*.1.las && rm L1.*.1.las
LAmerge R.D.2 L1.*.2.las && rm L1.*.2.las
LAmerge R.D.3 L1.*.3.las && rm L1.*.3.las

Note carefully that the parameters -k, -h, and -e default to more stringent values when running daligner as a mapper.  I have not carefully tested or evaluated the new daligner to see what values of -k and -h give the best time vs. sensitivity tradeoff for a given choice of error rate -e.  The values chosen for PacBio reads are if anything overly sensitive and could be further optimized for speed, but with the values given its already so fast I haven’t as yet bothered.  Indeed, I wonder if the daligner would now be a good mapper for Illumina reads.   Hmmmm, to be continued …

A precise, detailed, and up-to-date command reference including these new commands can be found here.

 

On Perfect Assembly

  During the AGBT meeting this February at Marco Island I decided to “get modern” and joined Twitter in part due to the encouragement of my younger colleagues, Ned Lederbragt and Pavel Tomancak.  I was at the AGBT meeting to present my new work towards an assembler for long-read-only data sets.  I’ve been out of the DNA sequencing “scene” for almost ten years now, mostly because the trend has been so much on cost versus quality that I just lost interest as a mathematician.  With the short read instruments, a great assembly is just not possible, especially in the face of strong sampling bias of the reads and consistently reproducible sequencing errors.  Yes, the results are still biologically useful, but I’m not a biologist.  What mathematician wants to work on a problem they know they cannot solve to their satisfaction?  Moreover, as a bioinformatician I know how much time is spent (wasted) dealing with the overwhelming number of caveats in the data sets.  Wouldn’t it be great if at least the source of much our work, the DNA sequence, was perfect?

What I perceived early in 2013 was that the relatively new Pacbio “long read” DNA sequencer was reaching sufficient maturity that it could produce data sets that might make this possible, or at least get us much, much closer to the goal of near perfect, reference quality reconstructions of novel genomes.  While the accuracy of its reads is relatively low (85-88%), causing many to erroneously think it not a good machine, the Pacbio machine has two incredibly powerful offsetting properties, namely, that (i) the set of reads produced is a nearly Poisson sampling of the underlying genome, and (ii) the location of errors within reads is truly randomly distributed.   Given my previous experience, I immediately understood the implication of these two properties, and decided in the summer of 2013 to buy a machine and get back into the genome assembly game.

So with my new twitter account, I decided to float the following 140 character theorem, that encapsulates my reasoning for coming back to the DNA sequencing scene:

Thm: Perfect assembly possible iff a) errors random b) sampling is Poisson c) reads long enough 2 solve repeats. Note: e-rate not needed

There was much follow on: some people didn’t believe it and some were (quite reasonably) confused about some of the conditions (especially c).  In addition I got some interesting links to work along these lines (here and here).  So how did I get to the conclusion of the theorem?

Condition (b), by the Poisson theory of Lander and Waterman, implies that for any minimum target coverage level k, there exists a level of sequencing coverage c that guarantees that every region of the underlying genome is covered k times.  In their 1988 paper they showed that the amount of the genome uncovered was roughly e-c where e is Napier’s constant, and an easy generalization shows that the amount of the genome not covered at least k times is ck/k! e-c.  Certain as c goes to infinity this quantity goes to 0 and since the genome is finite, at some point every part of the genome is covered k times.

Condition (a), from the early work of Churchill and Waterman, implies that the accuracy of the consensus sequence of k sequences, each with average error err, is O(errk) which goes to 0 as k increases.  This clearly implies that one can arrive at an arbitrarily accurate consensus by sampling enough sequences.  Note carefully that this is not true for a device that has reproducible errors and that every previous sequencing device makes such reproducible errors, the best ones at about a rate of 1 per 10Kbp.  This implies that with previous devices the best accuracy you could ever expect is a “Q40” reconstruction.  For the Pacbio where error is truly random any desired accuracy is possible.  For example, the Pacbio team has produced a Q60 reconstruction of E. Coli with their machines.

So conditions (a) and (b) together imply that with sufficiently large coverage c, an arbitrarily accurate reconstruction of the entire genome is possible, provided the sequencing reads can be correctly assembled.  As is well known it is the underlying repeat structure of a genome that confounds assembly.  Condition (c) is very loosely stated and basically says, if the reads are so long that every repetitive region is spanned, then the repeats aren’t a problem.  Of course, one could try to characterize more carefully how long the reads need to be (for example, they tried here), or one could simply capitulate and say all regions not involving repeats over a given size and fidelity are correctly assembled.  Practically, full length retrotransposons, the most frequent interspersed repetitive element, are 7Kbp long so most regions of any genome will not be problematic with 10Kbp reads.  Typically one finds that the euchromatic arms of most genomes assemble correctly into one or perhaps a few pieces with reads of this length.  While one can try to make theoretical statements, the truth of the matter is most assemblers are still not good at resolving repeats on the basis of the micro-variation within them, i.e. there is still significant room for improvement in the combinatorial aspects of genome assembly.  We intend to aggressively address this weakness going forward.

The theorem presupposes that there is a single haplotype under consideration.  Diploidy and polyploidy create additional complications that I believe are addressable.  But the really important issue is how to build an assembler that is efficient in coverage c.  That is, the theorem just says for c “big” enough.  But pragmatically the smaller c, the better the cost and efficiency.  Currently most of the “Pacbio-only” projects have involved 50X or more.  But in principle, a 20X assembly should be pretty darn good.  But no one currently gets good assembly at 20X because the 50X is needed to “correct” reads, i.e. understand read error, sufficiently to put the pieces together.  So there is lots of room for innovation and improvement in the arena of genome assembly algorithms: assembly is not a solved problem !

The Dextractor Module: Save disk space for your Pacbio projects

The Pacbio sequencers currently produce HDF5-formatted files that contain a lot of information about a sequencing run including the sequence, streams of quality values, and a lot of other stuff that most of us will never look at.  These files are quite big.  The 3 .bax.h5 files produced for a given SMRT cell typically occupy 7-9 Gb.  The recent  54X human genome data set produced by the company consists of the output from 277 SMRT cells, so if you want to download it and work with it, you will need to set aside roughly 2+Tb of disk.  Here we present a set of UNIX commands that extract just the data you need for an assembly project, and allow you to compress said data, so that it occupies roughly 1/13th the space.  For example, the extracted and compressed human data set occupies only ~150Gb !  For our own projects we routinely extract and compress data as it is produced, saving the HDF5 files to a tape archive as backup.  This saves our organization money as keeping such data sets on disk at these scales is not a cheap proposition.

The Dextractor module source code is available on Github here.  Grabbing the code and uttering “make” should produce 5 programs: dextract, dexta, undexta, dexqv, and undexqvDextract takes .bax.h5 files as input and produces a .fasta file containing the sequences of all the reads, and a .quiva file that contains all the quality value (QV) streams needed by the Quiver program to produce a consensus sequence as the final phase of assembly.  The .fasta file can then be compressed (~ 4x) with dexta to produce a .dexta file, and this can be decompressed with undexta to restore the original .fasta file.  Similarly dexqv & undexqv compress (~ 3.3x) and decompress a .quiva file into a .dexqv file and back.  The figure below shows the flow of information:

Screen Shot 2014-03-04 at 11.35.53 AM

How data flows through the dextractor modules from a source .bax.h5 file to the assembler. The yellow percentages indicate the relative amount of space occupied by the given file type relative to the source (100%).

As an example, the Pacbio RS II produces 3 .bax.h5 files as output for a given SMRT cell (there are currently 3 blades on the machine’s server each handling 1/3rd of the 150K zero-mode wells).  When we receive such an output on our disk storage system, say files G.1.bax.h5, G.2.bax.h5, G.3.bax.h5, we immediately execute the following commands:

dextract -o G.*.bax.h5
dexta G
dexqv G
mv G.*.bax.h5 <Our_tape_backup_system>

This leaves us with files G.dexta and G.dexqv on spooled up disk.  Whenever we are ready to begin an assembly we decompress G.dexta with undexta (for most phases only the .fasta files are needed) and then if we have an assembly we like and we want a final consensus sequence from Quiver, we decompress G.dexqv back into a .quiva file  with undexqv that we then feed into our customized version of Quiver (not yet available, alas)

A precise, detailed, and up-to-date command reference can be found here.