Mapping Your Reads: damapper


OK, so there are over 100 different read mappers out there.  Why another one?  First, its faster than all the rest on long noisy reads while being equally sensitive to the best, second, because it uses a chain model of alignments that accounts for and reveals the frequent drop outs in quality within a Pacbio read, and third, because it actually estimates and reports those regions of reads that involve repeat elements and their approximate copy number in the genome to which they are being mapped.  Finally, I was just curious as to whether the daligner‘s sort and merge k-mer strategy would be competitive with the prevailing paradigm of using FM-indices or Suffix tree/arrays.  The answer is that it is, albeit it depends on M, the amount of memory available, and G, the size of the reference genome.  The larger G, the slower damapper becomes relative to index-based mappers.  Morever, while damapper will run in any reasonable (e.g. 16Gb) amount of memory, it runs faster the more memory there is available.  For example, with 48Gb, damapper is 1.8 times faster than BWA on the human genome, 15 times faster on the fly genome, and 36 times faster on E.Coli.  Performance will be reported on more extensively in a subsequent post.

So “damapper REF DB.1 DB.2” will map the reads in blocks 1 and 2 of DB to a reference genome  REF.  It really doesn’t care if REF is a .dam or a .db, and it will handle whatever size it happens to be in the amount of memory available or specified with the -M option.  On the other hand the DB blocks must contain sequences that are not too long, i.e reads only, and is expected to be of an appropriate block size (e.g. 250Mbp).  The difference between damapper and daligner is that damapper is looking for the best or near best map of the read to a location in REF, whereas daligner will find every possible local alignment between the reads in DB.1 and DB.2 including all repeat induced LA’s.  Thus damapper has much less work to do and so can be and is much faster.

Most of the parameters to damapper are the same as for the daligner, i.e. -v, -b, -k, -t, -M, -e, -s, and -m, but with a much larger default of -k (20) and a higher correlation for -e (.85), and the -w, -h, and -l parameters are not needed.  Like daligner, damapper finds local alignments consistent with the error rate -e, and does not extend these alignments into regions where the sequences are no longer reasonably correlating at the specified rate.  This means that if a Pacbio read has, say, two very low quality regions internal to the read, a frequent occurrence in such data, then damapper will find and report a chain of 3 local alignments and not a single alignment that is basically nonsense in the two drop out regions.  This is distinctly different from other mappers that force an alignment through these regions.  This seems to my mind a more principled approach and identifies the fact that parts of a read are very low quality.  I’m not sure why seeing 2,000 garbage bases in a forced alignment to the reference sequence is valuable :-).

So a damapper mapping of a read is a chain of one or more local alignments to a corresponding contig of the reference.  Occasionally reads are chimers, in which case one wants to see a mapping of each of the two distinct parts of the read.  damapper accomplishes this by finding the best chain mapping for the read, then finding the best mapping to any portion of the read not spanned by the first chain, and so on, until all significant portions of the read are mapped (if possible).  Moreover, damapper will report other good matches to a segment if requested with the -n option.  For example, with -n.95, it reports all matches whose score is within 95% of the best score covering a given portion of a read.

Apart from the -n option, the other options unique to damapper are the -C, -N, and -p flags.  By default, damapper returns .las files where the A-reads are the mapped reads, and the B-reads are the contigs of the reference.  That is, “damapper REF DB.1 DB.2” will output DB.1.REF.M1.las, DB.1.REF.M2.las, … DB.2.REF.Mn.las, where T is the number of threads it is run with (specified by the -T option, default 4).  The T files for each block can be combined into a single .las file with for example “LAcat DB.1.REF.M# >DB.REF.1.las“.  If you would like to see the mapping from the point of view of the reference contigs, specifying the -C option, will further produce files of the form REF.DB.1.C1.las, … REF.DB.2.Cn.las, where the A-reads are the contigs of the reference, and the B-reads are the mapped reads.  Specifying -N suppresses the production of the .M#.las files in case you only want to see the coverage of the contigs by reads.

As is customary, there is an HPC script generator, HPC.damapper, that will generate all the calls necessary to produce the mapping of reads to contigs for blocks of a data set (and vice versa if -C is set).  For example, “HPC.damapper -C -n.95 REF DB” will produce DB.#.REF.las and REF.DB.#.las where DB.# is a block of DB.  For example, if DB has 4 blocks then HPC.damapper generates a script equivalent to the following:

# Damapper jobs (1)
damapper -v -C -n0.95 REF DB.1 DB.2 DB.3 DB.4
# Catenate jobs (8)
LAsort -a DB.1.REF.M*.las && LAcat DB.1.REF.M#.S >DB.1.REF.las
LAsort -a DB.2.REF.M*.las && LAcat DB.2.REF.M#.S >DB.2.REF.las
LAsort -a DB.3.REF.M*.las && LAcat DB.3.REF.M#.S >DB.3.REF.las
LAsort -a DB.4.REF.M*.las && LAcat DB.4.REF.M#.S >DB.1.REF.las
LAsort -a REF.DB.1.C*.las && LAmerge -a REF.DB.1 REF.DB.1.C*.S.las
LAsort -a REF.DB.2.C*.las && LAmerge -a REF.DB.2 REF.DB.2.C*.S.las
LAsort -a REF.DB.3.C*.las && LAmerge -a REF.DB.3 REF.DB.3.C*.S.las
LAsort -a REF.DB.4.C*.las && LAmerge -a REF.DB.4 REF.DB.4.C*.S.las
# Check all .las files (optional but recommended)
# Cleanup all intermediate .las files
rm DB.*.REF.*.las REF.DB.*.*.las

Its important to note that all the commands that deal with .las files, i.e. LAsort, LAmerge, LAcheck, LAshow, and LAdump have been upgraded to properly handle damapper-produced .las files that encode chains of LAs.  Internally, Dazzler software can distinguish between .las files that encode chains and those that don’t (e.g. as produced by daligner or datander).  For sorting, chains are treated as a unit, and sorted on the basis of the first LA.  Display commands will indicate chains when present, and LAcheck checks chains and their internal encoding for validity.  Another thing to note very carefully, is that chains are best sorted with the -a option and not the default.  This is particularly important for files produced in response to the -C option.  For example, to produce a mapping of all reads to REF in say REF.DB.las, one should merge the relevant files above with “LAmerge -va REF DB REF.DB.*.las”  One can then see how the reference sequence is covered by the database by then opening REF.DB.las in DaViewer.

In developing the chaining algorithm for damapper, it became obvious that at little additional overhead one could estimate the repetitiveness of the sequence in a region of a read simply by counting how many (sub-optimal) chains covered the region.  The -p option requests that damapper produce a repeat profile track for each read based on these counts which roughly estimate the repetitiveness of a read segment within the underlying genome.  That is, for each trace-point sized interval (established by the -s parameter) the number of times, c, this interval is involved in a distinct alignment to the reference is estimated.  0 is recorded for segments that don’t match anything in the reference, 1 for segments that match uniquely, and otherwise floor(log10c/10) up to a cap of 40 corresponding to 10,000 copies.  Observe carefully that the track has the same form as intrinsic quality values: a vector of small byte-sized integers for each trace point interval.  These vectors can be output by DBdump with the -p option and DaViewer is able to graphically display them.  These profiles should make it obvious when a read does not have a unique location in a reference sequence due to its being entirely or almost entirely repetitive.  I think having this information is critical in any subsequent analysis.

Dazzler DAMs and Other Upgrades

  Good news: the PacBio machines now occasionally produce reads that are longer than 65,536 bases long !  While I was excited about the length, my heart sank as I knew it meant that I had to take all my carefully compacted Dazzler data structures that used short integers and turn them into proper 4 byte integers requiring a review and retest of all the code released thus far.  To my relief it actually wasn’t that hard, and while I was at it I further modified the daligner local alignment finder so that it can accommodate k-mers of length up to 32.  The downside is that the daligner now takes twice as much memory as before, but it is only 4% slower on previous test cases.

  The upside, on the other hand, is huge because now the daligner can also be used as a general read mapper and string to string comparison tool, as a “read” can now be a DNA sequence that is as long as say the human genome or any other genomic sequence one might care to align reads to.  The only change required was to generalize a dazzler DB to something that holds all the contigs of a reference sequence and their .fasta header lines.  I decided to call these objects Dazzler Maps (DAM) and identify them with a .dam suffix versus the .db suffix of a regular database of Pacbio reads.  Internally almost everything is the same, save that a DAM, say foo, doesn’t have quality values (i.e. a .foo.qvs file) but does retain the header of every entry (in a new .foo.hdr file).  In fact every command that runs on a DB, now also runs on a DAM if it makes sense to do so.  In overview:

New Command:

fasta2DAM, DAM2fasta:

Like fasta2DB and DB2fasta, except entries with runs of N’s are broken into separate “contig” sequences (and the range of each contig recorded) by fasta2DAM and stitched back together with intervening N’s by DAM2fasta.  If you just have a plain old .fasta file with a bunch of sequences with no N’s in them, don’t worry, everything works just fine.


Like HPCdaligner, HPCmapper produces a UNIX script that using daligner, LAsort, and LAmerge to find all local alignments between the sequences in two databases that can be either a .dam or .db.  In addition, HPCdaligner also works on a .dam so for example I compared the E.Coli reference genome against itself to find all the repetitive elements which I thought was kind of cool.

Extended Commands:

DBshow, DBdust, Catrack, DBstats, DBsplit, DBrm:

All these commands run perfectly well on either a .db or a .damDBshow has also been upgraded to output the original .fasta header of displayed entries for .db‘s as well as .dam‘s.


The arguments to the daligner can now be a .dam, a .db, or a block thereof.  The daligner takes care of all the details in handling one versus the other, and now has a -A option that controls whether an alignment between reads a and b is reported twice as (a,b) and (b,a) (the previous behavior), or as just (a,b).  It also has what I think is a  cool -I option that tells daligner to allow a read to be compared against itself, excluding the identity match.

LAcheck, LAshow

LAcheck and LAshow both took a DB as an argument in order to either check or display .las entries, respectively.  For HPCmapper applications, the A- and B-sequences come from two different DB’s.  So the extended commands can take either a single .db or .dam, in which case they assume both the A- and B-sequences come from said DB, or they can now take a pair of .db or .dam arguments, that give the sources of the A- and B-sequences, respectively.

The new release of all the modules is available at Github here.

So with these upgrades and additions is it now possible to use Dazzler modules to map PacBio reads to a reference genome assembly.  In brief, one creates a DAM, say R.dam, of the reference genome with fasta2DAM and presumably one has the PacBio reads in a DB, D.db.  The .dam can be DBdust‘ed and split into blocks (if necessary with DBsplit).  A read mapping can then be effected by calling “HPCmapper R D” that produces a script involving daligner, LAsort, and LAmerge that will result in files R.D.1.las, R.D.2.las, … R.D.n.las where R.D.k.las contains the alignments between all of R.dam and the k‘th block of the DB D.db.  For example, if R is split into two blocks, and D into 3, then the call “HPCmapper -vd R D” produces the script:

# Daligner jobs (2)
daligner -vdA -k20 -h50 -e.85 R.1 D.1 D.2 D.3
daligner -vdA -k20 -h50 -e.85 R.2 D.1 D.2 D.3
# Initial sort jobs (6)
LAsort R.1.D.1.*.las && LAmerge L1.1.1 R.1.D.1.*.S.las && rm R.1.D.1.*.las
LAsort R.1.D.2.*.las && LAmerge L1.1.2 R.1.D.2.*.S.las && rm R.1.D.2.*.las
LAsort R.1.D.3.*.las && LAmerge L1.1.3 R.1.D.3.*.S.las && rm R.1.D.3.*.las
LAsort R.2.D.1.*.las && LAmerge L1.2.1 R.2.D.1.*.S.las && rm R.2.D.1.*.las

# Level 1 merge jobs (3)
LAmerge R.D.1 L1.*.1.las && rm L1.*.1.las
LAmerge R.D.2 L1.*.2.las && rm L1.*.2.las
LAmerge R.D.3 L1.*.3.las && rm L1.*.3.las

Note carefully that the parameters -k, -h, and -e default to more stringent values when running daligner as a mapper.  I have not carefully tested or evaluated the new daligner to see what values of -k and -h give the best time vs. sensitivity tradeoff for a given choice of error rate -e.  The values chosen for PacBio reads are if anything overly sensitive and could be further optimized for speed, but with the values given its already so fast I haven’t as yet bothered.  Indeed, I wonder if the daligner would now be a good mapper for Illumina reads.   Hmmmm, to be continued …

A precise, detailed, and up-to-date command reference including these new commands can be found here.


DALIGNER: Fast and sensitive detection of all pairwise local alignments

The first phase of the Dazzler assembler compares every read in the trimmed Dazzler database (DB) against every other read in the trimmed DB, seeking significant local alignments between them. In an old-fashioned OLC assembler, one would look, more restrictively, for overlaps between reads. But one can use local alignments to detect artifacts such as chimeric reads and unclipped adapter sequence by looking for consistent alignment terminations within a read, and one can also detect and annotate repetitive elements of the genome covered within a read by looking for “pile-ups” of alignments. Removing the artifacts and knowing the repetitive regions covered by a read is a huge advantage prior to assembling the reads. Indeed, the dazzler assembler employes several downstream phases specifically to remove artifacts, correct reads, and annotate repetitive elements prior to building a string graph which is the modern analog of the layout phase for an OLC architecture.

You can get the source code for the DALIGNER module on Github here. Grabbing the code and uttering “make” should produce 9 programs: daligner, LAsort, LAmerge, LAshow, LAcat, LAsplit, LAcheck, HPCdaligner, and HPCmapper.

Ideally one would like to call the alignment finder, daligner, with a DB, e.g. “daligner MyDB“, and after some computing it would output a file encoding all the found local alignments. Unfortunately for big projects the amount of memory and CPU time required make this impossible. For example, on the Pacbio 54X human data set, daligner took 15,600 CPU hours on our modest 480 core HPC cluster, and as a monolithic run would have required at least 12Tb of memory! I could start apologizing, but consider that the only other program that can compare reads at a 12-15% error rate, BLASR, took 404,000 CPU hours, and only delivered overlaps. Indeed, a primary reason for releasing daligner now, and not waiting to release a complete end-to-end assembler is precisely to provide the bioinformatics community with a feasible way to compute alignments for Pacbio data at this scale. While we’ve embedded daligner within our database framework, it is easy to grab daligner‘s output in an ASCII format and then use that in an HGAP pipeline (which is the only current end to end assembler publicly available).

So given the scale of the required computation we have to “bite the bullet” and take a job parallel, HPC approach to effect an all-against-all comparison of the trimmed DB. So as a running example, let’s suppose you’ve set up a DB that is partitioned and optionally has a “dust” track (see this blog post) with a command sequence like:

fasta2DB DB MyData1.fasta MyData2.fasta ...
DBdust DB
DBsplit -x1000 DB

Suppose for simplicity that DB splits into 3 blocks. Then calling HPCdaligner on the DB will produce a UNIX shell script that contains all the commands needed to effect the all against all comparison. For example, “HPCdaligner DB” produces the script:

# Daligner jobs (3)
daligner DB.1 DB.1
daligner DB.2 DB.1 DB.2
daligner DB.3 DB.1 DB.2 DB.3
# Initial sort jobs (9)
LAsort DB.1.DB.1.*.las && LAmerge L1.1.1 DB.1.DB.1.*.S.las && rm DB.1.DB.1.*.las
LAsort DB.1.DB.2.*.las && LAmerge L1.1.2 DB.1.DB.2.*.S.las && rm DB.1.DB.2.*.las
LAsort DB.1.DB.3.*.las && LAmerge L1.1.3 DB.1.DB.3.*.S.las && rm DB.1.DB.3.*.las
LAsort DB.2.DB.1.*.las && LAmerge L1.2.1 DB.2.DB.1.*.S.las && rm DB.2.DB.1.*.las

# Level 1 merge jobs (3)
LAmerge DB.1 L1.1.*.las && rm L1.1.*.las
LAmerge DB.2 L1.2.*.las && rm L1.2.*.las
LAmerge DB.3 L1.3.*.las && rm L1.3.*.las

Each non-comment line of the script should be submitted to your cluster as a job, and all the jobs following one of the 3 comment lines can be run in parallel, but these 3 batches of job groups need to be run in sequence. That is, the 3 “Daligner” jobs can be run in parallel as a group, and once they have all completed, one can then run the 9 “Initial sort” jobs in parallel as a group, and finally the “Level 1 merge” jobs as a group. And of course, for a small script like this one, you could just invoke the shell on the script and let the jobs run sequentially on your desktop, e.g. “HPCdaligner DB | csh“.

When all the jobs of our sample script have completed there will be 3 sorted alignment files DB.1.las, DB.2.las, DB.3.las that encode the alignments found for the reads in each of the 3 trimmed DB blocks, respectively. To describe the contents of these 3 files precisely, one must first know that for each alignment, a .las-file records:

a[ab,ae] x bcomp[bb,be]

where a and b are the indices (in the trimmed DB) of the reads that overlap, comp indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bcomp, respectively, that align. It is further the case that an alignment between reads a and b results in two alignment records, one for a versus b and another for b versus a. In addition, a series of trace points is retained with each alignment record to facilitate the rapid delivery of an optimal alignment between the relevant intervals of the reads. We can now explain that the output file DB.n.las contains all the overlap records for which the first read a is in the trimmed block DB.n and these records further occur in sorted order of a, then b, then comp, and finally ab. What this means is that all the alignments involving a given a-read occur in a contiguous interval of the output files permitting easy and job parallel computation in the downstream phases for chimer detection, read correction, repeat analysis, etc..

The HPCdaligner UNIX shell script employes the three commands daligner, LAsort, and LAmerge to produce the sorted and partitioned .las files for each of the N blocks of a hypothetical database D as follows:

  • “Daligner” step: Every block is compared against every other block by calling daligner on every pair D.x and D.y such that y≤x. A call such as “daligner D.x D.y1 .. D.yk” performs k comparisons of block D.x versus each block D.yi and does so with a compiled number of threads T (default 4). Each of these pairwise block comparisons produces 2T unsorted .las files D.x.D.y.[C|N]t.las and if x > y, an additional 2T unsorted .las files D.y.D.x.[C|N]t.las. The file D.x.D.y.Ot.las contains every alignment produced by thread t where the a-read is in D.x and the b-read is in D.y and is in orientation O. In total (N+1)N/2 blocks are compared and 2N2T .las files result.
  • “Initial sort” step: The 2T .las files for each block pair x,y, are sorted and merged into a single sorted .las file. The 2T files are first sorted with LAsort, where for each unsorted file U.las the command produces the sorted overlap file U.S.las. These sorted .S.las files are then merged into a single sorted file, L1.x.y.las, with LAmerge and afterwords removed. At the end of this step there are exactly N2 sorted .las files, one for each block pair.
  • “Level k merge” steps: In a sequence of O(log N) levels the N files L1.x.*.las for a given block x are merged by LAmerge into a single .las file, D.x.las. At level 1, the N sorted files for a block are merged in groups of up to 25 resulting in ⌊N/25⌋ files, that in level 2 are merged in groups of up to 25, resulting in ⌊N/625⌋ files, and so on until only one file D.x.las remains. Even big projects involve less than 625 blocks, so typically only one or at most two merge levels are required.

For extraordinarily large projects the computation can be performed in stages as new data is added to the database, thus amortizing the cost of finding alignments over the interval of data production. For example, suppose you have a really big project going and that thus far the DB has 78 blocks in its current partition. Executing the script produced by “HPCdaligner DB 1-77" will compare blocks 1 to 77 versus each other producing files DB.1.las through DB.77.las. Note carefully that the last block, 78, was not included, because it is a partial block. Later on suppose you’ve added more data and now have 144 blocks. Then executing the script produced by “HPCdaligner DB 78-143” will compare blocks 78 to 143 against each other and against the earlier blocks 1 through 77, updating DB.1.las through DB.77.las and creating DB.78.las through DB.144.las. And so on. The only restrictions are that (a) one must add the blocks in sequential order once only, and (b) one should never add the last, partial block until the very last increment. By submitting each new complete block as it appears, I figure one could keep up with data production for say a 100X shotgun of D. melanogaster on a decent laptop.

The command LAshow produces an ASCII display of all or a selected subset of the alignments encoded in a given .las file, sorted or unsorted. One can view just the read indices and their aligned intervals, one per line, or a line rendering of the relationship between the two reads, or a complete BLAST-style alignment of each local alignment where one has several options for customizing the display. In addition to permitting one to browse local alignments, the ASCII output can also be fed into a python or pearl script that converts the information to a from that can be input to an external program of your choosing, such as the read correction step of the HGAP assembler.

The LAcat and LAsplit commands allow one to effectively change the partition of the .las files for a project should it ever be necessary. Suppose the .las files D.1.las, D.2.las, …, D.N.las have been computed for a database D.db that has been partitioned into N blocks. “LAcat D.# >E.las” will find all the .las files of the form D.#.las and stream the conceptual concatenation of all the .las files in numeric order to the standard output, in this case to the file E.las. In the other direction, “LAsplit F.# N <E.las” will split E.las as evenly as possible into N block files, F.1.las, F.2.las, …, F.N.las subject to the restriction that the alignments involving a given a-read are all in the same file. Note that F.x.las does not necessarily equal the original D.x.las, as the a-reads in F.x.las may not be the same as the reads in a given block, D.x.db, because the splitting criteria for LAsplit and DBsplit are not the same. To get exactly the same files back, one should instead utter “LAsplit F.#.las D.db <E.las” or more tersely “LAsplit F.# D <E.las” that splits the piped input according to the partitioning of database D !  Finally, observe that if you want to repartition the DB D and also all the D.#.las files, first call DBsplit on D, and afterwords call “LAcat D.# | LAsplit D.# D” to redistribute the alignment records into .las files according to the new partitioning.

Lastly, the command LAcheck reads through the .las files given to it and checks them for structural integrity, i.e. do they reasonably encode a properly formatted .las file? We find this command useful in ensuring that every stage of a large HPCdaligner script has been properly executed. In processes involving thousands of jobs, it is not unusual for one or two to go “haywire”. So while we didn’t build it into the HPCdaligner scripts, we strongly recommend that you run LAcheck on .las files throughout the process to make sure all is well. Doing so is standard operating procedure for our assembly group.

A precise, detailed, and up-to-date command reference can be found here.