Good news: the PacBio machines now occasionally produce reads that are longer than 65,536 bases long ! While I was excited about the length, my heart sank as I knew it meant that I had to take all my carefully compacted Dazzler data structures that used short integers and turn them into proper 4 byte integers requiring a review and retest of all the code released thus far. To my relief it actually wasn’t that hard, and while I was at it I further modified the daligner local alignment finder so that it can accommodate k-mers of length up to 32. The downside is that the daligner now takes twice as much memory as before, but it is only 4% slower on previous test cases.
The upside, on the other hand, is huge because now the daligner can also be used as a general read mapper and string to string comparison tool, as a “read” can now be a DNA sequence that is as long as say the human genome or any other genomic sequence one might care to align reads to. The only change required was to generalize a dazzler DB to something that holds all the contigs of a reference sequence and their .fasta header lines. I decided to call these objects Dazzler Maps (DAM) and identify them with a .dam suffix versus the .db suffix of a regular database of Pacbio reads. Internally almost everything is the same, save that a DAM, say foo, doesn’t have quality values (i.e. a .foo.qvs file) but does retain the header of every entry (in a new .foo.hdr file). In fact every command that runs on a DB, now also runs on a DAM if it makes sense to do so. In overview:
Like fasta2DB and DB2fasta, except entries with runs of N’s are broken into separate “contig” sequences (and the range of each contig recorded) by fasta2DAM and stitched back together with intervening N’s by DAM2fasta. If you just have a plain old .fasta file with a bunch of sequences with no N’s in them, don’t worry, everything works just fine.
Like HPCdaligner, HPCmapper produces a UNIX script that using daligner, LAsort, and LAmerge to find all local alignments between the sequences in two databases that can be either a .dam or .db. In addition, HPCdaligner also works on a .dam so for example I compared the E.Coli reference genome against itself to find all the repetitive elements which I thought was kind of cool.
All these commands run perfectly well on either a .db or a .dam. DBshow has also been upgraded to output the original .fasta header of displayed entries for .db‘s as well as .dam‘s.
The arguments to the daligner can now be a .dam, a .db, or a block thereof. The daligner takes care of all the details in handling one versus the other, and now has a -A option that controls whether an alignment between reads a and b is reported twice as (a,b) and (b,a) (the previous behavior), or as just (a,b). It also has what I think is a cool -I option that tells daligner to allow a read to be compared against itself, excluding the identity match.
LAcheck and LAshow both took a DB as an argument in order to either check or display .las entries, respectively. For HPCmapper applications, the A- and B-sequences come from two different DB’s. So the extended commands can take either a single .db or .dam, in which case they assume both the A- and B-sequences come from said DB, or they can now take a pair of .db or .dam arguments, that give the sources of the A- and B-sequences, respectively.
The new release of all the modules is available at Github here.
So with these upgrades and additions is it now possible to use Dazzler modules to map PacBio reads to a reference genome assembly. In brief, one creates a DAM, say R.dam, of the reference genome with fasta2DAM and presumably one has the PacBio reads in a DB, D.db. The .dam can be DBdust‘ed and split into blocks (if necessary with DBsplit). A read mapping can then be effected by calling “
HPCmapper R D” that produces a script involving daligner, LAsort, and LAmerge that will result in files R.D.1.las, R.D.2.las, … R.D.n.las where R.D.k.las contains the alignments between all of R.dam and the k‘th block of the DB D.db. For example, if R is split into two blocks, and D into 3, then the call “
HPCmapper -vd R D” produces the script:
# Daligner jobs (2)
daligner -vdA -k20 -h50 -e.85 R.1 D.1 D.2 D.3
daligner -vdA -k20 -h50 -e.85 R.2 D.1 D.2 D.3
# Initial sort jobs (6)
LAsort R.1.D.1.*.las && LAmerge L1.1.1 R.1.D.1.*.S.las && rm R.1.D.1.*.las
LAsort R.1.D.2.*.las && LAmerge L1.1.2 R.1.D.2.*.S.las && rm R.1.D.2.*.las
LAsort R.1.D.3.*.las && LAmerge L1.1.3 R.1.D.3.*.S.las && rm R.1.D.3.*.las
LAsort R.2.D.1.*.las && LAmerge L1.2.1 R.2.D.1.*.S.las && rm R.2.D.1.*.las
# Level 1 merge jobs (3)
LAmerge R.D.1 L1.*.1.las && rm L1.*.1.las
LAmerge R.D.2 L1.*.2.las && rm L1.*.2.las
LAmerge R.D.3 L1.*.3.las && rm L1.*.3.las
Note carefully that the parameters -k, -h, and -e default to more stringent values when running daligner as a mapper. I have not carefully tested or evaluated the new daligner to see what values of -k and -h give the best time vs. sensitivity tradeoff for a given choice of error rate -e. The values chosen for PacBio reads are if anything overly sensitive and could be further optimized for speed, but with the values given its already so fast I haven’t as yet bothered. Indeed, I wonder if the daligner would now be a good mapper for Illumina reads. Hmmmm, to be continued …
A precise, detailed, and up-to-date command reference including these new commands can be found here.