Seeing Dazzler Data & Results

Because the Dazzler is under development and not yet an end-to-end assembler, every Dazzler user at some point needs to get the information that tools like daligner have computed, so that they can carry on with the rest of their assembly process.  Currently, a few intrepid souls have either gotten into the code far enough to directly decode various files or have built ASCII parsers that pull the information from the outputs of commands like DBshow and LAshow that give users a printed representation of the information in a database or alignment file, respectively.  To rectify this, I have recently added some tools that make it very easy for a user to get any information they want from the Dazzler suite in an easy-to-understand and easy-to-parse format.

In a recent talk delivered at the PacBio developers conference this August, I railed about how awful formats like .fasta and .bam are.  Basically, I think that the programs that produce a dataset (I call them writers) should produce output that is as easy as possible for a consumer of the data to use (I call them readers).  For example, why should the reader not know in advance the size of the largest sequence in a data set?  The writer knows this and should give the information to the reader at the start of the encoding so that the reader can simply allocate a buffer of this maximum size, knowing that they need not check for overflow or have to reallocate and expand the buffer during input.  As another example, why put new-lines in the middle of a sequence?  I mean really, how often do you actually read a .fasta file?  And even if you did, every text manager I know of wraps overly long lines.  So at this point you get the idea, I’m sure, and I’ll stop the rant here 🙂

Thus far the Dazzler suite produces a database organized around read records (or contigs of a reference sequence) and one or more alignment files giving local alignments between reads.  The new utility DBdump gets information out of a data base including any mask tracks associated with reads, and LAdump gets information out of an alignment file.  The formal specifications can be found by following the link associated with each name, here, we just give the intuition and idea behind the commands.  For example, the command “DBdump -hrs -mdust DB 10-12” will output the header (-h), sequence (-s), read number (-r), and dust track (-mdust) for reads 10 to 12 inclusive from DB.  The output might look like:

+ R 3
+ M 1
+ H 183
@ H 61
+ T0 4
@ T0 3
+ S 29186
@ S 11702
R 10
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 338 0 8952
T0 1  4370 4380
S 8952 aaaagagagatactg...ccctgcggt
R 11
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 347 0 8532
T0 1  6613 6647
S 8532 gaggggaaagatgat...ttcgacggc
R 12
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 385 0 11702
T0 3  491 539 3363 3383 3817 3827
S 11702 agtgaaagagtgaa...atccgctgg

Each line begins with a single symbol or 1-code that designates what is to follow, e.g. R for read number, L for well and pulse range, H for header, and S for sequence.  Lines that have the 1-code + or @ are always at the beginning of the file as they give information about the total number of a given type of object in a file (+) or the maximum size of any given object over all reads (@).  For example, the first 8 lines tell you there are 3 reads, 1 mask track, 183 characters in all headers with 61 characters in the longest header, 4 mask.0 intervals altogether with at most 3 intervals for any given read, 29186 bases in all the reads with the longest being 11702bp long.  With this information you can basically allocate once all the buffers you need for reading and processing the file.  The records themselves should be easy to understand noting that (a) all items on a line are separated by a single space, (b) a string is a length/sequence pair (no newlines in the string!), and (c) track codes are T0, T1, T2, … where the number corresponds to their order on the command line.

Similarly one can access the information in a sorted alignment file with a call such as  “LAdump [-cdt] DB.db DB.las 4-6” which will output the overlap pairs, coordinate intervals (-c), number of differences (-d), and trace points (-t) for all the overlaps in DB.las whose A-read is 4, 5, or 6.  The set of all alignments with a given A-read is called a pile and many components of the Dazzler suite downstream of daligner analyze piles.

+ P 117
% P 44
+ T 6986
% T 2858
@ T 90
P 4 241 n
C 0 1315 2028 3327
D 252
T 14
25  84
17  97
16 103
...
1  14
P 4 274 n
C 0 1085 3335 4443
D 247
T 11
28  91
15 101
20  90
P 4 1740 c
C 0 1945 13444 15451
D 461
T 20
21  86
...
P 4 2248 n
C 244 2850 0 2383
D 590
T 27
...

The 1-codes used are: P for a read pair and orientation (c or n), C for the alignment coordinates, D for the number of differences in the alignment, and T for the start of a list of trace points.  In the example output below, the first 5 lines give information about the number of objects, where the new 1-code % designates information about the maximum size of something with respect to a pile.  In the example, there are 117 alignments altogether with 44 in the largest pile, and there 6986 trace points for all alignments, where the largest alignment has 90, and the pile with the most trace points has 2858.  The first overlap record is between read 1[0..1085] and 14n[3335..4443] with 247 differences in the alignment and 11 trace points given in pairs in the 11 lines following.  A forthcoming post will give a precise description of what trace points are exactly.

Dazzler DAMs and Other Upgrades

  Good news: the PacBio machines now occasionally produce reads that are longer than 65,536 bases long !  While I was excited about the length, my heart sank as I knew it meant that I had to take all my carefully compacted Dazzler data structures that used short integers and turn them into proper 4 byte integers requiring a review and retest of all the code released thus far.  To my relief it actually wasn’t that hard, and while I was at it I further modified the daligner local alignment finder so that it can accommodate k-mers of length up to 32.  The downside is that the daligner now takes twice as much memory as before, but it is only 4% slower on previous test cases.

  The upside, on the other hand, is huge because now the daligner can also be used as a general read mapper and string to string comparison tool, as a “read” can now be a DNA sequence that is as long as say the human genome or any other genomic sequence one might care to align reads to.  The only change required was to generalize a dazzler DB to something that holds all the contigs of a reference sequence and their .fasta header lines.  I decided to call these objects Dazzler Maps (DAM) and identify them with a .dam suffix versus the .db suffix of a regular database of Pacbio reads.  Internally almost everything is the same, save that a DAM, say foo, doesn’t have quality values (i.e. a .foo.qvs file) but does retain the header of every entry (in a new .foo.hdr file).  In fact every command that runs on a DB, now also runs on a DAM if it makes sense to do so.  In overview:

New Command:

fasta2DAM, DAM2fasta:

Like fasta2DB and DB2fasta, except entries with runs of N’s are broken into separate “contig” sequences (and the range of each contig recorded) by fasta2DAM and stitched back together with intervening N’s by DAM2fasta.  If you just have a plain old .fasta file with a bunch of sequences with no N’s in them, don’t worry, everything works just fine.

HPCmapper:

Like HPCdaligner, HPCmapper produces a UNIX script that using daligner, LAsort, and LAmerge to find all local alignments between the sequences in two databases that can be either a .dam or .db.  In addition, HPCdaligner also works on a .dam so for example I compared the E.Coli reference genome against itself to find all the repetitive elements which I thought was kind of cool.

Extended Commands:

DBshow, DBdust, Catrack, DBstats, DBsplit, DBrm:

All these commands run perfectly well on either a .db or a .damDBshow has also been upgraded to output the original .fasta header of displayed entries for .db‘s as well as .dam‘s.

daligner:

The arguments to the daligner can now be a .dam, a .db, or a block thereof.  The daligner takes care of all the details in handling one versus the other, and now has a -A option that controls whether an alignment between reads a and b is reported twice as (a,b) and (b,a) (the previous behavior), or as just (a,b).  It also has what I think is a  cool -I option that tells daligner to allow a read to be compared against itself, excluding the identity match.

LAcheck, LAshow

LAcheck and LAshow both took a DB as an argument in order to either check or display .las entries, respectively.  For HPCmapper applications, the A- and B-sequences come from two different DB’s.  So the extended commands can take either a single .db or .dam, in which case they assume both the A- and B-sequences come from said DB, or they can now take a pair of .db or .dam arguments, that give the sources of the A- and B-sequences, respectively.

The new release of all the modules is available at Github here.

So with these upgrades and additions is it now possible to use Dazzler modules to map PacBio reads to a reference genome assembly.  In brief, one creates a DAM, say R.dam, of the reference genome with fasta2DAM and presumably one has the PacBio reads in a DB, D.db.  The .dam can be DBdust‘ed and split into blocks (if necessary with DBsplit).  A read mapping can then be effected by calling “HPCmapper R D” that produces a script involving daligner, LAsort, and LAmerge that will result in files R.D.1.las, R.D.2.las, … R.D.n.las where R.D.k.las contains the alignments between all of R.dam and the k‘th block of the DB D.db.  For example, if R is split into two blocks, and D into 3, then the call “HPCmapper -vd R D” produces the script:

# Daligner jobs (2)
daligner -vdA -k20 -h50 -e.85 R.1 D.1 D.2 D.3
daligner -vdA -k20 -h50 -e.85 R.2 D.1 D.2 D.3
# Initial sort jobs (6)
LAsort R.1.D.1.*.las && LAmerge L1.1.1 R.1.D.1.*.S.las && rm R.1.D.1.*.las
LAsort R.1.D.2.*.las && LAmerge L1.1.2 R.1.D.2.*.S.las && rm R.1.D.2.*.las
LAsort R.1.D.3.*.las && LAmerge L1.1.3 R.1.D.3.*.S.las && rm R.1.D.3.*.las
LAsort R.2.D.1.*.las && LAmerge L1.2.1 R.2.D.1.*.S.las && rm R.2.D.1.*.las

# Level 1 merge jobs (3)
LAmerge R.D.1 L1.*.1.las && rm L1.*.1.las
LAmerge R.D.2 L1.*.2.las && rm L1.*.2.las
LAmerge R.D.3 L1.*.3.las && rm L1.*.3.las

Note carefully that the parameters -k, -h, and -e default to more stringent values when running daligner as a mapper.  I have not carefully tested or evaluated the new daligner to see what values of -k and -h give the best time vs. sensitivity tradeoff for a given choice of error rate -e.  The values chosen for PacBio reads are if anything overly sensitive and could be further optimized for speed, but with the values given its already so fast I haven’t as yet bothered.  Indeed, I wonder if the daligner would now be a good mapper for Illumina reads.   Hmmmm, to be continued …

A precise, detailed, and up-to-date command reference including these new commands can be found here.

 

DALIGNER: Fast and sensitive detection of all pairwise local alignments

The first phase of the Dazzler assembler compares every read in the trimmed Dazzler database (DB) against every other read in the trimmed DB, seeking significant local alignments between them. In an old-fashioned OLC assembler, one would look, more restrictively, for overlaps between reads. But one can use local alignments to detect artifacts such as chimeric reads and unclipped adapter sequence by looking for consistent alignment terminations within a read, and one can also detect and annotate repetitive elements of the genome covered within a read by looking for “pile-ups” of alignments. Removing the artifacts and knowing the repetitive regions covered by a read is a huge advantage prior to assembling the reads. Indeed, the dazzler assembler employes several downstream phases specifically to remove artifacts, correct reads, and annotate repetitive elements prior to building a string graph which is the modern analog of the layout phase for an OLC architecture.

You can get the source code for the DALIGNER module on Github here. Grabbing the code and uttering “make” should produce 9 programs: daligner, LAsort, LAmerge, LAshow, LAcat, LAsplit, LAcheck, HPCdaligner, and HPCmapper.

Ideally one would like to call the alignment finder, daligner, with a DB, e.g. “daligner MyDB“, and after some computing it would output a file encoding all the found local alignments. Unfortunately for big projects the amount of memory and CPU time required make this impossible. For example, on the Pacbio 54X human data set, daligner took 15,600 CPU hours on our modest 480 core HPC cluster, and as a monolithic run would have required at least 12Tb of memory! I could start apologizing, but consider that the only other program that can compare reads at a 12-15% error rate, BLASR, took 404,000 CPU hours, and only delivered overlaps. Indeed, a primary reason for releasing daligner now, and not waiting to release a complete end-to-end assembler is precisely to provide the bioinformatics community with a feasible way to compute alignments for Pacbio data at this scale. While we’ve embedded daligner within our database framework, it is easy to grab daligner‘s output in an ASCII format and then use that in an HGAP pipeline (which is the only current end to end assembler publicly available).

So given the scale of the required computation we have to “bite the bullet” and take a job parallel, HPC approach to effect an all-against-all comparison of the trimmed DB. So as a running example, let’s suppose you’ve set up a DB that is partitioned and optionally has a “dust” track (see this blog post) with a command sequence like:

fasta2DB DB MyData1.fasta MyData2.fasta ...
DBdust DB
DBsplit -x1000 DB

Suppose for simplicity that DB splits into 3 blocks. Then calling HPCdaligner on the DB will produce a UNIX shell script that contains all the commands needed to effect the all against all comparison. For example, “HPCdaligner DB” produces the script:

# Daligner jobs (3)
daligner DB.1 DB.1
daligner DB.2 DB.1 DB.2
daligner DB.3 DB.1 DB.2 DB.3
# Initial sort jobs (9)
LAsort DB.1.DB.1.*.las && LAmerge L1.1.1 DB.1.DB.1.*.S.las && rm DB.1.DB.1.*.las
LAsort DB.1.DB.2.*.las && LAmerge L1.1.2 DB.1.DB.2.*.S.las && rm DB.1.DB.2.*.las
LAsort DB.1.DB.3.*.las && LAmerge L1.1.3 DB.1.DB.3.*.S.las && rm DB.1.DB.3.*.las
LAsort DB.2.DB.1.*.las && LAmerge L1.2.1 DB.2.DB.1.*.S.las && rm DB.2.DB.1.*.las

# Level 1 merge jobs (3)
LAmerge DB.1 L1.1.*.las && rm L1.1.*.las
LAmerge DB.2 L1.2.*.las && rm L1.2.*.las
LAmerge DB.3 L1.3.*.las && rm L1.3.*.las

Each non-comment line of the script should be submitted to your cluster as a job, and all the jobs following one of the 3 comment lines can be run in parallel, but these 3 batches of job groups need to be run in sequence. That is, the 3 “Daligner” jobs can be run in parallel as a group, and once they have all completed, one can then run the 9 “Initial sort” jobs in parallel as a group, and finally the “Level 1 merge” jobs as a group. And of course, for a small script like this one, you could just invoke the shell on the script and let the jobs run sequentially on your desktop, e.g. “HPCdaligner DB | csh“.

When all the jobs of our sample script have completed there will be 3 sorted alignment files DB.1.las, DB.2.las, DB.3.las that encode the alignments found for the reads in each of the 3 trimmed DB blocks, respectively. To describe the contents of these 3 files precisely, one must first know that for each alignment, a .las-file records:

a[ab,ae] x bcomp[bb,be]

where a and b are the indices (in the trimmed DB) of the reads that overlap, comp indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bcomp, respectively, that align. It is further the case that an alignment between reads a and b results in two alignment records, one for a versus b and another for b versus a. In addition, a series of trace points is retained with each alignment record to facilitate the rapid delivery of an optimal alignment between the relevant intervals of the reads. We can now explain that the output file DB.n.las contains all the overlap records for which the first read a is in the trimmed block DB.n and these records further occur in sorted order of a, then b, then comp, and finally ab. What this means is that all the alignments involving a given a-read occur in a contiguous interval of the output files permitting easy and job parallel computation in the downstream phases for chimer detection, read correction, repeat analysis, etc..

The HPCdaligner UNIX shell script employes the three commands daligner, LAsort, and LAmerge to produce the sorted and partitioned .las files for each of the N blocks of a hypothetical database D as follows:

  • “Daligner” step: Every block is compared against every other block by calling daligner on every pair D.x and D.y such that y≤x. A call such as “daligner D.x D.y1 .. D.yk” performs k comparisons of block D.x versus each block D.yi and does so with a compiled number of threads T (default 4). Each of these pairwise block comparisons produces 2T unsorted .las files D.x.D.y.[C|N]t.las and if x > y, an additional 2T unsorted .las files D.y.D.x.[C|N]t.las. The file D.x.D.y.Ot.las contains every alignment produced by thread t where the a-read is in D.x and the b-read is in D.y and is in orientation O. In total (N+1)N/2 blocks are compared and 2N2T .las files result.
  • “Initial sort” step: The 2T .las files for each block pair x,y, are sorted and merged into a single sorted .las file. The 2T files are first sorted with LAsort, where for each unsorted file U.las the command produces the sorted overlap file U.S.las. These sorted .S.las files are then merged into a single sorted file, L1.x.y.las, with LAmerge and afterwords removed. At the end of this step there are exactly N2 sorted .las files, one for each block pair.
  • “Level k merge” steps: In a sequence of O(log N) levels the N files L1.x.*.las for a given block x are merged by LAmerge into a single .las file, D.x.las. At level 1, the N sorted files for a block are merged in groups of up to 25 resulting in ⌊N/25⌋ files, that in level 2 are merged in groups of up to 25, resulting in ⌊N/625⌋ files, and so on until only one file D.x.las remains. Even big projects involve less than 625 blocks, so typically only one or at most two merge levels are required.

For extraordinarily large projects the computation can be performed in stages as new data is added to the database, thus amortizing the cost of finding alignments over the interval of data production. For example, suppose you have a really big project going and that thus far the DB has 78 blocks in its current partition. Executing the script produced by “HPCdaligner DB 1-77" will compare blocks 1 to 77 versus each other producing files DB.1.las through DB.77.las. Note carefully that the last block, 78, was not included, because it is a partial block. Later on suppose you’ve added more data and now have 144 blocks. Then executing the script produced by “HPCdaligner DB 78-143” will compare blocks 78 to 143 against each other and against the earlier blocks 1 through 77, updating DB.1.las through DB.77.las and creating DB.78.las through DB.144.las. And so on. The only restrictions are that (a) one must add the blocks in sequential order once only, and (b) one should never add the last, partial block until the very last increment. By submitting each new complete block as it appears, I figure one could keep up with data production for say a 100X shotgun of D. melanogaster on a decent laptop.

The command LAshow produces an ASCII display of all or a selected subset of the alignments encoded in a given .las file, sorted or unsorted. One can view just the read indices and their aligned intervals, one per line, or a line rendering of the relationship between the two reads, or a complete BLAST-style alignment of each local alignment where one has several options for customizing the display. In addition to permitting one to browse local alignments, the ASCII output can also be fed into a python or pearl script that converts the information to a from that can be input to an external program of your choosing, such as the read correction step of the HGAP assembler.

The LAcat and LAsplit commands allow one to effectively change the partition of the .las files for a project should it ever be necessary. Suppose the .las files D.1.las, D.2.las, …, D.N.las have been computed for a database D.db that has been partitioned into N blocks. “LAcat D.# >E.las” will find all the .las files of the form D.#.las and stream the conceptual concatenation of all the .las files in numeric order to the standard output, in this case to the file E.las. In the other direction, “LAsplit F.# N <E.las” will split E.las as evenly as possible into N block files, F.1.las, F.2.las, …, F.N.las subject to the restriction that the alignments involving a given a-read are all in the same file. Note that F.x.las does not necessarily equal the original D.x.las, as the a-reads in F.x.las may not be the same as the reads in a given block, D.x.db, because the splitting criteria for LAsplit and DBsplit are not the same. To get exactly the same files back, one should instead utter “LAsplit F.#.las D.db <E.las” or more tersely “LAsplit F.# D <E.las” that splits the piped input according to the partitioning of database D !  Finally, observe that if you want to repartition the DB D and also all the D.#.las files, first call DBsplit on D, and afterwords call “LAcat D.# | LAsplit D.# D” to redistribute the alignment records into .las files according to the new partitioning.

Lastly, the command LAcheck reads through the .las files given to it and checks them for structural integrity, i.e. do they reasonably encode a properly formatted .las file? We find this command useful in ensuring that every stage of a large HPCdaligner script has been properly executed. In processes involving thousands of jobs, it is not unusual for one or two to go “haywire”. So while we didn’t build it into the HPCdaligner scripts, we strongly recommend that you run LAcheck on .las files throughout the process to make sure all is well. Doing so is standard operating procedure for our assembly group.

A precise, detailed, and up-to-date command reference can be found here.

 

The Dazzler DB: Organizing an Assembly Pipeline

Assembling DNA sequence data is not a simple problem. Conceptually it generally can be thought of as proceeding in a number of sequential steps, performed by what we call phases arranged in a pipeline, that perform certain aspects of the overall process. For example in the “OLC” assembly paradigm, the first phase (O) is to find all the overlaps between reads, the second phase (L) is to then use the overlaps to construct a layout of the reads, and the final phase (C) is to compute a consensus sequence over the arrangement of reads. One can easily imagine additional phases, such as a “scrubber” that detects chimeric reads, a “cleaner” that corrects read sequence, and so on. Indeed, the header banner for this blog shows a 7 phase pipeline that roughly realizes our current dazzler prototype. To facilitate the multiple and evolving phases of the dazzler assembler, we have chosen to organize all the read data into what is effectively a “database” of the reads and their meta-information. The design goals for this data base are as follows:

  1. The database stores the source Pacbio read information in such a way that it can recreate the original input data, thus permitting a user to remove the (effectively redundant) source files. This avoids duplicating the same data, once in the source file and once in the database.
  1. The data base can be built up incrementally, that is new sequence data can be added to the data base over time.
  1. The data base flexibly allows one to store any meta-data desired for reads. This is accomplished with the concept of tracks that implementers can add as they need them.
  1. The data is held in a compressed form equivalent to the .dexta and .dexqv files of the data extraction module. Both the .fasta and .quiva information for each read is held in the data base and can be recreated from it. The .quiva information can be added separately and later on if desired.
  1. To facilitate job parallel, cluster operation of the phases of our assembler, the data base has a concept of a current partitioning in which all the reads that are over a given length and optionally unique to a well, are divided up into blocks containing roughly a given number of bases, except possibly the last block which may have a short count. Often programs can be run on blocks or pairs of blocks and each such job is reasonably well balanced as the blocks are all the same size. One must be careful about changing the partition during an assembly as doing so can void the structural validity of any interim block-based results.

The Dazzler DB module source code is available on Github here. Grabbing the code and uttering “make” should produce 13 programs: fast2DB, quiva2DB, DB2fasta, DB2quiva, fasta2DAM, DAM2fasta, DBsplit, DBdust, Catrack, DBshow, DBstats, DBrm, and simulator.

One can add the information from a collection of .fasta files to a given DB with fasta2DB, and one can subsequently add the quality streams in the associated .quiva files with quiva2DB. After adding these files to a data base one can then remove them as property 1 above guarantees that these files can be recreated from the DB, and the routines DB2fasta and DB2quiva do exactly that. Indeed, here in Dresden we typically extract .fasta and .quiva files with dextract from the Pacbio .bax.h5 source files and then immediately put the .fasta and .quiva files into the relevant DB, remove the .fasta and .quiva files, and send the .bax.h5 files to our tape backup library. We can do this because we know that we can recreate the .fasta and .quiva files exactly as they were when imported to the database.

A Dazzler DB consists of one named, visible file, e.g. FOO.db, and several invisible secondary files encoding various elements of the DB. The secondary files are “invisible” to the UNIX OS in the sense that they begin with a “.” and hence are not listed by “ls” unless one specifies the -a flag. We chose to do this so that when a user lists the contents of a directory they just see a single name, e.g. FOO.db, that is the one used to refer to the DB in commands. The files associated with a database named, say FOO, are as follows:

(a) FOO.db
a text file containing (i) the list of input files added to the database so far, and (ii) how to partition the database into blocks (if the partition parameters have been set).

(b) .FOO.idx
a binary “index” of all the meta-data about each read allowing, for example, one to randomly access a read’s sequence (in the store .FOO.bps). It is 28N + 88 bytes in size where N is the number of reads in the database.

(c) .FOO.bps
a binary compressed “store” of all the DNA sequences. It is M/4 bytes in size where M is the total number of base pairs in the database.

(d) .FOO.qvs
a binary compressed “store” of the 5 Pacbio quality value streams for the reads. Its size is roughly 5/3M bytes depending on the compression achieved. This file only exists if .quiva files have been added to the database.

(e) .FOO.<track>.anno, .FOO.<track>.data
a track called <track> containing customized meta-data for each read. For example, the DBdust command annotates low complexity intervals of reads and records the intervals for each read in the two files .FOO.dust.anno and .FOO.dust.data that are together called the “dust” track. Any kind of information about a read can be recorded, such as micro-sat intervals, repeat intervals, corrected sequence, etc. Specific tracks will be described as phases that produce them are created and released.

If one does not like the convention of the secondary files being invisible, then un-defining the constant HIDE_FILES in DB.h before compiling the library, creates commands that do not place a prefixing “.” before secondary file names, e.g. FOO.idx instead of .FOO.idx. One then sees all the files realizing a DB when listing the contents of a directory with ls.

While a Dazzler DB holds a collection of Pacbio reads, a Dazzler map DB or DAM holds a collection of contigs from a reference genome assembly. This special type of DB has been introduced in order to facilitate the mapping of reads to an assembly and has been given the suffix .dam to distinguish it from an ordinary DB. It is structurally identical to a .db except:

(a) there is no concept of quality values, and hence no .FOO.qvs file.

(b) every .fasta scaffold (a sequence with runs of N’s between contigs estimating the length of the gap) is broken into a separate contig sequence in the DB and the header for each scaffold is retained in a new .FOO.hdr file.

(c) the original and first and last pulse fields in the meta-data records held in .FOO.idx, hold instead the contig number and the interval of the contig within its original scaffold sequence.

A map DB can equally well be the argument of any of the commands below that operate on normal DBs. In general, a .dam can be an argument anywhere a .db can, with the exception of routines or optioned calls to routines that involve quality values, or the special routines fasta2DAM and DAM2fasta that create a DAM and reverse said, just like the pair fasta2DB and DB2fasta. So in general when we refer to a database we are referring to either a DB or a DAM.

The command DBsplit sets or resets the current partition for a database which is determined by 3 parameters: (i) the total number of basepairs to place in each block, (ii) the minimum read length of reads to include within a block, and (iii) whether or not to only include the longest read from a given well or all reads from a well (NB: several reads of the same insert in a given well can be produced by the Pacbio instrument). Note that the length and uniqueness parameters effectively select a subset of the reads that contribute to the size of a block. We call this subset the trimmed data base. Some commands operate on the entire database, others on the trimmed database, and yet others have an option flag that permits them to operate on either at the users discretion. Therefore, one should note carefully to which version of the database a command refers to. This is especially important for any command that identifies reads by their index (ordinal position) in the database.

Once the database has been split into blocks, the commands DBshow, DBstats, and DBdust below and commands yet to come, such as the local alignment finder dalign, can take a block or blocks as arguments. On the command line this is indicated by supplying the name of the DB followed by a period and then a block number, e.g. FOO.3.db or simply FOO.3, refers to the 3’rd block of DB FOO (assuming of course it has a current partition and said partition has a 3rd block). One should note carefully that a block is a contiguous range of reads such that once it is trimmed has a given size in base pairs (as set by DBsplit). Thus like an entire database, a block can be either untrimmed or trimmed and one needs to again be careful when giving a read index to a command such as DBshow.

The command DBdust runs the symmetric DUST low complexity filter on every read in a given untrimmed DB or DB block and it is the first example of a phase that produces a track. It is incremental in that you can call it repeatedly on a given DB and it will extend the track information to include any new reads added since the last time it was run on the DB. Reads are analyzed for low-complexity intervals and these intervals constitute the information stored in the .dust-track. These intervals are subsequently “soft”-masked by the local alignment command dalign, i.e. the low-complexity intervals are ignored for the purposes of finding read pairs that might have a significant local alignment between them, but any local alignment found will include the low-complexity intervals.

When DBdust is run on a block, say FOO.3, it is run on the untrimmed version of the block and it produces a block track in the files .FOO.3.dust.anno and .FOO.3.dust.data, whereas running DBdust on the entire DB FOO produces .FOO.dust.anno and .FOO.dust.data. Thus one can run an independent job on each block of a DB, and then after the fact stitch all the block tracks together with a call to Catrack, e.g. “Catrack FOO dust“. Catrack is a general command that merges any sequence of block tracks together into a track for the specified DB. For example, a 50X human dataset would partition into roughly 375, 400Mb blocks, for which DBdust takes about 20 seconds on a typical processor. One can either call DBdust on the entire DB and wait 375*20 ~ 2.07 hours, or run 375 jobs on each block on your local compute cluster, waiting 1 or 2 minutes, depending on how loaded your cluster is, and then call Catrack on the block tracks, which takes another half a minute at most.

Tracks that encode intervals of the underlying reads in the data base, of which the “dust” track is an example, are specifically designated interval tracks, and commands like DBshow, DBstats, and daligner can accept and utlize one or more such tracks, typically specified through a -m option on the command line.

DBshow and DBstats are routines that allow one to examine the current contents of a DB or DB block. DBstats gives one a summary of the contents of an untrimmed or trimmed DB of DB block including a histogram of read lengths and histograms of desired interval tracks if desired. DBshow allows your to display the sequence and/or quality streams of any specific set of reads or read ranges in the DB or DB block where the output can be in Pacbio fasta or quiva format depending on which option flags are set. There is a flag that controls whether the command applies to the entire DB or the trimmed DB and one must be careful, as the, say 5th read of the DB may not be the 5th read of the trimmed DB. A nice trick here is that since the output is in the Pacbio formats, one can then create a “mini-DB” of the set of reads output by DBshow, by calling fasta2DB and/or quiva2DB on its output. This conveniently allows a developer to test/debug a phase on a small, troublesome subset of reads in the DB.

DBrm removes all the files, including all tracks, realizing a given DB. It is safer to use this command than to attempt to remove a DB by calling rm on each hidden file. Lastly, this release includes a simple simulator, simulator, that creates a synthetic Pacbio data set useful for testing purposes.

A precise, detailed, and up-to-date command reference can be found here.

 

On Perfect Assembly

  During the AGBT meeting this February at Marco Island I decided to “get modern” and joined Twitter in part due to the encouragement of my younger colleagues, Ned Lederbragt and Pavel Tomancak.  I was at the AGBT meeting to present my new work towards an assembler for long-read-only data sets.  I’ve been out of the DNA sequencing “scene” for almost ten years now, mostly because the trend has been so much on cost versus quality that I just lost interest as a mathematician.  With the short read instruments, a great assembly is just not possible, especially in the face of strong sampling bias of the reads and consistently reproducible sequencing errors.  Yes, the results are still biologically useful, but I’m not a biologist.  What mathematician wants to work on a problem they know they cannot solve to their satisfaction?  Moreover, as a bioinformatician I know how much time is spent (wasted) dealing with the overwhelming number of caveats in the data sets.  Wouldn’t it be great if at least the source of much our work, the DNA sequence, was perfect?

What I perceived early in 2013 was that the relatively new Pacbio “long read” DNA sequencer was reaching sufficient maturity that it could produce data sets that might make this possible, or at least get us much, much closer to the goal of near perfect, reference quality reconstructions of novel genomes.  While the accuracy of its reads is relatively low (85-88%), causing many to erroneously think it not a good machine, the Pacbio machine has two incredibly powerful offsetting properties, namely, that (i) the set of reads produced is a nearly Poisson sampling of the underlying genome, and (ii) the location of errors within reads is truly randomly distributed.   Given my previous experience, I immediately understood the implication of these two properties, and decided in the summer of 2013 to buy a machine and get back into the genome assembly game.

So with my new twitter account, I decided to float the following 140 character theorem, that encapsulates my reasoning for coming back to the DNA sequencing scene:

Thm: Perfect assembly possible iff a) errors random b) sampling is Poisson c) reads long enough 2 solve repeats. Note: e-rate not needed

There was much follow on: some people didn’t believe it and some were (quite reasonably) confused about some of the conditions (especially c).  In addition I got some interesting links to work along these lines (here and here).  So how did I get to the conclusion of the theorem?

Condition (b), by the Poisson theory of Lander and Waterman, implies that for any minimum target coverage level k, there exists a level of sequencing coverage c that guarantees that every region of the underlying genome is covered k times.  In their 1988 paper they showed that the amount of the genome uncovered was roughly e-c where e is Napier’s constant, and an easy generalization shows that the amount of the genome not covered at least k times is ck/k! e-c.  Certain as c goes to infinity this quantity goes to 0 and since the genome is finite, at some point every part of the genome is covered k times.

Condition (a), from the early work of Churchill and Waterman, implies that the accuracy of the consensus sequence of k sequences, each with average error err, is O(errk) which goes to 0 as k increases.  This clearly implies that one can arrive at an arbitrarily accurate consensus by sampling enough sequences.  Note carefully that this is not true for a device that has reproducible errors and that every previous sequencing device makes such reproducible errors, the best ones at about a rate of 1 per 10Kbp.  This implies that with previous devices the best accuracy you could ever expect is a “Q40” reconstruction.  For the Pacbio where error is truly random any desired accuracy is possible.  For example, the Pacbio team has produced a Q60 reconstruction of E. Coli with their machines.

So conditions (a) and (b) together imply that with sufficiently large coverage c, an arbitrarily accurate reconstruction of the entire genome is possible, provided the sequencing reads can be correctly assembled.  As is well known it is the underlying repeat structure of a genome that confounds assembly.  Condition (c) is very loosely stated and basically says, if the reads are so long that every repetitive region is spanned, then the repeats aren’t a problem.  Of course, one could try to characterize more carefully how long the reads need to be (for example, they tried here), or one could simply capitulate and say all regions not involving repeats over a given size and fidelity are correctly assembled.  Practically, full length retrotransposons, the most frequent interspersed repetitive element, are 7Kbp long so most regions of any genome will not be problematic with 10Kbp reads.  Typically one finds that the euchromatic arms of most genomes assemble correctly into one or perhaps a few pieces with reads of this length.  While one can try to make theoretical statements, the truth of the matter is most assemblers are still not good at resolving repeats on the basis of the micro-variation within them, i.e. there is still significant room for improvement in the combinatorial aspects of genome assembly.  We intend to aggressively address this weakness going forward.

The theorem presupposes that there is a single haplotype under consideration.  Diploidy and polyploidy create additional complications that I believe are addressable.  But the really important issue is how to build an assembler that is efficient in coverage c.  That is, the theorem just says for c “big” enough.  But pragmatically the smaller c, the better the cost and efficiency.  Currently most of the “Pacbio-only” projects have involved 50X or more.  But in principle, a 20X assembly should be pretty darn good.  But no one currently gets good assembly at 20X because the 50X is needed to “correct” reads, i.e. understand read error, sufficiently to put the pieces together.  So there is lots of room for innovation and improvement in the arena of genome assembly algorithms: assembly is not a solved problem !

The Dextractor Module: Time and Space

Measurements of the compute time and space usage of the tools released in the recent post “The Dextractor Module: Save disk space for your Pacbio projects” are presented for  a few SMRT cells to give the curious reader or user an idea of their performance.  All tests were performed on a MacBook Pro with 16Gb of memory and a 2.7GHz Intel Core i7 processor running OS X 10.7.5.

We selected four SMRT cells of data from a recent sequencing project that vary in their size and the amount of good data they contain.  Table 1 gives the sizes of the four sets of .bax.h5 files for each SMRT cell in the first data column (recall that the data from a SMRT cell is currently delivered in 3 .bax.h5 files, here we deal with their total combined size).  First note that the amount of good data coming from a file varies, e.g. SMRT 3 yielded more data in reads over 500bp and better than 75% than SMRT 2 despite being smaller.

Screen Shot 2014-03-22 at 9.11.40 PM

Table 1: Original HDF5 file size, time to extract both .fasta and .quiva files, and the sizes of these in absolute terms and as a percentage of the original file.

The extraction time is roughly proportional to the size of the input file, but note that the system time (noted in parentheses in the second data column) dropped precipitously for the small input of SMRT 4.  We would guess that this is because the individual .bax.h5 files are smaller than 2Gb each for said SMRT, and greater than that for the others, and that there is some crossover in system file performance at this threshold.  On average the .fasta file is about 4.5% of the original file and the .quiva file is about 22% for a total of 26.5% (3.8X less space).  The relative sizes of the files is also shown graphically in Chart 1.

Screen Shot 2014-03-23 at 8.50.56 AM

Chart 1: Bar chart of the sizes of the .bax.h5, .quiva, and .fasta files for each of the 4 SMRT cells

Compression and decompression times between .fasta files and their losslessly compressed .dexta counterparts was very fast at roughly 1 or 2 seconds in each direction, with decompression times being about 1.32 times slower than compression times.  Each base is packed into 2-bits giving a very consistent compression factor of 4 (25% of the uncompressed size).

Screen Shot 2014-03-23 at 8.47.04 AM

Table 2: Compression sizes and compression/decompression times for both .fasta and .quiva files.

The compression of .quiva files involves computing Huffman and run-length statistics on a file-by-file basis and so takes considerably longer with again compression averaging 1.07 times faster than decompression.  The compression ratio can vary a bit more but was quite consistent at 29% plus or minus 1% (3.45X).  The Pacbio engineers have released different versions of the base caller and it may well be that for data sets produced with older versions of this software the compression ratio is different (it was not tested here). The relative sizes of the compressed files is shown graphically in Chart 2.

Screen Shot 2014-03-23 at 8.51.09 AM

Chart 2: Bar chart of the sizes of the .quiva and .fasta files and their compressed counterparts, .dexqv and .dexta, for each of the 4 SMRT cells

In summary, extracting just the necessary data from the HDF5 files provided by the manufacturer saves one a factor of almost 4 in storage.  Compressing this data then provides another factor of 3.5 or so, and decompressing the data is quite fast, indeed, very fast for the primary .fasta data.

The Dextractor Module: Save disk space for your Pacbio projects

The Pacbio sequencers currently produce HDF5-formatted files that contain a lot of information about a sequencing run including the sequence, streams of quality values, and a lot of other stuff that most of us will never look at.  These files are quite big.  The 3 .bax.h5 files produced for a given SMRT cell typically occupy 7-9 Gb.  The recent  54X human genome data set produced by the company consists of the output from 277 SMRT cells, so if you want to download it and work with it, you will need to set aside roughly 2+Tb of disk.  Here we present a set of UNIX commands that extract just the data you need for an assembly project, and allow you to compress said data, so that it occupies roughly 1/13th the space.  For example, the extracted and compressed human data set occupies only ~150Gb !  For our own projects we routinely extract and compress data as it is produced, saving the HDF5 files to a tape archive as backup.  This saves our organization money as keeping such data sets on disk at these scales is not a cheap proposition.

The Dextractor module source code is available on Github here.  Grabbing the code and uttering “make” should produce 5 programs: dextract, dexta, undexta, dexqv, and undexqvDextract takes .bax.h5 files as input and produces a .fasta file containing the sequences of all the reads, and a .quiva file that contains all the quality value (QV) streams needed by the Quiver program to produce a consensus sequence as the final phase of assembly.  The .fasta file can then be compressed (~ 4x) with dexta to produce a .dexta file, and this can be decompressed with undexta to restore the original .fasta file.  Similarly dexqv & undexqv compress (~ 3.3x) and decompress a .quiva file into a .dexqv file and back.  The figure below shows the flow of information:

Screen Shot 2014-03-04 at 11.35.53 AM

How data flows through the dextractor modules from a source .bax.h5 file to the assembler. The yellow percentages indicate the relative amount of space occupied by the given file type relative to the source (100%).

As an example, the Pacbio RS II produces 3 .bax.h5 files as output for a given SMRT cell (there are currently 3 blades on the machine’s server each handling 1/3rd of the 150K zero-mode wells).  When we receive such an output on our disk storage system, say files G.1.bax.h5, G.2.bax.h5, G.3.bax.h5, we immediately execute the following commands:

dextract -o G.*.bax.h5
dexta G
dexqv G
mv G.*.bax.h5 <Our_tape_backup_system>

This leaves us with files G.dexta and G.dexqv on spooled up disk.  Whenever we are ready to begin an assembly we decompress G.dexta with undexta (for most phases only the .fasta files are needed) and then if we have an assembly we like and we want a final consensus sequence from Quiver, we decompress G.dexqv back into a .quiva file  with undexqv that we then feed into our customized version of Quiver (not yet available, alas)

A precise, detailed, and up-to-date command reference can be found here.