The purpose of this post is to succinctly clarify the differences between Dazzler data bases or DB’s, and Dazzler maps or DAM’s and to further clarify a number of the most common misunderstandings users have about them. We start with a brief review:
A Dazzler data base or DB is the organizing data framework for the system (original post here). It holds all information about Pacbio reads at the start of and during their processing into an assembly by Dazzler modules. Most importantly the initial sequencing read data must be imported into a DB with fasta2DB and quiva2DB before any Dazzler programs such as the daligner can be applied, and the input must be in Pacbio’s format. A key feature is that you can always get your data back out of a DB exactly as you imported it with DB2fasta and DB2quiva, meaning that no data external to the system need be kept saving you (a lot of) disk space. Since many users use only selected components of the system, I also made it as easy as possible to get all relevant information out of the system with programs like DBdump and LAdump.
Somewhat later on (original post here), I introduced a Dazzler map or DAM, which is very much like a DB but designed to hold the scaffolds of a reference genome such as hg38 or Rel6 of the fly genome. This variation was introduced so that the daligner could be used to compare reads to a reference genome. But it also allowed other interesting possibilities such as comparing a genome against itself to detect the repetitive elements within it. The NCBI standard .fasta file format is imported into a DAM with fasta2DAM, where the .fasta format headers can be any string, and each entry is interpreted as a scaffold of contigs where runs of N’s are interpreted as scaffold gaps whose mean length is equal to the length of the run. The DAM stores each contig sequence as an entry and remembers how they are organized into scaffolds. In symmetry with DB import/export, DAM2fasta recreates the original input, implying the DAM need be the only record on your computer of the genome reference.
So both data objects contain a collection of DNA sequences and at this level they appear identical. But there are differences and we enumerate them below:
- DB: .fasta headers must be in Pacbio format and are parsed to obtain well, pulse range, and quality score.
- DAM: .fasta headers are not interpreted.
- DB: may contain QV quality streams needed by Quiver for consensus
- DAM: cannot contain QV quality streams.
- DAM: interprets N’s in input sequences as scaffold gaps and stores said as a collection of contig sequences, retaining the information needed to restore the scaffold.
- DB: does not allow N’s in the input sequence.
Until recently the system expected that each imported Pacbio file contained the data from a single SMRT cell and hence had exactly the same header sequence save for the well number, pulse range, and quality score. Quite a number of users however, placed the data from a number of SMRT cells all in the same .fasta file, and then were disappointed when fasta2DB didn’t allow them to import it. But they discovered that fasta2DAM would accept such input just fine as it makes no assumptions about the .fasta headers. Since a DAM can be used in almost every place a DB can, this might seem like a good solution, but its not because one can no longer trim the data (see DBsplit) so that only the best read in each well is used in downstream processing. As of this post, fasta2DB has been upgraded so that a user can pass it files with multiple SMRT cell data sets in it, as long as the entries for each SMRT cell are consecutive in the file. So please:
Use a DB for your Pacbio reads, and use a DAM for all secondary sequence data such as reference genomes.
A final point that some may have missed, is that the import routines are incremental. That is, you can import data into a DB or DAM over any number of distinct calls, where the data in each new file imported is appended to the growing DB or DAM. While its not so important for a DAM, this feature is very valuable for a Pacbio shotgun sequencing project where many SMRT cells are being produced over a period of days or weeks. You can add each new SMRT cell to your DB as it is output from the instrument. Moreover, this is performed efficiently: in time proportional to the size of the new data (and not the size of the DB). Moreover, all the downstream programs and scripts are also incremental so for example you can be performing the time consuming daligner overlap computation as the data arrives, thus spreading the CPU time over the sequencing time so that when that last SMRT cell comes off the machine, you will have a relatively short wait until you have all the overlaps that you need for the relatively rapid scrubbing and assembly steps.
Stay tuned, upcoming posts/releases:
DAMAPPER: an optimized version of daligner specifically for read mapping
DAVIEWER: a QT-based viewing tool for looking at read piles including quality and scrubbing information
DASCRUBBER: a complete end-to-end scrubber for removing all artifacts and low quality segments from your read data set (at last!).