Moving to the Sequel

Featured

In the last year, Pacbio has, in conjunction with the introduction of the Sequel instrument, made two changes to their software environment that affect downstream applications including the Dazzler suite.  Namely, they have (1) moved from the HDF5 format to the BAM format for the primary output of the instrument, and (2) there is a new alternative to the Quiver consensus base-caller, called Arrow that is equally if not more accurate and requires less information about the raw signal to do so.  Specifically, Quiver requires five quality streams each the length of the associated read, whereas Arrow needs more simply just the read’s pulse width stream clipped to the smallest four values and the four signal-to-noise (snr) ratios for each nucleotide channel.  All .bax.h5 files currently and previously produced for the RS II contain the the information needed by both Quiver and Arrow.  However, the new .subreads.bam files contain just the information needed by Arrow if they were produced by a Sequel instrument, but they can contain just the information needed by Quiver if it was produced by converting an RS II HDF5 file to a BAM file with the PacBio program bax2bam.

To accommodate these transitions, the Dazzler suite now interprets the .subreads.bam files produced by the instrument and can extract Quiver or Arrow information from either type of source file, and Dazzler DB’s have been upgraded to hold either the information needed by Quiver, or the information needed by Arrow (but not both).  Most of the new programs and modifications have occurred in the DEXTRACTOR module, although the upgrades also affected the DAZZ_DB module.  Below we outline the changes and new capabilities.

The original extraction program dextract can now also take a BAM file as input and from it produce a .fasta, a .quiva, an .arrow file, or some combination of these according to whether or not the command line options -f, -q, and -a, respectively, are set.  The .fasta and .quiva file types are unchanged, and the new .arrow file type contains the information needed by Arrow.  The dextract program has further been upgraded so that given an HDF5 file it too can produce any combination of the 3 files.  An example, of a sub-read entry in each format is immediately below and we explain the contents of each in the following paragraph.

screen-shot-2016-11-06-at-10-57-45-amA .fasta file is in FASTA format, and an entry consists of a standard “Pacbio header” followed by the DNA sequence on any number of ensuing lines.  The standard header contains the movie name, well number (44), pulse range (0-6289), and sequence quality value (.812) to 3 digits, always with the same syntax.  A .quiva file uses a FASTQ-like format, and an entry consists of the same Pacbio header, followed by each of the 5 quality streams, one per line, with no line breaks within a stream.  Finally, the .arrow file is in FASTA format, and an entry consists of a header with the movie name and the 4 snr values, followed by the pulse width stream on any number of ensuing lines.  The header in this case contains just the movie name and the four snr values (6.69, 5.89, 3.72, 4.04) in  a fixed syntax where the snr values are always to 2 digits of precision and always listed in the order of the A, C, G, and T channels, respectively.  The pulse width stream is always a sequence of the symbols, 1, 2, 3, and 4.

One might further recall that the DEXTRACTOR module also contains compression and decompression program pairs — dexta/undexta and dexqv/undexqv — that operate on .fasta and .quiva files, respectively.  There is now also a compressor/decompressor pair, dexar/undexar, that compresses an .arrow file into a .dexar file and decompresses it back.  The picture immediately below shows the existing programs in green and the new ones in orange:screen-shot-2016-11-10-at-1-19-49-pmThe numbers in green in the picture above give the approximate sizes of the various files in bytes, where N is the number of base pairs of data in the file.  The .bam file occupies roughly 2N bytes if it contains only Arrow information and 3.3N if it contains only Quiver information.  The larger bax.h5 file always contains the information for both programs within it.  The main thing to note is that the compressed files represent a significant space savings, but by not as big a margin over the .bam source files which are themselves gzip compressed.

Previously, a Dazzler DB held the DNA sequence and header information for each subread, and optionally could store and organize all the Quiver information.  I have upgraded the system so that now a DB can also optionally hold all the Arrow information for a project.  But note carefully, that a Dazzler DB cannot hold both.  So we term a DB that holds Quiver information a Quiver-DB (Q-DB), one that holds Arrow information an Arrow-DB (A-DB), and one that holds neither a Sequence-DB (S-DB).  One might think this forces a user to choose between Quiver and Arrow for consensus at the beginning of an assembly project, but this is not the case as one can at any time add or delete the Quiver or Arrow information independently of the sequence information.

Specifically, recall that one creates an S-DB, possibly incrementally, by adding .fasta files with fast2DB.   Anytime there after one could then add .quiva files with quiva2DB, in the process converting the S-DB into a Q-DB.  In exact analogy, one can now create an A-DB by incrementally adding .arrow files to an S-DB with the new program arrow2DB (found in the DAZZ_DB module).  As before, any files that were added to a DB can be recreated with the programs DB2fasta, DB2quiva, and a new DB2arrow program.  Since an A- or Q-DB contains everything you need for assembly (in compressed form) including producing the final consensus, you can choose to archive the source data files and only keep the DB as long as you are certain about which consensus program you will use. The figure below shows the additional programs:screen-shot-2016-11-10-at-1-20-16-pmNote that a Sequence DB consists of the files G.db, .G.idx, .G.bpsA Quiver DB additionally has the file .G.qvs, and an Arrow DB has a .G.arw file. In the hopefully rare use-cases where one wants to decide which consensus program to use late in the assembly process, we recommend that you create an S-DB and keep both the Arrow and Quiver information as compressed .dexqv and .dexar files.  One can also convert an A- or Q-DB back into an S-DB by calling a new program DBwipe that removes all the auxilliary information for Quiver or Arrow from a DB.

While the above programs allow you to extract information from both forms of PacBio data files and produce a DB from the extracted data, I further observed that for most applications, the intermediate production of .fasta, .quiva, and/or .arrow files or their compressed forms is unnecessary, and one saves both time and disk space with a program that both extracts the desired information and adds it to a DB in a single step as shown in the figure below:screen-shot-2016-11-10-at-1-20-42-pmThe program dex2DB has been developed and is now part of the DEXTRACTOR module.  The program can create either an S-, A-, or Q-DB directly as long as the necessary information is present in the source files.  I was tempted to remove all the programs that are faded out in the figure immediately above, but have decided to keep them to (1) help developers that want to work with Pacbio data but not with the Dazzler system, and (2) for those users that can’t decide yet if they want to use Quiver or Arrow and so need to keep both types of information in .dexqv and .dexar files external to the DB.

Advertisements

The Dazzler DB: Organizing an Assembly Pipeline

Assembling DNA sequence data is not a simple problem. Conceptually it generally can be thought of as proceeding in a number of sequential steps, performed by what we call phases arranged in a pipeline, that perform certain aspects of the overall process. For example in the “OLC” assembly paradigm, the first phase (O) is to find all the overlaps between reads, the second phase (L) is to then use the overlaps to construct a layout of the reads, and the final phase (C) is to compute a consensus sequence over the arrangement of reads. One can easily imagine additional phases, such as a “scrubber” that detects chimeric reads, a “cleaner” that corrects read sequence, and so on. Indeed, the header banner for this blog shows a 7 phase pipeline that roughly realizes our current dazzler prototype. To facilitate the multiple and evolving phases of the dazzler assembler, we have chosen to organize all the read data into what is effectively a “database” of the reads and their meta-information. The design goals for this data base are as follows:

  1. The database stores the source Pacbio read information in such a way that it can recreate the original input data, thus permitting a user to remove the (effectively redundant) source files. This avoids duplicating the same data, once in the source file and once in the database.
  1. The data base can be built up incrementally, that is new sequence data can be added to the data base over time.
  1. The data base flexibly allows one to store any meta-data desired for reads. This is accomplished with the concept of tracks that implementers can add as they need them.
  1. The data is held in a compressed form equivalent to the .dexta and .dexqv files of the data extraction module. Both the .fasta and .quiva information for each read is held in the data base and can be recreated from it. The .quiva information can be added separately and later on if desired.
  1. To facilitate job parallel, cluster operation of the phases of our assembler, the data base has a concept of a current partitioning in which all the reads that are over a given length and optionally unique to a well, are divided up into blocks containing roughly a given number of bases, except possibly the last block which may have a short count. Often programs can be run on blocks or pairs of blocks and each such job is reasonably well balanced as the blocks are all the same size. One must be careful about changing the partition during an assembly as doing so can void the structural validity of any interim block-based results.

The Dazzler DB module source code is available on Github here. Grabbing the code and uttering “make” should produce 13 programs: fast2DB, quiva2DB, DB2fasta, DB2quiva, fasta2DAM, DAM2fasta, DBsplit, DBdust, Catrack, DBshow, DBstats, DBrm, and simulator.

One can add the information from a collection of .fasta files to a given DB with fasta2DB, and one can subsequently add the quality streams in the associated .quiva files with quiva2DB. After adding these files to a data base one can then remove them as property 1 above guarantees that these files can be recreated from the DB, and the routines DB2fasta and DB2quiva do exactly that. Indeed, here in Dresden we typically extract .fasta and .quiva files with dextract from the Pacbio .bax.h5 source files and then immediately put the .fasta and .quiva files into the relevant DB, remove the .fasta and .quiva files, and send the .bax.h5 files to our tape backup library. We can do this because we know that we can recreate the .fasta and .quiva files exactly as they were when imported to the database.

A Dazzler DB consists of one named, visible file, e.g. FOO.db, and several invisible secondary files encoding various elements of the DB. The secondary files are “invisible” to the UNIX OS in the sense that they begin with a “.” and hence are not listed by “ls” unless one specifies the -a flag. We chose to do this so that when a user lists the contents of a directory they just see a single name, e.g. FOO.db, that is the one used to refer to the DB in commands. The files associated with a database named, say FOO, are as follows:

(a) FOO.db
a text file containing (i) the list of input files added to the database so far, and (ii) how to partition the database into blocks (if the partition parameters have been set).

(b) .FOO.idx
a binary “index” of all the meta-data about each read allowing, for example, one to randomly access a read’s sequence (in the store .FOO.bps). It is 28N + 88 bytes in size where N is the number of reads in the database.

(c) .FOO.bps
a binary compressed “store” of all the DNA sequences. It is M/4 bytes in size where M is the total number of base pairs in the database.

(d) .FOO.qvs
a binary compressed “store” of the 5 Pacbio quality value streams for the reads. Its size is roughly 5/3M bytes depending on the compression achieved. This file only exists if .quiva files have been added to the database.

(e) .FOO.<track>.anno, .FOO.<track>.data
a track called <track> containing customized meta-data for each read. For example, the DBdust command annotates low complexity intervals of reads and records the intervals for each read in the two files .FOO.dust.anno and .FOO.dust.data that are together called the “dust” track. Any kind of information about a read can be recorded, such as micro-sat intervals, repeat intervals, corrected sequence, etc. Specific tracks will be described as phases that produce them are created and released.

If one does not like the convention of the secondary files being invisible, then un-defining the constant HIDE_FILES in DB.h before compiling the library, creates commands that do not place a prefixing “.” before secondary file names, e.g. FOO.idx instead of .FOO.idx. One then sees all the files realizing a DB when listing the contents of a directory with ls.

While a Dazzler DB holds a collection of Pacbio reads, a Dazzler map DB or DAM holds a collection of contigs from a reference genome assembly. This special type of DB has been introduced in order to facilitate the mapping of reads to an assembly and has been given the suffix .dam to distinguish it from an ordinary DB. It is structurally identical to a .db except:

(a) there is no concept of quality values, and hence no .FOO.qvs file.

(b) every .fasta scaffold (a sequence with runs of N’s between contigs estimating the length of the gap) is broken into a separate contig sequence in the DB and the header for each scaffold is retained in a new .FOO.hdr file.

(c) the original and first and last pulse fields in the meta-data records held in .FOO.idx, hold instead the contig number and the interval of the contig within its original scaffold sequence.

A map DB can equally well be the argument of any of the commands below that operate on normal DBs. In general, a .dam can be an argument anywhere a .db can, with the exception of routines or optioned calls to routines that involve quality values, or the special routines fasta2DAM and DAM2fasta that create a DAM and reverse said, just like the pair fasta2DB and DB2fasta. So in general when we refer to a database we are referring to either a DB or a DAM.

The command DBsplit sets or resets the current partition for a database which is determined by 3 parameters: (i) the total number of basepairs to place in each block, (ii) the minimum read length of reads to include within a block, and (iii) whether or not to only include the longest read from a given well or all reads from a well (NB: several reads of the same insert in a given well can be produced by the Pacbio instrument). Note that the length and uniqueness parameters effectively select a subset of the reads that contribute to the size of a block. We call this subset the trimmed data base. Some commands operate on the entire database, others on the trimmed database, and yet others have an option flag that permits them to operate on either at the users discretion. Therefore, one should note carefully to which version of the database a command refers to. This is especially important for any command that identifies reads by their index (ordinal position) in the database.

Once the database has been split into blocks, the commands DBshow, DBstats, and DBdust below and commands yet to come, such as the local alignment finder dalign, can take a block or blocks as arguments. On the command line this is indicated by supplying the name of the DB followed by a period and then a block number, e.g. FOO.3.db or simply FOO.3, refers to the 3’rd block of DB FOO (assuming of course it has a current partition and said partition has a 3rd block). One should note carefully that a block is a contiguous range of reads such that once it is trimmed has a given size in base pairs (as set by DBsplit). Thus like an entire database, a block can be either untrimmed or trimmed and one needs to again be careful when giving a read index to a command such as DBshow.

The command DBdust runs the symmetric DUST low complexity filter on every read in a given untrimmed DB or DB block and it is the first example of a phase that produces a track. It is incremental in that you can call it repeatedly on a given DB and it will extend the track information to include any new reads added since the last time it was run on the DB. Reads are analyzed for low-complexity intervals and these intervals constitute the information stored in the .dust-track. These intervals are subsequently “soft”-masked by the local alignment command dalign, i.e. the low-complexity intervals are ignored for the purposes of finding read pairs that might have a significant local alignment between them, but any local alignment found will include the low-complexity intervals.

When DBdust is run on a block, say FOO.3, it is run on the untrimmed version of the block and it produces a block track in the files .FOO.3.dust.anno and .FOO.3.dust.data, whereas running DBdust on the entire DB FOO produces .FOO.dust.anno and .FOO.dust.data. Thus one can run an independent job on each block of a DB, and then after the fact stitch all the block tracks together with a call to Catrack, e.g. “Catrack FOO dust“. Catrack is a general command that merges any sequence of block tracks together into a track for the specified DB. For example, a 50X human dataset would partition into roughly 375, 400Mb blocks, for which DBdust takes about 20 seconds on a typical processor. One can either call DBdust on the entire DB and wait 375*20 ~ 2.07 hours, or run 375 jobs on each block on your local compute cluster, waiting 1 or 2 minutes, depending on how loaded your cluster is, and then call Catrack on the block tracks, which takes another half a minute at most.

Tracks that encode intervals of the underlying reads in the data base, of which the “dust” track is an example, are specifically designated interval tracks, and commands like DBshow, DBstats, and daligner can accept and utlize one or more such tracks, typically specified through a -m option on the command line.

DBshow and DBstats are routines that allow one to examine the current contents of a DB or DB block. DBstats gives one a summary of the contents of an untrimmed or trimmed DB of DB block including a histogram of read lengths and histograms of desired interval tracks if desired. DBshow allows your to display the sequence and/or quality streams of any specific set of reads or read ranges in the DB or DB block where the output can be in Pacbio fasta or quiva format depending on which option flags are set. There is a flag that controls whether the command applies to the entire DB or the trimmed DB and one must be careful, as the, say 5th read of the DB may not be the 5th read of the trimmed DB. A nice trick here is that since the output is in the Pacbio formats, one can then create a “mini-DB” of the set of reads output by DBshow, by calling fasta2DB and/or quiva2DB on its output. This conveniently allows a developer to test/debug a phase on a small, troublesome subset of reads in the DB.

DBrm removes all the files, including all tracks, realizing a given DB. It is safer to use this command than to attempt to remove a DB by calling rm on each hidden file. Lastly, this release includes a simple simulator, simulator, that creates a synthetic Pacbio data set useful for testing purposes.

A precise, detailed, and up-to-date command reference can be found here.

 

The Dextractor Module: Time and Space

Measurements of the compute time and space usage of the tools released in the recent post “The Dextractor Module: Save disk space for your Pacbio projects” are presented for  a few SMRT cells to give the curious reader or user an idea of their performance.  All tests were performed on a MacBook Pro with 16Gb of memory and a 2.7GHz Intel Core i7 processor running OS X 10.7.5.

We selected four SMRT cells of data from a recent sequencing project that vary in their size and the amount of good data they contain.  Table 1 gives the sizes of the four sets of .bax.h5 files for each SMRT cell in the first data column (recall that the data from a SMRT cell is currently delivered in 3 .bax.h5 files, here we deal with their total combined size).  First note that the amount of good data coming from a file varies, e.g. SMRT 3 yielded more data in reads over 500bp and better than 75% than SMRT 2 despite being smaller.

Screen Shot 2014-03-22 at 9.11.40 PM

Table 1: Original HDF5 file size, time to extract both .fasta and .quiva files, and the sizes of these in absolute terms and as a percentage of the original file.

The extraction time is roughly proportional to the size of the input file, but note that the system time (noted in parentheses in the second data column) dropped precipitously for the small input of SMRT 4.  We would guess that this is because the individual .bax.h5 files are smaller than 2Gb each for said SMRT, and greater than that for the others, and that there is some crossover in system file performance at this threshold.  On average the .fasta file is about 4.5% of the original file and the .quiva file is about 22% for a total of 26.5% (3.8X less space).  The relative sizes of the files is also shown graphically in Chart 1.

Screen Shot 2014-03-23 at 8.50.56 AM

Chart 1: Bar chart of the sizes of the .bax.h5, .quiva, and .fasta files for each of the 4 SMRT cells

Compression and decompression times between .fasta files and their losslessly compressed .dexta counterparts was very fast at roughly 1 or 2 seconds in each direction, with decompression times being about 1.32 times slower than compression times.  Each base is packed into 2-bits giving a very consistent compression factor of 4 (25% of the uncompressed size).

Screen Shot 2014-03-23 at 8.47.04 AM

Table 2: Compression sizes and compression/decompression times for both .fasta and .quiva files.

The compression of .quiva files involves computing Huffman and run-length statistics on a file-by-file basis and so takes considerably longer with again compression averaging 1.07 times faster than decompression.  The compression ratio can vary a bit more but was quite consistent at 29% plus or minus 1% (3.45X).  The Pacbio engineers have released different versions of the base caller and it may well be that for data sets produced with older versions of this software the compression ratio is different (it was not tested here). The relative sizes of the compressed files is shown graphically in Chart 2.

Screen Shot 2014-03-23 at 8.51.09 AM

Chart 2: Bar chart of the sizes of the .quiva and .fasta files and their compressed counterparts, .dexqv and .dexta, for each of the 4 SMRT cells

In summary, extracting just the necessary data from the HDF5 files provided by the manufacturer saves one a factor of almost 4 in storage.  Compressing this data then provides another factor of 3.5 or so, and decompressing the data is quite fast, indeed, very fast for the primary .fasta data.

The Dextractor Module: Save disk space for your Pacbio projects

The Pacbio sequencers currently produce HDF5-formatted files that contain a lot of information about a sequencing run including the sequence, streams of quality values, and a lot of other stuff that most of us will never look at.  These files are quite big.  The 3 .bax.h5 files produced for a given SMRT cell typically occupy 7-9 Gb.  The recent  54X human genome data set produced by the company consists of the output from 277 SMRT cells, so if you want to download it and work with it, you will need to set aside roughly 2+Tb of disk.  Here we present a set of UNIX commands that extract just the data you need for an assembly project, and allow you to compress said data, so that it occupies roughly 1/13th the space.  For example, the extracted and compressed human data set occupies only ~150Gb !  For our own projects we routinely extract and compress data as it is produced, saving the HDF5 files to a tape archive as backup.  This saves our organization money as keeping such data sets on disk at these scales is not a cheap proposition.

The Dextractor module source code is available on Github here.  Grabbing the code and uttering “make” should produce 5 programs: dextract, dexta, undexta, dexqv, and undexqvDextract takes .bax.h5 files as input and produces a .fasta file containing the sequences of all the reads, and a .quiva file that contains all the quality value (QV) streams needed by the Quiver program to produce a consensus sequence as the final phase of assembly.  The .fasta file can then be compressed (~ 4x) with dexta to produce a .dexta file, and this can be decompressed with undexta to restore the original .fasta file.  Similarly dexqv & undexqv compress (~ 3.3x) and decompress a .quiva file into a .dexqv file and back.  The figure below shows the flow of information:

Screen Shot 2014-03-04 at 11.35.53 AM

How data flows through the dextractor modules from a source .bax.h5 file to the assembler. The yellow percentages indicate the relative amount of space occupied by the given file type relative to the source (100%).

As an example, the Pacbio RS II produces 3 .bax.h5 files as output for a given SMRT cell (there are currently 3 blades on the machine’s server each handling 1/3rd of the 150K zero-mode wells).  When we receive such an output on our disk storage system, say files G.1.bax.h5, G.2.bax.h5, G.3.bax.h5, we immediately execute the following commands:

dextract -o G.*.bax.h5
dexta G
dexqv G
mv G.*.bax.h5 <Our_tape_backup_system>

This leaves us with files G.dexta and G.dexqv on spooled up disk.  Whenever we are ready to begin an assembly we decompress G.dexta with undexta (for most phases only the .fasta files are needed) and then if we have an assembly we like and we want a final consensus sequence from Quiver, we decompress G.dexqv back into a .quiva file  with undexqv that we then feed into our customized version of Quiver (not yet available, alas)

A precise, detailed, and up-to-date command reference can be found here.