Moving to the Sequel

Featured

In the last year, Pacbio has, in conjunction with the introduction of the Sequel instrument, made two changes to their software environment that affect downstream applications including the Dazzler suite.  Namely, they have (1) moved from the HDF5 format to the BAM format for the primary output of the instrument, and (2) there is a new alternative to the Quiver consensus base-caller, called Arrow that is equally if not more accurate and requires less information about the raw signal to do so.  Specifically, Quiver requires five quality streams each the length of the associated read, whereas Arrow needs more simply just the read’s pulse width stream clipped to the smallest four values and the four signal-to-noise (snr) ratios for each nucleotide channel.  All .bax.h5 files currently and previously produced for the RS II contain the the information needed by both Quiver and Arrow.  However, the new .subreads.bam files contain just the information needed by Arrow if they were produced by a Sequel instrument, but they can contain just the information needed by Quiver if it was produced by converting an RS II HDF5 file to a BAM file with the PacBio program bax2bam.

To accommodate these transitions, the Dazzler suite now interprets the .subreads.bam files produced by the instrument and can extract Quiver or Arrow information from either type of source file, and Dazzler DB’s have been upgraded to hold either the information needed by Quiver, or the information needed by Arrow (but not both).  Most of the new programs and modifications have occurred in the DEXTRACTOR module, although the upgrades also affected the DAZZ_DB module.  Below we outline the changes and new capabilities.

The original extraction program dextract can now also take a BAM file as input and from it produce a .fasta, a .quiva, an .arrow file, or some combination of these according to whether or not the command line options -f, -q, and -a, respectively, are set.  The .fasta and .quiva file types are unchanged, and the new .arrow file type contains the information needed by Arrow.  The dextract program has further been upgraded so that given an HDF5 file it too can produce any combination of the 3 files.  An example, of a sub-read entry in each format is immediately below and we explain the contents of each in the following paragraph.

screen-shot-2016-11-06-at-10-57-45-amA .fasta file is in FASTA format, and an entry consists of a standard “Pacbio header” followed by the DNA sequence on any number of ensuing lines.  The standard header contains the movie name, well number (44), pulse range (0-6289), and sequence quality value (.812) to 3 digits, always with the same syntax.  A .quiva file uses a FASTQ-like format, and an entry consists of the same Pacbio header, followed by each of the 5 quality streams, one per line, with no line breaks within a stream.  Finally, the .arrow file is in FASTA format, and an entry consists of a header with the movie name and the 4 snr values, followed by the pulse width stream on any number of ensuing lines.  The header in this case contains just the movie name and the four snr values (6.69, 5.89, 3.72, 4.04) in  a fixed syntax where the snr values are always to 2 digits of precision and always listed in the order of the A, C, G, and T channels, respectively.  The pulse width stream is always a sequence of the symbols, 1, 2, 3, and 4.

One might further recall that the DEXTRACTOR module also contains compression and decompression program pairs — dexta/undexta and dexqv/undexqv — that operate on .fasta and .quiva files, respectively.  There is now also a compressor/decompressor pair, dexar/undexar, that compresses an .arrow file into a .dexar file and decompresses it back.  The picture immediately below shows the existing programs in green and the new ones in orange:screen-shot-2016-11-10-at-1-19-49-pmThe numbers in green in the picture above give the approximate sizes of the various files in bytes, where N is the number of base pairs of data in the file.  The .bam file occupies roughly 2N bytes if it contains only Arrow information and 3.3N if it contains only Quiver information.  The larger bax.h5 file always contains the information for both programs within it.  The main thing to note is that the compressed files represent a significant space savings, but by not as big a margin over the .bam source files which are themselves gzip compressed.

Previously, a Dazzler DB held the DNA sequence and header information for each subread, and optionally could store and organize all the Quiver information.  I have upgraded the system so that now a DB can also optionally hold all the Arrow information for a project.  But note carefully, that a Dazzler DB cannot hold both.  So we term a DB that holds Quiver information a Quiver-DB (Q-DB), one that holds Arrow information an Arrow-DB (A-DB), and one that holds neither a Sequence-DB (S-DB).  One might think this forces a user to choose between Quiver and Arrow for consensus at the beginning of an assembly project, but this is not the case as one can at any time add or delete the Quiver or Arrow information independently of the sequence information.

Specifically, recall that one creates an S-DB, possibly incrementally, by adding .fasta files with fast2DB.   Anytime there after one could then add .quiva files with quiva2DB, in the process converting the S-DB into a Q-DB.  In exact analogy, one can now create an A-DB by incrementally adding .arrow files to an S-DB with the new program arrow2DB (found in the DAZZ_DB module).  As before, any files that were added to a DB can be recreated with the programs DB2fasta, DB2quiva, and a new DB2arrow program.  Since an A- or Q-DB contains everything you need for assembly (in compressed form) including producing the final consensus, you can choose to archive the source data files and only keep the DB as long as you are certain about which consensus program you will use. The figure below shows the additional programs:screen-shot-2016-11-10-at-1-20-16-pmNote that a Sequence DB consists of the files G.db, .G.idx, .G.bpsA Quiver DB additionally has the file .G.qvs, and an Arrow DB has a .G.arw file. In the hopefully rare use-cases where one wants to decide which consensus program to use late in the assembly process, we recommend that you create an S-DB and keep both the Arrow and Quiver information as compressed .dexqv and .dexar files.  One can also convert an A- or Q-DB back into an S-DB by calling a new program DBwipe that removes all the auxilliary information for Quiver or Arrow from a DB.

While the above programs allow you to extract information from both forms of PacBio data files and produce a DB from the extracted data, I further observed that for most applications, the intermediate production of .fasta, .quiva, and/or .arrow files or their compressed forms is unnecessary, and one saves both time and disk space with a program that both extracts the desired information and adds it to a DB in a single step as shown in the figure below:screen-shot-2016-11-10-at-1-20-42-pmThe program dex2DB has been developed and is now part of the DEXTRACTOR module.  The program can create either an S-, A-, or Q-DB directly as long as the necessary information is present in the source files.  I was tempted to remove all the programs that are faded out in the figure immediately above, but have decided to keep them to (1) help developers that want to work with Pacbio data but not with the Dazzler system, and (2) for those users that can’t decide yet if they want to use Quiver or Arrow and so need to keep both types of information in .dexqv and .dexar files external to the DB.

The Dextractor Module: Save disk space for your Pacbio projects

The Pacbio sequencers currently produce HDF5-formatted files that contain a lot of information about a sequencing run including the sequence, streams of quality values, and a lot of other stuff that most of us will never look at.  These files are quite big.  The 3 .bax.h5 files produced for a given SMRT cell typically occupy 7-9 Gb.  The recent  54X human genome data set produced by the company consists of the output from 277 SMRT cells, so if you want to download it and work with it, you will need to set aside roughly 2+Tb of disk.  Here we present a set of UNIX commands that extract just the data you need for an assembly project, and allow you to compress said data, so that it occupies roughly 1/13th the space.  For example, the extracted and compressed human data set occupies only ~150Gb !  For our own projects we routinely extract and compress data as it is produced, saving the HDF5 files to a tape archive as backup.  This saves our organization money as keeping such data sets on disk at these scales is not a cheap proposition.

The Dextractor module source code is available on Github here.  Grabbing the code and uttering “make” should produce 5 programs: dextract, dexta, undexta, dexqv, and undexqvDextract takes .bax.h5 files as input and produces a .fasta file containing the sequences of all the reads, and a .quiva file that contains all the quality value (QV) streams needed by the Quiver program to produce a consensus sequence as the final phase of assembly.  The .fasta file can then be compressed (~ 4x) with dexta to produce a .dexta file, and this can be decompressed with undexta to restore the original .fasta file.  Similarly dexqv & undexqv compress (~ 3.3x) and decompress a .quiva file into a .dexqv file and back.  The figure below shows the flow of information:

Screen Shot 2014-03-04 at 11.35.53 AM

How data flows through the dextractor modules from a source .bax.h5 file to the assembler. The yellow percentages indicate the relative amount of space occupied by the given file type relative to the source (100%).

As an example, the Pacbio RS II produces 3 .bax.h5 files as output for a given SMRT cell (there are currently 3 blades on the machine’s server each handling 1/3rd of the 150K zero-mode wells).  When we receive such an output on our disk storage system, say files G.1.bax.h5, G.2.bax.h5, G.3.bax.h5, we immediately execute the following commands:

dextract -o G.*.bax.h5
dexta G
dexqv G
mv G.*.bax.h5 <Our_tape_backup_system>

This leaves us with files G.dexta and G.dexqv on spooled up disk.  Whenever we are ready to begin an assembly we decompress G.dexta with undexta (for most phases only the .fasta files are needed) and then if we have an assembly we like and we want a final consensus sequence from Quiver, we decompress G.dexqv back into a .quiva file  with undexqv that we then feed into our customized version of Quiver (not yet available, alas)

A precise, detailed, and up-to-date command reference can be found here.