The Dextractor Module: Save disk space for your Pacbio projects

The Pacbio sequencers currently produce HDF5-formatted files that contain a lot of information about a sequencing run including the sequence, streams of quality values, and a lot of other stuff that most of us will never look at.  These files are quite big.  The 3 .bax.h5 files produced for a given SMRT cell typically occupy 7-9 Gb.  The recent  54X human genome data set produced by the company consists of the output from 277 SMRT cells, so if you want to download it and work with it, you will need to set aside roughly 2+Tb of disk.  Here we present a set of UNIX commands that extract just the data you need for an assembly project, and allow you to compress said data, so that it occupies roughly 1/13th the space.  For example, the extracted and compressed human data set occupies only ~150Gb !  For our own projects we routinely extract and compress data as it is produced, saving the HDF5 files to a tape archive as backup.  This saves our organization money as keeping such data sets on disk at these scales is not a cheap proposition.

The Dextractor module source code is available on Github here.  Grabbing the code and uttering “make” should produce 5 programs: dextract, dexta, undexta, dexqv, and undexqvDextract takes .bax.h5 files as input and produces a .fasta file containing the sequences of all the reads, and a .quiva file that contains all the quality value (QV) streams needed by the Quiver program to produce a consensus sequence as the final phase of assembly.  The .fasta file can then be compressed (~ 4x) with dexta to produce a .dexta file, and this can be decompressed with undexta to restore the original .fasta file.  Similarly dexqv & undexqv compress (~ 3.3x) and decompress a .quiva file into a .dexqv file and back.  The figure below shows the flow of information:

Screen Shot 2014-03-04 at 11.35.53 AM

How data flows through the dextractor modules from a source .bax.h5 file to the assembler. The yellow percentages indicate the relative amount of space occupied by the given file type relative to the source (100%).

As an example, the Pacbio RS II produces 3 .bax.h5 files as output for a given SMRT cell (there are currently 3 blades on the machine’s server each handling 1/3rd of the 150K zero-mode wells).  When we receive such an output on our disk storage system, say files G.1.bax.h5, G.2.bax.h5, G.3.bax.h5, we immediately execute the following commands:

dextract -o G.*.bax.h5
dexta G
dexqv G
mv G.*.bax.h5 <Our_tape_backup_system>

This leaves us with files G.dexta and G.dexqv on spooled up disk.  Whenever we are ready to begin an assembly we decompress G.dexta with undexta (for most phases only the .fasta files are needed) and then if we have an assembly we like and we want a final consensus sequence from Quiver, we decompress G.dexqv back into a .quiva file  with undexqv that we then feed into our customized version of Quiver (not yet available, alas)

A precise, detailed, and up-to-date command reference can be found here.



