The Dextractor Module: Time and Space

Measurements of the compute time and space usage of the tools released in the recent post “The Dextractor Module: Save disk space for your Pacbio projects” are presented for  a few SMRT cells to give the curious reader or user an idea of their performance.  All tests were performed on a MacBook Pro with 16Gb of memory and a 2.7GHz Intel Core i7 processor running OS X 10.7.5.

We selected four SMRT cells of data from a recent sequencing project that vary in their size and the amount of good data they contain.  Table 1 gives the sizes of the four sets of .bax.h5 files for each SMRT cell in the first data column (recall that the data from a SMRT cell is currently delivered in 3 .bax.h5 files, here we deal with their total combined size).  First note that the amount of good data coming from a file varies, e.g. SMRT 3 yielded more data in reads over 500bp and better than 75% than SMRT 2 despite being smaller.

Screen Shot 2014-03-22 at 9.11.40 PM

Table 1: Original HDF5 file size, time to extract both .fasta and .quiva files, and the sizes of these in absolute terms and as a percentage of the original file.

The extraction time is roughly proportional to the size of the input file, but note that the system time (noted in parentheses in the second data column) dropped precipitously for the small input of SMRT 4.  We would guess that this is because the individual .bax.h5 files are smaller than 2Gb each for said SMRT, and greater than that for the others, and that there is some crossover in system file performance at this threshold.  On average the .fasta file is about 4.5% of the original file and the .quiva file is about 22% for a total of 26.5% (3.8X less space).  The relative sizes of the files is also shown graphically in Chart 1.

Screen Shot 2014-03-23 at 8.50.56 AM

Chart 1: Bar chart of the sizes of the .bax.h5, .quiva, and .fasta files for each of the 4 SMRT cells

Compression and decompression times between .fasta files and their losslessly compressed .dexta counterparts was very fast at roughly 1 or 2 seconds in each direction, with decompression times being about 1.32 times slower than compression times.  Each base is packed into 2-bits giving a very consistent compression factor of 4 (25% of the uncompressed size).

Screen Shot 2014-03-23 at 8.47.04 AM

Table 2: Compression sizes and compression/decompression times for both .fasta and .quiva files.

The compression of .quiva files involves computing Huffman and run-length statistics on a file-by-file basis and so takes considerably longer with again compression averaging 1.07 times faster than decompression.  The compression ratio can vary a bit more but was quite consistent at 29% plus or minus 1% (3.45X).  The Pacbio engineers have released different versions of the base caller and it may well be that for data sets produced with older versions of this software the compression ratio is different (it was not tested here). The relative sizes of the compressed files is shown graphically in Chart 2.

Screen Shot 2014-03-23 at 8.51.09 AM

Chart 2: Bar chart of the sizes of the .quiva and .fasta files and their compressed counterparts, .dexqv and .dexta, for each of the 4 SMRT cells

In summary, extracting just the necessary data from the HDF5 files provided by the manufacturer saves one a factor of almost 4 in storage.  Compressing this data then provides another factor of 3.5 or so, and decompressing the data is quite fast, indeed, very fast for the primary .fasta data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s