Recording Alignments with Trace Points

The daligner often produces millions or billions of alignments for large genomes and records in .las files each of these alignments.  Thus far I’ve alluded to trace points as the way alignments are recorded in a couple of posts without real stating what they are.  An alignment between two reads a and b, must at a minimum record the interval of a, say [ab,ae], interval of b, say [bb,be], and the orientation o (= n or c) of b relative to a.  In brief:

a[ab,ae] x bo[bb,be]

However, at later stages one may want the actual alignment between these two substrings.  One choice, is to simply compute the alignment whenever it is needed.  Letting n = ae-ab ~ be-bb, and assuming an error rate of say ε = 15%, this takes O(εn2) time, which can be rather expensive when n is 10Kbp or more.  Another choice is to encode the alignment (daligner has it at one point in its work flow) as say a sequence of edit positions as in the SAM format.  This means that alignment is immediately available when needed, but unfortunately the encoding takes O(εn) space for each alignment which is a nontrivial amount.  For example, a 10Kbp alignment would involve 2500-3000 edits requiring conservatively 5-6Kb to encode, and that’s per alignment !

Trace points are a concept I came up with to realize a parameter-controlled trade off between the time to compute an alignment and the space to encode it.  Let Δ be a positive number. For instance, the daligner default is 100.  For an alignment with A-interval [ab,ae], imagine a trace point every 100bp, on every 100th base of the A-read.  For example, if [ab,ae] = [57,432], then place points at 100, 200, 300, and 400.  Generalizing, there are:

 τ = ⌈ ae/Δ ⌉ – ⌊ ab/Δ ⌋ = O(n/Δ)

intervals between the end-points of the A-interval and the trace points, e.g. [57,100], [100,200], [200,300], [300,400], [400,432].  Note that only the parameter Δ and the A-interval are needed to know where the trace points are in the A-read.  What we record is the number of bases in the B-interval [bb,be] that align with each of the trace-point intervals in A.  For example, suppose the B-interval is [1002,1391] , and in the full alignment trace that daligner has just computed:

A[ 57,100] aligns to B[1002,1050] with 10 diffs
A[100,200] aligns to B[1050,1155] with 25 diffs
A[200,300] aligns to B[1155,1250] with 18 diffs
A[300,400] aligns to B[1250,1351] with 22 diffs
A[400,432] aligns to B[1351,1391] with  7 diffs

Then the daligner records the alignment is recorded as the sequence of pairs (10,48) (25,105) (18,95) (22,101) (7,40) in the .las file along with the basic alignment information above.  In general, an alignment is encoded as a sequence (d1,b1) (d2,b2) (d3,b3) … (dτ,bτ) such that [bb+Σ(i-1),bb+Σ(i)] aligns with the Δ-spaced intervals of A, where Σ(i) = b1 + b2 + … + bi and by construction Σ(τ) = be-bb.  Note that if Δ is 100 and ε = 15% then one can be certain the bi‘s and di‘s fit in a single byte.  So the encoding is very sparse at exactly 2n/Δ bytes per alignment for the daligner running with the default trace point spacing.

So now suppose you want the alignment and you have the trace-point distances bi.  Very simply, all you need to do is compute alignments between each pair of A- and B- trace point intervals, and concatenate them together!  Each interval pair alignment takes O(εΔ2) time and there are O(n/Δ) intervals, for a total of O(εΔn) time.  For any fixed value of Δ, alignment delivery is thus linear in n and not quadratic!  One should note that the number of differences in each trace point interval are not actually needed to build an alignment.  The value of having these differences tucked away in the .las file will become apparent in a future post on “Intrinsic Quality Values”, but we introduce the exact encoding here for completeness.

With Δ=100 as the default, daligner uses only 200 bytes for every 10Kbp of aligned bases (compared to 6,000 bytes for BAM), but can still deliver alignments very rapidly with time that grows linearly in the number of aligned bases.  While the higher error rate of Pacbio data inspired this design, one should note that this scheme can and I think should be used for low error rate scenarios by simply using a much bigger Δ.  SAM was designed for short, almost perfect (ε = .5%) read alignments.  It is not general and space inefficient, whereas the trace point concept scales to whatever read size and error rate one could imagine for a technology simply by adjusting Δ.  SAM is also a really bad example of a bio-format that violates the “easy-to-read-by-a-program” philosophy I explained in the post “Seeing Dazzler Data & Results“.  The utility LAdump described in that post can be used to get the trace point sequence for any alignment from a daligner-produced .las file.

The Dextractor Module: Time and Space

Measurements of the compute time and space usage of the tools released in the recent post “The Dextractor Module: Save disk space for your Pacbio projects” are presented for  a few SMRT cells to give the curious reader or user an idea of their performance.  All tests were performed on a MacBook Pro with 16Gb of memory and a 2.7GHz Intel Core i7 processor running OS X 10.7.5.

We selected four SMRT cells of data from a recent sequencing project that vary in their size and the amount of good data they contain.  Table 1 gives the sizes of the four sets of .bax.h5 files for each SMRT cell in the first data column (recall that the data from a SMRT cell is currently delivered in 3 .bax.h5 files, here we deal with their total combined size).  First note that the amount of good data coming from a file varies, e.g. SMRT 3 yielded more data in reads over 500bp and better than 75% than SMRT 2 despite being smaller.

Screen Shot 2014-03-22 at 9.11.40 PM

Table 1: Original HDF5 file size, time to extract both .fasta and .quiva files, and the sizes of these in absolute terms and as a percentage of the original file.

The extraction time is roughly proportional to the size of the input file, but note that the system time (noted in parentheses in the second data column) dropped precipitously for the small input of SMRT 4.  We would guess that this is because the individual .bax.h5 files are smaller than 2Gb each for said SMRT, and greater than that for the others, and that there is some crossover in system file performance at this threshold.  On average the .fasta file is about 4.5% of the original file and the .quiva file is about 22% for a total of 26.5% (3.8X less space).  The relative sizes of the files is also shown graphically in Chart 1.

Screen Shot 2014-03-23 at 8.50.56 AM

Chart 1: Bar chart of the sizes of the .bax.h5, .quiva, and .fasta files for each of the 4 SMRT cells

Compression and decompression times between .fasta files and their losslessly compressed .dexta counterparts was very fast at roughly 1 or 2 seconds in each direction, with decompression times being about 1.32 times slower than compression times.  Each base is packed into 2-bits giving a very consistent compression factor of 4 (25% of the uncompressed size).

Screen Shot 2014-03-23 at 8.47.04 AM

Table 2: Compression sizes and compression/decompression times for both .fasta and .quiva files.

The compression of .quiva files involves computing Huffman and run-length statistics on a file-by-file basis and so takes considerably longer with again compression averaging 1.07 times faster than decompression.  The compression ratio can vary a bit more but was quite consistent at 29% plus or minus 1% (3.45X).  The Pacbio engineers have released different versions of the base caller and it may well be that for data sets produced with older versions of this software the compression ratio is different (it was not tested here). The relative sizes of the compressed files is shown graphically in Chart 2.

Screen Shot 2014-03-23 at 8.51.09 AM

Chart 2: Bar chart of the sizes of the .quiva and .fasta files and their compressed counterparts, .dexqv and .dexta, for each of the 4 SMRT cells

In summary, extracting just the necessary data from the HDF5 files provided by the manufacturer saves one a factor of almost 4 in storage.  Compressing this data then provides another factor of 3.5 or so, and decompressing the data is quite fast, indeed, very fast for the primary .fasta data.