Recording Alignments with Trace Points

The daligner often produces millions or billions of alignments for large genomes and records in .las files each of these alignments.  Thus far I’ve alluded to trace points as the way alignments are recorded in a couple of posts without real stating what they are.  An alignment between two reads a and b, must at a minimum record the interval of a, say [ab,ae], interval of b, say [bb,be], and the orientation o (= n or c) of b relative to a.  In brief:

a[ab,ae] x bo[bb,be]

However, at later stages one may want the actual alignment between these two substrings.  One choice, is to simply compute the alignment whenever it is needed.  Letting n = ae-ab ~ be-bb, and assuming an error rate of say ε = 15%, this takes O(εn2) time, which can be rather expensive when n is 10Kbp or more.  Another choice is to encode the alignment (daligner has it at one point in its work flow) as say a sequence of edit positions as in the SAM format.  This means that alignment is immediately available when needed, but unfortunately the encoding takes O(εn) space for each alignment which is a nontrivial amount.  For example, a 10Kbp alignment would involve 2500-3000 edits requiring conservatively 5-6Kb to encode, and that’s per alignment !

Trace points are a concept I came up with to realize a parameter-controlled trade off between the time to compute an alignment and the space to encode it.  Let Δ be a positive number. For instance, the daligner default is 100.  For an alignment with A-interval [ab,ae], imagine a trace point every 100bp, on every 100th base of the A-read.  For example, if [ab,ae] = [57,432], then place points at 100, 200, 300, and 400.  Generalizing, there are:

 τ = ⌈ ae/Δ ⌉ – ⌊ ab/Δ ⌋ = O(n/Δ)

intervals between the end-points of the A-interval and the trace points, e.g. [57,100], [100,200], [200,300], [300,400], [400,432].  Note that only the parameter Δ and the A-interval are needed to know where the trace points are in the A-read.  What we record is the number of bases in the B-interval [bb,be] that align with each of the trace-point intervals in A.  For example, suppose the B-interval is [1002,1391] , and in the full alignment trace that daligner has just computed:

A[ 57,100] aligns to B[1002,1050] with 10 diffs
A[100,200] aligns to B[1050,1155] with 25 diffs
A[200,300] aligns to B[1155,1250] with 18 diffs
A[300,400] aligns to B[1250,1351] with 22 diffs
A[400,432] aligns to B[1351,1391] with  7 diffs

Then the daligner records the alignment is recorded as the sequence of pairs (10,48) (25,105) (18,95) (22,101) (7,40) in the .las file along with the basic alignment information above.  In general, an alignment is encoded as a sequence (d1,b1) (d2,b2) (bd,b3) … (dτ,bτ) such that [bb+Σ(i-1),bb+Σ(i)] aligns with the Δ-spaced intervals of A, where Σ(i) = b1 + b2 + … + bi and by construction Σ(τ) = be-bb.  Note that if Δ is 100 and ε = 15% then one can be certain the bi‘s and di‘s fit in a single byte.  So the encoding is very sparse at exactly 2n/Δ bytes per alignment for the daligner running with the default trace point spacing.

So now suppose you want the alignment and you have the trace-point distances bi.  Very simply, all you need to do is compute alignments between each pair of A- and B- trace point intervals, and concatenate them together!  Each interval pair alignment takes O(εΔ2) time and there are O(n/Δ) intervals, for a total of O(εΔn) time.  For any fixed value of Δ, alignment delivery is thus linear in n and not quadratic!  One should note that the number of differences in each trace point interval are not actually needed to build an alignment.  The value of having these differences tucked away in the .las file will become apparent in a future post on “Intrinsic Quality Values”, but we introduce the exact encoding here for completeness.

With Δ=100 as the default, daligner uses only 200 bytes for every 10Kbp of aligned bases (compared to 6,000 bytes for BAM), but can still deliver alignments very rapidly with time that grows linearly in the number of aligned bases.  While the higher error rate of Pacbio data inspired this design, one should note that this scheme can and I think should be used for low error rate scenarios by simply using a much bigger Δ.  SAM was designed for short, almost perfect (ε = .5%) read alignments.  It is not general and space inefficient, whereas the trace point concept scales to whatever read size and error rate one could imagine for a technology simply by adjusting Δ.  SAM is also a really bad example of a bio-format that violates the “easy-to-read-by-a-program” philosophy I explained in the post “Seeing Dazzler Data & Results“.  The utility LAdump described in that post can be used to get the trace point sequence for any alignment from a daligner-produced .las file.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s