Seeing Dazzler Data & Results

Because the Dazzler is under development and not yet an end-to-end assembler, every Dazzler user at some point needs to get the information that tools like daligner have computed, so that they can carry on with the rest of their assembly process.  Currently, a few intrepid souls have either gotten into the code far enough to directly decode various files or have built ASCII parsers that pull the information from the outputs of commands like DBshow and LAshow that give users a printed representation of the information in a database or alignment file, respectively.  To rectify this, I have recently added some tools that make it very easy for a user to get any information they want from the Dazzler suite in an easy-to-understand and easy-to-parse format.

In a recent talk delivered at the PacBio developers conference this August, I railed about how awful formats like .fasta and .bam are.  Basically, I think that the programs that produce a dataset (I call them writers) should produce output that is as easy as possible for a consumer of the data to use (I call them readers).  For example, why should the reader not know in advance the size of the largest sequence in a data set?  The writer knows this and should give the information to the reader at the start of the encoding so that the reader can simply allocate a buffer of this maximum size, knowing that they need not check for overflow or have to reallocate and expand the buffer during input.  As another example, why put new-lines in the middle of a sequence?  I mean really, how often do you actually read a .fasta file?  And even if you did, every text manager I know of wraps overly long lines.  So at this point you get the idea, I’m sure, and I’ll stop the rant here 🙂

Thus far the Dazzler suite produces a database organized around read records (or contigs of a reference sequence) and one or more alignment files giving local alignments between reads.  The new utility DBdump gets information out of a data base including any mask tracks associated with reads, and LAdump gets information out of an alignment file.  The formal specifications can be found by following the link associated with each name, here, we just give the intuition and idea behind the commands.  For example, the command “DBdump -hrs -mdust DB 10-12” will output the header (-h), sequence (-s), read number (-r), and dust track (-mdust) for reads 10 to 12 inclusive from DB.  The output might look like:

+ R 3
+ M 1
+ H 183
@ H 61
+ T0 4
@ T0 3
+ S 29186
@ S 11702
R 10
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 338 0 8952
T0 1  4370 4380
S 8952 aaaagagagatactg...ccctgcggt
R 11
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 347 0 8532
T0 1  6613 6647
S 8532 gaggggaaagatgat...ttcgacggc
R 12
H 61 m130403_204322_42204_c100505872550000001823074808081366_s1_p0
L 385 0 11702
T0 3  491 539 3363 3383 3817 3827
S 11702 agtgaaagagtgaa...atccgctgg

Each line begins with a single symbol or 1-code that designates what is to follow, e.g. R for read number, L for well and pulse range, H for header, and S for sequence.  Lines that have the 1-code + or @ are always at the beginning of the file as they give information about the total number of a given type of object in a file (+) or the maximum size of any given object over all reads (@).  For example, the first 8 lines tell you there are 3 reads, 1 mask track, 183 characters in all headers with 61 characters in the longest header, 4 mask.0 intervals altogether with at most 3 intervals for any given read, 29186 bases in all the reads with the longest being 11702bp long.  With this information you can basically allocate once all the buffers you need for reading and processing the file.  The records themselves should be easy to understand noting that (a) all items on a line are separated by a single space, (b) a string is a length/sequence pair (no newlines in the string!), and (c) track codes are T0, T1, T2, … where the number corresponds to their order on the command line.

Similarly one can access the information in a sorted alignment file with a call such as  “LAdump [-cdt] DB.db DB.las 4-6” which will output the overlap pairs, coordinate intervals (-c), number of differences (-d), and trace points (-t) for all the overlaps in DB.las whose A-read is 4, 5, or 6.  The set of all alignments with a given A-read is called a pile and many components of the Dazzler suite downstream of daligner analyze piles.

+ P 117
% P 44
+ T 6986
% T 2858
@ T 90
P 4 241 n
C 0 1315 2028 3327
D 252
T 14
25  84
17  97
16 103
...
1  14
P 4 274 n
C 0 1085 3335 4443
D 247
T 11
28  91
15 101
20  90
P 4 1740 c
C 0 1945 13444 15451
D 461
T 20
21  86
...
P 4 2248 n
C 244 2850 0 2383
D 590
T 27
...

The 1-codes used are: P for a read pair and orientation (c or n), C for the alignment coordinates, D for the number of differences in the alignment, and T for the start of a list of trace points.  In the example output below, the first 5 lines give information about the number of objects, where the new 1-code % designates information about the maximum size of something with respect to a pile.  In the example, there are 117 alignments altogether with 44 in the largest pile, and there 6986 trace points for all alignments, where the largest alignment has 90, and the pile with the most trace points has 2858.  The first overlap record is between read 1[0..1085] and 14n[3335..4443] with 247 differences in the alignment and 11 trace points given in pairs in the 11 lines following.  A forthcoming post will give a precise description of what trace points are exactly.

Advertisements

One thought on “Seeing Dazzler Data & Results

  1. Pingback: Seeing Dazzler Data & Results | tabletkitabesi

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s