Seeing Your Reads: DaViewer

The old adage that “A picture is worth a 1000 words” might well be stated today to as “A visualization of your data is worth a 1000 print outs”.  While there are many programs in the Dazzler system to produce ASCII text displays of your reads and their overlaps with each other, I have found that it is hugely more powerful to look at pile-o-grams and visualize the quality of the matches, the intrinsic QVs of a read, and any mask intervals produced by programs such as the repeat maskers or the forthcoming scrubber.  To this end I’m releasing today DaViewer that is a Qt-implemented user interface for seeing Dazzler piles and associated information.  Unlike other modules, this one does have a dependency on the Qt library, which you can download and install for free here.  I developed it under Qt5.4 and have recently tested it with Qt5.6 (latest version) under Mac OS X.  The library and my C++ code should compile and run under any OS, but this has not been tested, so I’d appreciate reports of porting problems, and of course, any patches.  In this post I give a brief overview of what the DaViewer can do, and refer you to the manual page for detailed documentation.

In brief, DaViewer allows you to view any subset of the information in a given .las file of local alignments computed by the daligner or the forth coming damapper, and any track information associated with the read database(s) that were compared to produce the local alignments (LAs).  As an example here is a screen showing several overlap piles (click on the image to see a bigger view of it):

Screen Shot 2016-05-29 at 1.42.44 PM The viewer has a concept of a palette that controls the color of all the elements on a screen and you can set this up according to how you like to see the data.  I prefer a black background with primary colors for the foreground, but in the example below I set the palette so that the background is white and adjusted the foreground colors accordingly:Screen Shot 2016-05-29 at 1.44.13 PMWith the query panel you can zoom to any given read or range of reads, and when you zoom in far enough for the display to be meaningful you see (a) a heat map displaying the quality of all trace point interval matches, (b) a “tri-state” color map of the quality values of read 19 for each trace point (as computed by DASqv), and (c) a grid-spacer (if the appropriate option is turned on).  Moreover, you can ask the viewer to chain together local alignments that are consistently spaced with a dashed line connecting them and a color-coded ramp indicating the compression or expansion of the spacing.  For example, below I queried read “19”: Screen Shot 2016-05-29 at 2.27.25 PMYou can see the grid-lines spaced every 1000bp, compressed dash lines across what must be a low quality gap in the read between bases 2000 and 3000, and numerous expanded dashes across what are most likely low quality gaps in matching B-reads.  The settings in the palette dialog producing this view were as follows:

Screen Shot 2016-05-29 at 2.28.06 PMScreen Shot 2016-05-29 at 2.28.57 PM

In the panel at right the “Tri-State” option is chosen for “Show qual qv’s” and the colors are set so that trace point intervals with quality values not greater than 23 (i.e. “Good”) are colored green,  intervals with values not less than 30 (i.e. “Bad”) are colored red, and otherwise the interval is colored yellow.  Further note that you can also ask to see read’s QV’s projected on to the B-reads of the pile, producing a view like this:

Screen Shot 2016-06-01 at 7.39.53 AMA variety of programs such as DBdust, REPmask, TANmask, and the forthcoming DAStrim, all produce interval tracks associated with the underlying DB(s).  When a new .las file is opened with daviewer the program automatically searches the specified DBs for any interval tracks associated with them and loads them for (possible) display.  The found tracks are listed in the “Masks” tab of the palette dialog as illustrated immediately below.

Screen Shot 2016-06-01 at 7.49.11 AMScreen Shot 2016-06-01 at 7.48.17 AM

In the palette tab at left each available interval track or mask is listed.  One can choose whether a track is displayed, on which register line, its color, and whether or not you want to see the mask on the B-reads as well.  The tracks are displayed in the order given and the order can be adjusted with drag-and-drop initiated by depressing the up/down arrow at the far left.

Finally, one can record all the settings of a particular palette arrangement in what is called a view.  At the bottom of the palette dialog is a combo-box that allows one to select any saved view, and the creation of views, their updating, and removal are controlled by the Add, Update, and Delete buttons, respectively.  In the palette example above, a view call “trim” has been created and that as seen in “Quality”-tab has a different heat map for trace interval segments.  The screen capture below shows an example of this view and further illustrates the display of tracks.Screen Shot 2016-05-30 at 5.07.03 PMOne should also know that clicking on items in the display area produce popups with information about the object, clicking below the coordinate axis at bottom zooms in or out (shift-click) by sqrt(2), and grabbing a background area allows you to scroll by hand (as will as with the scroll widgets).

Finally, one can have multiple, independently scrollable and zoomable views of the same data set by clicking the “Duplicate” button (+-sign in the tool bar at upper-left), and one can further arrange them into a tiling of your screen by clicking the “Tile” button (again in the tool bar).  As an example, the (entire) screen capture below was created by hitting “Duplicate” 3 times, then “Tile”, and then zooming and scrolling the four individual views.

Screen Shot 2016-06-01 at 10.56.21 AMIn closing, DaViewer is a full featured and flexible visualization tool for the specific task of seeing your Pacbio reads, their quality, and their LAs with other reads.  It has so many features that a man page will take a while for me to produce.  In the meantime, I will hope that you can just open up a data set and figure it out by trial and error button pushing and clicking.  Unlike other Dazzler modules it is after all a GUI, have you ever read the Microsoft Word manual?

DB’s and DAM’s : What’s the difference?

The purpose of this post is to succinctly clarify the differences between Dazzler data bases or DB’s, and Dazzler maps or DAM’s and to further clarify a number of the most common misunderstandings users have about them.  We start with a brief review:

A Dazzler data base or DB is the organizing data framework for the system (original post here).  It holds all information about Pacbio reads at the start of and during their processing into an assembly by Dazzler modules.  Most importantly the initial sequencing read data must be imported into a DB with fasta2DB and quiva2DB before any Dazzler programs such as the daligner can be applied, and the input must be in Pacbio’s format.  A key feature is that you can always get your data back out of a DB exactly as you imported it with DB2fasta and DB2quiva, meaning that no data external to the system need be kept saving you (a lot of) disk space.  Since many users use only selected components of the system, I also made it as easy as possible to get all relevant information out of the system with programs like DBdump and LAdump.

Somewhat later on (original post here), I introduced a Dazzler map or DAM, which is very much like a DB but designed to hold the scaffolds of a reference genome such as hg38 or Rel6 of the fly genome.  This variation was introduced so that the daligner could be used to compare reads to a reference genome.  But it also allowed other interesting possibilities such as comparing a genome against itself to detect the repetitive elements within it.  The NCBI standard .fasta file format is imported into a DAM with fasta2DAM, where the .fasta format headers can be any string, and each entry is interpreted as a scaffold of contigs where runs of N’s are interpreted as scaffold gaps whose mean length is equal to the length of the run.  The DAM stores each contig sequence as an entry and remembers how they are organized into scaffolds.  In symmetry with DB import/export, DAM2fasta recreates the original input, implying the DAM need be the only record on your computer of the genome reference.

So both data objects contain a collection of DNA sequences and at this level they appear identical.  But there are differences and we enumerate them below:

  • DB: .fasta headers must be in Pacbio format and are parsed to obtain                  well, pulse range, and quality score.
  • DAM: .fasta headers are not interpreted.
  • DB: may contain QV quality streams needed by Quiver for consensus
  • DAM: cannot contain QV quality streams.
  • DAM: interprets N’s in input sequences as scaffold gaps and stores said as a collection of contig sequences, retaining the information needed to restore the scaffold.
  • DB: does not allow N’s in the input sequence.

Until recently the system expected that each imported Pacbio file contained the data from a single SMRT cell and hence had exactly the same header sequence save for the well number, pulse range, and quality score.  Quite a number of users however, placed the data from a number of SMRT cells all in the same .fasta file, and then were disappointed when fasta2DB didn’t allow them to import it.  But they discovered that fasta2DAM would accept such input just fine as it makes no assumptions about the .fasta headers.  Since a DAM can be used in almost every place a DB can, this might seem like a good solution, but its not because one can no longer trim the data (see DBsplit) so that only the best read in each well is used in downstream processing.  As of this post, fasta2DB has been upgraded so that a user can pass it files with multiple SMRT cell data sets in it, as long as the entries for each SMRT cell are consecutive in the file.  So please:

Use a DB for your Pacbio reads, and use a DAM for all secondary sequence data such as reference genomes.

A final point that some may have missed, is that the import routines are incremental.  That is, you can import data into a DB or DAM over any number of distinct calls, where the data in each new file imported is appended to the growing DB or DAM.  While its not so important for a DAM, this feature is very valuable for a Pacbio shotgun sequencing project where many SMRT cells are being produced over a period of days or weeks.  You can add each new SMRT cell to your DB as it is output from the instrument.  Moreover, this is performed efficiently: in time proportional to the size of the new data (and not the size of the DB).  Moreover, all the downstream programs and scripts are also incremental so for example you can be performing the time consuming daligner overlap computation as the data arrives, thus spreading the CPU time over the sequencing time so that when that last SMRT cell comes off the machine, you will have a relatively short wait until you have all the overlaps that you need for the relatively rapid scrubbing and assembly steps.

Stay tuned, upcoming posts/releases:

DAMAPPER: an optimized version of daligner specifically for read mapping

DAVIEWER: a QT-based viewing tool for looking at read piles including quality and scrubbing information

DASCRUBBER: a complete end-to-end scrubber for removing all artifacts and low quality segments from your read data set (at last!).