Intrinsic Quality Values

One of the most vexing problems with Pacbio data is that it is very difficult to understand the quality of a given small segment or region of a read.  The Q-scores that come out of the instrument are not particularly informative in this regard, their value seems to be mostly in indicating if a given base is better or worse than the bases neighboring it (and this is of importance to multi-aligners and consensus callers like Quiver).  I tried every way I could think of to aggregate these Q-scores over a say 100bp interval so as to arrive at an informative statistic about the overall quality of that 100bp stretch and failed.  But having this information is very important because over the course of a long 10-40Kbp read, the quality of the base calls varies significantly, where there are even long internal stretches that are essentially junk.  I illustrate a typical Pacbio read error profile to help you further understand why I think it is extremely important to understand the “regional” quality of a read.

Error profile of a typical long read.
My solution to determining read quality begins with the idea of a pile, which is the set of all alignments that involve a given read, say a, as the A-read in an alignment pair (a,b).  Recall that the HPCdaligner script arranges to sort all the alignments between reads found by daligner so that they are in order of A-read, then B-read, then B-orientation, etc.  In particular, this means that for a given read, say 103, one encounters all the alignments with 103 as the A-read in consecutive order in a .las file.  That is the 103-pile and in fact every pile can be collected and processed consecutively as one scans a .las file!  All post-overlap and pre-assembly analyses in the Dazzler suite are based on looking at read piles.

The figure below shows a pile in a graphical form I call a pile-ogram.  The long yellow bar at the top represents the A-read and every white bar below it represents the A-interval of an alignment with another read.  The brown tips indicate that the B-read continues but does not align with the A-read  (note that some white bars have brown tips, others do not).

Screen Shot 2015-11-06 at 6.07.01 PMRecall from the previous post on trace points, that for each alignment we record the number of differences in the alignment for each section between tick marks spaced Δ base pairs apart in A.  Daligner uses a default of Δ = 100, so the number of differences is also the percentage mismatch between the two reads over each 100bp stretch of the A-read (save possibly at the very beginning and very end of an alignment that spans only a part of a 100bp stretch).   In the pile-ogram below, we have color-coded each 100bp interval of each alignment according to the number of difference in the segment, where the heat map scale is given just above the pile-ogram.

Screen Shot 2015-11-06 at 9.40.26 PMIt should immediately be obvious that in the 100bp “columns” that are all orange, red, and purple, every B-read disagrees with the corresponding segment of the A-read and the clear inference is that the quality of the A-read is very low in that segment.  Going a little further, if a column has say 40 alignments covering it and the average error rate of these is 12%, then it is very unlikely that 50% or more of the 40 B-reads are worse than average in their portions of the alignment spanning the column and indeed we expect the average quality of this 50% to be quite stable, perhaps from 8-12%.  The variable part that mostly determines the color code must then be the quality of the A segment so a good proxy statistic for the quality of each 100bp segment of the A-read is the average match fidelity of the best say 25-50% of the alignments covering the segment.  In the figure below we now color code the segments of the A-read according to their intrinsic quality value.  We use the adjective “intrinsic” to emphasize that this statistic was derived solely from the read data set itself.

Screen Shot 2015-11-06 at 6.04.16 PMIn the case that there is a large drop out in the quality of an A-read, this tends to “break” local alignments as the alignment software cannot find a good correspondence with any B-read in this region.  The pile below illustrates this effect where it is clear all alignments are interrupted over the interval [2100,2200].

Screen Shot 2015-11-08 at 9.43.24 AMThe pile below shows an even larger low quality region that is over 4000bp long.  However, one cannot immediately conclude that the segments in question are low quality, as another possibility is that the read is a chimer, and the drop out is due to the presence of the chimer junction.  Typically such breaks are small as in the pile above.  To distinguish the two possibilities, one needs to see if the B-reads are the same on both sides of the gap and that the gap distance is consistent with the gaps implied in the B-reads.  In the pile below, alignment pairs that are so consistent, are indicated by dashed lines across the gap.  One sees that about 1/2 the pairs are consistent across the gap.  For the others, a careful analysis reveals that the B-reads are not long enough to actually span the 4Kbp gap, terminating before reaching the other side.

Screen Shot 2015-11-08 at 9.44.32 AMHopefully, the examples and exposition above make it clear that analyzing read piles is pretty interesting and informative.  We have built a complete research prototype pipeline that uses the intrinsic quality values and the piles to trim poor prefixes and suffixes of each read, to detect and break all chimeric reads, to remove all adaptamers missed by the Pacbio software, and to identify and patch low quality gaps.  I am still not completely happy with elements of the current scrubbing pipeline and so will not release all of it at this time.  However, the intrinsic quality value caller is stable and I am releasing it now, so that at least you can have an error profile for your reads at this time.  Perhaps you can build a better scrubber.

The command “DASqv -c40 Project.db Project.las” will produce an intrinsic quality track, “qual” in the files .Project.qual.anno and .Project.qual.data.  In this example I was assuming that Project entailed 40X sequencing of a target genome and set the parameter -c accordingly.  While I could have tried to estimate the coverage of your project by looking at the piles, the assumption was that the user invariably knows this information and so can easily supply it.  For a given read, its quality vector has length ⌊ n/Δ ⌋ + 1 where Δ is the trace spacing (default 100), and it consists of one byte numbers between 0 and 50 where I capped the values at 50 as any segment with worse than an average 50% difference to its B-reads is obviously a hideously bad segment (two random DNA string will align with 52% differences).  As stated carefully previously, intrinsic quality values are statistics that correlate with the underlying quality, and not the actual error rate of the segment.  To help you calibrate this to the actual quality, DASqv can be asked to output a histogram of the match values and quality values with the -v option.  An example output, might be:

DASqv -c40 Project Project

Input:   28,347reads,   213,832,622 bases

Histogram of q-values (average 10 best)

         Input                 QV

50:     124591    0.2%       369417   17.2%

49:      30831    0.0%         1177    0.1%
48:      45052    0.1%         1304    0.1%
47:      55249    0.2%         1318    0.2%
46:      70125    0.3%         1576    0.3%
45:      88451    0.4%         1823    0.4%
44:     109586    0.6%         2019    0.5%
43:     138026    0.8%         2404    0.7%
42:     167315    1.0%         2891    0.8%
41:     206812    1.3%         3271    1.0%
40:     276479    1.7%         3718    1.2%
39:     273300    2.1%         4164    1.4%
38:     363039    2.7%         4507    1.7%
37:     439018    3.3%         5130    2.0%
36:     520598    4.1%         5792    2.3%
35:     629455    5.0%         6352    2.7%
34:     760599    6.1%         6980    3.1%
33:     893687    7.4%         8118    3.5%
32:    1109495    9.0%         9013    4.0%
31:    1309997   10.9%        10322    4.6%
30:    1599547   13.2%        12054    5.3%
29:    1881921   16.0%        14483    6.1%
28:    2230686   19.2%        17300    7.1%
27:    2659619   23.1%        21917    8.3%
26:    3071431   27.6%        27422    9.8%
25:    3660064   32.9%        34941   11.8%
24:    3751121   38.4%        45721   14.3%
23:    4299877   44.7%        58711   17.6%
22:    4550533   51.3%        75977   21.9%
21:    4729179   58.2%        96397   27.3%
20:    4818604   65.2%       118736   34.0%
19:    4445302   71.7%       142094   41.9%
18:    4232805   77.8%       160940   51.0%
17:    3771886   83.3%       172833   60.7%
16:    3209555   88.0%       173724   70.4%
15:    2585374   91.8%       161052   79.4%
14:    1974279   94.7%       135985   87.1%
13:    1416910   96.7%       100996   92.7%
12:     958661   98.1%        66090   96.4%
11:     599259   99.0%        37087   98.5%
10:     350258   99.5%        17349   99.5%
9:     185194   99.8%         6601   99.9%
8:      91231   99.9%         1966  100.0%
7:      39498  100.0%          518  100.0%
6:      15779  100.0%          114  100.0%
5:       5154  100.0%           23  100.0%
4:       1630  100.0%            3  100.0%

It is a matter of experience, but my general finding is that of the QV’s not capped at 50, about 80% of the data is definitely usable and 5-7% is definitely bad, leaving the rest in a “grey” zone.  From the histogram above anything under 22 is definitely usable, and anything over 28 is definitely bad.  Future scrubbing commands that trim and patch reads take these two user-selected thresholds as input (a -g and -b parameters) to control their function.  In this way, you can scrub harder or softer by adjusting these two thresholds.

To end, the routine DBdump now takes a -i option that asks it to output the quality vector for reads, provided, of course, the “qual” track exists.

A developing command guide for the scrubber module is available here.

Advertisements

2 thoughts on “Intrinsic Quality Values

  1. Pingback: New Blog Post from Gene Myers on “Intrinsic Quality Values” of Pacbio Sequences « Homolog.us – Bioinformatics

  2. I’d be interested to know how this compares with Nanopore reads. My gut expectation is that there will be more regions of 100% wrong agreement to the consensus, but the lengths of those regions will be shorter. FWIW, results from the most productive run in the MARC phase 1 paper can be found here (Fast5 files and fastq files, 200GB compressed):

    ftp://ftp.sra.ebi.ac.uk/vol1/ERA459/ERA459785/oxfordnanopore_native/E.coli_MARC1_run1.tar.gz

    In the brief flashes of spare time that I have I chip away at these data, but it’s probably going to take me about 10 years to process just that one run at the rate I’m going.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s