Recently we published the damapper module on github and an accompanying blog post. There is already an ample number of mappers for long noisy reads including blasr, bwa and others, so why would one come up with another one? The damapper blog post already gives a hint that speed is one of the reasons, and in this post we take a closer look at how damapper performs in comparison with other aligners. We will first discuss the run-time performance and then the sensitivity achieved.
First let us consider the speed of damapper and competitors on a set of simulated E.coli reads. The reads were generated to feature an average length of 15,000 bases with an average error rate of 10% (of which 80% are insertions, 13.3% deletions and 6.7% substitutions). We generated 300MB worth of simulated read data, which amounts to about 20,000 reads. The reads have interspersed short stretches of higher error rate (25%). The interested reader can find the generating program in the loresim2 package on github. We chose to run our benchmarks on simulated instead of real data in order to be able to check the sensitivity achieved. Benchmarking was done on compute cluster nodes equipped with 12 CPU cores (Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz) and 128G of RAM. The graph below plots the wall clock run-time of damapper, bwa, blasr, yalora (a flexible BWT/FM based prototype mapper we created while testing various seed chaining methods for damapper) and damapper_bwt (a damapper variant in which the first stage of seed finding is replaced by a BWT-based index) against the number of threads used. Note that the scale on the run-time (y) axis is logarithmic.
The run-time of damapper is a step function. While the number of threads to be used by damapper can be set via the -T switch, the program internally changes this value to the next lower power of 2. This is why the run-time for 12 threads is essentially the same as for 8 threads. The parameters used are
- damapper: block size 384m (use -s384 when running DBsplit), -M32 -n.85 -C -N -T<threads>
- damapper_bwt: -i384m -p<threads> –sasamplingrate4
- yalora: –sasamplingrate4 -t<threads>
- blasr: –nproc <threads>
- bwa: -t<threads> -xpacbio
damapper processes the reads in 16 seconds. The run-time of the other aligners as multiple of the damapper run-time are
- damapper_bwt: 2.0
- yalora: 36.9
- blasr: 57.4
- bwa: 64.4
On this data-set blasr is slower than damapper by a factor of almost 60.
The following plot shows the memory usage of the aligners for the E.coli test run.
damapper uses considerably more memory than the other aligners (it uses between 10 and 11GB), but there is no point of having a lot of RAM on a machine and then not use it.
Next consider mapping Drosophila melanogaster based simulated long reads of the same characteristics as given above. We keep the same parameters as above and obtain the following plot.
damapper is still the fastest by far. The run-time with 12 threads is 33.62 seconds. The run-time for the other aligners as a multiple of the damapper run-time are
- damapper_bwt: 1.85
- yalora: 31.1
- blasr: 35.0
- bwa: 40.3
As can be expected the difference gets smaller for larger genomes (there are a lot more spurious/false positive k-mer seed candidates, particularly so for reads originating from repetetive regions), but in comparison with blasr and bwa it is still more than an order of magnitude.
The memory usage situation is similar to the E.coli case in terms of the order of the mappers’ lines as they appear in the graph. damapper uses hardly more memory then for the E.coli case. The other aligners start to use some more memory.
Finally we consider mapping simulated long reads to the human reference, GRCh38 (excluding alternate loci scaffolds, handling of which may be added to damapper at some later stage). The timing is shown in the following plot:
damapper still beats the performance of bwa, blasr and yalora, albeit no longer by an order of magnitude. The speed factors are
- blasr: 5.0
- yalora: 5.6
- bwa: 8.8
- damapper_bwt: 0.64
At the scale of the human genome, we have reached the point where damapper’s approach of merging the sorted lists of all reference and read k-mers is no longer faster than any FM/suffix array based approach to finding seed k-mers. The chaining and alignment stages of damapper, however, are still faster than those of other aligners like blasr and bwa.
The memory usage of the aligners on the human reference is shown in the following plot.
It is of course trivial to write a faster program if it compromises sensitivity — in the absurd extreme one can take 0 seconds to report that no reads map. Given that we used simulated data, we know the ground truth and show below that damapper is competitive with all the rest.
bwa, blasr and yalora map all reads to the correct location, i.e. they are 100% sensitive. We define the term mapping to the correct location the following way: the first (primary) alignment produced by a mapper aligns at least one base to the reference in exactly the same way as the read was simulated. This is stricter than requiring a coordinate interval overlap between the ground truth and the alignment produced by the mapper as such an overlap could be obtained by mapping the read to the wrong instance of a tandem repeat. damapper and damapper_bwt map all but 1 read (which is 99.98%) of the reads correctly. The unmapped read has a length of merely 209 bases, not something which would typically be found in a set of long reads. Obviously aligning a single base would hardly be useful, so we also provide the percentage of bases used in the primary/first alignment of a read by each aligner:
- bwa: 99.9%
- blasr: 99.8%
- yalora: 99.9%
- damapper: 98.8%
- damapper_bwt: 98.8%
The amount of bases aligned by damapper is lower than what we see for the other aligners. Note however that this is intentional, as damapper deliberately does not report stretches of a read as aligned when it deems the error rate for a section as excessive. If a read has regions featuring an exceptionally high error rate, then damapper will, instead of producing an end to end alignment, produce a chain of alignments containing the stretches of acceptable error rate only. It will not align random noise to a reference just because there is good read data to the left and right of that noise.
Note that even the aligners which produce end to end alignments and map 100% of the reads do not have 100% of all bases aligned. This is down to effects at the read boundaries. Clipping off bases at the front and back of a read can yield a higher score than including those bases in an alignment if the clipped bases contain errors.
The percentage of correctly mapped reads for the aligners is:
- bwa: 99.4% (21960/22095)
- blasr: 99.5% (21994/22095)
- yalora: 99.3% (21955/22095)
- damapper_bwt: 99.5% (21996/22095)
- damapper: 99.6% (22000/22095)
If we include non primary alignments, i.e. we consider cases where a mapper manages to find a suitable alignment but does not report it as the preferred one, then the numbers become:
- bwa: 99.4% (21960/22095)
- blasr: 99.8% (22046/22095)
- yalora: 99.9% (22075/22095)
- damapper_bwt: 99.6% (22017/22095)
- damapper: 99.7% (22038/22095)
The percentage of read bases in correct alignments is:
- bwa: 99.5%
- blasr: 99.6%
- yalora: 99.7%
- damapper_bwt: 98.6%
- damapper: 98.5%
Percentage of correctly mapped reads:
- bwa: 97.4% (18155/18631)
- blasr: 98.8% (18403/18631)
- yalora: 99.0% (18455/18631)
- damapper_bwt: 98.8% (18411/18631)
- damapper: 98.7% (18396/18631)
Allowing non-primary alignments the numbers become:
- bwa: 97.4% (18155/18631)
- blasr: 99.6% (18559/18631)
- yalora: 99.9% (18614/18631)
- damapper_bwt: 99.7% (18577/18631)
- damapper: 99.5% (18542/18631)
The percentage of read bases in correct alignments is
- bwa: 97.2%
- blasr: 98.6%
- yalora: 99.0%
- damapper_bwt: 97.6%
- damapper: 97.4%