Detecting and Masking Repeats

The daligner was specifically designed to find local alignments (LAs) between reads as opposed to overlaps where in each end of the alignment reaches an end of one or the other read.  This is a necessity because raw PacBio reads can have very low quality or even random segments at any position, and one cannot therefore expect an alignment through these regions.  Finding LAs instead of overlaps, allows the Dazzler suite to then analyze pile-ograms in order to detect and address these regions (recall that a pile for a read x is the set of all LAs where x is the A-read and a pile-ogram is a depiction of all the A-intervals covering x).  Computing LAs also implies that if a read x is from a region of DNA that has a relatively conserved repetitive element in it, then every copy of the element in any read of the shotgun data set will align with the relevant interval of x.  For example, given a 50X (=c) data set and a repeat that occurs 20 (=r) times in the genome, one should expect every occurrence of the repeat in a read to be covered on average by 1,000 (=rc) LA’s in its pile.  The pile-ogram below depicts such a repeat stack:

A small repeat stackWhat’s good about this is that the excessive depth of “stacked” LA’s clearly signals the repetitive element, its approximate copy number, and its approximate boundaries.  What’s bad about this is that the daligner can spend 95% or more of its time finding all the local alignments in these repeat stacks.  This observation raises two questions: (a) Is it worth finding them all, and (b) If not, how can we avoid doing so?

The answer to the first question is “no” (i.e., it is not worth finding them all) and our argument is as follows.  First recall, that daligner can accept any number of interval tracks that it uses to soft mask the reads.  That is, it will only seed/seek alignments in regions that are not masked, but will extend an alignment through a masked region if the the two sequences involved continue to be similar.  Next observe, that if a repetitive element is flanked by unique sequence and it is short enough that with high probability there is one or more spanning reads that cover the repeat and at least say 1Kbp of the unique flank on either side, then there is no consequence to soft masking this repeat as the spanning reads will provide a coherent assembly across it as shown in the figure below as the left (small) scenario.  For Pacbio reads this is generally true of any interspersed repeat of 10Kb or less.  If a repeat is compound or so large that there are no spanning reads for it (as shown in the right (large) scenario), then all current assemblers frankly fail to assemble the region involving it, and so in some sense Effect of soft-masking repeatsno harm is done by soft-masking the long repeats too. We therefore think that soft-masking the repetitive parts of reads is a good initial strategy for the down stream assembly problem: most of the genome will come together with long reads, the “overlap”/daligner stage of assembly will be 10 to 20 times faster, and the big repeats can still be solved in a subsequent post-process by analyzing the much smaller number of reads that were completely masked as repetitive.

So on to the second question, which is how does one efficiently build the needed repeat masks?  Consider taking the hypothetical 50X data set above and comparing every 1X block of this data set against itself.  First observe that doing so is 2% of all the comparisons that would normally be performed by daligner.  Second, observe that unique parts of a read will have on average about 1 LA covering them, whereas a segment from a repeat that occurs, say, 10 times in the genome will already have have 10 LA’s covering it and will appear as a repeat stack in a read’s pile-ogram.  The probability of having 10 or more LA’s covering a unique segment of a read is exceedingly low, so simply specifying a threshold, such as 10, above which a region is labelled repetitive is an effective strategy for identifying high-copy number repetitive elements.  This mask can then be used by the daligner when performing the remaining 98% of its comparisons.  The resulting suppression of purely repeat-induced LAs will speed things up 10-20 times at almost no degradation in the performance of a downstream assembler like Falcon or the Celera Assembler as argued earlier.

For a very lScreen Shot 2016-04-04 at 9.44.23 AMarge data set one may want to repeat this process at different coverage levels and with different thresholds in order to achieve greater efficiency.  For example, we have a 30X data set of a 30Gbp genome, that requires dividing the Dazzler DB into 3,600 250Mbp blocks.  We first run each block against itself (.008X vs .008X) and create a mask .rep1 with a threshold of 20, that detects only super-repetitive elements.  Next we compare every group of 10 consecutive blocks (.08X) against themselves soft-maskiScreen Shot 2016-04-04 at 9.44.35 AMng with .rep1, and create a mask .rep10 with a threshold of 15 to detect fairly abundant repetitive elements.  Then we compare every consecutive group of 100 blocks (.8X) against themselves soft-masking with .rep1 and .rep10, and create a mask .rep100 with threshold 10 used to further soft-mask the final all-against-all of the 3,600 blocks in the entire data set.  The figures interspersed within this paragraph illustrate a 1×1 and 5×5 schema for a hypothetical set of blocks.

To implement this strategy we have created the programs HPC.REPmask and REPmask as part of a new DAMASKER module.  “HPC.REPmask -g#1 -c#2” produces a job-based UNIX script suitable for an HPC cluster that:

  1. compares each consecutive #1 blocks against themselves,
  2. sorts and merges these into .las files for each block, and then
  3. calls “REPmask -c#2 -mrep#1” on each block and associated .las file to produce a block interval track or mask, with the name .rep#1, of any read region covered #2 or more times.

As a running example consider a hypothetical Dazzler data base D that is divided into 150 blocks and contains an estimated 50X coverage of the target genome.  Then 3 blocks of data contains 1X of the genome, and to realize a 1X versus 1X repeat mask, we need to compare every consecutive group of 3 blocks against each other and then produce a mask of all reads covered more than say 10 deep.  Calling “HPC.REPmask -g3 -c10 D” will produce a shell script to do this, where the script has the following form:

# Daligner jobs (150)
daligner D.1 D.1
daligner D.2 D.1 D.2
daligner D.3 D.1 D.2 D.3
daligner D.4 D.4
daligner D.5 D.4 D.5
daligner D.6 D.4 D.5 D.6

daligner D.148 D.148
daligner D.149 D.148 D.149
daligner D.150 D.148 D.149 D.150
# Initial sort & merge jobs (450)
LAsort D.1.D.1.*.las && LAmerge L1.1.1 D.1.D.1.*.S.las
LAsort D.1.D.2.*.las && LAmerge L1.1.2 D.1.D.2.*.S.las
LAsort D.1.D.3.*.las && LAmerge L1.1.3 D.1.D.3.*.S.las
LAsort D.2.D.1.*.las && LAmerge L1.2.1 D.2.D.1.*.S.las
LAsort D.2.D.2.*.las && LAmerge L1.2.2 D.2.D.2.*.S.las
LAsort D.2.D.3.*.las && LAmerge L1.2.3 D.2.D.3.*.S.las

LAsort D.150.D.150.*.las && LAmerge L1.150.150 D.150.D.150.*.S.las
# Level 1 merge jobs (150)
LAmerge D.R3.1 L1.1.1 L1.1.2 L1.1.3
LAmerge D.R3.2 L1.2.1 L1.2.2 L1.2.3
LAmerge D.R3.3 L1.3.1 L1.3.2 L1.3.3
LAmerge D.R3.4 L1.4.1 L1.4.2 L1.4.3
LAmerge D.R3.5 L1.5.1 L1.5.2 L1.5.3
LAmerge D.R3.6 L1.6.1 L1.6.2 L1.6.3

# REPmask jobs (50)
REPmask -c10 -mrep3 D D.R3.1 D.R3.2 D.R3.3
REPmask -c10 -mrep3 D D.R3.4 D.R3.5 D.R3.6
REPmask -c10 -mrep3 D D.R3.7 D.R3.8 D.R3.9

REPmask -c10 -mrep3 D D.R3.148 D.R3.149 D.R3.150
# Produce final mask for entirety of D
Catrack D rep3

So the above would pretty much be the end of this post if it were not for one other kind of repetitive element that haunts single-molecule data sets that sample the entire genome including the centromeres and telomeres, namely tandem satellites.  Back in the days of Sanger sequencing, such repeats were rarely seen as they were unclonable, but with single molecule reads we see and sample all the genome including these regions.  Tandem repeats are even nastier than interspersed repeats at creating deep stacks in a read’s pile-ogram because they align with themselves at every offset of the tandem element as well as with each other!  So our masking strategy actually starts by detecting and building a mask for tandem repeats in the data set.

A tandem repeat can be detected in a read by comparing the read against itself, with the slight augmentation of not allowing the underlying alignment algorithm to use the main diagonal of the edit matrix/graph.  When a tandem is present, it induces off-diagonal alignments at intervalTandem dot plots equal to the tandem element length.  That is, if the tandem is x10 then the prefix of the first n copies of x aligns with the suffix of the last n copies of x creating a ladder of n-1 alignments as seen in the dot plot accompanying this paragraph.  Comparing a read against itself is already possible in our framework with the -I option to daligner.  To detect tandems, we built a special version of daligner that we call datander that given a DB block, compares each read in the block only against itself and outputs these alignments.  We further developed the program TANmask to use these self-alignments to detect tandem regions to mask: namely, when the two aligned segments of a self-LA overlap, the union of the two segments marks a tandem region.  Lastly, the program HPC.TANmask generates a script generator to call datander, sort and merge the resulting self-LAs, and finally call TANmask on the results.  For example, “HPC.TANmask D” for the 50X database D introduced earlier produces a script of the form:

# Datander jobs (38)
datander D.1 D.2 D.3 D.4
datander D.5 D.6 D.7 D.8

datander D.149 D.150
# Sort & merge jobs (150)
LAsort D.1.T*.las && LAmerge D.T.1 D.1.T*.S.las
LAsort D.2.T*.las && LAmerge D.T.2 D.2.T*.S.las

LAsort D.150.T*.las && LAmerge D.T.150 D.150.T*.S.las
# TANmask jobs (38)
TANmask D D.T.1 D.T.2 D.T.3 D.T.4
TANmask D D.T.5 D.T.6 D.T.7 D.T.8

TANmask D D.T.149 D.T.150
# Produce final mask for entirety of D
Catrack D tan

One should note carefully, that tandem repeat masking should be performed before the interspersed repeat masking discussed at the beginning.  To conclude, we close with the example of the super large 30X database of a 30Gbp genome consisting of 3,600 blocks.  Assuming the DB is called, say Big, all the scripts for performing the repeat-masked “overlap” phase of assembly would be produced by invoking the commands:

HPC.TANmask Big
HPC.REPmask -g1 -c20 -mtan Big
HPC.REPmask -g10 -c15 -mtan -mrep1 Big
HPC.REPmask -g100 -c10 -mtan -mrep1 -mrep10 Big
HPC.daligner -mtan -mrep1 -mrep10 -mrep100 Big

The scripts produced are quite huge for a task of this scale and include a variety of integrity checks and cleanup commands not mentioned here.  A newcomer would best be served by gaining experience on a small data set with small blocks, and only proceeding to large production runs with these “HPC.<x>” scripts when one has confidence that they can manage such a pipeline 😄  In a subsequent post on performance/results I will present some timing studies of the impact of repeat masking on the overlap phase of assembly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s