Update: November 11, 2016: This module has undergone a major upgrade to handle the BAM files produced by Sequel instruments, as well as to extract the information needed by the Arrow consensus program in the form of .arrow files. dextract can now read either .bax.h5 files or .subreads.[bs]am files and output a .fasta, .quiva, or .arrow file or any combination thereof. There is a new compressor/decompressor pair dexar/undexar for .arrow files. Most important, is a new program dex2DB that extracts information from PacBio source files and adds it directly to a Dazzler database.
The latest source code can be found on Github here.
The DEXTRACTOR commands allow one to pull exactly and only the information needed for assembly and reconstruction from the source HDF5 files produced by the PacBio RS II sequencer, or from the source BAM files produced by the PacBio Sequel sequencer. The diagram below shows the flow of information through the programs of this module (in green) and the relevant DB creation routines of the DAZZ_DB module in orange. The program dextract pulls sequence, Quiver, and/or Arrow information from source files outputting the information in the form of .fasta, .quiva, and .arrow files, respectively. These in turn can be compressed and decompressed, and incrementally added to a growing DB. In the event, the intermediate information is not of value, one can directly and more efficiently produce a DB directly from PacBio source files with dex2DB.
All programs add suffixes (e.g. .fasta, .dexta) as needed. The formal UNIX command line descriptions and options for the DEXTRACTOR module commands are as follows:
Dextract takes a series of .bax.h5 or .subreads.[bs]am files as input, and depending on
the option flags settings produces:
- -f: a .fasta file containing subread sequences, each with a “standard” Pacbio header
consisting of the movie name, well number, pulse range, and read quality value.
- -a: a FASTA format .arrow file containing the pulse width stream for each subread, with a header that contains the movie name and the 4 channel SNR values.
- -q: a FASTQ-like .quiva file containing for each subread the same header as the .fasta file above, save that it starts with an @-sign, followed by the 5 quality value streams used by Quiver, one per line, where the order of the streams is: deletion QVs, deletion Tags, insertion QVs, merge QVs, and last substitution QVs.
If the -v option is set then the program reports the processing of each Pacbio input file, otherwise it runs silently. If none of the -f, -a, or -q flags is set, then by default -f is assumed. The destination of the extracted information is controlled by the -o parameter as follows:
- If -o is absent, then for each input file X.bax.h5 or X.subreads.[bs]am, dextract will produce X.fasta, X.arrow, and/or X.quiva as per the option flags.
- If -o is present and followed by a path Y, then the concatenation of the output for the input files is placed in Y.fasta, Y.arrow, and/or Y.quiva as per the option flags.
- If -o is present but with no following path, then the output is sent to the standard output (to enable a UNIX pipe if desired). In this case only one of the flags -f, -a, or -q can be set.
One can select which subreads are transferred to the output with the -e option that contains a C-style boolean expression over integer constants and the 8 variable names that for a given subread designate:
zm - well number
ln - length of the subread
rq - quality value of the subread (normalized to [0,1000])
bc1 - # of the first barcode
bc2 - # of the second barcode
bq - quality of the barcode detection (normalized to [0,100])
np - number of passes producing subread
qs - start pulse of subread
For each subread, the expression is evaluated and the subread is output only if the expression is true. The default filter is “ln >= 500 && rq >= 750”, that is, only subreads longer than 500bp with a quality score of .750 or better will be output. If a variable is undefined for a subread (e.g. bar codes are often not present), the value of the variable will be -1.
Dexta compresses a set of .fasta files (produced by either Pacbio’s software or dextract) replacing them with new files with a .dexta extension. That is, submitting G.fasta will result in a compressed image G.dexta, and G.fasta will no longer exist. With the -k option the .fasta source is not removed. If -v is set, then the program reports its progress on each file. Otherwise it runs completely silently (good for batch jobs to an HPC cluster). The compression factor is always slightly better than 4.0. Undexta reverses the compression of dexta, replacing the uncompressed image of G.dexta with G.fasta. By default the sequences output by undexta are in lower case and 80 chars per line. The -U option specifies upper case should be used, and the characters per line, or line width, can be set to any positive value with the -w option.
With the -i option set, the program runs as a UNIX pipe that takes .fasta (.dexta) input from the standard input and writes .dexta (.fasta) to the standard output. In this case the -k option has no effect.
Dexqv compresses a set of .quiva files (produced by dextract) into new files with a .dexqv extension. That is, submitting G.quiva will result in a compressed image G.dexqv, and G.quiva will no longer exist. The -k flag prevents the removal of G.quiva. With -v set, progress is reported, otherwise the command runs silently. If slightly more compression is desired at the expense of being a bit “lossy” then set the -l option. This option is experimental in that it remains to be seen if Quiver gives the same results with the scaled values inducing the loss. Undexqv reverses the compression of dexqv, replacing the uncompressed image of G.dexqv with G.quiva. The flags are analogous to the -v and -k flags for dexqv. The compression factor is typically 3.3 or so (4.0 or so with -l set). By .fastq convention each QV vector is output by undexqv as a line without intervening new-lines, and by default the Deletion Tag vector is in lower case letters. The -U option specifies upper case letters should instead be used for said vector.
Dexar compresses a set of .arrow files and replaces them with new files with a .dexar extension. That is, submitting G.arrow will result in a compressed image G.dexar, and G.arrow will no longer exist. With the -k option the .arrow source is not removed. If -v is set, then the program reports its progress on each file. Otherwise it runs completely silently (good for batch jobs to an HPC cluster). The compression factor is always slightly better than 4.0. Undexar reverses the compression of dexar, replacing the uncompressed image of G.dexar with G.arrow. By default the sequences output by undexar are 80 chars per line. The characters per line, or line width, can be set to any positive value with the -w option
With the -i option set, the program runs as a UNIX pipe that takes .arrow (.dexar) input from the standard input and writes .dexar (.arrow) to the standard output. In this case the -k option has no effect.
Builds an initial data base, or adds to an existing database, directly from either (a) the list of .bax.h5 or .subreads.[bs]am files following the database name argument, or (b) the list of PacBio source files in <file> if the -f option is used. One can filter which reads are added to the DB with the -e option (see dextract above).
On a first call to dex2DB, i.e. one that creates the database, the settings of the -a and -q flags, determine the type of the DB as follows. If the -a option is set, then Arrow information is added to the DB and the DB is an Arrow-DB (A-DB). If the -q option is set, then Quiver information is added to the DB and the DB is a Quiver-DB (Q-DB). If neither flag is set, then the DB is a simple Sequence-DB (S-DB). One can never specify both the -a and the -q flags together. After the first additions to the database, the settings of the -a and -q flags must be consistent with the type of the database, or left unset in which case the type of the database determines which information is pulled from the inputs.
To compile the programs you must have the HDF5 library installed on your system and the library and include files for said must be on the appropriate search paths. The HDF5 library in turn depends on zlib, so make sure it too is installed on your system. The most recent version of the source for the HDF5 library can be obtained here.