|
3.4.12
|
report <report type> <other arguments>
The 'report' command is used to generate reports about the currently open
project. The type of report is determined by the '<report type>' argument.
The '<other arguments>' are determined by the report type.
The following report types are available. Run
'help report <report type>' for more detailed information.
alignment The alignments in the currently open project.
variantHits The variant hits in the currently open project.
|
3.4.12.1
|
report align[ment]
-sam[ple] <sample name>
-ref[erence] <reference sequence name>
[-readT[ype] <"con[sensus]" or "ind[ividual]">]
[-start <reference start position>] [-end <reference end position>]
[-mar[gin] <size>]
[-wrap[pingWidth] <width>]
[-makeDir[ectory] <"all", "last" or "none">]
[-outputFor[mat] <"fasta", "clustal", "ace", "sam", "bam",
"table" [-tableOutputFormat <tsv|csv>]> ]
[-outputDir[ectory] <directory path>]
[ [-outputFile <file>] |
[ [-outputPre[fix] <prefix>]
[-outputSuf[fix] <suffix>]
[-mappingFile <file>] ] ]
[-annot[ationFileSuffix] <suffix>]
[-fileFilter <"all", "linux", "mac", or "windows">]
[-file <file> [-format <format>]]
[<amplicon name 1> <amplicon name 2> ...]
The 'report alignment' command outputs sequence alignments in one of
several formats. FASTA format is the default, but Clustal, Ace, SAM, BAM
and Table may also be specified using the -outputFormat parameter.
Values for the '-sample' and '-reference' parameters are required, and if
specified as the names of a sample and reference sequence for which an
alignment has been computed in the project, then the corresponding
alignment will be output. If no '-outputFile' option is given, the
alignment is printed to the standard output of the interpreter. An output
file of "-" has the same effect. If an output file is given, the alignment
is written to that file. Run 'help general filePaths' for more information
about specifying files.
Alternatively, either or both of the '-sample' or '-reference' parameters
may be specified as the (wildcard) character '*', in which case all
alignments that have been computed in the project for the indicated
combination of samples and reference sequences will be output. When using
this form of the command, multiple alignments will typically be produced,
and so the output cannot be sent to standard output and the '-outputFile'
parameter cannot be used. As explained below, the alignments are written
to files in a directory structure according to a file naming convention
that can be customized using the '-outputPrefix' and '-outputSuffix'
parameters.
Using the '-file' parameter, one or more of the parameter values may be
supplied from tabular input. Run 'help general tabularCommands' for
information about the '-file' option.
The remaining parameters are described below, grouped by their use in
specifying the alignment region to output, formatting the alignment, and
determining where the output is to be written.
ALIGNMENT TYPE AND REGION PARAMETERS:
The '-readType' parameter specifies the type of read to include in the
alignment, and may be either "consensus" (the default if '-readType' is not
used) or "individual".
By default, the alignment output includes the target sequence regions of
all the amplicons for which there are computed alignment data for the given
'-sample' and '-reference' values. An optional, space separated, list of
amplicon names may be provided to restrict the alignment output to the
target sequence neighborhoods of those specific amplicons. The amplicon
names are interpreted relative to the given '-reference' value, and thus
this amplicon filtering ability is typically only useful if a non-wildcard
'-reference' value is supplied.
The '-start' and '-end' parameters may be used to precisely define
(in 1-based reference sequence positions) the bounds for the reads in the
alignment.
If '-start' and/or '-end' positions are specified along with a list of
specific amplicons (or all amplicons for the reference sequence, if a
specific list is not supplied), the alignment output will be restricted to
that region of reference base positions that constitute the (smallest)
intersection of all the specifications.
Bases of reads that extend outside the specified alignment region will be
trimmed from the output, and reads that align within these positions will
be padded on either side, as applicable, with gap characters ('-'). Reads
whose alignments have no overlap with the specified alignment region will
not be included in the output at all.
FORMATTING PARAMETERS:
The '-margin' parameter specifies a number of additional reference bases
to include on either side of the alignment region (as determined by the
amplicons, '-start' and '-end' parameters described above). The bases
of the reads in the alignment will still be trimmed to the specified
alignment region, but the reference sequence, which is output as the first
sequence of the alignment output, will include the additional contextual
bases. Under these reference positions, the read alignments will be padded
with the gap character ('-'). If not specified, the default margin is
0 (zero).
The '-wrappingWidth' parameter defines the maximum number of alignment
characters to allow per line in the formatted alignment output. In FASTA
output only, the special value 0 (zero) may be given to indicate no
wrapping. If no value is supplied, then the default value of 50 will be
used. ACE and SAM/BAM ignore this option.
WRITING ALIGNMENT TO STANDARD OUTPUT:
If no wildcard ('*') specifiers are used for either the '-sample' or
'-reference' and no '-outputFile' parameter value is supplied (or one is
supplied, but it is the special value '-'), then the alignment
will be written to the standard output of the interpreter.
WRITING ALIGNMENT(S) TO FILE(S):
Alignment output may be written to files using a combination of the
'-outputDirectory' parameter and other parameters that depend on whether
or not a wildcard ('*') specification was provided for either of the
'-sample' or '-reference' parameters.
The '-outputDirectory' is optional, but can be used as a convenience to
factor out the specification of a containing directory from the remainder
of the output file path specification. The value given for
'-outputDirectory' follows all the rules as explained for specifying
paths in 'help general filePaths' and, in particular, allows the use of
path shortcuts like %homeDir at the beginning of the path specification.
When wildcard ('*') specifications for '-sample' and '-reference' are not
used, the '-outputFile' parameter may be used to specify a single file for
the alignment output. The file is placed under the path specified
by the '-outputDirectory' parameter, if given. If '-outputDirectory' is
not specified, then the file specified by '-outputFile' will be written
under the current directory unless the '-outputFile' itself contains some
additional, prefixed relative or absolute path specification as explained
in 'help general filePaths'.
When a wildcard ('*') specification for either '-sample' or '-reference' is
used, the output file for a given sample / reference combination is a file
in the directory:
outputDirectory/filteredSampleName/filteredReferenceName
where the outputDirectory is the current directory if '-outputDirectory'
is not specified. The filteredSampleName and filteredReferenceName are the
original sample and reference names from the project, possibly changed
according to the value of the '-fileFilter' parameter, which is explained
below.
Within that directory structure, that alignment file is written to a file
of the automatically generated name:
outputPrefix +
filteredSampleName + "_vs_" + filteredReferenceName + outputSuffix
where "+" indicates concatenation of the values. The outputPrefix value
can be specified with the '-outputPrefix' parameter and defaults to the
empty string if not supplied. The outputSuffix may be specified with the
'-outputSuffix' parameter to provide a filename extension; when unspecified
it defaults to the filename-extension associated with the type given in
-outputFormat, i.e., 'fasta'=".fna", 'clustal'=".aln", 'ace'=".ace".
Note that the "." that separates the file extension from the rest of the
file name is explicitly supplied as part of the outputSuffix itself, and
so the extension can be effectively eliminated by supplying an empty string
("") for the '-outputSuffix' parameter value.
When wildcards are used, the automatically generated filenames and the
directory structure that contains the alignment output, are based on the
names of the samples and reference sequences. It is possible that these
names contain characters that are not allowed in filenames according to
the operating system where the files are initially created or may
eventually be viewed (if the files were copied to another machine).
Consequently, these names must be filtered to be compatible with file
naming conventions of the intended operating systems.
Filename filtering is controlled by the '-fileFilter' parameter that
ensures that the automatically generated output filenames and paths use
legal file system characters. If this parameter is not supplied, then its
value defaults to "all" which provides the most strict filtering and should
produce filenames that are compatible across all major operating systems.
Illegal characters are replaced with a hyphen and a unique index (for the
one invocation of the report alignment command) that uniquely encodes the
characters. Less general, OS-specific filename filtering may be elected by
setting this parameter to "linux", "windows" or "mac". Note, that this
setting does not filter the file-path value set by '-outputFile' when
wildcards are not used, where the user is in complete control of the
filename.
When wildcards are used, the '-mappingFile' parameter may optionally
designate the name of the file that should be created by the report
alignment command in the outputDirectory. This file will contain a row of
data for each sample/reference name pair and specify the relative path to
the corresponding alignment output file for that pair. Using this file,
a user, or automated process, can determine the alignment output file based
on the original sample and reference names, prior to any
filesystem-specific filename filtering. The mapping file will be in comma
separated format if specified with a ".csv" extension, and will be
tab-separated otherwise.
When using wildcards, it is possible that the directory specified by
'-outputDirectory' does not already exist. The '-makeDirectory' parameter
may be given to specify what to do in this case. Providing the value "all"
will allow all sub-directories in the -outputDirectory path to be created
(i.e., if they don't already exist on the disk). The value "last" will
allow the last directory on the path to be created, but if any of the
intermediate parent directories do not exist, the command will fail with
an error. When not supplied, the default value is "none", in which case
the entire '-outputDirectory' path must already exist. Regardless of this
value, the subdirectories based on the filtered sample and reference names
will automatically be created below the '-outputDirectory' location, and
do not have to pre-exist.
When not using wildcards, the '-makeDirectory' parameter is also available,
but is applied to the full directory path derived from the combination of
the values of the '-outputDirectory' and '-outputFile' parameters, rather
than just to the '-outputDirectory' value itself.
When writing to files, pre-existing files may be overwritten. Run
'help set outputFileOverwritePolicy' to learn how to be alerted to, or
prevent, such file overwrites.
SUPPLEMENTAL ANNOTATION FILES
The -annotationFileSuffix may only be used in conjunction with
'-outputFormat clustal' or '-outputFormat ace' to generate two files: the
primary (i.e., clustal or ace) and the secondary, an ‘annotation file’ in
‘table’ format. The secondary file has the same name as the primary output
file plus the given annotation suffix. If the suffix ends with ‘.csv’ the
annotation file format will be a table in comma separated value format, tab
separated value otherwise. NOTE: annotation files can not be sent to
standard output, only to files.
BASIC EXAMPLES:
report alignment -sample Sample1 -reference EGFR_Exon_19
Reports the consensus read alignment (default) for all amplicons in the
EGFR_Exon_19 reference to the standard output of the command
interpreter in FASTA format. Default wrapping width of 50 characters
is used.
report align -sam Sample1 -ref EGFR_Exon_19 -readType individual \
-wrapping 0 -outputFile rpts/out.fna
Reports the alignment of individual reads with no line wrapping and
output going to the file:
%currDir/rpts/out.fna
report align -sam Sample2 -ref HLA_Long_Amps -readType consensus \
-wrappingWidth 60 -margin 15
Reports, to standard output, the alignment of the consensus reads with
a margin of 15 bases from the reference sequence added to both ends and
then line wrapped on every 60th character. Note: it is not necessary
to use '-readType consensus' as this is the default report output.
AMPLICON FILTERING EXAMPLES:
report align -sam Sample1 -ref HLA_Long_Amps GA9 DE15
Reports the consensus alignment for the amplicons GA9 and DE15
in the reference to the standard output of the command interpreter in
FASTA format.
report align -sam Sample1 -ref HLA_Long_Amps DD14 DE15 \
-start 50 -end 350
Reports the consensus alignment for the amplicons DD14 and DE15,
clipping output to the given reference sequence positions
[50, 350], inclusive.
WILDCARD SAMPLE AND REFERENCE EXAMPLES:
report align -sam * -ref *
Reports the consensus alignment for all valid sample and reference
pairs to a collection of files located in the current directory.
report align -sam Sample1 -ref * -outputDir dirA -makeDir last \
-fileFilter linux -mappingFile map.tsv
Reports the consensus alignment for all valid Sample1 and reference
pairs to files (whose auto-generated names are linux OS compliant) in
the %currdir/dirA directory, creating the 'dirA' directory if
necessary, and creating a mapping file called "map.tsv" in the dirA
directory as well.
FASTA ALIGNMENT OUTPUT FORMAT
The FASTA alignment output first begins with an entry for the reference
sequence as trimmed according to the '-start', '-end', amplicon list, and
'-margin' parameter values. Subsequent entries are either the individual
or consensus reads (depending on the '-readType' parameter) that comprise
the alignment, padded as necessary with '-' gap characters. Each entry
consists of a definition line prefixed with a '>' followed by the aligned
sequence data, wrapped according to the '-wrappingWidth' parameter. The
definition line specifies the name of the reference sequence or read, as
applicable, followed by a set of keyword/value pairs that annotate the
sequence. The general form of the definition line is:
>name keyword1=value1 keyword2=value2 ...
The particular keyword value pairs that appear on the definition line
depend on whether or not the entry corresponds to the reference sequence
or an individual or consensus read. The keywords are as follows, depending
on the sequence type.
KEYWORD |R|C|I| DESCRIPTION OF CORRESPONDING VALUE
-----------------+-+-+-+----------------------------------------------
sample |x| | | name of the sample that is the read source
amplicon | |x|x| name of the amplicon that is the read source
consensusLabel | | |x| consensus read containing the individual read
strand |x|x|x| + = forward, - = reverse
forwardCount | |!| | # of + strand reads in consensus
reverseCount | |!| | # of - strand reads in consensus
refStart |x|x|x| start alignment position relative to reference
refEnd |x|x|x| end alignment position relative to reference
readStart | |~|x| position of base within read at alignment start
readEnd | |~|x| position of base within read at alignment end
alignedReadBases | |x|x| number of aligned read bases
NOTE: R(x) = key is shown for the Reference Sequence (first output line).
I(x) = key is shown for Individual alignment reads.
C(x) = key is shown for Consensus alignment reads.
C(!) = key is shown for Consensus alignment reads only if
value is non-zero.
C(~) = key is shown for Consensus alignment reads but positions
are synthesized as [1..alignedReadBases].
For a given alignment output, all the reads will be derived from the same
sample and so, for brevity, the sample keyword is only present on the
definition line of the reference sequence that appears at the start of the
output. All reported positions are given using a 1-based positioning
system (i.e., the first base is base #1). For reads with a strand of '-',
the readStart and readEnd are given relative to the original read
orientation, and so in this case readStart will be greater than the
readEnd.
TABLE OUTPUT FORMAT
The Table format is a tab or comma separated value table whose column
headers are identical to FASTA's keywords, but with the first letter of
each keyword in upper case (e.g., the "readEnd" values of the FASTA output
would appear in a column labeled "ReadEnd"). Two additional columns of
data are also included, 'Accno' and 'Alignment', specifying the identifier
of a sequence and its (gapped) sequence alignment, respectively. The first
row after the column labels contains data for the reference sequence and
subsequent rows contain the data for the consensus or individual reads
(depending on the value of the -readType parameter).
The '-tableOutputFormat' option controls the format of the table.
If 'tsv' is specified, a tab-delimited format is used. Alternatively if
'csv' is given, then a comma-delimited format is used. If not specified,
table will be tab-delimited, unless an output file is given
(or is wildcard generated) with a ".csv" extension.
Example:
report alignment -sample Sample1 -reference EGFR_Exon_19 \
-outputFormat table -outputFile S1_E19.dat \
-tableOutputFormat csv
Reports the consensus read alignment (default) for all amplicons in the
EGFR_Exon_19 reference to the file S1_E19.dat in a Table format, with
data separated by commas.
The Table format can also, optionally, be used to supplement Clustal and
Ace outputs formats to compensate for sequence annotations that are not
fully supported by those formats. When used in this manner, the Alignment
column of data is not included in the output (see Clustal Output Format
documentation for an example).
CLUSTAL OUTPUT FORMAT
The Clustal output format is provided as another way to export AVA
nucleotide sequence alignments. Output produced in this format is
from the AVA alignments, and should not be misconstrued as being output
from an actual Clustal-based alignment implementation.
For more information on specifics of the Clustal output format, and the
basis of the AVA implementation of that format, see:
http://mcast.sdsc.edu/doc/clustalw-format.html
All 'report align' options used with CLUSTAL have similar effects as
described for FASTA. One exception is -wrappingWidth, which for CLUSTAL
is limited to a range of [1..60] and defaults to 50 if left unspecified.
Clustal format does not include space for key information, such as the
forwardCount or reverseCount of reads contained within consensus reads or
the true refStart and refEnd position of the Reference sequence and the
readStart and readEnd positions of the reads in the type of local
alignments performed by AVA (post primer trimming). A Table format
output containing this additional information to annotate the Clustal
formatted output can be generated along with the Clustal output by
specifying a value for the '-annotationFileSuffix' option.
Example:
report align -sam * -ref * -outputFormat clustal \
-annotationFileSuffix _annot.csv
In the above example, the wildcard expansion will generate file names
based on the Sample and Reference names in the usual manner, and each
file will contain alignments in Clustal format. For each such output file
named X, an additional file named X_annot.csv will be generated in the
Table format (see Table Output Format above) and contains the supplemental
annotations.
NOTE: if -annotationFileSuffix is used, the report output can not be
directed to the console's standard output.
ACE OUTPUT FORMAT
Using the option '-outputFormat ace', alignments are output in Ace format.
Alignments in this format are still those of the AVA alignment algorithm
and shouldn't be misconstrued as being output based on the Phrap
assembly/alignment algorithm.
For more information on specifics of the Ace output format, see:
http://www.phrap.org/consed/distributions/README.16.0.txt
In the current implementation, the "BQ" tagged quality score values are not
truly output (the constant value 30 is output for each base).
All 'report align' options used with ACE have similar effects as described
for FASTA. One exception is -wrappingWidth, which is ignored for ACE
because the width is fixed at 50.
The -annotationFileSuffix option may be used with the Ace format
(see Clustal Output Format for an example) to generate separate file(s)
containing supplemental annotation information for each alignmed sequence
in tabular form.
SAM / BAM OUTPUT FORMAT (Sequence Alignment/Map Format)
Using the option '-outputFormat sam', alignments are output in SAM format
per v0.1.2 draft here:
http://samtools.sourceforge.net/SAM1.pdf
Using the option '-outputFormat bam', alignments are output in a
compressed binary format.
Currently the reference sequence is added as the first sequence in the
output file. We don't advise dumping bam output to the console.
All 'report align' options used with SAM/BAM have similar effects as
described for FASTA. One exception is -wrappingWidth, which is ignored.
READ ORDER IN ALIGNMENT:
Every alignment begins with an entry for the reference sequence. Depending
on the specified '-readType', the consensus or individual reads that follow
are ordered as follows:
For the "consensus" reads:
1. Reads are grouped by amplicon, and the amplicon-based groups are
ordered so that amplicons with smaller target start values appear
first, and shorter (nested) amplicons with the same target start
appear before the longer (containing) amplicons: i.e., reads from
amplicons closest to the 5' end of the reference sequence appear
before reads from amplicons that are closer to the 3' end.
2. Within an amplicon-based group, the consensus reads are ordered by:
1. Constituent read count: consensi with the largest forwardCount
and reverseCount values appear first.
2. And if tied, then ordered by refStart: reads with fewer leading
gaps appear first.
3. And if tied, then ordered by the aligned nucleotide sequence:
these are sorted by their natural ASCII lexicographic order
(i.e., - < A < C < G < N < T).
4. And if tied, then ordered by the strand: forward reads appear
before reverse reads.
5. And finally, if necessary, ordered by the consensus read name.
For the "individual" reads:
1. Reads are first ordered by the refStart: reads with fewer leading
gaps appear first.
2. And if tied, then ordered by the aligned nucleotide sequence:
these are sorted by their natural ASCII lexicographic order
(i.e., - < A < C < G < N < T).
3. And if tied, then ordered by the strand: forward reads appear
before reverse reads.
4. And if tied, then ordered by the read identifier (i.e., as taken
from the SFF file).
|
3.4.12.2
|
report variantHits [-outputFile <file>] [-format <table format>]
Reports variant hits. Variant hits are reported in the form of a table.
The table has columns for the following.
Reference Name
Variant Name
Variant Status
Variant Pattern
Sample Name
Forward Hits
Forward Denom
Reverse Hits
Reverse Denom
Read Type
Data are provided for a Variant of a given Reference Sequence if there
are reads of a Sample that span the region of variation as described
by the Variant Pattern. The number of forward and reverse reads that
span the region are reported in the Forward Denom and Reverse Denom
columns, respectively. The number of these reads that have the variation
are given in the Forward Hits and Reverse Hits columns. The Hit / Denom
ratio provides an estimate of the Variant frequency in the Sample.
Two rows of data are given for each Variant based on the Read Type,
which is either Consensus or Individual.
If no '-outputFile' option is given, the table is printed in a
tab-delimited format to the standard output of the interpreter. An output
file of "-" has the same effect. If an output file is given, the table is
written to that file. Run 'help general filePaths' for more information
about specifying files.
The '-format' option controls the format of the printed table. If "tsv", a
tab-delimited format is used. If "csv", a comma-delimited format is used.
By default, the tab-delimited format is used, unless an output file is
given with a ".csv" extension.
Here are some examples.
report variantHits
Reports the variant hits table to the standard output of the command
interpreter in a tab-delimited format.
report variantHits -outputFile /reports/hits.csv
Reports the variant hits table to the /reports/hits.csv file in a
comma-delimited format.
report variantHits -outputFile -
Reports the variant hits table to the standard output of the command
interpreter in a tab-delimited format.