|
2.17.1
|
|
Alignment info selection:
|
|||||
|
Selection. If used, single read variations are also included
|
|||||
|
ACE Format Selection
Ace read mode Selection
|
|||||
|
A text file containing a section listing the high confidence rearrangement points, followed by a section listing the high confidence rearrangement regions. (The associated data describing rearrangement points and rearrangement regions output is found in section2.17.1.15.) The GS Reference Mapper application uses a combination of flow signal information, quality score information and variant type information to determine if a variant is High-Confidence.
|
|||||
|
Pairwise alignment selection
|
|||||
|
Pairwise alignment selection
|
|||||
|
2.17.1.1
|
|
1.
|
Position – the position in the reference
|
|
2.
|
Consensus – the consensus nucleotide for that position in the reference
|
|
3.
|
Quality Score – the quality score of the consensus base
|
|
4.
|
Unique Depth – the number of non-duplicate, uniquely mapping reads that align at that location
|
|
5.
|
Align Depth – the number of uniquely mapping reads aligned at that location
|
|
6.
|
Total Depth – an estimated unique plus repeat mapping depth at that location, where the repeat depth is estimated. The estimate is made by randomly assigning each repeat read to one of its assigned locations and incrementing the existing count for that location.
|
|
7.
|
Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment
|
|
8.
|
StdDeviation – the standard deviation of the read flowgram signals at the corresponding flows
|
|
9.
|
Region Status (with –reg option) – identifies each mapped base as “IN” if they map within the target regions or “EXT” if they map in the extended target region, but not in the target regions (see Section 4.17 for more details). This column is not present without the –reg option.
|
|
2.17.1.3
|
|
1.
|
Read Accno – Accession number of the input read.
|
|
2.
|
Mapping Status – Status of the read in the mapping, which can be one of the following:
|
|
3.
|
Full – the read is fully aligned to the reference (every base)
|
|
4.
|
Partial – only part of the read aligned to the reference
|
|
5.
|
Chimeric – part of the read aligned to one location on the reference and a different part of the read aligned to a different reference or to a distant location on the same reference
|
|
6.
|
Repeat – the read aligned equally well to multiple locations in the reference
|
|
7.
|
Unmapped – the read did not align to the reference
|
|
8.
|
TooShort – the trimmed read was too short to be used in the computation (shorter than 50 bases and longer than minlen bases, unless 454 Paired End Reads are included in the data set, in which case, all reads at least “minlen” bases are used and 454NewblerMetrics.txt will report the value of numberTooShort as 0 since any shotgun reads at least as long as the minimum read length will be used in the mapping).
|
|
9.
|
Mapped Accuracy – The percentage identity of the alignment, rounded to the nearest whole number (reads with ‘Full’ and ‘Partial’ status only)
|
|
10.
|
% of Read Mapped – The percentage of the read that occurs in the alignment (reads with ‘Full’ or ‘Partial’ status only)
|
|
11.
|
Ref Accno – The accno of the reference sequence to which the read is aligned
|
|
12.
|
Ref Start – The position in the reference sequence where the read’s alignment begins
|
|
13.
|
Ref Stop – The position in the reference sequence where the read’s alignment ends
|
|
14.
|
Strand – The orientation of the read’s alignment relative to the reference sequence. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the reference. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the reference.
|
|
15.
|
Region Status (with –reg option) – Indicates whether or not the read intersects a target region as defined by the parameter given with the –reg option: ‘InRegion’ means that the read intersects a target region, ‘InExtRegion’ means that the read is in the extended target region but not in the target region, and ‘OutOfRegion’ means that the read does not intersect any extended target regions. This column is not present without the –reg option.
|
|
2.17.1.4
|
|
1.
|
Accno – accession number of the input read
|
|
2.
|
Trimpoints Used – the final trimpoints used in the mapping, in #-# format
|
|
3.
|
Trimmed Length – the final trimmed length of the read
|
|
4.
|
Orig. Trimpoints – the original trimpoints of the read, found in the SFF or FASTA file
|
|
5.
|
Orig. Trimmed Length – the original trimmed length of the read
|
|
6.
|
Raw Length – the length of the raw read (without any trimming)
|
|
2.17.1.6
|
|
2.17.1.6.1
|
|
•
|
referenceSequenceData group – contains information about the reference sequence file(s).
|
|
•
|
runData group – contains information about the read data used in the analysis (both Sanger and non-Paired-End 454 Sequencing read files are reported on in this section; not shown since only Paired End data files were used in this example).
|
|
•
|
pairedReadData group – contains information about the Paired End input data [Paired End only; 454 Sequencing reads only (not Sanger reads)].
|
|
2.17.1.6.2
|
|
•
|
runMetrics group – contains information about the mapping computation.
|
|
•
|
readMappingResults group – contains information about the mapping process for each input file [SFF, FASTA/FASTQ (including Sanger Paired End), or Run regions from wells file; not shown on Figure 78 since only Paired End data files were used in this example]. In the case of mapping performed with a region file, metrics are also provided for reads mapping uniquely in regions and out of regions. Any read whose mapping overlaps a region by at least one base will be included in NumUniqueInRegions. Other uniquely-mapping reads are included in NumUniqueOutOfRegions”. The total of these two categories is reported as NumUniquelyMapped.
|
|
•
|
pairedReadResults group – contains information about the Paired End input data (Paired End only; GS Junior and GS FLX+ Systems) (Figure 78)
|
|
2.17.1.6.3
|
|
•
|
consensusDistribution group – contains information about the consensus signals and basecalling thresholds.
|
|
•
|
consensusResults group – contains summary information and statistics about reads, scaffolds, and contigs.
|
|
◦
|
readStatus – summary information about the reads
|
|
◦
|
pairedReadStatus – Paired End library statistics (if Paired End reads used).
|
|
◦
|
scaffoldMetrics – scaffold statistics (if Paired End reads are used).
|
|
◦
|
largeContigMetrics – contig statistics for large contigs (longer than ‘largeContigThreshold’; default is 500 bp).
|
|
◦
|
allContigMetrics– contig statistics for all contigs (default is 100 bp).
|
|
2.17.1.7
|
|
2.17.1.8
|
|
1.
|
QueryAccno – accession number of the read used in the overlap detection search (the “query sequence”)
|
|
2.
|
QueryStart – starting position of the alignment in query sequence
|
|
3.
|
QueryEnd – ending position of the alignment in query sequence
|
|
4.
|
QueryLength – length of the query sequence
|
|
5.
|
SubjAccno – accession number of the other read (the “subject sequence”)
|
|
6.
|
SubjStart – starting position of the alignment in subject sequence
|
|
7.
|
SubjEnd – ending position of the alignment in subject sequence
|
|
8.
|
SubjLength – length of the subject sequence
|
|
9.
|
NumIdent – number of identities in the pairwise alignment, i.e. where query and subject characters match
|
|
10.
|
AlignLength – the length of the pairwise alignment
|
|
11.
|
QueryAlign – query alignment sequence
|
|
12.
|
SubjAlign – subject alignment sequence
|
|
2.17.1.9
|
|
1.
|
Template – template string for the pair (this will be the original 454 accession for 454 Paired End reads, and the “template” string for Sanger reads)
|
|
2.
|
Status – the status of the pair in the mapping, with the following possible values:
|
|
3.
|
BothUnmapped – both halves of the pair were unmapped
|
|
4.
|
OneUnmapped – one of the reads in the pair was unmapped
|
|
5.
|
MultiplyMapped – one or both of the reads in the pair were marked as Repeat
|
|
6.
|
TruePair – both halves of the pair were mapped into the same reference sequence, with the correct relative orientation, and are within the expected distance of each other
|
|
7.
|
FalsePair – the halves were mapped to the same reference sequence, but the orientation of their alignment is inconsistent with a Paired End pair or the distance between the halves is outside the expected distance
|
|
8.
|
Distance – for “TruePair” or “FalsePair” pairs, the distance between the halves
|
|
9.
|
Left Contig – the contig where the left half was mapped, or “-“ if the read was Unmapped or Repeat
|
|
10.
|
Left Pos – the position in the contig where the 5’ end of the left half was mapped
|
|
11.
|
Left Dir – the direction (‘+’ for the forward strand of the reference sequence and ‘-‘ for reverse strand) in which the left half was mapped
|
|
12.
|
Right Contig – the contig where the right half was mapped, or “-“ if the read was Unmapped or Repeat
|
|
13.
|
Right Pos – the position in the contig where the 3’ end of the right half was mapped
|
|
14.
|
Right Dir – the direction (‘+’ for the forward strand of the reference sequence and ‘-‘ for reverse strand) in which the right half was mapped
|
|
15.
|
Left Distance – the distance from the Left Pos to the respective end of the reference sequence (for forward matches, this is the distance to the 3’ end of the sequence; for reverse matches, to the 5’ end)
|
|
16.
|
Right Distance – the distance from the Right Pos to the respective end of the reference sequence (for forward matches, this is the distance to the 3’ end of the sequence; for reverse matches, to the 5’ end).
|
|
2.17.1.10
|
|
2.17.1.11
|
|
a.
|
Num. Reads – the number of input reads used in the mapping computation
|
|
b.
|
Num. Bases – the number of bases in the input reads
|
|
c.
|
Mapped Reads – the number and percentage of reads that uniquely mapped to the reference, followed by the number and percentage of reads that uniquely or multiply mapped
|
|
d.
|
Mapped Bases – the number and percentage of bases that uniquely mapped to the reference, followed by the number and percentage of reads that uniquely or multiply mapped
|
|
e.
|
Inf. Read Error – the “inferred read error” percentage and quality score (calculated as the number of read alignment differences over the number of mapped bases), along with the counts of the number of read alignment differences and mapped bases
|
|
f.
|
Exp. Read Error – the expected read error computed from the input read quality scores, given as a percentage, quality score and expected number of alignment differences. This is computed by summing the expected number of errors for each quality score value (i.e. number of bases with a quality score times the accuracy rate of that quality score).
|
|
g.
|
Last 100 Base IRE – the “inferred read error” numbers, using only the last (3’) 100 bases of each read
|
|
h.
|
Last 50 Base IRE – the “inferred read error” numbers, using only the last (3’) 50 bases of each read
|
|
i.
|
Last 20 Base IRE – the “inferred read error” numbers, using only the last (3’) 20 bases of each read
|
|
j.
|
Genome Size – the number of bases in the reference
|
|
k.
|
Num. Large Contigs – the number of large contigs reported in the 454LargeContigs.fna file
|
|
l.
|
Num. Large Contig Bases – the number of bases in the large contigs
|
|
m.
|
Avg. Depth – the average alignment depth (i.e. how many reads aligned to each position of the reference)
|
|
n.
|
Avg. Map Length – the average length of the alignment of a read (the read’s “map length”)
|
|
ii.
|
The percentages shown in the first overcall/undercall table are given as a percentage of the column (e.g. what percent of the time at a reference 5-mers did the read have a 4-mer). Also, the percent table does not show the percentage of the correct alignments (e.g. 5-mer to 5-mer), nor does it show percentages less than 0.1% (in order to highlight the overcall/undercall trend).
|
|
iii.
|
Below the counts table, a “%Ident” row displays the percentages for each reference n-mer where the read was called correctly (i.e. its homopolymer length matched the reference).
|
|
a.
|
GC Observed/Expected – the two lines below this display the GC content percentages (from 0 to 100) and the observed over expected mapping depth. This is calculated by first counting the number of reads with particular GC content and counting the GC content of all windows of the reference (where the window length matches the average read flowspace or nucleotide length). Then the two counts (read and reference) for a specific GC content value are divided by the read/reference totals to compute the percentage of the reads/references with that GC content. The observed/expected value is the ratio of those two percentages.
|
|
b.
|
GC Std. Dev. – this is the standard deviation of the GC Observed/Expected (based on the sampling at that GC content value). The values on this line are useful for setting the “Y Error Bars” information in Excel, if an “XY (Scatter)” chart is made using the GC Observed/Expected two lines as the source data. This line can then be used as the “+” and “-“ data of the “Custom” Error amount, found inside the “Y Error Bars” tab of the “Format Data Series” dialog box).
|
|
a.
|
Predicted Score – The quality score values, from 0 to 60
|
|
b.
|
Observed Quality – The observed quality score obtained from the read alignments (computed as “Observed Num. Errors” over “Num. Bases With Score” values, see below)
|
|
c.
|
Observed Accuracy – The observed quality score expressed as an accuracy percentage
|
|
d.
|
Num. Bases With Score – the number of mapped bases having the Predicted Score (only mapped bases are used, because they can be evaluated for accuracy)
|
|
e.
|
Expected Num. Errors – the expected number of errors for a quality score, given the number of mapped bases with that quality score
|
|
f.
|
Observed Num. Errors – the number of bases which did not match in the read alignment (i.e. the alignment column containing that base was not an identity)
|
|
a.
|
Read Length and Map Length Histograms – histograms showing the number of reads of each read length (the “Read Length Histogram” column) and number of reads at each length of the aligned regions per the reference sequence, i.e. counting only the read bases in the alignment, not the alignment length (the “Map Length Histogram” column). Histogram values are displayed up to 400 bases.
|
|
b.
|
Errors by Base Position – plot values showing position-by-position errors in the reads, i.e., how many errors occurred at the N’th base across all the reads. The four columns show the accuracy percentage and equivalent quality score of the accuracy at a specific position (the “Errors by Base Position” columns) and the cumulative accuracy up to that position (the “Cumul. Errors by Base Position” columns)
|
|
c.
|
Note: if an alignment column contains a gap in the read, that is counted as an error at the previous base position (i.e., any alignment gaps between base 5 and 6 in a read are counted as errors at position 5)
|
|
d.
|
Cross-Reference Depth and GC Information
The last six columns of this section contain region-by-region statistics of the alignments across the reference, where the reference is evenly divided into 1000 regions Important Note: This division of the reference into regions has no understanding of repeat regions, and simply reports on the alignments of the uniquely mapping reads. Since repeat reads are not aligned to the reference, the values in this column will count repeat regions as unaligned regions. |
|
i.
|
The first column displays the position in the reference sequence at the center of the region
|
|
ii.
|
Avg. Depth – the average alignment depth in the region
|
|
iii.
|
Min. Depth – the minimum alignment depth in the region
|
|
iv.
|
Max. Depth – the maximum alignment depth in the region
|
|
v.
|
Depth Score – a score that is indicative of the shallowness of the alignment in the region. Each alignment column in the region is given a score of “max(0, 4-depth)” where “depth” is the alignment depth of the column. The Depth Score for a region is the sum of the column scores. This score is a very sensitive metric for use in resequencing projects, in order to gauge when enough sequencing has been performed (and the addition of more reads will not fill in any more unaligned or shallowly aligned regions of the reference)
|
|
vi.
|
GC – the average GC content of the region
|
|
2.17.1.12
|
|
1.
|
Reference Accession – accession number of a reference sequence
|
|
2.
|
Num Unique Matching Reads – how many reads mapped uniquely to the reference. To be considered unique, a given portion of a read may only map to a single reference location. If a portion of a read maps to multiple reference locations (or multiple transcript variants of the same gene in the case of cDNA mapping projects), the read is considered to be a repeat.
|
|
3.
|
Pct of All Unique Matches – the number of reads mapping uniquely to an individual reference sequence divided by the total number of reads that mapped uniquely to any reference sequence
|
|
4.
|
Pct of All Reads - number of reads that mapped uniquely to this reference divided by the total number of reads in this mapping project
|
|
5.
|
Pct Coverage of Reference – number of reference bases covered by at least one uniquely mapping read divided by the total number of bases in this reference
|
|
6.
|
Description – reference description obtained from the renaming file or annotation files
|
|
2.17.1.13
|
|
1.
|
Reference Accno - The accession number of the reference sequence in which the difference was detected
|
|
2.
|
Start Pos - The start position within the reference sequence, where the difference occurs
|
|
3.
|
End Pos - The end position within the reference sequence, where the difference occurs
|
|
4.
|
Ref Nuc - The reference nucleotide sequence at the difference location
|
|
5.
|
Var Nuc - The differing nucleotide sequence at the difference location
|
|
6.
|
Total Depth - The total number of reads that fully span the difference location
|
|
7.
|
Var Freq - The percentage of different reads versus total reads that fully span the difference location
|
|
8.
|
Ref AA - The reference amino acid sequence at the difference location, if it occurs within the coding region of an annotated gene
|
|
9.
|
Var AA - The differing amino acid sequence at the difference location, if it occurs within the coding region of an annotated gene
|
|
10.
|
Coding Frame - {-3, -2, -1, +1, +2, +3} - The reading frame, if the difference occurs within the coding region of an annotated gene
|
|
11.
|
Region name - The gene name at the difference location, if it occurs within the region of an annotated gene
|
|
12.
|
Known SNP’s - The list of known SNP IDs that occur at the difference location
|
|
13.
|
# Fwd w/ Var - The number of forward reads that include the difference (with –fd only)
|
|
14.
|
# Rev w/ Var - The number of reverse reads that include the difference (with –fd only)
|
|
15.
|
# Fwd Total - The total number of forward reads that fully span the difference location (with –fd only)
|
|
16.
|
# Rev Total - The total number of reverse reads that fully span the difference location (with ‑fd only)
|
|
17.
|
Tgt Region Status (with –reg option) – identifies each difference as “InRegion” if they map within the target regions or “InExtRegion” if they map in the extended target region, but not in the target regions (see Section 4.17 for more details). This column is not present without the –reg option.
|
|
2.17.1.14
|
|
•
|
If the difference is a single-base overcall or undercall, then the reads with the difference must form the consensus of the sequenced reads (i.e., at that location, the overall consensus must differ from the reference) and the signal distribution of the differing reads must vary from the matching reads (and the number of bases in that homopolymer of the reference).
|
|
1.
|
Ref Accno1 – the accession number of the reference sequence on one side of the variation
|
|
2.
|
Ref Pos1 – the reference position on one side of the variation
|
|
3.
|
Var Side1 – a direction arrow “-->” or “<--” describing the direction of the variation on the reference (e.g., the 3’ end of the reads or paired-end clones that diverge from the reference occur in this direction)
|
|
4.
|
Region Name1 – the gene name, or annotated region name, covering the location in the reference denoted by Ref Accno1 and Ref Pos1 (Gene or Region annotation must be included in the project for this field to contain a value)
|
|
5.
|
Ref Accno2 – the accession number of the reference sequence on the other side of the variation, if known. (If only one side of the variation is known, a question mark is given here)
|
|
6.
|
Ref Pos2 – the reference position on the other side of the variation, or a question mark if only one side is known
|
|
7.
|
Var Side2 – a direction arrow “-->” or “<--” for the direction on the other side of the variation, or a question mark if only one side is known
|
|
8.
|
Region Name2 – the gene name, or annotated region name, covering the location in the reference denoted by Ref Accno2 and Ref Pos2 (Gene or Region annotation must be included in the project for this field to contain a value)
|
|
9.
|
Total Depth – the number of reads (for rearrangement points) or pairs (for rearrangement regions) covering the variation location(s)
|
|
10.
|
Var Freq – the percentage of the reads/pairs that support the variation
|
|
11.
|
Deviation Length – if both sides of the variation occur on the same reference, this is the distance between the two variation locations
|
|
12.
|
Type – the string “Point” or “Region” to denote whether the rearrangement is a rearrangement point identified by split-read alignments or a rearrangement region identified by paired-end reads
|
|
13.
|
# Fwd w/ var – number of reads on the forward orientation that contain the variation (requires ‑fd option)
|
|
14.
|
# Rev w/ var – number of reads on the reverse orientation that contain the variation (requires ‑fd option).
|
|
15.
|
# Fwd Total – total number of reads in the forward orientation that map to this area of the reference (requires ‑fd option).
|
|
16.
|
# Rev Total – total number of reads in the reverse orientation that map to this area of the reference (requires ‑fd option).
|
|
17.
|
Var ID (no heading in file) – a field that identifies each individual variation, in the format “Var#x”, where # is the ID number.
|
|
2.17.1.15.1
|
|
2.17.1.15.2
|
|
•
|
Label – states which type of rearrangements was found. The type can be a deletion, insertion, substitution, inversion, tandem duplication, interspersed duplication, translocation, or fusion. Duplications and translocations can also be inverted.
|
|
•
|
Reference accno – for intra-chromosomal rearrangements, there will only be one reference. Inter-chromosomal rearrangements will span two.
|
|
•
|
Reference positions – The rearrangements will involve one, two, or three points on the reference(s), depending on their type.
|
|
•
|
|
•
|
Confidence – Low or High. Confidence is considered High if at least one of the individual variations comprising the given rearrangements is also High. If none of the variations are high confidence, the rearrangements will be marked as Low confidence, and will only appear in the 454AllStructRearrangements file.
|
|
•
|
Support and context – The number of supporting and non-supporting shotgun reads at each involved positions will be shown where available, along with the reference context of that position. Note that rearrangements supported only by Paired End but no shotgun reads at a given position will have no support information available at that position.
|
|
•
|
Paired End lengths – The deviation length of supporting Paired End reads from their expected library lengths will be shown. If there are less than 10 such reads, their lengths will be listed. If there are 10 or more, their lengths will be shown in a histogram. There will be a separate listing/histogram for groups of Paired Ends mapping to different strands.
|
|
•
|
Individual Variation IDs – This is a list of the varID numbers of the individual variations in the 454AllStructVars.txt and 454HCStructVars.txt file that make up the given rearrangements. Note that a rearrangement can and usually will be made up of more than one variation. Note also that in the 454HCStructVars.txt and 454AllStructVars.txt file, the summary line for every variation has been appended with a field in the format “Var#x”, where # is the var ID number.
|