skip to main content
Roche logo
2. GS Reference Mapper : 2.17 GS Reference Mapper Output
Alignment info selection:
Selection. If used, single read variations are also included
ACE Format Selection
A text file containing a section listing the high confidence rearrangement points, followed by a section listing the high confidence rearrangement regions. (The associated data describing rearrangement points and rearrangement regions output is found in section2.17.1.15.) The GS Reference Mapper application uses a combination of flow signal information, quality score information and variant type information to determine if a variant is High-Confidence.
2.17.1.1
454AlignmentInfo.tsv
The 454AlignmentInfo.tsv file (Figure 71) contains position-by-position summary information about the consensus sequence for the contigs generated by the GS Reference Mapper application, listed one nucleotide per line (in a tab-delimited format). Lines are generated for any position of the reference for which any of the reported Depths is greater than 0. The 454AlignmentInfo.tsv file is output conditionally (depending on the ‑info/-infoall/-noinfo options or the selection made on the GUI Parameters Tab Output Sub tab). By default, this file is only output if there are fewer than 4 million input reads and the total length of reference sequences is less than 40Mbp. For larger projects, -info or -infoall or the corresponding GUI Output selection for Alignment Info must be used to generate this file. From the command line, the ‑infoall option is used to tell the mapper to report all lines of the 454AlignmentInfo.tsv file, even if there is no coverage. (This option guarantees that every reference position is reported in the 454AlignmentInfo.tsv file, with the exception that if regions are specified in the mapper, only the regions are reported (but every position in the regions will be reported).
1.
Position – the position in the reference
2.
Consensus – the consensus nucleotide for that position in the reference
3.
Quality Score – the quality score of the consensus base
4.
Unique Depth – the number of non-duplicate, uniquely mapping reads that align at that location
5.
Align Depth – the number of uniquely mapping reads aligned at that location
6.
Total Depth – an estimated unique plus repeat mapping depth at that location, where the repeat depth is estimated. The estimate is made by randomly assigning each repeat read to one of its assigned locations and incrementing the existing count for that location.
7.
Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment
8.
StdDeviation – the standard deviation of the read flowgram signals at the corresponding flows
9.
Region Status (with –reg option) – identifies each mapped base as “IN” if they map within the target regions or “EXT” if they map in the extended target region, but not in the target regions (see Section 4.17 for more details). This column is not present without the –reg option.
2.17.1.2
fna and qual files: 454AllContigs, 454LargeContigs, 454TrimmedReads
These files contain the nucleotide sequences of all the contigs and trimmed reads (Figure 72), and associated nucleotide Quality Scores (Phred-equivalent; Figure 73) produced by the GS Reference Mapper application. The AllContig and LargeContig output lengths are specified in the GUI or by CLI options described in Table 3, above. The TrimmedReads output is generated by specifying the CLI option –tr or by checking the “Output trimmed reads” checkbox on the Parameters tab/Output sub-tab. With the –reg option, the output is restricted to reads in the extended target regions (see Section 4.17 for more details).
2.17.1.3
454ReadStatus.txt
The 454ReadStatus.txt file (Figure 74) contains the status identifiers for all the reads used in the mapping computation, plus the position for each mapped read’s alignment within the reference (unless the mapping status is “chimeric”). The reads are listed one per line, in tab-delimited format. Furthermore, if the “-reg” option is given (to specify a set of regions of the reference, such as in a NimbleGen sequence capture experiment), then the per-read "InRegion", “InExtRegion”, or "OutOfRegion" status is given, to describe which reads aligned in the target regions, in the regions flanking the target regions, and which ones aligned elsewhere in the genome (see Section 4.17 for more details). Each line contains the following information:
1.
Read Accno – Accession number of the input read.
2.
Mapping Status – Status of the read in the mapping, which can be one of the following:
3.
Full – the read is fully aligned to the reference (every base)
4.
Partial – only part of the read aligned to the reference
5.
Chimeric – part of the read aligned to one location on the reference and a different part of the read aligned to a different reference or to a distant location on the same reference
6.
Repeat – the read aligned equally well to multiple locations in the reference
7.
Unmapped – the read did not align to the reference
8.
TooShort – the trimmed read was too short to be used in the computation (shorter than 50 bases and longer than minlen bases, unless 454 Paired End Reads are included in the data set, in which case, all reads at least “minlen” bases are used and 454NewblerMetrics.txt will report the value of numberTooShort as 0 since any shotgun reads at least as long as the minimum read length will be used in the mapping).
9.
Mapped Accuracy – The percentage identity of the alignment, rounded to the nearest whole number (reads with ‘Full’ and ‘Partial’ status only)
10.
% of Read Mapped – The percentage of the read that occurs in the alignment (reads with ‘Full’ or ‘Partial’ status only)
11.
Ref Accno – The accno of the reference sequence to which the read is aligned
12.
Ref Start – The position in the reference sequence where the read’s alignment begins
13.
Ref Stop – The position in the reference sequence where the read’s alignment ends
14.
Strand – The orientation of the read’s alignment relative to the reference sequence. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the reference. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the reference.
15.
Region Status (with –reg option) – Indicates whether or not the read intersects a target region as defined by the parameter given with the –reg option: ‘InRegion’ means that the read intersects a target region, ‘InExtRegion’ means that the read is in the extended target region but not in the target region, and ‘OutOfRegion’ means that the read does not intersect any extended target regions. This column is not present without the –reg option.
2.17.1.4
454TrimStatus.txt
1.
Accno – accession number of the input read
2.
Trimpoints Used – the final trimpoints used in the mapping, in #-# format
3.
Trimmed Length – the final trimmed length of the read
4.
Orig. Trimpoints – the original trimpoints of the read, found in the SFF or FASTA file
5.
Orig. Trimmed Length – the original trimmed length of the read
6.
Raw Length – the length of the raw read (without any trimming)
2.17.1.5
454Contigs.ace or ace/ContigName.ace or consed/…
This viewer-ready genome file shows all the reference sequences to which reads mapped. The file allows the display of how the individual reads aligned to those reference sequences, in an ACE format file suitable for use in various third-party sequence finishing programs (Figure 76). (The freeware “clview” application can be downloaded from: http://compbio.dfci.harvard.edu/tgi/software/; a full description of the .ace file format can be found at: http://bozeman.mbt.washington.edu/consed/consed.html.) It should be noted, however, that such third-party viewing software will not be able to make full use of the flowspace mapping information available with 454 Sequencing reads, and that conversely some of the third-party program’s functions (e.g. involving sequence chromatogram input) are not usable with 454 Sequencing data sets. Nonetheless, these programs may be useful to view and assess read characteristics and coverage depth in regions of interest.
2.17.1.6
454NewblerMetrics.txt
2.17.1.6.1
Input information (Figure 77)
referenceSequenceData group – contains information about the reference sequence file(s).
runData group – contains information about the read data used in the analysis (both Sanger and non-Paired-End 454 Sequencing read files are reported on in this section; not shown since only Paired End data files were used in this example).
pairedReadData group – contains information about the Paired End input data [Paired End only; 454 Sequencing reads only (not Sanger reads)].
2.17.1.6.2
Operation metrics (Figure 78)
runMetrics group – contains information about the mapping computation.
readMappingResults group – contains information about the mapping process for each input file [SFF, FASTA/FASTQ (including Sanger Paired End), or Run regions from wells file; not shown on Figure 78 since only Paired End data files were used in this example]. In the case of mapping performed with a region file, metrics are also provided for reads mapping uniquely in regions and out of regions. Any read whose mapping overlaps a region by at least one base will be included in NumUniqueInRegions. Other uniquely-mapping reads are included in NumUniqueOutOfRegions”. The total of these two categories is reported as NumUniquelyMapped.
pairedReadResults group – contains information about the Paired End input data (Paired End only; GS Junior and GS FLX+ Systems) (Figure 78)
2.17.1.6.3
Consensus results (Figure 79)
consensusDistribution group – contains information about the consensus signals and basecalling thresholds.
consensusResults group – contains summary information and statistics about reads, scaffolds, and contigs.
readStatus – summary information about the reads
pairedReadStatus – Paired End library statistics (if Paired End reads used).
scaffoldMetrics – scaffold statistics (if Paired End reads are used).
largeContigMetrics – contig statistics for large contigs (longer than ‘largeContigThreshold’; default is 500 bp).
allContigMetrics– contig statistics for all contigs (default is 100 bp).
2.17.1.7
454NewblerProgress.txt
2.17.1.8
454PairAlign.txt
This file contains the pairwise alignments of the overlaps that were found during the mapping computation (Figure 80). By default, this file is not generated, but if the “-p” or “-pt” options are given on the runProject command line, it will be generated either in a human-readable text format (“-p”) or in tab-delimited format (“-pt”).
1.
QueryAccno – accession number of the read used in the overlap detection search (the “query sequence”)
2.
QueryStart – starting position of the alignment in query sequence
3.
QueryEnd – ending position of the alignment in query sequence
4.
QueryLength – length of the query sequence
5.
SubjAccno – accession number of the other read (the “subject sequence”)
6.
SubjStart – starting position of the alignment in subject sequence
7.
SubjEnd – ending position of the alignment in subject sequence
8.
SubjLength – length of the subject sequence
9.
NumIdent – number of identities in the pairwise alignment, i.e. where query and subject characters match
10.
AlignLength – the length of the pairwise alignment
11.
QueryAlign – query alignment sequence
12.
SubjAlign – subject alignment sequence
2.17.1.9
454PairStatus.txt
1.
Template – template string for the pair (this will be the original 454 accession for 454 Paired End reads, and the “template” string for Sanger reads)
2.
Status – the status of the pair in the mapping, with the following possible values:
3.
BothUnmapped – both halves of the pair were unmapped
4.
OneUnmapped – one of the reads in the pair was unmapped
5.
MultiplyMapped – one or both of the reads in the pair were marked as Repeat
6.
TruePair – both halves of the pair were mapped into the same reference sequence, with the correct relative orientation, and are within the expected distance of each other
7.
FalsePair – the halves were mapped to the same reference sequence, but the orientation of their alignment is inconsistent with a Paired End pair or the distance between the halves is outside the expected distance
8.
Distance – for “TruePair” or “FalsePair” pairs, the distance between the halves
9.
Left Contig – the contig where the left half was mapped, or “-“ if the read was Unmapped or Repeat
10.
Left Pos – the position in the contig where the 5’ end of the left half was mapped
11.
Left Dir – the direction (‘+’ for the forward strand of the reference sequence and ‘-‘ for reverse strand) in which the left half was mapped
12.
Right Contig – the contig where the right half was mapped, or “-“ if the read was Unmapped or Repeat
13.
Right Pos – the position in the contig where the 3’ end of the right half was mapped
14.
Right Dir – the direction (‘+’ for the forward strand of the reference sequence and ‘-‘ for reverse strand) in which the right half was mapped
15.
Left Distance – the distance from the Left Pos to the respective end of the reference sequence (for forward matches, this is the distance to the 3’ end of the sequence; for reverse matches, to the 5’ end)
16.
Right Distance – the distance from the Right Pos to the respective end of the reference sequence (for forward matches, this is the distance to the 3’ end of the sequence; for reverse matches, to the 5’ end).
2.17.1.10
454TagPairAlign.txt
2.17.1.11
454MappingQC.xls
a.
Num. Reads – the number of input reads used in the mapping computation
b.
Num. Bases – the number of bases in the input reads
c.
Mapped Reads – the number and percentage of reads that uniquely mapped to the reference, followed by the number and percentage of reads that uniquely or multiply mapped
d.
Mapped Bases – the number and percentage of bases that uniquely mapped to the reference, followed by the number and percentage of reads that uniquely or multiply mapped
e.
Inf. Read Error – the “inferred read error” percentage and quality score (calculated as the number of read alignment differences over the number of mapped bases), along with the counts of the number of read alignment differences and mapped bases
f.
Exp. Read Error – the expected read error computed from the input read quality scores, given as a percentage, quality score and expected number of alignment differences. This is computed by summing the expected number of errors for each quality score value (i.e. number of bases with a quality score times the accuracy rate of that quality score).
g.
Last 100 Base IRE – the “inferred read error” numbers, using only the last (3’) 100 bases of each read
h.
Last 50 Base IRE – the “inferred read error” numbers, using only the last (3’) 50 bases of each read
i.
Last 20 Base IRE – the “inferred read error” numbers, using only the last (3’) 20 bases of each read
j.
Genome Size – the number of bases in the reference
k.
Num. Large Contigs – the number of large contigs reported in the 454LargeContigs.fna file
l.
Num. Large Contig Bases – the number of bases in the large contigs
m.
Avg. Depth – the average alignment depth (i.e. how many reads aligned to each position of the reference)
n.
Avg. Map Length – the average length of the alignment of a read (the read’s “map length”)
ii.
The percentages shown in the first overcall/undercall table are given as a percentage of the column (e.g. what percent of the time at a reference 5-mers did the read have a 4-mer). Also, the percent table does not show the percentage of the correct alignments (e.g. 5-mer to 5-mer), nor does it show percentages less than 0.1% (in order to highlight the overcall/undercall trend).
a.
GC Observed/Expected – the two lines below this display the GC content percentages (from 0 to 100) and the observed over expected mapping depth. This is calculated by first counting the number of reads with particular GC content and counting the GC content of all windows of the reference (where the window length matches the average read flowspace or nucleotide length). Then the two counts (read and reference) for a specific GC content value are divided by the read/reference totals to compute the percentage of the reads/references with that GC content. The observed/expected value is the ratio of those two percentages.
b.
GC Std. Dev. – this is the standard deviation of the GC Observed/Expected (based on the sampling at that GC content value). The values on this line are useful for setting the “Y Error Bars” information in Excel, if an “XY (Scatter)” chart is made using the GC Observed/Expected two lines as the source data. This line can then be used as the “+” and “-“ data of the “Custom” Error amount, found inside the “Y Error Bars” tab of the “Format Data Series” dialog box).
a.
Predicted Score – The quality score values, from 0 to 60
b.
Observed Quality – The observed quality score obtained from the read alignments (computed as “Observed Num. Errors” over “Num. Bases With Score” values, see below)
c.
Observed Accuracy – The observed quality score expressed as an accuracy percentage
d.
Num. Bases With Score – the number of mapped bases having the Predicted Score (only mapped bases are used, because they can be evaluated for accuracy)
e.
Expected Num. Errors – the expected number of errors for a quality score, given the number of mapped bases with that quality score
f.
Observed Num. Errors – the number of bases which did not match in the read alignment (i.e. the alignment column containing that base was not an identity)
a.
Read Length and Map Length Histograms – histograms showing the number of reads of each read length (the “Read Length Histogram” column) and number of reads at each length of the aligned regions per the reference sequence, i.e. counting only the read bases in the alignment, not the alignment length (the “Map Length Histogram” column). Histogram values are displayed up to 400 bases.
b.
Errors by Base Position – plot values showing position-by-position errors in the reads, i.e., how many errors occurred at the N’th base across all the reads. The four columns show the accuracy percentage and equivalent quality score of the accuracy at a specific position (the “Errors by Base Position” columns) and the cumulative accuracy up to that position (the “Cumul. Errors by Base Position” columns)
c.
Note: if an alignment column contains a gap in the read, that is counted as an error at the previous base position (i.e., any alignment gaps between base 5 and 6 in a read are counted as errors at position 5)
d.
Cross-Reference Depth and GC Information

The last six columns of this section contain region-by-region statistics of the alignments across the reference, where the reference is evenly divided into 1000 regions

Important Note: This division of the reference into regions has no understanding of repeat regions, and simply reports on the alignments of the uniquely mapping reads. Since repeat reads are not aligned to the reference, the values in this column will count repeat regions as unaligned regions.
i.
The first column displays the position in the reference sequence at the center of the region
ii.
Avg. Depth – the average alignment depth in the region
iii.
Min. Depth – the minimum alignment depth in the region
iv.
Max. Depth – the maximum alignment depth in the region
v.
Depth Score – a score that is indicative of the shallowness of the alignment in the region. Each alignment column in the region is given a score of “max(0, 4-depth)” where “depth” is the alignment depth of the column. The Depth Score for a region is the sum of the column scores. This score is a very sensitive metric for use in resequencing projects, in order to gauge when enough sequencing has been performed (and the addition of more reads will not fill in any more unaligned or shallowly aligned regions of the reference)
vi.
GC – the average GC content of the region
2.17.1.12
454RefStatus.txt
1.
Reference Accession – accession number of a reference sequence
2.
Num Unique Matching Reads – how many reads mapped uniquely to the reference. To be considered unique, a given portion of a read may only map to a single reference location. If a portion of a read maps to multiple reference locations (or multiple transcript variants of the same gene in the case of cDNA mapping projects), the read is considered to be a repeat.
3.
Pct of All Unique Matches – the number of reads mapping uniquely to an individual reference sequence divided by the total number of reads that mapped uniquely to any reference sequence
4.
Pct of All Reads - number of reads that mapped uniquely to this reference divided by the total number of reads in this mapping project
5.
Pct Coverage of Reference – number of reference bases covered by at least one uniquely mapping read divided by the total number of bases in this reference
6.
Description – reference description obtained from the renaming file or annotation files
2.17.1.13
454AllDiffs.txt
1.
Reference Accno - The accession number of the reference sequence in which the difference was detected
2.
Start Pos - The start position within the reference sequence, where the difference occurs
3.
End Pos - The end position within the reference sequence, where the difference occurs
4.
Ref Nuc - The reference nucleotide sequence at the difference location
5.
Var Nuc - The differing nucleotide sequence at the difference location
6.
Total Depth - The total number of reads that fully span the difference location
7.
Var Freq - The percentage of different reads versus total reads that fully span the difference location
8.
Ref AA - The reference amino acid sequence at the difference location, if it occurs within the coding region of an annotated gene
9.
Var AA - The differing amino acid sequence at the difference location, if it occurs within the coding region of an annotated gene
10.
Coding Frame - {-3, -2, -1, +1, +2, +3} - The reading frame, if the difference occurs within the coding region of an annotated gene
11.
Region name - The gene name at the difference location, if it occurs within the region of an annotated gene
12.
Known SNP’s - The list of known SNP IDs that occur at the difference location
13.
# Fwd w/ Var - The number of forward reads that include the difference (with –fd only)
14.
# Rev w/ Var - The number of reverse reads that include the difference (with –fd only)
15.
# Fwd Total - The total number of forward reads that fully span the difference location (with –fd only)
16.
# Rev Total - The total number of reverse reads that fully span the difference location (with ‑fd only)
17.
Tgt Region Status (with –reg option) – identifies each difference as “InRegion” if they map within the target regions or “InExtRegion” if they map in the extended target region, but not in the target regions (see Section 4.17 for more details). This column is not present without the –reg option.
2.17.1.14
454HCDiffs.txt
This file contains the same type of information as the 454AllDiffs.txt file (section 2.17.1.13, above), but restricted to the “High-Confidence” differences. The GS Reference Mapper application uses a combination of flow signal information, quality score information and difference type information to determine if a difference is High-Confidence. The general rules are:
If the difference is a single-base overcall or undercall, then the reads with the difference must form the consensus of the sequenced reads (i.e., at that location, the overall consensus must differ from the reference) and the signal distribution of the differing reads must vary from the matching reads (and the number of bases in that homopolymer of the reference).
2.17.1.15
454HCStructVars.txt and 454AllStructVars.txt
1.
Ref Accno1 – the accession number of the reference sequence on one side of the variation
2.
Ref Pos1 – the reference position on one side of the variation
3.
Var Side1 – a direction arrow “-->” or “<--” describing the direction of the variation on the reference (e.g., the 3’ end of the reads or paired-end clones that diverge from the reference occur in this direction)
4.
Region Name1 – the gene name, or annotated region name, covering the location in the reference denoted by Ref Accno1 and Ref Pos1 (Gene or Region annotation must be included in the project for this field to contain a value)
5.
Ref Accno2 – the accession number of the reference sequence on the other side of the variation, if known. (If only one side of the variation is known, a question mark is given here)
6.
Ref Pos2 – the reference position on the other side of the variation, or a question mark if only one side is known
7.
Var Side2 – a direction arrow “-->” or “<--” for the direction on the other side of the variation, or a question mark if only one side is known
8.
Region Name2 – the gene name, or annotated region name, covering the location in the reference denoted by Ref Accno2 and Ref Pos2 (Gene or Region annotation must be included in the project for this field to contain a value)
9.
Total Depth – the number of reads (for rearrangement points) or pairs (for rearrangement regions) covering the variation location(s)
10.
Var Freq – the percentage of the reads/pairs that support the variation
11.
Deviation Length if both sides of the variation occur on the same reference, this is the distance between the two variation locations
12.
Type – the string “Point” or “Region” to denote whether the rearrangement is a rearrangement point identified by split-read alignments or a rearrangement region identified by paired-end reads
13.
# Fwd w/ var – number of reads on the forward orientation that contain the variation (requires ‑fd option)
14.
# Rev w/ var – number of reads on the reverse orientation that contain the variation (requires ‑fd option).
15.
# Fwd Total – total number of reads in the forward orientation that map to this area of the reference (requires ‑fd option).
16.
# Rev Total – total number of reads in the reverse orientation that map to this area of the reference (requires ‑fd option).
17.
Var ID (no heading in file) – a field that identifies each individual variation, in the format “Var#x”, where # is the ID number.
2.17.1.15.1
Rearrangement Points
2.17.1.15.2
Rearrangement Regions
Rearrangement Regions are represented by clusters of paired-end reads that are considered False Pairs (see section 4.8.4 for a brief discussion of True Pair vs. False Pair reads). The detection of Rearrangement Regions depends on finding clusters of False Pairs of Paired End reads near each other. When a cluster shows a consistent deviation from the reference (varying either in size or expected orientation), a rearrangement region is reported. An example rearrangement region is shown in Figure 89.
2.17.1.16
454HCStructRearrangements.txt / 454AllStructRearrangements
Label – states which type of rearrangements was found. The type can be a deletion, insertion, substitution, inversion, tandem duplication, interspersed duplication, translocation, or fusion. Duplications and translocations can also be inverted.
Reference accno – for intra-chromosomal rearrangements, there will only be one reference. Inter-chromosomal rearrangements will span two.
Reference positions – The rearrangements will involve one, two, or three points on the reference(s), depending on their type.
Length – (where applicable): Most but not all rearrangements will have an associated length. See the individual specifications detailed in Appendix 4.16.
Confidence – Low or High. Confidence is considered High if at least one of the individual variations comprising the given rearrangements is also High. If none of the variations are high confidence, the rearrangements will be marked as Low confidence, and will only appear in the 454AllStructRearrangements file.
Support and context – The number of supporting and non-supporting shotgun reads at each involved positions will be shown where available, along with the reference context of that position. Note that rearrangements supported only by Paired End but no shotgun reads at a given position will have no support information available at that position.
Paired End lengths – The deviation length of supporting Paired End reads from their expected library lengths will be shown. If there are less than 10 such reads, their lengths will be listed. If there are 10 or more, their lengths will be shown in a histogram. There will be a separate listing/histogram for groups of Paired Ends mapping to different strands.
Individual Variation IDs – This is a list of the varID numbers of the individual variations in the 454AllStructVars.txt and 454HCStructVars.txt file that make up the given rearrangements. Note that a rearrangement can and usually will be made up of more than one variation. Note also that in the 454HCStructVars.txt and 454AllStructVars.txt file, the summary line for every variation has been appended with a field in the format “Var#x”, where # is the var ID number.