skip to main content
Roche logo
1. GS De Novo Assembler : 1.15 GS De Novo Assembler Output
1.15
GS De Novo Assembler Output
Output Files produced by the GS De Novo Assembler are described briefly below. GUI settings and command line options that control the content type of a file are given when applicable.
Table 1: GS De Novo Assembly Output Files
454NewblerProgress.txt: If Runs are added incrementally and multiple executions of runProject occur, the output messages are appended to this file. If the “Incremental De Novo assembly analysis” checkbox option is not selected in the GUI application, or the “-r” option is given on the runProject command line, the GS De Novo Assembler deletes intermediate data and “restarts” the assembly computation, and the NewblerProgress.txt file is deleted and restarted as well.
1.15.1.1
454AlignmentInfo.tsv
The 454AlignmentInfo.tsv file (Figure 23) contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tab-delimited format). Output conditionally (using the ‑info/-infoall/-noinfo options or the selection made on the GUI Parameters Tab Output Sub tab). By default, this file is only output if there are fewer than 4 million input reads and the total length of assembled contigs is less than 40 Mbp. For larger projects, -info or -infoall or the corresponding GUI Output selection for Alignment Info must be used to generate this file.
1.
Position – the position in the contig
2.
Consensus – the consensus nucleotide for that position in the contig
3.
Quality Score – the quality score of the consensus base
4.
Unique Depth – the number of non-duplicate reads that align at that position in the alignment
5.
Align Depth – the number of reads (including duplicates) that align at that position in the alignment
6.
Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment
7.
StdDeviation – the standard deviation of the read flowgram signals at the corresponding flows
1.15.1.2
fna and qual files: 454AllContigs, 454LargeContigs, 454Scaffolds, 454TrimmedReads
These files contain the nucleotide sequences of all the contigs, large contigs, scaffolds, scaffoldContigs or trimmed reads (Figure 24) and associated nucleotide Quality Scores (Phred-equivalent; Figure 25) produced by the GS De Novo Assembler application. The AllContig and LargeContig minimum length thresholds are specified in the GUI or by CLI options described in Table 1, above.
1.15.1.3
454ReadStatus.txt
The 454ReadStatus.txt file (Figure 26) contains the status identifiers for all the reads used in the assembly computation, plus the 3’ and 5’ positions for each assembled read’s alignment within the contig results. The reads are listed one per line, in tab-delimited format. Each line contains the following information (these are the columns in the tab-delimited format):
1.
Accno – Accession number of the input read. If this is a Paired End read, the accno is followed by an underscore character and the mention “left” of “right”, for which half of the pair this read comes from.
2.
Read Status – status of the read in the assembly, which can be one of the following:
3.
Assembled – the read is fully incorporated into the assembly
4.
PartiallyAssembled – only part of the read was included in the assembly, the rest was deemed to have diverged sufficiently to not be included
5.
Singleton – the read did not overlap with any other reads in the input
6.
Repeat – the read was either:
9.
Outlier – the read was identified by the GS De Novo Assembler as problematic, and was excluded from the final contigs (one explanation of these outliers are chimeric sequences, but sequences may be identified as outliers simply as an assembler artifact)
10.
TooShort – the trimmed read was too short to be used in the computation (shorter than 50 bases and longer than the value of the minlen parameter, unless 454 Paired End Reads are included in the data set, in which case, all reads at least “minlen” bases are used).
11.
5’ Contig – The accno of the contig in which the 5’ end of the read’s alignment begins.
12.
5’ Position – The position in the 5’ contig where the 5’ end of the read’s alignment begins.
13.
5’ Strand – The orientation of the read’s alignment relative to the 5’ contig. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the 5’ contig. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the 5’ contig.
14.
3’ Contig – The accno of the contig in which the 3’ end of the read’s alignment ends.
15.
3’ Position – The position in the 3’ contig where the 3’ end of the read’s alignment ends.
16.
3’ Strand – The orientation of the read’s alignment relative to the 3’ contig. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the 3’ contig. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the 3’ contig.
1.15.1.4
454TrimStatus.txt
1.
Accno – accession number of the input read
2.
Trimpoints Used – the final trimpoints used in the assembly, in #-# format
3.
Trimmed Length – the final trimmed length of the read
4.
Orig. Trimpoints – the original trimpoints of the read, found in the SFF or FASTA
5.
Orig. Trimmed Length – the original trimmed length of the read
6.
Raw Length – the length of the raw read (without any trimming)
1.15.1.5
454Contigs.ace or ace/ContigName.ace or consed/…
This viewer-ready genome file shows all the contigs contained in 454AllContigs.fna and allows the display of how the individual reads aligned to those contigs, in an ACE format file suitable for use in various third-party sequence finishing programs (Figure 28). (The freeware “clview” application can be downloaded from: http://compbio.dfci.harvard.edu/tgi/software/; a full description of the .ace file format can be found at: http://bozeman.mbt.washington.edu/consed/consed.html.) It should be noted, however, that such third-party viewing software will not be able to make full use of the flowspace assembly information available with 454 Sequencing reads, and that conversely some of the third-party program’s functions (e.g. involving sequence chromatogram input) are not usable with 454 Sequencing data sets. Nonetheless, these programs may be useful to view and assess read characteristics and coverage depth in regions of interest.
The read information included in the ACE file produced by the GS De Novo Assembler application differs from traditional files in that a single read may appear in multiple contigs of the assembly. This occurs because the objective of the 454 Sequencing System software is to first identify and partition the repeat and non-repeat regions of the genome, and then output the consensus sequences for those regions. Therefore, if a single read spans the boundary between two contigs, and either one of these contigs consists of a repeat region, the read will be displayed in both the repeat contig and the non-repeat contig on the other side of the boundary.
1.15.1.6
454NewblerMetrics.txt
1.15.1.6.1
Input information (Figure 29)
runData group – contains information about the read data used in the analysis (both Sanger and non-Paired-End 454 Sequencing read files are reported on in this section) (not shown in Figure 29 since only Paired End data files were used in this example).
pairedReadData group – contains information about the Paired End input data (Paired End only; 454 Sequencing System and Sanger if any).
1.15.1.6.2
Operation metrics (Figure 30)
runMetrics group – contains information about the assembly computation.
readAlignmentResults group – contains information about the alignments for each input file (SFF, FASTA/FASTQ, or Run regions from wells file.
pairedReadResults group – contains information about the Paired End input data (Paired End only; 454 Sequencing System and Sanger, if any).
1.15.1.6.3
Consensus distribution information
consensusDistribution group – contains information about the consensus signals and basecalling thresholds
1.15.1.6.4
Alignment depths
alignmentDepths group – provides a histogram of the number of alignment positions at each coverage depth (including the gaps at all positions)
1.15.1.6.5
Consensus results (Figure 31).
consensusResults group – contains summary information and statistics about reads, scaffolds, and contigs.
readStatus – summary information about the reads
pairedReadStatus – Paired End library statistics (if Paired End reads used).
scaffoldMetrics – scaffold statistics (if Paired End reads are used).
largeContigMetrics – contig statistics for large contigs (longer than ‘largeContigThreshold’; default is 500 bp).
allContigMetrics– contig statistics for all contigs (default is 100 bp).
1.15.1.7
454NewblerProgress.txt
This file represents the text log of the messages sent to standard output by the runProject command (showing the progress of the execution of the assembly computation). If Runs are added incrementally and multiple executions of runProject occur, the output messages are appended to this file. If the “Incremental De Novo assembler analysis” checkbox option is not selected in the GUI application, or if the “-r” option is given on the runProject command line, the GS De Novo Assembler deletes intermediate data and “restarts” the assembly computation, and this file is deleted and restarted as well.
1.15.1.8
454PairAlign.txt
This file contains the pairwise alignments of the overlaps that were found during the assembly computation (Figure 32). By default, this file is not generated, but if the “-p” or “-pt” options are given on the runProject command line, it will be generated either in a human-readable text format (“-p”) or in tab-delimited format (“-pt”).
1.
QueryAccno – accession number of the read used in the overlap detection search (the “query sequence”)
2.
QueryStart – starting position of the alignment in query sequence
3.
QueryEnd – ending position of the alignment in query sequence
4.
QueryLength – length of the query sequence
5.
SubjAccno – accession number of the other read (the “subject sequence”)
6.
SubjStart – starting position of the alignment in subject sequence
7.
SubjEnd – ending position of the alignment in subject sequence
8.
SubjLength – length of the subject sequence
9.
NumIdent – number of identities in the pairwise alignment, i.e. where query and subject characters match
10.
AlignLength – the length of the pairwise alignment
11.
QueryAlign – query alignment sequence
12.
SubjAlign – subject alignment sequence
1.15.1.9
454PairStatus.txt
1.
Template – template string for the pair (this will be the original 454 accession for 454 Paired End reads, and the “template” string for Sanger reads)
2.
Status – the status of the pair in the assembly, with the following possible values:
a.
BothUnmapped – both halves of the pair were unmapped
b.
OneUnmapped – one of the reads in the pair was unmapped
c.
MultiplyMapped – one or both of the reads in the pair were marked as Repeat
d.
SameContig – both halves of the pair were assembled into the same contig, with the correct relative orientation, and are within the expected distance of each other
e.
Link – the halves were assembled to different contigs, and are near enough to the ends of those contigs that they could be used as a link in a scaffold
f.
FalsePair – the halves were assembled, but the orientation of the aligned contigs is inconsistent with a Paired End pair or the distance between the halves is outside the expected distance
3.
Distance – for “SameContig” or “Link” pairs, the distance between the halves (where the “Link” distance is simply the sum of the distances from each half to the end of its respective contig).
4.
Left Contig – the contig where the left half was assembled, or ‘-‘ if the read was Unmapped or Repeat. (If the read aligns across multiple contigs, the location of the first base in the read that is aligned is used, to denote where the end of the corresponding clone occurs.)
5.
Left Pos – the position in the contig where the 5’ end of the left half was assembled
6.
Left Dir – the direction (‘+’ for the forward strand of the contig and ‘-‘ for reverse strand) in which the left half was assembled
7.
Right Contig – the contig where the right half was assembled, or ‘-‘ if the read was Unmapped or Repeat. (If the read aligns across multiple contigs, the location of the first base of the read that is aligned is used, to denote where the end of the corresponding clone occurs.)
8.
Right Pos – the position in the contig where the 3’ end of the right half was assembled
9.
Right Dir – the direction (‘+’ for the forward strand of the contig and ‘-‘ for reverse strand) in which the right half was assembled
10.
Left Distance – the distance from the Left Pos to the respective end of the contig (for forward matches, this is the distance to the 3’ end of the contig; for reverse matches, to the 5’ end)
11.
Right Distance – the distance from the Right Pos to the respective end of the contig (for forward matches, this is the distance to the 3’ end of the contig; for reverse matches, to the 5’ end).
1.15.1.10
454TagPairAlign.txt
1.15.1.11
454Scaffolds.txt and 454ContigScaffolds.txt
This file contains AGP-formatted information describing how the contigs included in the 454LargeContigs.fna file (see section 1.15.1.2) are scaffolded into the sequence scaffolds of 454Scaffolds.fna (section 1.15.1.2). A description of the AGP format can be found on NCBI’s web site
1.15.1.12
454ContigGraph.txt
1.
The number of the contig (i.e., for contig “contig00002”, this column would contain the number “2”)
2.
The full name of the contig (e.g., “contig00002”)
6.
The depth of the multiple alignment that spans between the two contig ends (i.e., the number of reads whose alignments crosses the two contigs ends).
C    1    5’    2    3’   21
S    6    3’    8    5’
I    24   CAAGAGATTGGCTCTTCCAGCTTAAACG    23:25-3'..712-5'; 48:25-3'..838-5'