|
1.15
|
GS De Novo Assembler Output
|
|
1.15.1
|
|
454NewblerProgress.txt: If Runs are added incrementally and multiple executions of runProject occur, the output messages are appended to this file. If the “Incremental De Novo assembly analysis” checkbox option is not selected in the GUI application, or the “-r” option is given on the runProject command line, the GS De Novo Assembler deletes intermediate data and “restarts” the assembly computation, and the NewblerProgress.txt file is deleted and restarted as well.
|
|
1.15.1.1
|
|
1.
|
Position – the position in the contig
|
|
2.
|
Consensus – the consensus nucleotide for that position in the contig
|
|
3.
|
Quality Score – the quality score of the consensus base
|
|
4.
|
Unique Depth – the number of non-duplicate reads that align at that position in the alignment
|
|
5.
|
Align Depth – the number of reads (including duplicates) that align at that position in the alignment
|
|
6.
|
Signal – the average signal of the read flowgrams, for the flows that correspond to that position in the alignment
|
|
7.
|
StdDeviation – the standard deviation of the read flowgram signals at the corresponding flows
|
|
1.15.1.3
|
|
1.
|
Accno – Accession number of the input read. If this is a Paired End read, the accno is followed by an underscore character and the mention “left” of “right”, for which half of the pair this read comes from.
|
|
2.
|
Read Status – status of the read in the assembly, which can be one of the following:
|
|
3.
|
Assembled – the read is fully incorporated into the assembly
|
|
4.
|
PartiallyAssembled – only part of the read was included in the assembly, the rest was deemed to have diverged sufficiently to not be included
|
|
5.
|
Singleton – the read did not overlap with any other reads in the input
|
|
6.
|
Repeat – the read was either:
|
|
9.
|
Outlier – the read was identified by the GS De Novo Assembler as problematic, and was excluded from the final contigs (one explanation of these outliers are chimeric sequences, but sequences may be identified as outliers simply as an assembler artifact)
|
|
10.
|
TooShort – the trimmed read was too short to be used in the computation (shorter than 50 bases and longer than the value of the minlen parameter, unless 454 Paired End Reads are included in the data set, in which case, all reads at least “minlen” bases are used).
|
|
11.
|
5’ Contig – The accno of the contig in which the 5’ end of the read’s alignment begins.
|
|
12.
|
5’ Position – The position in the 5’ contig where the 5’ end of the read’s alignment begins.
|
|
13.
|
5’ Strand – The orientation of the read’s alignment relative to the 5’ contig. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the 5’ contig. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the 5’ contig.
|
|
14.
|
3’ Contig – The accno of the contig in which the 3’ end of the read’s alignment ends.
|
|
15.
|
3’ Position – The position in the 3’ contig where the 3’ end of the read’s alignment ends.
|
|
16.
|
3’ Strand – The orientation of the read’s alignment relative to the 3’ contig. A ‘+’ indicates the alignment orientation of the read is the same as the orientation of the 3’ contig. A ‘-‘ indicates the alignment orientation of the read is opposite to the orientation of the 3’ contig.
|
|
1.15.1.4
|
|
1.
|
Accno – accession number of the input read
|
|
2.
|
Trimpoints Used – the final trimpoints used in the assembly, in #-# format
|
|
3.
|
Trimmed Length – the final trimmed length of the read
|
|
4.
|
Orig. Trimpoints – the original trimpoints of the read, found in the SFF or FASTA
|
|
5.
|
Orig. Trimmed Length – the original trimmed length of the read
|
|
6.
|
Raw Length – the length of the raw read (without any trimming)
|
|
The read information included in the ACE file produced by the GS De Novo Assembler application differs from traditional files in that a single read may appear in multiple contigs of the assembly. This occurs because the objective of the 454 Sequencing System software is to first identify and partition the repeat and non-repeat regions of the genome, and then output the consensus sequences for those regions. Therefore, if a single read spans the boundary between two contigs, and either one of these contigs consists of a repeat region, the read will be displayed in both the repeat contig and the non-repeat contig on the other side of the boundary.
|
|
1.15.1.6
|
|
1.15.1.6.1
|
|
•
|
runData group – contains information about the read data used in the analysis (both Sanger and non-Paired-End 454 Sequencing read files are reported on in this section) (not shown in Figure 29 since only Paired End data files were used in this example).
|
|
•
|
pairedReadData group – contains information about the Paired End input data (Paired End only; 454 Sequencing System and Sanger if any).
|
|
1.15.1.6.2
|
|
•
|
runMetrics group – contains information about the assembly computation.
|
|
•
|
readAlignmentResults group – contains information about the alignments for each input file (SFF, FASTA/FASTQ, or Run regions from wells file.
|
|
•
|
pairedReadResults group – contains information about the Paired End input data (Paired End only; 454 Sequencing System and Sanger, if any).
|
|
1.15.1.6.3
|
|
•
|
consensusDistribution group – contains information about the consensus signals and basecalling thresholds
|
|
1.15.1.6.4
|
|
•
|
alignmentDepths group – provides a histogram of the number of alignment positions at each coverage depth (including the gaps at all positions)
|
|
1.15.1.6.5
|
|
•
|
consensusResults group – contains summary information and statistics about reads, scaffolds, and contigs.
|
|
◦
|
readStatus – summary information about the reads
|
|
◦
|
pairedReadStatus – Paired End library statistics (if Paired End reads used).
|
|
◦
|
scaffoldMetrics – scaffold statistics (if Paired End reads are used).
|
|
◦
|
largeContigMetrics – contig statistics for large contigs (longer than ‘largeContigThreshold’; default is 500 bp).
|
|
◦
|
allContigMetrics– contig statistics for all contigs (default is 100 bp).
|
|
1.15.1.7
|
|
1.15.1.8
|
|
1.
|
QueryAccno – accession number of the read used in the overlap detection search (the “query sequence”)
|
|
2.
|
QueryStart – starting position of the alignment in query sequence
|
|
3.
|
QueryEnd – ending position of the alignment in query sequence
|
|
4.
|
QueryLength – length of the query sequence
|
|
5.
|
SubjAccno – accession number of the other read (the “subject sequence”)
|
|
6.
|
SubjStart – starting position of the alignment in subject sequence
|
|
7.
|
SubjEnd – ending position of the alignment in subject sequence
|
|
8.
|
SubjLength – length of the subject sequence
|
|
9.
|
NumIdent – number of identities in the pairwise alignment, i.e. where query and subject characters match
|
|
10.
|
AlignLength – the length of the pairwise alignment
|
|
11.
|
QueryAlign – query alignment sequence
|
|
12.
|
SubjAlign – subject alignment sequence
|
|
1.15.1.9
|
|
1.
|
Template – template string for the pair (this will be the original 454 accession for 454 Paired End reads, and the “template” string for Sanger reads)
|
|
2.
|
Status – the status of the pair in the assembly, with the following possible values:
|
|
a.
|
BothUnmapped – both halves of the pair were unmapped
|
|
b.
|
OneUnmapped – one of the reads in the pair was unmapped
|
|
c.
|
MultiplyMapped – one or both of the reads in the pair were marked as Repeat
|
|
d.
|
SameContig – both halves of the pair were assembled into the same contig, with the correct relative orientation, and are within the expected distance of each other
|
|
e.
|
Link – the halves were assembled to different contigs, and are near enough to the ends of those contigs that they could be used as a link in a scaffold
|
|
f.
|
FalsePair – the halves were assembled, but the orientation of the aligned contigs is inconsistent with a Paired End pair or the distance between the halves is outside the expected distance
|
|
3.
|
Distance – for “SameContig” or “Link” pairs, the distance between the halves (where the “Link” distance is simply the sum of the distances from each half to the end of its respective contig).
|
|
4.
|
Left Contig – the contig where the left half was assembled, or ‘-‘ if the read was Unmapped or Repeat. (If the read aligns across multiple contigs, the location of the first base in the read that is aligned is used, to denote where the end of the corresponding clone occurs.)
|
|
5.
|
Left Pos – the position in the contig where the 5’ end of the left half was assembled
|
|
6.
|
Left Dir – the direction (‘+’ for the forward strand of the contig and ‘-‘ for reverse strand) in which the left half was assembled
|
|
7.
|
Right Contig – the contig where the right half was assembled, or ‘-‘ if the read was Unmapped or Repeat. (If the read aligns across multiple contigs, the location of the first base of the read that is aligned is used, to denote where the end of the corresponding clone occurs.)
|
|
8.
|
Right Pos – the position in the contig where the 3’ end of the right half was assembled
|
|
9.
|
Right Dir – the direction (‘+’ for the forward strand of the contig and ‘-‘ for reverse strand) in which the right half was assembled
|
|
10.
|
Left Distance – the distance from the Left Pos to the respective end of the contig (for forward matches, this is the distance to the 3’ end of the contig; for reverse matches, to the 5’ end)
|
|
11.
|
Right Distance – the distance from the Right Pos to the respective end of the contig (for forward matches, this is the distance to the 3’ end of the contig; for reverse matches, to the 5’ end).
|
|
1.15.1.10
|
|
1.15.1.12
|
|
1.
|
The number of the contig (i.e., for contig “contig00002”, this column would contain the number “2”)
|
|
2.
|
The full name of the contig (e.g., “contig00002”)
|
|
6.
|
The depth of the multiple alignment that spans between the two contig ends (i.e., the number of reads whose alignments crosses the two contigs ends).
|
C 1 5’ 2 3’ 21
S 6 3’ 8 5’
I 24 CAAGAGATTGGCTCTTCCAGCTTAAACG 23:25-3'..712-5'; 48:25-3'..838-5'