MMETSP0322-20121206
-------------------

stats.txt:  Summary  statistics  associated  with contigs.fa. In-
cludes the total number of sequences and bases in the contig set,
N50,  etc.   Q1,  Q2, Q3 are the quartiles of the reported contig
lengths. B1000 and B2000 indicate the  percentage  of  bases  in-
volved in contigs at least 1000 bp and 2000 bp, respectively.

--

contigs.fa:  Contigs  from the assembly, minimum 150 bp. Possibly
includes UTRs. Sequences contain IUPAC ambiguity codes represent-
ing    ambiguous   bases,   http://www.bioinformatics.org/sms/iu-
pac.html.

--

cds.fa: Coding regions associated with contigs, as  predicted  by
ESTscan, minimum 50 bp.  Sequence identifiers for these predicted
CDS are provided suffixes _1, _2, etc., to  accommodate  multiple
predictions.   Sequences contain IUPAC ambiguity codes represent-
ing   ambiguous   bases,    http://www.bioinformatics.org/sms/iu-
pac.html.  Note  that  the total number of predicted CDS might be
higher or lower than the number of contigs. This can  be  due  to
the reporting threshold of 50 nt or multiple predictions per con-
tig.

--

peptides.fa: Protein products associated with  contigs,  as  pre-
dicted by ESTScan, minimum 30 aa.  Sequence identifiers for these
predicted products correspond to the  associated  nucleotide  se-
quence  in  contig.fa, and are provided suffixes _1, _2, etc., to
accommodate multiple predictions. Note that the total  number  of
predicted  peptides  might  be higher or lower than the number of
contigs. This can be due to the reporting threshold of 30  aa  or
multiple predictions per contig.

--

readcounts/contigs.dat,  readcounts/cds.dat: Read counts obtained
by post hoc alignment of reads using BWA to reported contigs  and
CDS, respectively, with default parameters. Tab-delimited columns
with the format

  contig_id all_aligned all_aligned_fraction unique_aligned paired_aligned contig_len

where contig_id is the  contig  identifier,  for  example,  MMET-
SP0322-20121206|1234;  all_aligned is the number of reads aligned
to this contig, including multimapped reads. all_aligned_fraction
is the number of reads aligned to this contig, but in the case of
multimapped reads, the read is assigned fractionally to  the  hit
contigs.   This   has   the   advantage   that  the  sum  of  the
all_aligned_fraction counts equals the total number of reads that
aligned.  unique_aligned  is  the  number  of  reads that aligned
uniquely to this contig; and paired_aligned is the number of read
pairs  aligned  to  this contig.  contig_len is the length of the
contig in bp.

                     * * * PLEASE NOTE * * *
While this information is sufficient to compute common normalized
values such as RPKM (reads per kilobase of transcript per million
mapped reads) and FPKM (fragments per kilobase of transcript  per
million mapped reads), these read counts are provided for quality
assessment of the contig set only.  For  differential  expression
analyses, it is recommended more sophisticated estimators of rel-
ative expression level be employed. See for example:  Salzman  J,
Jiang H, Wong WH. Statistical modeling of RNA-Seq data. Statisti-
cal Science 26 (2011).

--

annot/*.gff3: Models and Swiss-Prot accessions matching predicted
protein  products  (peptides.fa)  reported in GFF3 format. HMMER3
reports are based on searches against  the  Pfam-A,  Superfamily,
and  TIGRFAMs  model  sets,  and are restricted to full-sequence-
evalue <= 1.0e-5 with the top five hits reported. BLASTP  reports
are  based  on  a search against SwissProt, and are restricted to
the top five HSP bitscores with evalue <= 1.0e-20.

Association with InterPro terms is indicated in the Ontology_term
attribute, and is based on the assertions (InterPro -> model, In-
terPro -> protein accession) made by InterPro. Currently InterPro
associations from Superfamily hits are not computed.

------------------------------------
National Center for Genome Resources
http://www.ncgr.org











































