Emiliania-huxleyi-374-20130904
------------------------------
Samples Included in the assembly:
MMETSP1006
MMETSP1007
MMETSP1008
MMETSP1009
--

stats.txt:  Summary  statistics  associated  with contigs.fa. In-
cludes the total number of sequences and bases in the contig set,
N50,  etc.   Q1,  Q2, Q3 are the quartiles of the reported contig
lengths. B1000 and B2000 indicate the  percentage  of  bases  in-
volved in contigs at least 1000 bp and 2000 bp, respectively.

contigs.fa:  Contigs  from the assembly, minimum 150 bp. Possibly
includes UTRs.

cds.fa: Coding regions associated with contigs, as  predicted  by
ESTscan, minimum 150 bp.  Sequence identifiers for these predict-
ed CDS are provided suffixes _1, _2, etc., to accommodate  multi-
ple  predictions, and to indicate association with predicted pro-
tein products. Note that the total number of predicted CDS  might
be higher or lower than the number of contigs. This can be due to
the reporting threshold of 150 bp  or  multiple  predictions  per
contig.

peptides.fa:  Protein  products  associated with contigs, as pre-
dicted by ESTScan, minimum 30 aa.  Sequence identifiers for these
predicted  products  correspond  to the associated nucleotide se-
quence in contig.fa, and are provided suffixes _1, _2,  etc.,  to
accommodate  multiple  predictions. Note that the total number of
predicted peptides might be higher or lower than  the  number  of
contigs.  This  can be due to the reporting threshold of 30 aa or
multiple predictions per contig.

--
--

The "readcounts" directory contains read counts from  each  indi-
vidual sample.

contigs.dat:  Read counts generated using default RSEM parameters
to align sequence reads to contigs.fa.  As  counts  delivered  by
RSEM are fragments (corresponding to read pairs), counts are mul-
tiplied by two and rounded to the nearest integer.  The  tab  de-
limited  file contains a row for each contig and a column of read
counts for each sample in the assembly.  Read counts are not nor-
malized.

cds.dat:  Read  counts generated using default RSEM parameters to
align sequence reads to cds.fa.  As counts delivered by RSEM  are
fragments (corresponding to read pairs), counts are multiplied by
two and rounded to the nearest integer.  The tab  delimited  file
contains  a row for each cds and a column of read counts for each
sample in the assembly.  Read counts are not normalized.

--
--

The "annot" directory contains annotations in GFF3 format for the
predicted protein products using two primary methods.

pfam.gff3,  tigrfams.gff3, superfamily.gff3: Models matching pre-
dicted protein products (peptides.fa) reported  in  GFF3  format;
based  on  HMMER3  searches  against the Pfam-A, Superfamily, and
TIGRFAMs model sets. These are restricted to full-sequence-evalue
<= 1.0e-5 with the top five hits reported.

Association with InterPro terms is indicated in the Ontology_term
attribute, and is based on the assertions (InterPro -> model, In-
terPro -> protein accession) made by InterPro. Currently InterPro
associations from Superfamily hits are not computed.

swissprot.gff3: Protein sequence  accessions  matching  predicted
protein products (peptides.fa) reported in GFF3 format; based up-
on NCBI-BLASTP searches against SwissProt.  These are  restricted
to the top five HSP bitscores with evalue <= 1.0e-20.

--
--

The  "extras"  directory  contains  annotations for the predicted
protein products.

hmmer3_pfam.hits,   hmmer3_superfam.hits,   hmmer3_tigrfams.hits:
Program  output  from  HMMER3   searches using default parameters
against  the  Pfam-A, Superfamily, and TIGRFAMs model  sets.  For
convenience,  results  are provided in GFF3 format in the "annot"
directory.

blastp_swissprot.xml: Program output  from  NCBI-BLASTP  searches
against  SwissProt/UniProtKB  database.  For convenience, results
are provided in GFF3 format in the "annot" directory.   These are
restricted to evalue <= 1.0e-20.

estscan.gff3:  Coding  regions  (cds.fa)  of  full length contigs
(contigs.fa) as predicted by ESTScan in GFF3 format.

------------------------------------
National Center for Genome Resources
http://www.ncgr.org






























