README

Thalassiosira pseudonana Genome Resequencing Project (May 2021)
-----------------------------------------------------

The raw fast5 Oxford Nanopore MinION data were basecalled using Albacore (v2.1.7). The 1D sequencing adapters were removed for the ÒpassÓ (q-score>7) fastq files using Porechop (v0.2.3). Filtlong (v0.1.0) was used to create a subset of the long-read data, including reads ³30 Kbp (--mean-q-weight=8) for 100x coverage of the expected genome size of ~35Mbp based on Armbrust et al. (2004) and Bowler et al. (2008). 

Nanopore long reads ³30 Kbp were assembled using Flye (v2.3). The Flye genome assembly was corrected using two iterations of Racon (v1.3.1), followed by Nanopolish (v0.10.1). Final polishing was done using Illumina short read data via Pilon (v1.22). Note that the final nuclear genome assembly is an amalgamation of the two haplotypes. 

The final polished Flye assembly of the T. pseudonana nuclear genome is located in the following file: Thal_pseudonana_FLYE_assembly.fasta
Flye assembled contigs that were identified as the plastid genome are in the file: Thal_pseudonana_FLYE_chloroplast_contigs.fasta

-----------------------------------------------------

FLYE ASSEMBLY STATISTICS
Total assembly length: 33.8 Mbp
Number of contigs:52
Largest contig: 2761696
Contig N50: 1.38 Mbp
Contig L50: 8
G+C content: 47%


Contig 		Length		Telomere(s)	Telomere Sequencecontig4		2761696		0contig10	2737527		0contig12	2652838		2		TAACC(C), TA(G)CC(CC )ÉÉ TAGGGT, TAGG(G)AGTcontig6		2536014		1		TAGG(G)Tcontig23	2056298		1		TAACC(C)contig9b	1699792		1		TAGGGTcontig15	1505999		0contig11	1384551		1		TAACCCcontig24	1216303		1		TAACCCcontig2		1188749		1		TAACCCcontig20	1112199		0contig5		1055523		1		TAGGGTcontig1		1015641		0contig8		950771		1		TAACC(C)
contig7		861658		0contig26	810110		1		TAGGGTcontig34a	788881		0contig3		733359		1		TAACCCscaffold21	711435		0contig18	692985		0contig36	629637		0contig90	577983		1		TAGGGTcontig9a	572364		1		TAGGGT
contig13	474235		1		TAACCC
contig29	422608		1		TAGGGT
contig27	350128		1		TAGGGT
contig16	332712		1		TAACC(C)
contig37	271849		1		TAGGGTcontig46	234828		0contig19	174840		0contig35	146685		1		TAGGGTcontig17	131800		0contig33	125558		1		TAGG(G)AGTcontig70	120315		0contig28	104872		1		TAGG(GGC)Tcontig82	95520		0contig40	91967		0contig45	71322		0contig14	70349		0contig48	69073		1		TAACC(C)contig30	66137		0contig22	62249		0contig34b	47647		0contig47	23422		0contig69	14904		1		TAACC(C)contig49	13959		1		TAGGGTcontig57	13664		1		TAACC(CC)contig31	12174		0contig52	9414		1		TAACCC(C)contig87	6940		0contig61	6832		0contig67	6674		0-----------------------------------------------------

The protein coding gene dataset (16491 genes) was based on the Flye assembly and Trinity (v2.9.1) assembled transcriptome (four sets of paired Illumina RNA-Seq SRA datasets: SRR9042946, SRR9042947, SRR9042958, SRR9042959). Gene prediction was performed using an in-house pipeline based on BRAKER (v2) with increased attention to chimeric gene models and real intron boundaries. The gene set was then corrected using PASA.

The amino acid sequences of the 16491 proteins are located here: Thal_pseudonana_PASA_gene_predictions.fasta
Protein coding gene annotations are located in the following file: Thal_pseudonana_PASA_gene_predictions.gff

-----------------------------------------------------
The raw fast5 MinION data have been deposited in the NCBI SRA database: accession SRX4617979.The raw Illumina sequence data have been deposited in the NCBI SRA database: accession: SRX4617978.