README Thalassiosira pseudonana Genome Resequencing Project (May 2021) ----------------------------------------------------- The raw fast5 Oxford Nanopore MinION data were basecalled using Albacore (v2.1.7). The 1D sequencing adapters were removed for the ÒpassÓ (q-score>7) fastq files using Porechop (v0.2.3). Filtlong (v0.1.0) was used to create a subset of the long-read data, including reads ³30 Kbp (--mean-q-weight=8) for 100x coverage of the expected genome size of ~35Mbp based on Armbrust et al. (2004) and Bowler et al. (2008). Nanopore long reads ³30 Kbp were assembled using Flye (v2.3). The Flye genome assembly was corrected using two iterations of Racon (v1.3.1), followed by Nanopolish (v0.10.1). Final polishing was done using Illumina short read data via Pilon (v1.22). Note that the final nuclear genome assembly is an amalgamation of the two haplotypes. The final polished Flye assembly of the T. pseudonana nuclear genome is located in the following file: Thal_pseudonana_FLYE_assembly.fasta Flye assembled contigs that were identified as the plastid genome are in the file: Thal_pseudonana_FLYE_chloroplast_contigs.fasta ----------------------------------------------------- FLYE ASSEMBLY STATISTICS Total assembly length: 33.8 Mbp Number of contigs:52 Largest contig: 2761696 Contig N50: 1.38 Mbp Contig L50: 8 G+C content: 47% Contig Length Telomere(s) Telomere Sequence contig4 2761696 0 contig10 2737527 0 contig12 2652838 2 TAACC(C), TA(G)CC(CC )ÉÉ TAGGGT, TAGG(G)AGT contig6 2536014 1 TAGG(G)T contig23 2056298 1 TAACC(C) contig9b 1699792 1 TAGGGT contig15 1505999 0 contig11 1384551 1 TAACCC contig24 1216303 1 TAACCC contig2 1188749 1 TAACCC contig20 1112199 0 contig5 1055523 1 TAGGGT contig1 1015641 0 contig8 950771 1 TAACC(C) contig7 861658 0 contig26 810110 1 TAGGGT contig34a 788881 0 contig3 733359 1 TAACCC scaffold21 711435 0 contig18 692985 0 contig36 629637 0 contig90 577983 1 TAGGGT contig9a 572364 1 TAGGGT contig13 474235 1 TAACCC contig29 422608 1 TAGGGT contig27 350128 1 TAGGGT contig16 332712 1 TAACC(C) contig37 271849 1 TAGGGT contig46 234828 0 contig19 174840 0 contig35 146685 1 TAGGGT contig17 131800 0 contig33 125558 1 TAGG(G)AGT contig70 120315 0 contig28 104872 1 TAGG(GGC)T contig82 95520 0 contig40 91967 0 contig45 71322 0 contig14 70349 0 contig48 69073 1 TAACC(C) contig30 66137 0 contig22 62249 0 contig34b 47647 0 contig47 23422 0 contig69 14904 1 TAACC(C) contig49 13959 1 TAGGGT contig57 13664 1 TAACC(CC) contig31 12174 0 contig52 9414 1 TAACCC(C) contig87 6940 0 contig61 6832 0 contig67 6674 0 ----------------------------------------------------- The protein coding gene dataset (16491 genes) was based on the Flye assembly and Trinity (v2.9.1) assembled transcriptome (four sets of paired Illumina RNA-Seq SRA datasets: SRR9042946, SRR9042947, SRR9042958, SRR9042959). Gene prediction was performed using an in-house pipeline based on BRAKER (v2) with increased attention to chimeric gene models and real intron boundaries. The gene set was then corrected using PASA. The amino acid sequences of the 16491 proteins are located here: Thal_pseudonana_PASA_gene_predictions.fasta Protein coding gene annotations are located in the following file: Thal_pseudonana_PASA_gene_predictions.gff ----------------------------------------------------- The raw fast5 MinION data have been deposited in the NCBI SRA database: accession SRX4617979. The raw Illumina sequence data have been deposited in the NCBI SRA database: accession: SRX4617978.