4. GS De Novo Assembler and GS Reference Mapper Appendices
: 4.6 Serial I/O
4.6
Serial I/O
4.6.1 Overview
4.6.2 Options for Serial I/O
4.6.3 Guidelines for Use
4.6.1
Overview
The serial I/O options invoke an optimized method of processing the read files during the later phases of the assembly or mapping processes. In the assembler, the order of sequences required in these later phases is determined by the structure of the scaffolds created. In the mapper, the order of sequences required in is the last phase is determined by the order of the references to which the reads map. In both cases, the order is different than the order of the sequences in the input read files. By default, random access I/O is used to obtain the sequences from the read files on disk while computing signals and producing the output files. That is, read operations will be scattered across all of the input read files. For projects using a small number of input read files this is an adequate approach. However, for larger projects (e.g.
de novo
assembly of large genomes) there may be hundreds of read files and tens of millions of sequences. Consequently, the inefficiency of random access file I/O becomes magnified many times and results in prohibitively long execution times for the I/O-intensive, later part of the assembly/mapping operation.
The approach used in serial I/O is to iterate over the scaffolds or references to create a list of the required sequences in the order in which they will be accessed. This list is then used to create a single read file containing all the reads in the order specified by the list. Doing so allows the use of more efficient sequential I/O operations when the sequences are actually read from disk. The costs of this approach are the memory footprint and disk space (up to three times the space for the input read files) required to implement the algorithms used in creating the sequential read file. For sufficiently large projects, these costs are far outweighed by the greater efficiency of the I/O operations enabled by this approach.