2. GS Reference Mapper
2.
GS Reference Mapper
2.1 Overview of the GS Reference Mapper
2.2 GS Reference Mapper GUI
2.3 Launching the GS Reference Mapper GUI
2.4 Opening a Project
2.5 View Project Summary with the Overview Tab
2.6 Add/Remove Read Data and References with the Project Tab
2.7 Customize Project with the Parameters Tab
2.8 Computing the Mapping
2.9 Viewing Mapping Output with the Result Files Tab
2.10 Using the Variants Tab
2.11 Profile Tab
2.12 Viewing Reads Mapped to the Reference with the Alignment Results Tab
2.13 The Flowgrams Tab
2.14 Project Error Indicators
2.15 GS Reference Mapper Command Line Interface
2.16 GS Reference Mapper cDNA / Transcriptome Options
2.17 GS Reference Mapper Output
2.1
Overview of the GS Reference Mapper
The GS Reference Mapper application aligns sequencing reads against a reference sequence consisting of one or more sequences or a GoldenPath genome, with or without associated annotations. The GS Reference Mapper software is an interactive application used to create mapping projects, add or remove reads from the project, specify reference sequences, annotations and other project parameters, run the mapping algorithms on the project data, and view the output produced by the mapping computations. The application can be accessed via a Graphical User Interface (GUI) or from a command line interface (CLI).
Input data can come from one or several regions of one or several Runs of interest. Additional read data (for example from Sanger sequencing) can be imported from one or more external file formats, including FASTA and FASTQ. Mapping generates consensus sequences of the reads that align against the reference and also computes statistics for variations found in the reads, relative to the reference. Data are output to a variety of file formats, including FASTA, ACE, BAM or consed files.
The GS Reference Mapper application allows a user to:
•
create mapping projects for genomic or cDNA reads
•
add and remove read data sets from the project
•
add or remove reference sequences and annotations
•
specify mapping parameters
•
specify special library preparations (such as the use of MIDs)
•
run the mapping algorithms on the project data
•
and view the output produced by the mapping computation
When the mapping algorithms run, the software performs the following operations:
•
For each read, search for its best alignment to the reference sequence(s) (a read may align to multiple positions in the reference); this is done in ‘nucleotide’ space.
•
Perform multiple alignments for the reads that align contiguously to the reference in order to form “contigs.” From the contigs’ multiple alignments, consensus basecall sequences are produced using the signals of the reads in the multiple alignments (performed in ‘flowspace’)
•
Identify subsets of the reads that vary relative to the reference to form lists of putative variations (nucleotide differences and structural variants). For each putative variation, the reads supporting the variation will be in the subset.
•
Evaluate these lists of putative variations to identify High-Confidence nucleotide differences (HCDiffs), structural variations (HCStructVars) and larger-scale structural rearrangements (HCStructRearrangements).
•
Output the following information:
◦
contig consensus sequence(s) and associated quality values
◦
alignments of the reads to the reference
◦
position-by-position metrics of the depth and consensus accuracy (quality values) for each position in the aligned reference
◦
the positions and alignments of identified differences.
Read overlaps and multiple alignments are made in ‘nucleotide’ space while the consensus basecalling and quality value determination for contigs are performed in ‘flowspace’. Work in flowspace allows the averaging of processed flow signals (a continuous variable) at each nucleotide flow of the sequencing Run(s) and allows the use of information from the “negative flows”,
i.e.
flows where no nucleotide incorporation is detected. The use of flowspace in determining the properties of the consensus sequence results in an improved accuracy for the final basecalls.
•
The GS Reference Mapper application can be run from the Attendant PC or a DataRig when using the GS Junior Instrument.
•
The GS Reference Mapper application is not available on the GS FLX+ Instrument and must be run on a DataRig for users of this instrument.
The GS Reference Mapper allows users to create, modify, and run mapping projects. Both the GUI and command line interface (CLI) provide this functionality. Projects may be setup to map all reads at once. Alternatively, incremental operation allows additional reads to be added to an existing mapping project. Results appear as output files using either the GUI or the CLI. The GUI provides a graphical interface to view many of the results from the mapping computation whether the mapping was performed using the GUI or the CLI.
The GS Reference Mapper application uses a folder on the file system to hold the mapping project information (whether the mapping is performed through the GUI application or through the newMapping and related commands) and to hold the output files generated during the mapping computation. The contents of this folder and the names of the files generated by the application are the same for any mapping project or output folder.
The operation of the GUI is described in several of the subsequent sections. A description of the CLI is then presented followed by a discussion of transcriptome mapping. Finally, output files produced by the GS Reference Mapper are described.