An Amplicon Project is the main container of an Amplicon Sequencing experiment. In it, you specify the Reference Sequence(s) to which the sequencing reads will be compared, in search for Variants; the Amplicon(s) that constitute the library(ies) you sequenced [and hence, the reads in the Read Data Set(s)]; the Variant(s) that you specifically want the software to search and report on; and the Sample(s) that constitute the organizational basis for the analysis. If the Amplicon library(ies) contain Multiplex Identifiers (MIDs), the Project should further specify the MIDs used and Multiplexers to define the relationship between MIDs and Samples. All these terms correspond to “elements” that constitute the Amplicon Project, and are further defined in the following sub-sections. The Project format allows the user to incrementally add new information (Read Data Sets, of course, but also Sample, Amplicon, Variant and even new Reference Sequence or MID/Multiplexer definitions) to a Project,
e.g. as the sequencing results from new Runs/regions become available.
The basic definition of a Reference Sequence is quite straightforward: it is simply a string of A, T, G, C (or N) characters representing a DNA sequence against which the sequencing reads will be aligned and compared so variations can be identified and reported. The Reference Sequence(s) also provide the coordinates used to localize other elements defined in the Project (Amplicons and Variants; each Reference Sequence starts at coordinate “1”). You can define any number of Reference Sequences in a Project.
It is important to note that only “nucleotide” characters (A, T, G, C, or N) are accepted when you enter a Reference Sequence into the AVA software (by typing or pasting). For convenience, when pasting sequences, characters that are not nucleotide characters and are also not IUPAC ambiguity characters (such as R for purine, Y for pyrimidine, etc.) are
removed from the pasted entry. This is useful when pasting sequences from sources that may include non-sequence information (such as white space or numerical position information in the margin of each line). During such pastes, any IUPAC ambiguity characters are converted to “N” characters, as the other ambiguity characters are not supported by the software (typing individual “ambiguity” characters, however, does not result in their conversion to “N”; these are simply ignored and the text “Only ATGC and N” at the top of the Edit Sequence window turns bold and red to alert you that an invalid character was used). The restriction that no ambiguity characters other than N be present in a sequence is a requirement of many alignment algorithms and is not unique to the 454 Sequencing System software.
The term Amplicon is used in the AVA software to represent essentially the same entity (sequence) as in the preparation of an Amplicon library, except that it does not include the 19 bp “Primer A” and “Primer B” parts of the Fusion Primers. As such, therefore, they match the sequencing reads from the Read Data Set(s).
In the AVA software, however, an Amplicon is a virtual entity defined relative to a Reference Sequence by specifying two primers (the “template-specific” parts of the Fusion Primers). This relative definition is also
directional: the AVA software names the two template-specific primers “Primer 1” and “Primer 2” in the 5’-Primer 1 --> Primer 2-3’ orientation of the Reference Sequence. Therefore, Amplicon orientation is internal to the AVA software, and is NOT dependent upon the “Primer A” and “Primer B” parts of the Fusion Primers used in library construction.
The term Target specifies the part of an Amplicon that is between the two primers (
i.e., the non-primer portion of the Amplicon). This is the sequence that is actually aligned to the Reference Sequence during the computations. It is important to trim the primers before alignment because any variant found therein would be a reflection of primer design (or errors in primer synthesis) rather than representing variations in the DNA sample used to prepare the Amplicon library, and therefore would not have any biological significance.
A Read Data Set is a group of sequencing reads derived from an Amplicon library. In a Project, Read Data Sets exist within a Read Group (this helps to organize the data) and are associated with pairings of Amplicons and Samples:
In the current release of the AVA software, a Read Data Set is equivalent to an SFF file, e.g. as output by the data processing pipeline of the 454 Sequencing System, each file corresponding to a region of the PTP Device. On the GS Junior System, there is only one region per run while on the GS FLX+ System, there can be two or more regions per run depending on the gasket format employed. Using the SFFTools (see Part C, Section 3) from the command line, a user may reorganize the SFF files into multiple separate files prior to importing them as Read Data Sets into a Project. More typically, the SFF files are taken as-is from the data processing pipeline and so for the GS Junior System, there will typically be one Read Data Set for each Amplicon sequencing Run you import into the Project, and for Amplicon Sequencing performed on a GS FLX+ System, there will usually be one Read Data set for each region of the PTP Device of the Run you import
Simply put, a Variant is a sequence difference relative to a Reference Sequence. Like Amplicons, Variants are thus defined
relative to a Reference Sequence. Four kinds of variations can be defined in the AVA software: substitutions, deletions, insertions, and required matches; and a defined Variant can include any number of these, in any combination (haplotypic variations). You can define any number of Variants in a Project, each associated with a specific Reference Sequence; you can also associate any number of Variants to a given Reference Sequence.
Though the multiple alignment views of the AVA software show all variations between the reads displayed and their Reference Sequence, a Variant must be defined in the Project to be reported in the application’s Variants tab. Known Variants (e.g. from the scientific literature) can be defined directly in a Project, and putative substitution and deletion Variants will be automatically identified and defined by the AVA software if they are detected at a preset minimum abundance during computation of the Project; alignments of these putative Variants can be examined in detail, to allow you to formally “accept” them as legitimate Variants or “reject” them as noise. You can also define new Variants from the variations observed between the Reference Sequence(s) and the reads included in your Project.
The term Sample, in the context of the AVA software, can be defined very generically as a virtual “container” specified by the user only as a name (and an optional annotation), and used to group reads for analysis and reporting. The Samples thus represent the organizational foundation for the analysis, whose primary output is the Variants Tab, such that the frequency of any or all defined Variants can be compared between the different “Samples” defined in the Project. You can define any number of Samples in a Project, each associated with one or more Read Data Sets and with one or more Amplicons. For example, Samples could correspond to sequencing data from an Amplicon library prepared from a “control” DNA sample; and those associated with a second Sample, to a library prepared from the DNA of an “experimental” tissue or individual. Or, different Samples could correspond to multiple replicate libraries of a biological sample,
e.g. to allow for statistical comparison between them.
Within a Read Data Set, reads may correspond to one or more Samples. In order to demultiplex the reads, i.e. assign them each to the proper Sample, the reads must contain reliably identifiable Sample-specific features. The AVA software can use either of two mechanisms to assign reads to Samples:
To perform the read to Sample assignments, the AVA software relies on user-specified, three-way associations between Read Data Sets – Samples – Amplicons (first mechanism), or Read Data Sets – Multiplexers – Amplicons (second mechanism). In the second case, the Multiplexers (see sections 1.1.1.7 and
1.1.1.8) provide the MID to Sample assignment information. Within one Read Data Set, a given Amplicon cannot belong to more than one such three-way association because the software would then be unable to unambiguously determine which association mechanism to use in order to assign reads from that Amplicon to their proper Samples.
Once the read to Sample assignment is made, the AVA software can compute the prevalence of Variants found in the reads, broken out by Sample. These statistics are reported in the Variants tab (section 1.5). Be aware, however, that while you can examine Variant frequency statistics for all the Samples of the Project in the Variants tab, you can view read alignments of only one Sample at a time (
e.g. in the Global Align tab).
An MID (or Multiplex Identifier) is a short, recognizable sequence tag that can be added to the design of the Adaptors used for library preparation, between the sequencing key and the template-specific primer, to help determine the provenance of the read (see section
4.6). Multiple Amplicon libraries (the Project’s Samples) can be prepared that include the same Amplicon target sequences (with the same template-specific primers), each labeled with different MID tags. The MID sequences provide extra context that, in concert with the template-specific primers, allow flexible demultiplexing options, and specifically enable the sequencing of the same Amplicon across multiple Samples within the same Read Data Set: when using MIDs, the Sample-Amplicon associations are indirectly specified in the software by associating Amplicons with Multiplexers (see section
1.1.1.8), which themselves specify the relationship between MIDs and Samples and then apply that information to the associated Amplicons. Note that both non-MID and MID-tagged Amplicons may be used in a Project, but within a given Read Data Set, all the reads for any individual Amplicon must be of one type or the other.
If multiple sets of MIDs are used in a laboratory, it may be useful to define MID Groups for each set, allowing them to be referred to as a group. A common grouping may be by length of the MID tags, because there is a restriction that all MIDs used at one end of any given Amplicon be the same length (see section
1.3.2.6). The AVA software is delivered with an MID Group named “454Standard”, containing 14 MIDs carefully chosen to be resilient to sequencing and primer synthesis errors.
A Multiplexer specifies the association between MIDs and Samples,
i.e. how the MIDs should be used to assign reads to Samples. Depending on the design of the Amplicon libraries, Multiplexers allow four types of encoding (see section
4.6 for a description of Amplicon library design, in the context of MIDs):
|
•
|
Primer 1 MID: This encoding provides an MID signature only on the end of the read that contains the template-specific primer defined as “Primer 1” in the Project. This will be at the beginning of the “forward” reads, or at the end of “reverse” (complemented) reads. These MIDs are then used to assign the reads to the proper Sample, as defined by the Multiplexer.
|
|
•
|
Primer 2 MID: This encoding is the same as Primer 1 MID encoding, except that the MID appears at the “Primer 2” end of the Amplicons.
|
|
•
|
Both: This encoding provides MIDs at both ends of the Amplicons and requires that read length be sufficient to read through to the distal MID, in both orientations. The paired combination of MIDs located on the Primer 1 and Primer 2 sides is used to assign reads to their proper Sample, as defined by the Multiplexer.
|
|
•
|
Either: This encoding also provides MIDs at both ends of the Amplicons, but assigns the reads to their proper Sample on the basis of only the proximal MID on the read, in either orientation. This allows for proper assignment of both forward and reverse reads even if the Amplicon is longer than the read length provided by the sequencing Run script. Note that even if full read-through to the distal end of the read is possible, only the proximal MID will be used for Sample assignment (and any contradiction between the MIDs seen at the two ends will be assumed to be the effect of sequencing artifacts at the distal end of the read).
|
|
|
Selecting the proper encoding: It is crucially important to select the encoding method that truly corresponds to the way the libraries were prepared. For example, if a library was prepared with the ‘Either’ chemistry in mind, it may be tempting to use a ‘Primer 1 MID’ or ‘Primer2 MID’ encoded Multiplexer since the distal MID gets discounted in favor of the proximal MID, in ‘Either’ encoding. However, the AVA software needs to know that MIDs are expected to be found at both ends: without that knowledge, the trimmer might get a suboptimal alignment of the distal primer, which in certain cases could drop valid reads out of the analysis.
|
Multiplexers specify the assignment of reads that contain each defined MID (or MID pair) to each specific Sample, within a Read Data Set. Different Amplicons within a Read Data Set may simultaneously be sequenced even if they use different Multiplexer encoding methods, or no encoding at all (
i.e. are sequenced without the use of MIDs), but any given Amplicon can only be sequenced in a single manner within a given Read Data Set. In the software, Multiplexers are associated with Read Data Sets and then one or more Amplicons are associated with those Multiplexers, in the context of the Read Data Sets (creating Read Data Sets – Multiplexers – Amplicons triads). The software then assigns the reads from those Amplicons to Samples according to the rules of the Multiplexer encoding. Operationally, the same restriction exists regarding the association of Amplicons to Multiplexers as exists regarding the association of Amplicons to non-MID Samples (see section
1.1.1.6): a given Amplicon cannot belong to more than one Multiplexer within one Read Data Set, because the software would then be unable to unambiguously resolve which Multiplexer to use to determine the proper Sample assignment for the Amplicon reads.