Models segmented copy ratios from denoised read counts and segmented minor-allele fractions from allelic counts
Possible inputs are: 1) denoised copy ratios for the case sample, 2) allelic counts for the case sample, and 3) allelic counts for a matched-normal sample. All available inputs will be used to to perform segmentation and model inference.
If allelic counts are available, the first step in the inference process is to genotype heterozygous sites, as the allelic counts at these sites will subsequently be modeled to infer segmented minor-allele fraction. We perform a relatively simple and naive genotyping based on the allele counts (i.e., pileups), which is controlled by a small number of parameters (minimum-total-allele-count, genotyping-homozygous-log-ratio-threshold, and genotyping-homozygous-log-ratio-threshold). If the matched normal is available, its allelic counts will be used to genotype the sites, and we will simply assume these genotypes are the same in the case sample. (This can be critical, for example, for determining sites with loss of heterozygosity in high purity case samples; such sites will be genotyped as homozygous if the matched-normal sample is not available.)
Next, we segment, if available, the denoised copy ratios and the alternate-allele fractions at the genotyped heterozygous sites. This is done using kernel segmentation (see KernelSegmenter). Various segmentation parameters control the sensitivity of the segmentation and should be selected appropriately for each analysis.
If both copy ratios and allele fractions are available, we perform segmentation using a combined kernel that is sensitive to changes that occur not only in either of the two but also in both. However, in this case, we simply discard all allele fractions at sites that lie outside of the available copy-ratio intervals (rather than imputing the missing copy-ratio data); these sites are filtered out during the genotyping step discussed above. This can have implications for analyses involving the sex chromosomes; see comments in CreateReadCountPanelOfNormals.
After segmentation is complete, we run Markov-chain Monte Carlo (MCMC) to determine posteriors for segmented models for the log2 copy ratio and the minor-allele fraction; see CopyRatioModeller and AlleleFractionModeller, respectively. After the first run of MCMC is complete, smoothing of the segmented posteriors is performed by merging adjacent segments whose posterior credible intervals sufficiently overlap according to specified segmentation-smoothing parameters. Then, additional rounds of segmentation smoothing (with intermediate MCMC optionally performed in between rounds) are performed until convergence, at which point a final round of MCMC is performed.
gatk ModelSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--allelic-counts tumor.allelicCounts.tsv \
--normal-allelic-counts normal.allelicCounts.tsv \
--output-prefix tumor \
-O output_dir
gatk ModelSegments \
--denoised-copy-ratios normal.denoisedCR.tsv \
--allelic-counts normal.allelicCounts.tsv \
--output-prefix normal \
-O output_dir
gatk ModelSegments \
--allelic-counts tumor.allelicCounts.tsv \
--normal-allelic-counts normal.allelicCounts.tsv \
--output-prefix tumor \
-O output_dir
gatk ModelSegments \
--denoised-copy-ratios normal.denoisedCR.tsv \
--output-prefix normal \
-O output_dir
gatk ModelSegments \
--allelic-counts tumor.allelicCounts.tsv \
--output-prefix tumor \
-O output_dir
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
| Argument name(s) | Default value | Summary | |
|---|---|---|---|
| Required Arguments | |||
| --output -O |
null | Output directory. | |
| --output-prefix |
null | Prefix for output files. | |
| Optional Tool Arguments | |||
| --allelic-counts |
null | Input file containing allelic counts (output of CollectAllelicCounts). | |
| --arguments_file |
[] | read one or more arguments files and add them to the command line | |
| --denoised-copy-ratios |
null | Input file containing denoised copy ratios (output of DenoiseReadCounts). | |
| --gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
| --gcs-project-for-requester-pays |
"" | Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. | |
| --genotyping-base-error-rate |
0.05 | Maximum base-error rate for genotyping and filtering homozygous allelic counts, if available. The likelihood for an allelic count to be generated from a homozygous site will be integrated from zero base-error rate up to this value. Decreasing this value will increase the number of sites assumed to be heterozygous for modeling. | |
| --genotyping-homozygous-log-ratio-threshold |
-10.0 | Log-ratio threshold for genotyping and filtering homozygous allelic counts, if available. Increasing this value will increase the number of sites assumed to be heterozygous for modeling. | |
| --help -h |
false | display the help message | |
| --kernel-approximation-dimension |
100 | Dimension of the kernel approximation. A subsample containing this number of data points will be used to construct the approximation for each chromosome. If the total number of data points in a chromosome is greater than this number, then all data points in the chromosome will be used. Time complexity scales quadratically and space complexity scales linearly with this parameter. | |
| --kernel-scaling-allele-fraction |
1.0 | Relative scaling S of the kernel K_AF for allele-fraction segmentation to the kernel K_CR for copy-ratio segmentation. If multidimensional segmentation is performed, the total kernel used will be K_CR + S * K_AF. | |
| --kernel-variance-allele-fraction |
0.025 | Variance of Gaussian kernel for allele-fraction segmentation, if performed. If zero, a linear kernel will be used. | |
| --kernel-variance-copy-ratio |
0.0 | Variance of Gaussian kernel for copy-ratio segmentation, if performed. If zero, a linear kernel will be used. | |
| --maximum-number-of-segments-per-chromosome |
1000 | Maximum number of segments allowed per chromosome. | |
| --maximum-number-of-smoothing-iterations |
25 | Maximum number of iterations allowed for segmentation smoothing. | |
| --minimum-total-allele-count-case |
0 | Minimum total count for filtering allelic counts in the case sample, if available. The default value of zero is appropriate for matched-normal mode; increase to an appropriate value for case-only mode. | |
| --minimum-total-allele-count-normal |
30 | Minimum total count for filtering allelic counts in the matched-normal sample, if available. | |
| --minor-allele-fraction-prior-alpha |
25.0 | Alpha hyperparameter for the 4-parameter beta-distribution prior on segment minor-allele fraction. The prior for the minor-allele fraction f in each segment is assumed to be Beta(alpha, 1, 0, 1/2). Increasing this hyperparameter will reduce the effect of reference bias at the expense of sensitivity. | |
| --normal-allelic-counts |
null | Input file containing allelic counts for a matched normal (output of CollectAllelicCounts). | |
| --number-of-burn-in-samples-allele-fraction |
50 | Number of burn-in samples to discard for allele-fraction model. | |
| --number-of-burn-in-samples-copy-ratio |
50 | Number of burn-in samples to discard for copy-ratio model. | |
| --number-of-changepoints-penalty-factor |
1.0 | Factor A for the penalty on the number of changepoints per chromosome for segmentation. Adds a penalty of the form A * C * [1 + log (N / C)], where C is the number of changepoints in the chromosome, to the cost function for each chromosome. Must be non-negative. | |
| --number-of-samples-allele-fraction |
100 | Total number of MCMC samples for allele-fraction model. | |
| --number-of-samples-copy-ratio |
100 | Total number of MCMC samples for copy-ratio model. | |
| --number-of-smoothing-iterations-per-fit |
0 | Number of segmentation-smoothing iterations per MCMC model refit. (Increasing this will decrease runtime, but the final number of segments may be higher. Setting this to 0 will completely disable model refitting between iterations.) | |
| --smoothing-credible-interval-threshold-allele-fraction |
2.0 | Number of 10% equal-tailed credible-interval widths to use for allele-fraction segmentation smoothing. | |
| --smoothing-credible-interval-threshold-copy-ratio |
2.0 | Number of 10% equal-tailed credible-interval widths to use for copy-ratio segmentation smoothing. | |
| --version |
false | display the version number for this tool | |
| --window-size |
[8, 16, 32, 64, 128, 256] | Window sizes to use for calculating local changepoint costs. For each window size, the cost for each data point to be a changepoint will be calculated assuming that the point demarcates two adjacent segments of that size. Including small (large) window sizes will increase sensitivity to small (large) events. Duplicate values will be ignored. | |
| Optional Common Arguments | |||
| --gatk-config-file |
null | A configuration file to use with the GATK. | |
| --QUIET |
false | Whether to suppress job-summary info on System.err. | |
| --tmp-dir |
null | Temp directory to use. | |
| --use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
| --use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
| --verbosity |
INFO | Control verbosity of logging. | |
| Advanced Arguments | |||
| --showHidden |
false | display hidden arguments | |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Input file containing allelic counts (output of CollectAllelicCounts).
File null
read one or more arguments files and add them to the command line
List[File] []
Input file containing denoised copy ratios (output of DenoiseReadCounts).
File null
A configuration file to use with the GATK.
String null
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
String ""
Maximum base-error rate for genotyping and filtering homozygous allelic counts, if available. The likelihood for an allelic count to be generated from a homozygous site will be integrated from zero base-error rate up to this value. Decreasing this value will increase the number of sites assumed to be heterozygous for modeling.
double 0.05 [ [ -∞ ∞ ] ]
Log-ratio threshold for genotyping and filtering homozygous allelic counts, if available. Increasing this value will increase the number of sites assumed to be heterozygous for modeling.
double -10.0 [ [ -∞ ∞ ] ]
display the help message
boolean false
Dimension of the kernel approximation. A subsample containing this number of data points will be used to construct the approximation for each chromosome. If the total number of data points in a chromosome is greater than this number, then all data points in the chromosome will be used. Time complexity scales quadratically and space complexity scales linearly with this parameter.
int 100 [ [ 1 ∞ ] ]
Relative scaling S of the kernel K_AF for allele-fraction segmentation to the kernel K_CR for copy-ratio segmentation. If multidimensional segmentation is performed, the total kernel used will be K_CR + S * K_AF.
double 1.0 [ [ 0 ∞ ] ]
Variance of Gaussian kernel for allele-fraction segmentation, if performed. If zero, a linear kernel will be used.
double 0.025 [ [ 0 ∞ ] ]
Variance of Gaussian kernel for copy-ratio segmentation, if performed. If zero, a linear kernel will be used.
double 0.0 [ [ 0 ∞ ] ]
Maximum number of segments allowed per chromosome.
int 1000 [ [ 1 ∞ ] ]
Maximum number of iterations allowed for segmentation smoothing.
int 25 [ [ 0 ∞ ] ]
Minimum total count for filtering allelic counts in the case sample, if available. The default value of zero is appropriate for matched-normal mode; increase to an appropriate value for case-only mode.
int 0 [ [ 0 ∞ ] ]
Minimum total count for filtering allelic counts in the matched-normal sample, if available.
int 30 [ [ 0 ∞ ] ]
Alpha hyperparameter for the 4-parameter beta-distribution prior on segment minor-allele fraction. The prior for the minor-allele fraction f in each segment is assumed to be Beta(alpha, 1, 0, 1/2). Increasing this hyperparameter will reduce the effect of reference bias at the expense of sensitivity.
double 25.0 [ [ 1 ∞ ] ]
Input file containing allelic counts for a matched normal (output of CollectAllelicCounts).
File null
Number of burn-in samples to discard for allele-fraction model.
int 50 [ [ 0 ∞ ] ]
Number of burn-in samples to discard for copy-ratio model.
int 50 [ [ 0 ∞ ] ]
Factor A for the penalty on the number of changepoints per chromosome for segmentation. Adds a penalty of the form A * C * [1 + log (N / C)], where C is the number of changepoints in the chromosome, to the cost function for each chromosome. Must be non-negative.
double 1.0 [ [ 0 ∞ ] ]
Total number of MCMC samples for allele-fraction model.
int 100 [ [ 1 ∞ ] ]
Total number of MCMC samples for copy-ratio model.
int 100 [ [ 1 ∞ ] ]
Number of segmentation-smoothing iterations per MCMC model refit. (Increasing this will decrease runtime, but the final number of segments may be higher. Setting this to 0 will completely disable model refitting between iterations.)
int 0 [ [ 0 ∞ ] ]
Output directory.
R String null
Prefix for output files.
R String null
Whether to suppress job-summary info on System.err.
Boolean false
display hidden arguments
boolean false
Number of 10% equal-tailed credible-interval widths to use for allele-fraction segmentation smoothing.
double 2.0 [ [ 0 ∞ ] ]
Number of 10% equal-tailed credible-interval widths to use for copy-ratio segmentation smoothing.
double 2.0 [ [ 0 ∞ ] ]
Temp directory to use.
String null
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
Window sizes to use for calculating local changepoint costs. For each window size, the cost for each data point to be a changepoint will be calculated assuming that the point demarcates two adjacent segments of that size. Including small (large) window sizes will increase sensitivity to small (large) events. Duplicate values will be ignored.
List[Integer] [8, 16, 32, 64, 128, 256]
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.1.0.0 built at Tue, 29 Jan 2019 22:20:41 -0500.