ENCSR356KRQ

ATAC-seq on primary keratinocytes in day 0.0 of differentiation

Pipeline version: v1.4.2

Report generated at 2019-07-16 13:31:57

Paired-end: [True, True]

Pipeline type: ATAC-Seq

Genome: hg38_google.tsv

Peak caller: MACS2

Alignment


Flagstat (raw BAM)

rep1 (PE)rep2 (PE)
Total576473539739416534
Total(QC-failed)00
Dupes00
Dupes(QC-failed)00
Mapped574328804738217348
Mapped(QC-failed)00
% Mapped99.630099.8400
Paired276305638338971378
Paired(QC-failed)00
Read1138152819169485689
Read1(QC-failed)00
Read2138152819169485689
Read2(QC-failed)00
Properly Paired242145586298823586
Properly Paired(QC-failed)00
% Properly Paired87.640088.1600
With itself273774038337183440
With itself(QC-failed)00
Singletons386865588752
Singletons(QC-failed)00
% Singleton0.14000.1700
Diff. Chroms209673290550
Diff. Chroms (QC-failed)00

Marking duplicates (filtered BAM)

Filtered out (samtools view -F 1804):


rep1 (PE)rep2 (PE)
Unpaired Reads00
Paired Reads88122254105427549
Unmapped Reads00
Unpaired Dupes00
Paired Dupes2262327620176431
Paired Opt. Dupes642560907390
% Dupes/1000.25670.1914

Library complexity (filtered non-mito BAM)

rep1 (PE)rep2 (PE)
Total Reads (Pairs)8391268499281933
Distinct Reads (Pairs)6442396683759127
One Read (Pair)5017723972137004
Two Reads (Pairs)110618949758709
NRF = Distinct/Total0.76780.8436
PBC1 = OnePair/Distinct0.77890.8612
PBC2 = OnePair/TwoPair4.53607.3921

Mitochondrial reads are filtered out.

NRF (non redundant fraction)
PBC1 (PCR Bottleneck coefficient 1)
PBC2 (PCR Bottleneck coefficient 2)
PBC1 is the primary measure. Provisionally


Flagstat (filtered/deduped BAM)

Filtered and duplicates removed

rep1 (PE)rep2 (PE)
Total130997956170502236
Total(QC-failed)00
Dupes00
Dupes(QC-failed)00
Mapped130997956170502236
Mapped(QC-failed)00
% Mapped100.0000100.0000
Paired130997956170502236
Paired(QC-failed)00
Read16549897885251118
Read1(QC-failed)00
Read26549897885251118
Read2(QC-failed)00
Properly Paired130997956170502236
Properly Paired(QC-failed)00
% Properly Paired100.0000100.0000
With itself130997956170502236
With itself(QC-failed)00
Singletons00
Singletons(QC-failed)00
% Singleton0.00000.0000
Diff. Chroms00
Diff. Chroms (QC-failed)00

Peak calling


IDR (Irreproducible Discovery Rate) plots

rep1-rep2
rep1-rep2
rep1-pr
rep1-pr
rep2-pr
rep2-pr
ppr
ppr

Reproducibility QC and peak detection statistics

The number of peaks is capped at 300K for peak-caller MACS2


overlapIDR
Nt271142194344
N1262006174428
N2265492184698
Np276280199429
N optimal276280199429
N conservative271142194344
Optimal Setpprppr
Conservative Setrep1-rep2rep1-rep2
Rescue Ratio1.01891.0262
Self Consistency Ratio1.01331.0589
Reproducibilitypasspass

Overlapping peaks


IDR (Irreproducible Discovery Rate) peaks


Enrichment


Strand cross-correlation measures

Performed on subsampled reads (25M)

rep1rep2
Reads2500000025000000
Est. Fragment Len.00
Corr. Est. Fragment Len.0.39490.3776
Phantom Peak7075
Corr. Phantom Peak0.33670.3303
Argmin. Corr.15001500
Min. Corr.0.19050.2008
NSC2.07251.8805
RSC1.39841.3655

NOTE1: For SE datasets, reads from replicates are randomly subsampled.
NOTE2: For PE datasets, the first end of each read-pair is selected and the reads are then randomly subsampled.


rep1
rep1
rep2
rep2

Fraction of reads in overlapping peaks

rep1-rep2rep1-prrep2-prppr
Fraction of Reads in Peak0.35800.36990.33950.3603


Fraction of reads in IDR peaks

rep1-rep2rep1-prrep2-prppr
Fraction of Reads in Peak0.31610.31890.29590.3200


ATAQC


Summary table

rep1rep2
GenomeGRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gzGRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/single-endedPaired-endedPaired-ended
Read length7676
Read count from sequencer276305638338971378
Read count successfully aligned274160903337772192
Read count after filtering for mapping quality230224388276338606
Read count after removing duplicate reads207601112256162175
Read count after removing mitochondrial reads (final read count)130997956170502236
Mapping quality > q30 (out of total)230224388, 0.833223634763276338606, 0.815226959959
Duplicates (after filtering)22623276, 0.25672620176431, 0.191377
Mitochondrial reads (out of total)19474557, 0.033908375941428112577, 0.0380817073402
Duplicates that are mitochondrial (out of all dups)6617292, 0.14624964129910082178, 0.249850382359
Final reads (after all filters)130997956, 0.474105258757170502236, 0.502998916917
NRF = Distinct/Total0.76775, out of range [0.8, inf]0.843649, OK
PBC1 = OnePair/Distinct0.77886, out of range [0.8, inf]0.861244, OK
PBC2 = OnePair/TwoPair4.536044, OK7.392064, OK
Picard est library size218626965367870336
Fraction of reads in nfr0.502667213295, OK0.558274761995, OK
Nfr / mono-nuc reads1.64013857555, out of range [2.5, inf]1.99516853611, out of range [2.5, inf]
Presence of nfr peakOKOK
Presence of mono-nuc peakOKOK
Presence of di-nuc peakOKOK
Naive overlap peaks276280, OK276280, OK
Idr peaks199429, OK199429, OK
Naive peak stats: min size73.000073.0000
Naive peak stats: 25 percentile291.0000291.0000
Naive peak stats: 50 percentile (median)514.0000514.0000
Naive peak stats: 75 percentile797.0000797.0000
Naive peak stats: max size3287.00003287.0000
Naive peak stats: mean585.3929585.3929
Idr peak stats: min size73.000073.0000
Idr peak stats: 25 percentile419.0000419.0000
Idr peak stats: 50 percentile (median)632.0000632.0000
Idr peak stats: 75 percentile894.0000894.0000
Idr peak stats: max size3287.00003287.0000
Idr peak stats: mean685.3150685.3150
Tss enrichment18.896018.1123
Fraction of reads in universal dhs regions62416237, 0.48311236279775774611, 0.450253599697
Fraction of reads in blacklist regions1946, 1.50623732412e-052366, 1.40587988882e-05
Fraction of reads in promoter regions20876818, 0.16159014635324906360, 0.147993874167
Fraction of reads in enhancer regions52561389, 0.40683415169166060635, 0.392533043911
Fraction of reads in called peak regions41196412, 0.31886728352549800995, 0.29591807825

Replicate 1

Sample Information

Sample
Genome GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended Paired-ended
Read length 76

Summary

Read count from sequencer 276,305,638
Read count successfully aligned 274,160,903
Read count after filtering for mapping quality 230,224,388
Read count after removing duplicate reads 207,601,112
Read count after removing mitochondrial reads (final read count) 130,997,956
Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.
This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

138152819 reads; of these:
  138152819 (100.00%) were paired; of these:
    17080026 (12.36%) aligned concordantly 0 times
    75296892 (54.50%) aligned concordantly exactly 1 time
    45775901 (33.13%) aligned concordantly >1 times
    ----
    17080026 pairs aligned concordantly 0 times; of these:
      13825615 (80.95%) aligned discordantly 1 time
    ----
    3254411 pairs aligned 0 times concordantly or discordantly; of these:
      6508822 mates make up the pairs; of these:
        2144735 (32.95%) aligned 0 times
        452748 (6.96%) aligned exactly 1 time
        3911339 (60.09%) aligned >1 times
99.22% overall alignment rate

  

Samtools flagstat

576473539 + 0 in total (QC-passed reads + QC-failed reads)
300167901 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
574328804 + 0 mapped (99.63%:-nan%)
276305638 + 0 paired in sequencing
138152819 + 0 read1
138152819 + 0 read2
242145586 + 0 properly paired (87.64%:-nan%)
273774038 + 0 with itself and mate mapped
386865 + 0 singletons (0.14%:-nan%)
756774 + 0 with mate mapped to a different chr
209673 + 0 with mate mapped to a different chr (mapQ>=5)

  
Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total) 230,224,388 0.833
Duplicates (after filtering) 22,623,276 0.257
Mitochondrial reads (out of total) 19,474,557 0.034
Duplicates that are mitochondrial (out of all dups) 6,617,292 0.146
Final reads (after all filters) 130,997,956 0.474
Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.
  

Library complexity statistics

ENCODE library complexity metrics

Metric Result
NRF 0.76775 out of range [0.8, inf]
PBC1 0.77886 out of range [0.8, inf]
PBC2 4.536044 - OK
The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

218,626,965

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric Result
Fraction of reads in NFR 0.502667213295 - OK
NFR / mono-nuc reads 1.64013857555 out of range [2.5, inf]
Presence of NFR peak OK
Presence of Mono-Nuc peak OK
Presence of Di-Nuc peak OK
Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric Result
Naive overlap peaks 276280 - OK
IDR peaks 199429 - OK

Naive overlap peak file statistics

Min size 73.0
25 percentile 291.0
50 percentile (median) 514.0
75 percentile 797.0
Max size 3287.0
Mean 585.392912987

IDR peak file statistics

Min size 73.0
25 percentile 419.0
50 percentile (median) 632.0
75 percentile 894.0
Max size 3287.0
Mean 685.315034423
For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/
  

Annotated genomic region enrichments

Fraction of reads in universal DHS regions 62,416,237 0.483
Fraction of reads in blacklist regions 1,946 0.000
Fraction of reads in promoter regions 20,876,818 0.162
Fraction of reads in enhancer regions 52,561,389 0.407
Fraction of reads in called peak regions 41,196,412 0.319
Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.

Replicate 2

Sample Information

Sample
Genome GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
Paired/Single-ended Paired-ended
Read length 76

Summary

Read count from sequencer 338,971,378
Read count successfully aligned 337,772,192
Read count after filtering for mapping quality 276,338,606
Read count after removing duplicate reads 256,162,175
Read count after removing mitochondrial reads (final read count) 170,502,236
Note that all these read counts are determined using 'samtools view' - as such,
these are all reads found in the file, whether one end of a pair or a single
end read. In other words, if your file is paired end, then you should divide
these counts by two. Each step follows the previous step; for example, the
duplicate reads were removed after reads were removed for low mapping quality.
This bar chart also shows the filtering process and where the reads were lost
over the process. Note that each step is sequential - as such, there may
have been more mitochondrial reads which were already filtered because of
high duplication or low mapping quality. Note that all these read counts are
determined using 'samtools view' - as such, these are all reads found in
the file, whether one end of a pair or a single end read. In other words,
if your file is paired end, then you should divide these counts by two.

Alignment statistics

Bowtie alignment log

169485689 reads; of these:
  169485689 (100.00%) were paired; of these:
    20073896 (11.84%) aligned concordantly 0 times
    88534373 (52.24%) aligned concordantly exactly 1 time
    60877420 (35.92%) aligned concordantly >1 times
    ----
    20073896 pairs aligned concordantly 0 times; of these:
      16513261 (82.26%) aligned discordantly 1 time
    ----
    3560635 pairs aligned 0 times concordantly or discordantly; of these:
      7121270 mates make up the pairs; of these:
        1199186 (16.84%) aligned 0 times
        630984 (8.86%) aligned exactly 1 time
        5291100 (74.30%) aligned >1 times
99.65% overall alignment rate

  

Samtools flagstat

739416534 + 0 in total (QC-passed reads + QC-failed reads)
400445156 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
738217348 + 0 mapped (99.84%:-nan%)
338971378 + 0 paired in sequencing
169485689 + 0 read1
169485689 + 0 read2
298823586 + 0 properly paired (88.16%:-nan%)
337183440 + 0 with itself and mate mapped
588752 + 0 singletons (0.17%:-nan%)
1069062 + 0 with mate mapped to a different chr
290550 + 0 with mate mapped to a different chr (mapQ>=5)

  
Note that the flagstat command counts alignments, not reads. please 
use the read counts table to get accurate counts of reads at each
stage of the pipeline.

Filtering statistics

Mapping quality > q30 (out of total) 276,338,606 0.815
Duplicates (after filtering) 20,176,431 0.191
Mitochondrial reads (out of total) 28,112,577 0.038
Duplicates that are mitochondrial (out of all dups) 10,082,178 0.250
Final reads (after all filters) 170,502,236 0.503
Mapping quality refers to the quality of the read being aligned to that
particular location in the genome. A standard quality score is > 30.
Duplications are often due to PCR duplication rather than two unique reads
mapping to the same location. High duplication is an indication of poor
libraries. Mitochondrial reads are often high in chromatin accessibility
assays because the mitochondrial genome is very open. A high mitochondrial
fraction is an indication of poor libraries. Based on prior experience, a
final read fraction above 0.70 is a good library.
  

Library complexity statistics

ENCODE library complexity metrics

Metric Result
NRF 0.843649 - OK
PBC1 0.861244 - OK
PBC2 7.392064 - OK
The non-redundant fraction (NRF) is the fraction of non-redundant mapped reads
in a dataset; it is the ratio between the number of positions in the genome
that uniquely mapped reads map to and the total number of uniquely mappable
reads. The NRF should be > 0.8. The PBC1 is the ratio of genomic locations
with EXACTLY one read pair over the genomic locations with AT LEAST one read
pair. PBC1 is the primary measure, and the PBC1 should be close to 1.
Provisionally 0-0.5 is severe bottlenecking, 0.5-0.8 is moderate bottlenecking,
0.8-0.9 is mild bottlenecking, and 0.9-1.0 is no bottlenecking. The PBC2 is
the ratio of genomic locations with EXACTLY one read pair over the genomic
locations with EXACTLY two read pairs. The PBC2 should be significantly
greater than 1.

Picard EstimateLibraryComplexity

367,870,336

Yield prediction

Preseq performs a yield prediction by subsampling the reads, calculating the
number of distinct reads, and then extrapolating out to see where the
expected number of distinct reads no longer increases. The confidence interval
gives a gauge as to the validity of the yield predictions.

Fragment length statistics

Metric Result
Fraction of reads in NFR 0.558274761995 - OK
NFR / mono-nuc reads 1.99516853611 out of range [2.5, inf]
Presence of NFR peak OK
Presence of Mono-Nuc peak OK
Presence of Di-Nuc peak OK
Open chromatin assays show distinct fragment length enrichments, as the cut
sites are only in open chromatin and not in nucleosomes. As such, peaks
representing different n-nucleosomal (ex mono-nucleosomal, di-nucleosomal)
fragment lengths will arise. Good libraries will show these peaks in a
fragment length distribution and will show specific peak ratios.

Peak statistics

Metric Result
Naive overlap peaks 276280 - OK
IDR peaks 199429 - OK

Naive overlap peak file statistics

Min size 73.0
25 percentile 291.0
50 percentile (median) 514.0
75 percentile 797.0
Max size 3287.0
Mean 585.392912987

IDR peak file statistics

Min size 73.0
25 percentile 419.0
50 percentile (median) 632.0
75 percentile 894.0
Max size 3287.0
Mean 685.315034423
For a good ATAC-seq experiment in human, you expect to get 100k-200k peaks
for a specific cell type.

Sequence quality metrics

GC bias

Open chromatin assays are known to have significant GC bias. Please take this
into consideration as necessary.

Annotation-based quality metrics

Enrichment plots (TSS)

Open chromatin assays should show enrichment in open chromatin sites, such as
TSS's. An average TSS enrichment in human (hg19) is above 6. A strong TSS enrichment is
above 10. For other references please see https://www.encodeproject.org/atac-seq/
  

Annotated genomic region enrichments

Fraction of reads in universal DHS regions 75,774,611 0.450
Fraction of reads in blacklist regions 2,366 0.000
Fraction of reads in promoter regions 24,906,360 0.148
Fraction of reads in enhancer regions 66,060,635 0.393
Fraction of reads in called peak regions 49,800,995 0.296
Signal to noise can be assessed by considering whether reads are falling into
known open regions (such as DHS regions) or not. A high fraction of reads
should fall into the universal (across cell type) DHS set. A small fraction
should fall into the blacklist regions. A high set (though not all) should
fall into the promoter regions. A high set (though not all) should fall into
the enhancer regions. The promoter regions should not take up all reads, as
it is known that there is a bias for promoters in open chromatin assays.

Comparison to Roadmap DNase

This bar chart shows the correlation between the Roadmap DNase samples to
your sample, when the signal in the universal DNase peak region sets are
compared. The closer the sample is in signal distribution in the regions
to your sample, the higher the correlation.