\documentclass[10pt]{article}
\usepackage[hmargin=1.5cm,top=2cm,bottom=2cm]{geometry}
\usepackage{multicol}
\setlength\columnsep{15pt}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{array}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage[auth-sc]{authblk}
\usepackage{longtable}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{enumerate}
\usepackage[labelfont=bf]{caption}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{mdframed}
\usepackage{graphics}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{array}
\usepackage{capt-of}
\usepackage{lscape}
\usepackage{caption}
\usepackage{breakurl}
\usepackage{todonotes}
\usepackage{hanging}
\usepackage{pagecolor}
\usepackage[final]{pdfpages}
\usepackage[leftFloats,CaptionAfterwards]{fltpage}
\usepackage[numbers,super,sort&compress]{natbib}
\setlength{\bibsep}{0pt plus 0.3ex}
\usepackage{abstract}
\usepackage{enumitem}
\usepackage{soul}
\usepackage{titlesec}
\titleformat{\section}[block]{\large\bfseries\filcenter}{\thesection.}{0.4em}{}
\titleformat{\subsection}[block]{\normalsize\sc\bfseries\filcenter}{\thesubsection.}{0.4em}{}
\titleformat{\subsubsection}[block]{\normalsize\sc\itshape\filright}{\thesubsection.}{0.4em}{}
\setcounter{secnumdepth}{5}

\usepackage[hang]{footmisc}
\setlength\footnotemargin{0em}

\setlength{\skip\footins}{0.75cm}

\makeatletter
\renewcommand\footnoterule{%
  \kern-3\p@
  \hrule\@width \textwidth height 1.5pt
  \kern2.6\p@}
\makeatother

\makeatletter
\def\@biblabel#1{\@ifnotempty{#1}{#1.}}
\makeatother

\newcommand{\filllastline}[1]{
\setlength\leftskip{0pt}
\setlength\rightskip{0pt}
\setlength\parfillskip{0pt}
#1}

\newenvironment{Figure}
{\par\medskip\noindent\minipage{\linewidth}}
{\endminipage\par\medskip}


\title{\bf Chromatin accessibility profiling methods}
\renewcommand\Authfont{\scshape\normalsize}
\author[1]{AUTHORS}
% \author[1]{Georgi K. Marinov}
\renewcommand\Affilfont{\itshape\normalsize}
\affil[1]{Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305}
% \affil[*]{These authors contributed equally}
\date{}

\begin{document}
\maketitle

% \centerline{}
% \centerline{}
\begin{abstract}

\noindent {\normalsize \textbf{}
}
\centerline{}
\centerline{}
\end{abstract}

\begin{multicols}{2}

\section*{Introduction}

Chromatin accessibility refers to the level of physical compaction of chromatin, a complex formed by DNA and associated proteins consisting mainly of histones and DNA-binding transcription factors (TF)\cite{Kornberg1974a,Kornberg1974b,Klemm2019}. Although eukaryotic chromatin is generally tightly packed into nucleosomes, which comprise $\sim$147 bp of DNA wrapped around an octamer of histones\cite{Luger1997,Woodcock1976a,Woodcock1976b} (Historical review in Trojer and Reinberg\cite{Troier2007}), chromatin accessibility is not spatiotemporally stable, nor uniform across the genome. Histones are typically depleted at genomic locations that interact with transcriptional regulators (e.g. TFs), such as at enhancers, promoters and other regulatory elements\cite{ENCODE2007,ENCODE2012,Lee2004,Thurman2012,Ozsolak2007,Sheffield2012}. Therefore, profiling chromatin accessibility on a genome-wide scale can serve as an excellent tool to identify all putative regulatory elements in a cell type or state. Note that not only the compaction but also post-translational modifications of the chromatin, including DNA methylation and histone tail acetylation, are dynamic and change in different cell states, and can reflect specific functions of genomic regions\cite{Turner2007,Suzuki2008}. Initial changes in accessibility may be facilitated by chromatin-binding TFs, specifically so-called pioneer factors, that thermodynamically outcompete histones and recruit ATP-dependent chromatin remodelers, allowing other TFs to co-bind and further stabilize the nucleosome depleted region and cooperatively regulate gene expression of target genes\cite{Zaret2011,Cirillo1999,Sherwood2014}. Consequently, the analysis of TF binding sites within accessible regions can bring insight into the master regulators and gene regulatory networks of the studied cell type.
 
Changes in the chromatin landscape as well as mutations in chromatin remodelers and non-coding mutations in accessible regions have been linked to a range of diseases\cite{Schwartzentruber2012,Hendrich2001,Vinagre2013,Matsumoto2010}. In fact, many causal GWAS variants are located in accessible regulatory elements (Maurano et al., 2012) and TF-bound DNA harbors increased mutation rates since TFs and DNA repair enzymes compete for damaged regulatory regions\cite{Moore2016,Sabarinathan2016}. In order to improve our understanding of chromatin dynamics during development and in disease contexts, since several years, researchers and large consortia such as the ENCODE Consortium and the NIH Roadmap Epigenomics Mapping Consortium have embarked on collecting and comparing chromatin landscapes across cell types and during disease development\cite{ENCODE2007,ENCODE2012,Roadmap2015,Yue2014}.

Over the past decades, we have witnessed the development and widespread use of several chromatin accessibility profiling methods\cite{Wu1979a,Wu1979b,Buenrostro2013,Corces2016,Corces2017,Boyle2008,Giresi2007,Schones2008,Taberlay2011,Kelly2012}. Generally, these methods are based on the physical accessibility to enzymes that fragment, tagment or methylate DNA in chromatin. Initial screens in the 1970s, showed that regions of active transcription were particularly sensitive to digestion by DNA endonucleases, indicating a more permissive form of the chromatin\cite{Weintraub1976}, and that the chromatin digested at regularly spaced sites due to nucleosome phasing\cite{Hewish1973,Kornberg1974a,Kornberg1974b}. Note that DNase I was, and still is, the reagent of choice for TF footprinting, which can determine the location of TF binding sites\cite{Jackson1985,Galas1978,Vierstra2020}. Technological advances have led to the first genome-wide assessments of accessible chromatin in 2008 by sequencing genomic DNA fragments following DNase I digestion, a technique referred to as DNase I hypersensitive site sequencing (DNase-seq)\cite{Boyle2008,Hesselberth2009}. This was followed by the development of a handful more genome-wide chromatin accessibility profiling methods\cite{Giresi2009,Schone2008,Buenrostro2013,Corces2017,Kelly2012}, of which Assay for Transposase-Accessible Chromatin using sequencing ATAC-seq (and variants)\cite{Buenrostro2013,Corces2016,Corces2017}, together with DNase-seq are the two most used chromatin accessibility profiling methods today. As these methods are high-throughput sequencing-based, the analysis of the generated omics data heavily relies on bioinformatics, not only for the initial processing but also to biologically interpret chromatin accessibility profiles and to perform more intricate downstream analyses. 

Importantly, as regulatory regions co-define a cell type, their accessibility is cell type-dependent, especially for distal regulatory regions\cite{Banerji1981,West2014,Thurman2012}. When investigating heterogeneous samples, it is therefore advisable to measure chromatin accessibility at a single-cell level as bulk methods will yield population-averaged accessibility profiles (Fig. 1). Currently, the field of single-cell omics, including single-cell epigenomics such as single-cell ATAC-seq (scATAC-seq), is booming due to the unprecedented opportunities to assess genome regulation in complex tissues such as the brain, whole embryos and tumors\cite{Fullard2018,Preissl2018,Cusanovich2018a,Cusanovich2018b,Al-Ali2019,Pijuan-Sala2020}. Accompanied with the rise of several single-cell chromatin accessibility profiling technologies, bioinformatic tools have been developed that allow analysis of the generated data which is intrinsically sparse \hl{XXX}.

Although chromatin accessibility profiling methods may serve as an analytic foundation to identify regulatory regions, it is reported that only around 10-26\% of accessible or predicted regulatory regions in human are in fact active in \cite{Kwasnieski2014}. In addition, linking (active) accessible regulatory regions to their target genes solely based on accessibility data remains elusive. Additional data, including transcriptomics, reporter assays and 3D chromatin architecture maps, especially when combined in a multi-omics fashion, may help (in the future) to determine an accessible region's functionality and putative target genes\cite{Kempfer2020} \hl{Hafez et al., 2017; Moore et al., 2020; Bravo et al., 2020; Sanyal et al., 2012; Ron et al., 2017}.

This Primer on chromatin accessibility profiling methods provides an overview of the most used and latest methods to profile chromatin accessibility, both on bulk and single-cell level. In addition, it provides an in-depth outline of bioinformatic analysis tools and examples of applications in diverse fields. Lastly, the Primer discusses standards for data deposition and examines currently unmet needs and future possibilities to increase our understanding of chromatin accessibility landscapes and their functional role in gene regulation during development, evolution and in disease contexts.

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=18.5cm]{Fig1.png}
\end{center}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{
{\bf Broad overview of chromatin accessibility profiling approaches}. Regulatory elements and linker regions between positioned nucleosomes are characterized by increased susceptibility to enzymatic cleavage/modification, which property is the basis of all chromatin accessibility profiling methods, as it allows for the specific enrichment/labeling of such regions. Chromatin accessibility can be profiled both in bulk as well as at the single-cell level, by applying various strategies to index sequencing library fragments according to their cell of origin. \hl{Accessibility can also be profiled at the single-molecule level}.
} 
\label{Fig1}
\end{figure*}

\section*{Experimentation}

\subsection*{Experimental assays for analyzing bulk cell chromatin accessibility }

The emergence of next generation sequencing techniques (NGS) has revolutionized the way that chromatin is investigated. Chromatin accessibility is traditionally probed by assays such as deoxyribonuclease I (DNase) digestion, micrococcal nuclease (MNase), or restriction enzyme digestion, typically at a few selected genomic regions each time. Chromatin modifications and occupancy can be analyzed by immunoprecipitation (ChIP). The combination of these assays with NGS results in the ``seq'' techniques such as ChIP-seq\cite{Barski2007,Mikkelsen2007,Robertson2007,Johnson2007}, DNase-seq\cite{Boyle2008,Thurman2012}, ATAC-seq\cite{Buenrostro2013}, MNase-seq\cite{Schones2008}, which have enabled the analysis of chromatin states to a whole genome level. We briefly describe the principles, pros and cons of these techniques in this section.

\subsubsection*{DNase-seq}

DNase-seq analysis can and has been applied to both fresh cells or fixed samples\cite{Boyle2008,Thurman2012}. First, nuclei are isolated and permeabilized using mild detergent such as 0.1\% Triton X-100, so that DNase I can enter the nucleus efficiently. DNase I is an endonuclease that preferentially introduces double-stranded breaks in accessible chromatin\cite{Hesselberth2009}. Second, since DNase I digestion is a continuous process, it is necessary to titrate the amount of DNase I to achieve optimal activity when using a new type of cells, or when using DNase I from a different manufacturer or from a different batch. After digestion, the small DNA fragments (50-100 bp) are purified and size-selected for downstream library construction and sequencing.
 
Major limitations of the traditional DNase-seq include the large number of cells (millions) as input materials and its tedious and lengthy procedures\cite{Song2010}. Recently, a modified DNase-seq assay (scDNase-seq) has been developed to analyze a small number of or even single cells\cite{Jin2015,Lu2016}. The scDNase-seq requires only hundreds to thousands of either fresh or fixed cells for a bulk cell assay and less than two days for library construction, without the need of fractionation of DNA fragments in the procedure\cite{Cooper2017}.   
 
Caution must be taken when interpreting DNase-seq results because they show some intrinsic bias in cleavage sites\cite{Suck1988,Lazarovici2013}, which should be considered when interpreting the footprint of transcription factor\cite{He2014}. Besides, the biological significance of potential regulatory elements identified by this assay, they preferably should be interpreted in combination with other assays\cite{Song2010}.

\subsubsection*{ATAC-seq}

Assay for Transposase Accessible Chromatin Sequencing (ATAC-seq) emerged as an alternative assay to investigate accessible chromatin profiles\cite{Buenrostro2013}. In this assay, a genetically engineered hyperactive DNA transposase (Tn5) transposes preloaded monovalent mosaic end (ME) adapters to accessible or nucleosome-depleted chromatin regions and tags the DNA with the ME sequence simultaneously\cite{Goryshin1998,Adey2010,Buenrostro2013}. The target DNA fragments are purified, PCR-amplified, and sequenced by NGS platforms. ATAC-seq detected sequences are highly enriched in DHSs and ATAC-seq has been applied to analyze changes of chromatin accessibility in numerous disease or developmental systems\cite{Wu2016,Wu2018,Qu2017}. Similar to ATAC-seq, a technique termed as TrAC-looping, which utilizes Tn5 and a bivalent ME adaptor, also detects genome-wide chromatin accessibility in addition to providing genome-wide chromatin interaction information on regulatory regions\cite{Lai2018}.
 
ATAC-seq and its variants\cite{Corces2017,Corces2016} is a sensitive assay that works well on low-input samples(for example 500-50,000 cells) and requires a simplified library preparation procedure due to the simultaneous chromatin fragmentation and tagging. Although, the original ATAC-seq protocol works most efficiently with fresh cells and slowly cooled cryopreserved cells, it is possible to generate high signal-to-background profiles from snap-frozen pellets using the improved Omni-ATAC protocol\cite{Corces2017}.  

Some limitations of ATAC-seq are related to the intrinsic properties of Tn5: (1) it shows steric hindrance and sequence bias in chromatin tagmentation\cite{Adey2010,Sato2019,Meyer2014}, which would be a challenge for the mapping resolution on both chromatin accessibility and transcription factors footprint. (2) The contamination from mitochondrial DNA can also increase the sequencing costs, especially for some cell types which have large amount of mitochondrial DNA. However, this contamination can be significantly reduced either by improved lysis condition (as is the case in Omni-ATAC\cite{Corces2017}) or by applying the clustered regularly interspaced short palindromic repeats (CRISPR) technology to cleavage the mitochondrial ribosomal DNA prior to the experiment\cite{Wu2016,Montefiori2017}.

\subsubsection*{ATAC-seq derivatives}

The versatility provided bu the simultaneous fragmentation and adapter attachment in the ATAC procedure has allowed the development of several derivatives of the assay, which employ modified versions of the Tn5 adapter sequences.

One deficiency of the original procedure is that half of all fragments are lost due to the fact that they contain two adapter sequences of the same kind (i.e. AA and BB, while it is only AB fragments that can successfully PCR amplify). The THS-seq version of ATAC-seq attempts to rescue the other half of fragments by utilizing a T7 RNA Polymerase linear amplification protocol\cite{Sos2016}. 

Several protocols now exist for simultaneous profiling of accessibility and methylation (EpiMethylTag\cite{Lhoumaud2019}, methyl-ATAC-seq\cite{Spektor2019}, and ATAC-Me\cite{Barnett2020}), while repli-ATAC-seq measures accessibility on newly synthesized chromatin fibers\cite{Stewart-Morgan2019}. 
 
\subsubsection*{MNase-seq}

Nucleosome position and occupancy in the genome play key roles in chromatin accessibility.   MNase is an endo-exonuclease that cleaves the DNA regions without nucleosome protection and leaves the nucleosome core particles undigested, which can be purified, ligated to adaptors, PCR-amplified and sequenced by NGS to reveal genome-wide nucleosome positions (MNase-seq\cite{Schones2008}).
 
In MNase-seq, 10,000 to 100,000 of either fresh or formaldehyde crosslinked cells are used for library construction. Digestion of chromatin by MNase typically results in a nucleosome ladder consisting of mononucleosome, dinucleosome, trinucleosome etc., depending on the concentration of MNase in the reaction. The optimal range of digestion usually leads to about 70-80\% mononucleosomes and 20-30\% higher nucleosome ladders\cite{Schones2008}.

MNase-seq has been applied to investigate the dynamics of nucleosome landscape and their function in transcriptional regulation\cite{Lai2017}. However, since the nucleosome position and occupancy revealed by MNase-seq are based on the average profile of a large number of cells, caution should be taken when interpreting the results, particularly at inactive chromatin regions\cite{Lai2018b}.

% (4) ChIP-seq
 
% The N terminal tails of core histones are enriched with various covalent modifications, which serve as the docking sites for many chromatin-binding proteins. Chromatin immunoprecipitation and sequencing (ChIP-seq) is developed to analyze the occupancy of chromatin-binding factors, as well as the landscapes of various histone modifications at a genome-wide level (PMID: 17512414, 17603471, 17540862, 17558387)
 
% In ChIP-seq, chromatin in formaldehyde fixed or non-fixed cells is fragmented to a range from 100bp to 500bp by sonication or enzymatic digestion (PMID: 10694875, 10579938, 12893176). Using specific antibodies, the target proteins or histone modifications are captured along with the associated DNA fragments by protein A/G coupled agarose beads or magnetic beads. Then, the DNA fragments are eluted from beads, end-repaired, ligated to adaptors, PCR-amplified and sequenced by NGS.
 
% Traditionally, ChIP-seq needs hundreds of thousands of cells for profiling histone modifications and millions of cells for profiling transcription factors. The ChIP-seq data quality critically depends on the antibody specificity and the efficiency of chromatin fragmentation. Moreover, the whole procedure for ChIP-seq is time-consuming and laborious. In the past decade, several ChIP-seq derivatives are developed that work with a lower number of cell input and detect TF binding at higher resolution (PMID: 27626382, 27626377, 25607992, 2628033, 25103404, 23352811). In addition, techniques without the need of prior chromatin fragmentation became available for profiling chromatin modification and TF occupancy on chromatin using hundreds or thousands of cells (PMID: 31431618, 30923384, 28079019, 31036827, 31471188, 30532068), which mainly replaced the chromatin fragmentation and immunoprecipitation steps of ChIP-seq with one step of antibody guided MNase cleavage or Tn5 tagmentation of chromatin to simplify the procedure of library construction.

\subsubsection*{Chemical mapping of nucleosome positioning}

While MNase is a powerful and straightforward method for mapping nucleosome positions, it suffers in accuracy from the random and imprecise nature of digestion. To address these limitations, chemical approaches for direct mapping of nucleosome positions have been developed. The first such method is based on replacing endogenous histone H4 genes with a H4S47C protein variant. The cysteine in position 47 is located close to the nucleosome center position and can be chemically modified and, using copper and hydrogen peroxide catalysis, used to trigger the cleavage of the DNA backbone close to it\cite{Flaus1996}. This method was first used to precisely map nucleosome positions in the budding yeast \textit{S. cerevisiae}\cite{Brogaard2012}, and more recently in mouse embryonic stem cells\cite{Voong2016}, though its application is somewhat limited in more complex eukaryotes by the large number of copies of histone genes.

A more recent conceptually similar approach relies on the H3Q85C mutation, which generates cleavage at positions close to the nucleosome flanks\cite{Chereji2018}

\subsection*{Single-molecule chromatin accessibility profiling}

An important emerging class of methods aim to map chromatin accessibility within single molecules. The major advantage of such an approach is that it does not rely on enrichment and provides information about the distribution of accessibility states within the population of chromatin fibers. The assays in this class rely on methyltransferase enzymes that preferentially modify accessible DNA. For years, the only readout that such methods could rely on was bisulfite conversion of DNA followed by Sanger (for localized analysis of particular loci) and later Illumina (for both local and genome-wide coverage) sequencing, which also dictated the enzymes used to modify DNA. The first genome-wide assay of this kind was NOMe-seq\cite{Kelly2012}, which uses a m5C methyltransferase that modifies cytosines in a GpC context. 

As mammalian genomes, as well as the genomes of many other eukaryotes contain abundant endogenous methylation in CpG context, and no non-specific m5C methyltransferases are available, this is the only modification that can be used in mammals. This presents a serious limitation, as GpC nucleotides are rare in the genome, only found once every 20 to 30 bp, with much larger stretches of sequence with no informative positions at all being quite common. However, in species such as yeast and \textit{Drosophila}, which lack endogenous methylation, a combination of both a GpC and a CpG methyltransferase can be used, which increases assay resolution down to $\sim$10 bp, a method termed dSMF\cite{Krebs2017} (digital Single Molecule Footprinting). This approach has proven to be very powerful in enumerating the distinct functional states of individual regulatory elements\cite{Krebs2017}, down to the ability to footprint the occupancy of individual transcription factors \hl{CITE RECENT PAPER}. 

Still, there are limitations -- such footprinting is again only possible where there are sufficiently many informative positions, and these are inherently rare when relying on m5C modification in dinucleotide context, and it only provides information about the state of individual molecules within at most 600 bp (the current limit of combined paired-end read length for Illumina sequencing). 

The latter issue has been resolved by the advent of long-read sequencing platforms such as PacBio and Oxford Nanopore, which are capable of reading modified bases directly within individual long molecules, though with significantly decreased accuracy compared to bisulfite sequencing. The nanoNOMe\cite{nanoNOMe}/MeSMLR-seq\cite{MeSMLR-seq} assay uses GpC methylation and nanopore sequencing to map accessibility on a multikilobase scale, though it is still limited in its resolution by available informative positions. 

That limitation has been overcome by taking advantage of the ability of long read platforms to read any modification, and the use of non-specific methyltransferases, such the m$6$A depositing enzyme EcoGII combined with nanopore or PacBio sequencing, either on total genomic DNA (SMAC-seq\cite{SMAC,Fiber-seq2020}) or in combination with a phasing MNase digestion step (SAMOSA\cite{SAMOSA}). The large number of informative positions allows for fine-scale footprinting almost everywhere in the genome, subject to the limitations imposed by the higher error rate of single-molecule sequencing.

\subsection*{Other methods}

A variety of other methods have been developed to probe accessibility. 

An orthogonal approach to obtain absolute occupancy/protection values along the genome, though not at the single molecule level, is based on the susceptibility of accessible DNA to restriction enzyme cleavage\cite{Almer1986,Gregory1999}. It has been adapted to a high-throughput sequencing format in the form of assays such as NA-seq\cite{Gargiulo2009}, RED-seq\cite{RED-seq}, qDA-seq\cite{Chereji2019}, and ORE-seq\cite{Oberbeckmann2019}, and has been used to estimate absolute accessibility levels in yeast and mammalian genomes.

Nucleosome positioning has now also been probed using long-read methods too, which allow the mapping of the ends of larger nucleosome arrays rather than the single, di-, or at most trinucleosomes measurable with short reads.

NicE-seq\cite{Ponnaluri2017} uses a nicking enzyme to probe accessible DNA. 

FAIRE-seq\cite{Giresi2007,Giresi2009} was based on the preferential release of accessible chromatin during sonication of crosslinked cells. Protect-seq\cite{Spracklin2020} was recently developed to assay the inverse of accessible chromatin, strongly heterochromatinized genomic regions, based on their resistance to nuclease digestion.

DIVA\cite{Tchasovnikarova2017,Timms2019} utilizes the preferential viral insertion into accessible DNA to map open chromatin regions.

CATaDa\cite{CATaDa} labels open chromatin using ectopic expression of the \textit{E. coli} Dam methyltransferase.

Finally, reactive small molecules have also been applied to probe the fine-grained features of accessibility, such as DMS (DMS-seq\cite{DMS-seq}) and MPE (MPE-seq\cite{MPE-seq}).

\subsection*{Single-cell chromatin accessibility profiling}
 
Innovation in barcoding and microfluidic strategies have recently enabled high-throughput biochemical profiling of chromatin accessibility at single-cell resolution, including scDNase-seq\cite{Jin2015}, scMNase-seq\cite{Lai2018b}) and scATAC-seq \hl{XXXXCITEXXX}. Of these protocols, scATAC-seq has emerged as a popular and relatively simple approach to profile chromatin accessibility across hundreds to thousands of individual cells. Current scATAC-seq methods rely on either microfluidic or fluorescence cytometrical/plate-based partitioning to uniquely label nuclei in isolation. Procedures characteristic to both flavors of scATAC-seq, as well as consideration for experimental design, are described below:
 
\subsubsection*{Microfluidic scATAC-seq}
 
Droplet-based single-cell partitioning via microfluidic devices has emerged as a powerful approach for single-cell data generation owing to its reproducibly and relative ease of use. 

The initial version of scATAC-seq employed the Fluidigm platform\cite{Buenrostro2015,Buenrostro2018}, which allowed the profiling a few hundred cells. Higher throughput platforms have emerged since then and become widely used, such from 10X Genomics\cite{Satpathy2019} (Chromium Next Gem Single Cell ATAC-seq Library Kit; 1000176) and BioRad\cite{Lareau2019} (SureCell ATAC-seq Library Preparation Kit; 17004620), provide all required reagents necessary to produce scATAC-seq libraries. However, these commercial applications require the acquisition of proprietary robotic sample processing devices (Chromium Controller, 10X Genomics and ddSEQ single-cell isolator, BioRad) that are non-standard in most laboratories. 
 
Microfluidic-based scATAC-seq methods are generally composed of four major steps. First, Tn5 adapter integration is performed on the bulk nuclei suspension, similar to traditional ATAC-seq. Second, transposed nuclei are loaded onto an aqueous channel with PCR reagents and proprietary buffers and mixed with gel-beads containing distinct barcodes. To encapsulate individual nuclei in picolitre reactions with a single gel-bead, the aqueous flow is restricted to channels measuring $\sim$55 uM in width. Droplets are produced by exposing the aqueous flow to a continuous stream of oil.  To assure that the vast majority of droplets contain one or no nucleus, nNuclei droplet loading follows a Poisson distribution and nuclei ar thusloaded at low concentrations. Third, barcoded sequences with P5 adapters and tail sequences complementary to Tn5-inserted adapters are released from gel-beads following droplet generation; enabling PCR amplification of accessible chromatin fragments within each droplet in isolation. Finally, the droplet-oil mixture is emulsified, purified with magnetic beads, and subjected to bulk PCR to attach sequencing indices and P7 sequences. 
 
\subsubsection*{Plate-based scATAC-seq}

An alternative to microfluidics approach is to physically separate cells into the wells of plates. Straightforward 96- and 384-well scATAC-seq protocols have been published\cite{Chen2018}, however, their throughput remains limited by the low number of wells available. The adaptation of scATAC-seq to the ICELL8 Single Cell System (Takara Bio), which has 5084 nanoliter wells, in the form of $\mu$ATAC-seq\cite{Mezger2018}, increased the throughput of the assay to a few thousand cells. 

\subsubsection*{Combinatorial indexing (sciATAC-seq)}

Higher throughput can be achieved using combinatorial indexing, or sciATAC-seq\cite{Cusanovich2015,Cusanovich2018a,Cusanovich2018b}. In contrast to microfluidic approaches, sciATAC-seq via combinatorial indexing can be performed with access to standard instruments and reagents found on the campuses of most research institutions, with the exception that it requires homemade Tn5. In lieu of micrometer width channels to restrict barcoding and PCR reactions to individual cells, combinatorial indexing leverages fluorescence-activated nuclei sorting (FANS) to dispense low numbers of nuclei to multiplexed well plates. First, nuclei are stained with DAPI and 2500 nuclei are sorted in each well of a 96-well plate. ATAC-seq with 96 uniquely indexed Tn5 transposomes is then performed such that nuclei from each well are tagged by a distinct combination of adapters. Nuclei are then pooled and redispensed onto a new 96-well plate with each well containing up to 25 nuclei and a unique secondary barcode attached to Illumina-compatible adapters. Two rounds of nuclei shuffling and barcode attachment with 96 uniquely-indexed Tn5 transposomes allows for unique tagging of up to 9,216 nuclei. 

Multiple rounds of barcoding at also possible, utilizing ligation of barcodes to transposed fragments\cite{Yin2019,Zhu2019,Ma2020}, vastly increasing potential throughput.

Another approach for increasing throughput is to combine upstream transposition of barcoded Tn5 with a droplet-based scATAC platfrom such as 10X or BioRad, in the form of droplet combinatorial indexing, or dsciATAC\cite{Lareau2019}.

\subsubsection*{Single-cell multiomics}

The development of single-cell multiomics assays, which probe multiple molecular characteristics of individual cells at the same time, is major emerging area of single cell genomics, and chromatin accessibility is a key component of most such assays. 

Multiple protocols have been published recently from simultaneous single cell ATAC-seq and transcriptome profiling. These include sci-CAR\cite{Cao2018}, Paired-seq\cite{Zhu2019}, and SHARE-seq\cite{Ma2020}, which are all based on combinatorial indexing to achieve high-throughput, as well as droplet-based methods, such as SNARE-seq\cite{SNARE-seq}. 

Single-cell versions of the NOMe-seq assay (scNOMe-seq\cite{Pott2017}/COOL-seq\cite{Guo2017})allow simutaneous measurement of chromatin accessibility and DNA methylation in single cells, and have also been further extended in combination with transcriptome measurements in the form of assays such as scNMT-seq\cite{Clark2018} and snmC2T-seq\cite{Luo2019}. 

Perturb-ATAC has been developed to measure the effect of various CRISPR perturbations on chromatin accessibility at the single-cell level using libraries of sgRNA\cite{Rubin2019}

Protocols are also being developed for simultaneously measuring protein epitopes and accessibility at the single-cell level (e.g. Pi-ATAC\cite{Pi-ATAC})

Finally, spatially resolved single-cell chromatin accessibility measurements are emerging as an important area of method development (e.g. sciMAP-ATAC\cite{Thornton2020}).
 
\subsubsection*{Experimental design}
 
Similar to other sequencing-based profiling methods, scATAC-seq is susceptible to batch effects that can obscure biological variation. Careful attention to experimental design is central to mitigating batch and other sources of technical variation that strongly depend on the goals of the researcher. For example, in atlas and test versus control studies, a common objective is to contrast regulatory patterns among and within cell-types found in different tissues and organs, or between treatments and control samples. For such cases, scATAC-seq libraries should be constructed in parallel from as many sample types as possible, preferably including biological replicates, permitting resources. Prioritizing sample type diversity in preparations from individual batches aids in the mitigation of technical effects and allows researchers to average environmental and genotype influences across replicates. In contrast, comparison of two scATAC-seq libraries produced from separate preparations and from different samples will be confounded by batch effects, resulting in misleading or even erroneous results due to inflated variance between samples (ref). Furthermore, all libraries apart of the intended analysis should be sequenced on the same flow-cell, as sequencing scATAC-seq libraries on separate runs could also lead to increased technical variation (ref). Computational removal of batch effects from single-cell data has been a major focus of many informatics laboratories and shows promise in correcting mistakes stemming from poorly constructed experimental design (see Results). However, there is currently no accepted method to reliably remove all batch effects while preserving biological variation in the absence of true biological replicates. Thus, in cases where generating and sequencing scATAC-seq libraries in different batches is unavoidable, it is pertinent that the researcher takes note of possible sources of variation among samples.  

% -- DNAse-seq, ATAC-seq, NOMe-seq, long-read methods
% -- mention other techniques too

% -- ATAC-seq, 3D ATAC-PALM

% \subsection*{Single-cell approaches}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=18.5cm]{Fig2.png}
\end{center}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{
{\bf Primary experimental approaches for measuring chromatin accessibility and nucleosome positioning}. 
a) In ATAC-seq, hyperactive version of the Tn5 transposase is used to preferentially insert into accessible chromatin while simultaneously attaching adapters to the resulting fragments that can be used to directly amplify sequencing libraries,
b) In DNase-seq, the DNase I enzyme is used to preferentially cleave accessible chromatin, generating fragments that can then be amplified into sequencing libraries. Both ATAC-seq and DNase-seq generate peaks in read coverage over accessible regions in the genome.
c) In MNase-seq, the MNAse enzyme is used to digest unprotected DNA, leaving intact fragments protected by protein occupancy (primarily nucleosomes). These fragments are then amplified, resulting in increased read coverage over positioned nucleosomes
d) Methyltransferase-based approaches, such as NOMe-seq, dSMF, SMAC-seq, nanoNOMe/MeSLMR-seq and SAMOSA, rely on the labeling of accessible DNA within open chromatin regions and over nucleosome linkers with DNA methylation modifications. These modifications can be m5C methylation in GpC and CpG contexts and also m$^6$A methylation. Bisulfite conversion or the EM-seq assay can be used to convert fragmented DNA into Illumina-compatible libraries, resulting in short-range and sparse-coverage single-molecule footprints. Alternatively, long-read sequencing, which can also directly read m$^6$A methylation and take advantage of its much higher density in the genome, can be used, resulting in multikilobase-scale single-molecule footprints. Methyltransferase-based approaches tend to provide a simultaneous readout of both nucleosome positioning and open chromatin regions, appearing as small ``bumps'' in the methylated fraction of bases over linker regions and larger peaks over regulatory elements, respectively.
}
\label{Fig2}
\end{figure*}

\section*{Results}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=18.5cm]{Fig3.png}
\end{center}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{
{\bf Overview of common bulk chromatin accessibility measurement processing and analysis tasks}. 
a) Outline of key steps in processing ATAC/DNase-seq, MNase-seq, NOMe-seq/dSMF and SMAC-seq and other long-read datasets;
b) Fragment distribution is a key informative characteristic of ATAC-seq (shown here), DNase-seq and MNase-seq datasets;
c) Enrichment around TSSs is a useful ATAC-seq/DNase-seq quality control metric to assess the degree of global enrichment over open chromatin regions in the genome that is independent of peak calling or other arbitrary thresholds (shown here for an ATAC-seq dataset). It can be formalized in a single number as the ratio of signal immediately around the TSS versus signal in the flanks $\pm$2kbp away. Analogous plots serve the same function for methylation-based methods;
d) Dimensionality reduction techniques, such as PCA, $t$-SNE and UMAP, allow for intuitive visualization of the global similarities between datasets;
e) Identifying peaks differentially accessible between two conditions is a common task in the analysis of ATAC-seq/DNase-seq datasets;
f) Clustering analysis charts the relationships between datasets in a data matrix;
g) ChromVar\cite{Schep2017} analysis identifies TF motifs with elevated/decreased accessibility across samples;
h-k) Transcription factor footprinting. TF footprints are most often measured globally, in the aggregate over a set of motif instances and across conditions (h). Deep DNase-seq datasets also enable local footprint analysis at the level of individual motif instances (i). Short- and long-read single-molecule footprinting techniques enable the quantitative analysis of distinct footprint states (j--k).
}
\label{Fig3}
\end{figure*}

In general, a chromatin accessibility analysis workflow consists of three main steps (1) pre-processing, (2) peak calling and (3) downstream analysis, in which the later can include differential accessibility analysis, annotation, footprinting, motif enrichment and integration with other omics data. We will discuss each of the steps in more detail and mention commonly used bioinformatic tools. Although there is not yet a golden standard in the field, some general pipelines, such as the ENCODE pipeline for ATAC-seq analysis (\burl{https://www.encodeproject.org/pipelines/ENCPL792NWO/}), exist and propose specific tools and a guided workflow for the analysis of chromatin accessibility data. These steps differ for bulk, single-cell and single-molecule analysis and, they are thus discussed separately.

\subsection*{Bulk accessibility analysis} 
 
\subsubsection*{Pre-processing and QC}
 
Like most high-throughput sequencing data, pre-alignment quality control is recommended for chromatin accessibility data and can for instance be done using FastQC (ref) to examine sequencing quality, GC bias and overrepresented sequences. Next, sequencing adaptors should be removed using tools such as cutadapt (ref), trimmonmatic (ref) and fastq-mcf (ref), which require the input of known Illumina adaptor sequences. Depending on the experimental techniques and when paired-end reads are available, a size selection can be performed at this point. For instance, removal of multi-nucleosomal reads is advised for MNase-seq data, and for the ‘double-hit’ DNase-seq protocol an additional in silico filtering for fragment inserts between 50-100 bp for TFs binding site detection can be performed on top of the gel-based or SPRI-based experimental size selection (Cooper et al., 2017: https://doi.org/10.1038/nprot.2017.099; He et al., 2014). The trimmed and filtered reads are mapped to the organism-specific reference genome. The most widely used aligners for chromatin accessibility data are Bowtie2 (used in the ENCODE ATAC-seq pipeline) (ref), MEM (ref) or STAR (used in CellRanger [check])(ref). Following alignment, some additional filtering steps are advised to discard reads with low mapping quality or multi-mapped reads, PCR-duplicated reads, ENCODE blacklisted regions (Amemiya et al., 2019) and mitochondrial reads (specifically important for ATAC-seq data in which these can make up as high as 75\% of the total amount of mapped reads when using the original protocol).
 
An additional quality control step is recommended at this point by visualizing accumulated read abundance around transcription start sites, which are generally highly accessible (ref). In addition, visually inspecting the distribution of reads across the genome using genome browsers such as IGV (ref) or UCSC (ref) can further increase insight in the quality of the samples.
 
\subsubsection*{Peak calling}
 
Following initial read processing and quality control comes one of the crucial steps in chromatin accessibility data analysis, namely defining so-called ‘peaks’ or locations with a high accumulation of reads. These peaks form the basic units in most of the downstream analyses. The most widely used tool for peak calling is MACS, which is also the default in the ENOCDE ATAC-seq pipeline (Zhang et al., 2008). MACS is a model-based algorithm originally designed for ChIP-seq data analysis, and implements a dynamic Poisson distribution to capture local background biases in the genome and to effectively detect peaks (Zhang et al., 2008). Other general (e.g. ZINBA (ref)) or more technology-specific peak callers exist, e.g. HMMRATAC for ATAC-seq; F-seq and Hotspot for DNase-seq and ATAC-seq. Note that MNase-seq is actually an orthogonal assay compared to the other discussed chromatin accessibility profiling methods as it measures nucleosome-occupied regions. It is therefore the method of choice to map nucleosome positions genome-wide, for which specific tools have been developed such as GeneTrack (ref) and DANPOS (Chen et al., 2013). Note that ATAC-seq also lends itself for nucleosome-positioning by for instance using the tool NucleoATAC (Schep et al., 2015). An important parameter to consider during the peak calling step is the signal threshold, which influences the sensitivity and specificity of peak retrieval (Koohy et al., 2014). The default minimum FDR cutoff of 0.05 for MACS has been shown to be optimal for a range of DNase-seq datasets (Koohy et al., 2014).
 
As datasets often comprise different samples, the construction of a common set of features is crucial in order to be able to compare samples to each other in downstream steps. Mostly, a consensus peak file comprising a set of merged peaks across the samples is used. The ENCODE pipeline provides a possible workflow with merge and filter steps for this purpose (ref), although other tools serve the same purpose (e.g. consensusSeekeR (Samb et al., 2015)) Alternatively, a pre-defined set of regions or a binned genome can be used as features in downstream analyses (Bravo et al., 2020; Cusanovich et al., 2018).
 
Lastly, to ensure reproducibility in the data, ENCODE guidelines recommend that each ATAC-seq experiment should have two or more biological replicates and that replicate concordance should be checked by calculating Irreproducible Discovery Rate (IDR) values (Qunua et al., 2011). An additional quality control step is to calculate measures that represent the signal-to-noise ratio, namely the fraction of reads in called peaks (FRiP score) for ATAC-seq, which should preferably be greater than 0.3 (or at least greater than 0.2), and the signal proportion of tags (SPOT score) for DNase-seq, which should exceed 0.4 (i.e. 40\% of mapped reads within DHSs) (ENCODE projects; Roadmap Epigenomics).
 
\subsubsection*{Downstream analysis}
 
Often, the experimental set-up of chromatin accessibility profiling methods includes several samples, such as treatment versus control or cells undergoing differentiation, for one wishes to study the changes in chromatin accessibility profiles. Therefore, defining the set of peaks that is differentially accessible between conditions is often a central question. Most differentially accessibility tools, including DiffBind (ref), HOMER (ref), DBChIP (ref), use as input a read count matrix on a feature file (e.g. consensus peaks) across the different conditions. These methods rely on bioinformatic tools that were designed for differential expression analysis of RNA-seq data, such as DESeq2 (ref) and edgeR (ref). Differential peak calling can also be done using MACS in which mapped bam files of treatment and control samples are provided rather than a count matrix (ref).
 
These differential accessibility tools are well-suited for pairwise comparison of chromatin accessibility. However, differential accessibility analysis becomes more tricky when comparing three or more samples. In this case, a one-versus-the-rest approach using aforementioned tools can reveal regulatory regions that are specific for a condition or cell type, however, a clustering analysis is more versatile. In general, clustering can both be used to query which samples in a dataset resemble eachother, as well as to define groups of specifically accessible regions per sample class. Both hierarchical clustering and k-means clustering have been applied to a (Pearson or Spearman correlation matrix on) a normalized (reads per million mapped reads or RPM) accessibility count matrix using a consensus peak set (; other refs). Such clustering algorithms are for instance implemented in the deepTools package (ref). The differentially accessible regions are often visualized in a heatmap representing the accessibility of the regions within each cluster across the compared samples. Other researchers have drawn inspiration from tools designed for clustering of regions in single-cell epigenomics data using factor analysis and unsupervised learning (ref). For instance, topic modelling or non-negative matrix factorization, in which a high-dimensional dataset is approximated by a reduced number of representative components, can be applied to bulk datasets after a bootstrapping/subsampling procedure in which simulated single-cells are created from bulk samples (ref).
 
To gain biological insight in the sets of cell type specific regions identified via differential accessibility analysis, peak annotation tools such as GREAT, ChIPseeker, ChIPpeakAnno, [clusterProfiler?] are used to couple peaks to the nearest gene (and to report associated Gene Ontology terms) as well as to annotated peaks based on their location relative to genes (for instance as ‘promoter’, ‘intron’ or ‘intergenic’). In addition, chromatin segmentation approaches such as ChromHMM (Ernst and Kellis 2017), EpicSeg (Mammana and Chung, 2015) and Segway (Hoffman et al., 2012) are used for genome-wide classification of genomic regions based on epigenomic marks (mostly based on histone modification ChIP-seq) into chromatin states, such as ‘active promoter’ or ‘weak/poised enhancer’ per cell type. These annotations can be useful to aid interpretation of gained or lost accessible regions in a study.
 
As combinatorial binding of TFs to accessible regulatory regions forms the basis of gene regulation, one of the major downstream analysis steps is unravelling which TFs are bound to the set of cell type-specific or differentially accessible regions. Since TFs recognize and bind to TF-specific DNA sequences, we can leverage the enrichment of so-called ‘motifs’ in a set of sequences to detect potential binding of TFs. Two major groups of motif analysis tools exist. The first class of  tools, e.g. HOMER, MEME and i-cisTarget, use pre-defined databases (e.g. JASPAR, CIS-BP, TRANSFAC and HOCOMOCO) containing motif sequences, modelled in the form of position-weight matrices (PWMs). These approaches use motif scanner tools to identify enriched cognate TF binding sequences in a provided peak set via the comparison with a tool- or user-provided background set of sequences. On the contrary, other bioinformatic tools, such as RSAT, MEME, Weeder and HOMER, perform de novo motif discovery, allowing an unsupervised identification of TF binding motifs. Next to these classical de novo motif discovery tools, recently, with the advent of deep learning algorithms in regulatory genomics, several convolutional neural network models have been applied to this purpose, including DeepATAC (Hiranuma et al., 2017), DeepLIFT (Shrikumar et al., 2019), DeepMEL (Minnoye et al., 2019). Often, these models not only de novo capture important motifs across the training regions but are also able to predict their importance at single-nucleotide resolution within the sequences. Note that most of the motif discovery tools require a set of regions-of-interest as input that was priorly identified by for instance differential accessibility or clustering analysis. In contrast, MEDEA extracts cell-type-specific peaks from just an input sample using a panel of peaks from reference cell types (e.g. ENCODE-DREAM) prior to a TF motif enrichment analysis (Mariani et al., 2020). Altogether motif detection on a set of specifically accessible regulatory regions allows to decode the genome sequences and may reveal possible master regulators that bind to these regions.
 
One can also go further and aim to detect the precise location of TF binding events from chromatin accessibility data via TF footprinting. Footprints are small regions (8-30bp) that display relative protection from cleavage due to binding of a TF, and thus correspond to dips in the accessibility peak (Baek and Sung, 2016;….). DNase I has been and is still the preferred footprinting reagent. A study from Schwessinger et al., 2017 showed that ATAC-seq footprinting was less accurate than DNase-seq footprints, which might be attributed to the large size of the Tn5 dimer and Tn5-specific cleavage biases that are not accounted for in DNase-seq-designed footprinting algorithms (Buenrostro et al., 2013; Li et al., 2019). Analytic genomic footprinting approaches either de novo annotate DNase I footprints (e.g. the Wellington algorithm (ref), HINT (ref), DBFP (ref) and DNase2TF (PMID 25242143)); or determine TF occupancy at specific genomic location (e.g. CENTIPEDE (ref) and FLR, (ref)) (Vierstra and Stamatoyannopoulos, 2016). Nevertheless to popularity of DNase-seq data for footprinting, footprinting analysis on ATAC-seq data has also been attempted by several groups with success, for instance in the initial ATAC-seq publication (Buenrostro et al., 2013), using DeFCoM (Quach and Furey, 2017) or ATAC-seq-specific footprinting algorithms such as HINT-ATAC that consider ATAC-seq artefacts (Li et al., 2019). Note that TF footprinting comes with some limitations as it requires extremely deep sequencing, ideally at least 200 million uniquely mapped reads from a DNase-seq experiment (Vierstra and Stamatoyannopoulos, 2016), and it is biased by short residence times for some TFs (ref) and by intrinsic sequence preferences of DNase I (Sung et al., 2016).
 
\subsection*{Single-molecule accessibility analysis} 



\subsection*{Single-cell chromatin accessibility data analysis}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=18.5cm]{Fig4.png}
\end{center}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{
{\bf Overview of common scATAC-seq processing and analysis tasks}.
a) Outline of key steps in processing scATAC-seq datasets;
b) Clustering of cell types and UMAP embedding of single cells;
c) Identification of marker genes and/or peaks;
d) Identification of peak-to-gene links;
e) Assessing peak coaccessibility;
f) Generation of pseudobulk genome browser tracks for each cell type;
g-h) Analysis of gene activity scores (g) and motif deviations (h) within embeddings;
i) Differentiation trajectories analysis.
}
\label{Fig4}
\end{figure*}


\subsubsection*{Pre-processing and QC}

Single-cell chromatin accessibility data requires similar upstream processing steps as bulk data, including alignment, feature definition and the generation of a count matrix. However due to the substantial scale and sparsity of the feature-by-cell matrix, specialized bioinformatic tools have been developed, mostly for scATAC-seq data, to handle these assay-specific challenges (refs for all tools). One major point in which these tools differ is the way they define features [e.g. peaks from bulk or aggregated single-cell data (chromVar, Cicero, cisTopic, Gene Scoring, scABC, Scasat), pseudo-bulk samples (Cusanovich2018) or fixed size bins (Cusanovich2018, SnapATAC))] and what the count features represent [e.g. counting reads in peaks (cisTopic, Cusanovich2018, scABC, Scasat), counting (gapped) k-mers under peaks or around transposase cut sites (BROCKMAN, chromVAR) or counting reads overlapping TF motifs in peaks or genome-wide (chromVar, SCRAT)] (Chen et al., 2019). Important follow-up steps are transformation (e.g. by binarization) and dimensionality reduction of the feature-by-cell matrix to visualize the cells into a 2D- or 3D-space and to perform further downstream analyses such as clustering to uncover the different populations in the sample and their specifically accessible regions. Recently, 10 computational methods for the analysis of scATAC-seq data have been benchmarked by Chen et al., demonstrating that SnapATAC (ref), Cusanovich2018 (ref), and cisTopic (ref) performed best in distinguishing cell populations in both synthetic and real datasets (Chen et al., 2019). Note that compared to scRNA-seq frameworks, there are no designated tools that correct for batch effects in scATAC-seq data, but batch correction is performed inexplicitly during the processing steps such as during feature selection or dimensionality reduction (Baek and Lee, 2020). Batch correction tools designed for scRNA-seq data can be used with precautions to not remove biological variance. Batch effect removal becomes especially important when combining multiple runs into atlases or when integration with scRNA-seq data, for which BBKNN, Scanorama and scVI performed best in a recent benchmark (Luecken et al., 2020).
 
As the complexity of a system or disease exists across all molecular layers, computationally integrating multiple omics modalities on the system of interest holds the promise to gain a systems biology view and to reconstruct gene regulatory networks. Especially the integration of chromatin accessibility profiles with ChIP-seq and RNA-seq data are of interest. As TFBS enrichment within regulatory regions may elude to TF binding, correlation with TF ChIP-seq tracks (cisTarget/ other methods??); or enrichment/overlap of TF ChIP-seq signal/peaks within accessible regions can validate the predicted target sites. For the reconstruction of regulatory networks, specifically the integration of epigenomics and transcriptomics is of interest as this may predict links between accessible regulatory regions and target genes (Ackermann et al., 2016). An example from the single-cell field is the study by Cao et al. where they used a least absolute shrinkage and selection operator (LASSO) model to correlate a gene’s expression level with the accessibility of all peak within 100kB around its TSS, linking 1,260 distal regions to 321 potential target genes, which improved predictions of gene expression based on accessibility profiles by a fourfold as compared to only using chromatin accessibility at promoters (Cao et al., 2018).

\section*{Applications}

Chromatin accessibility profiling is widely useful for applications in biology and biomedicine, ranging from the analysis of gene regulation and cellular states (section 1 and 2 below) over the dissection of healthy and diseased tissues and organs (section 3 and 4) to the investigation of populations and species (section 5 and 6). These applications profit from the high genomic resolution of chromatin accessibility profiling and from it relative ease and throughput of these assay.

\subsection*{Regulation of chromatin accessibility}

<Enhancers/super-enhancer, 3D structure, GRNs etc., Georgi to write a draft?>
Chromatin accessibility at regulatory regions such as promoters and enhancers plays an important role for gene regulation, and it is itself controlled by a complex interplay of transcription factors, chromatin remodeling complexes, non-coding RNAs and other factors. Chromatin accessibility profiling has provided important insights into <...>

\subsection*{Cellular dynamics of chromatin accessibility}

<Embryogenesis, stem cells, etc., Georgi to write a draft?>
Chromatin accessibility captures developmental cell states such as pluripotency and controls how cells respond to stimuli. Changes in chromatin accessibility often concur with changes in the expression of the corresponding genes, although the correlation is far from perfect. For example, pluripotent stem cells retain a high level of chromatin accessibility at key regulatory regions <...>

\subsection*{Chromatin accessibility across cell types and organs}

Chromatin accessibility at gene-regulatory regions is highly dynamic during cellular differentiation and organ development (PMID: 21116306, 20110991). Chromatin accessibility profiling has contributed to our understanding of chromatin regulation across a broad range of cells in human and mouse (PMID: 25693563, 30078704) and in specific organs and cell types. The hematopoietic lineage in particular has served as a blueprint for deciphering the role of chromatin accessibility and epigenetic changes in cellular differentiation (PMID: 29364285, 22398613). Application of ATAC-seq and/or ChIP-seq to FACS-purified hematopoietic cell populations established comprehensive maps of regulatory regions and their dynamic changes in the hematopoietic lineage of human and mouse (PMID: 25103404, 30686579, 27526324, 29706549). Detailed investigations of macrophages connected the regulation of these immune cells to their tissue environment (PMID: 25480296, 25480297), while analyses of CD4+ T cells (19144320, 24097267, 29686426) and innate lymphoid cells (PMID: 27156451, 27156452) uncovered a striking degree of plasticity in these immune cell populations. Chromatin regulation in immune cells also contributes to the generation of memory T cells (PMID: 29236683), which are poised to implement effector functions upon re-exposure to pathogens, and to the more limited memory of inflammation in regulatory T cells (PMID: 27499023). Importantly, immune cell memory is not restricted to B cells and T cells, but also includes monocytes and NK cells (PMID: 32132681), and the regulation of such trained immunity appears to involve tightly regulated changes in the epigenomes of the affected cells (PMID: 25258085, 27863248).

Reaching beyond the hematopoietic lineage, RNA-seq, ATAC-seq and ChIPmentation profiling of epithelial cells, endothelial cells and fibroblasts from 12 different organs uncovered widespread immune gene regulation in these non-hematopoietic, structural cells, as well as an epigenetic potential that appears to pre-program these cells for contributing to pathogen response (DOI: 10.1038/s41586-020-2424-4). Chromatin accessibility has also been studied in neural development (PMID: 31974223, 29307494, 26365491, 29434377) as well as in brain samples of humans (PMID: 31727856, 29945882, 29227469) and non-human primates (PMID: 31980617). Notable applications of chromatin accessibility profiling to other cell types and organs have included the analysis of cardiac development (PMID: 31271750, 30451828), epidermal progenitor cells in skin (30220568), and mammary gland development (PMID: 30174241). Finally, initial single-cell atlases of chromatin accessibility across tissues and organs are emerging (PMID: 30078704, 32231307), which have the potential to discover new cell types and to define the chromatin states of cell types that are difficult to purify or enrich using FACS. In summary, chromatin accessibility profiling has uncovered a transcription-regulatory landscape that is cell-type-specific and organ-specific, and dynamically changed over the course of cellular differentiation and organ development.

\subsection*{Chromatin accessibility in human diseases}

Changes in chromatin accessibility have been implicated in multiple diseases, where they reflect disease-linked changes in cell composition, gene regulation and/or epigenetic cell states. Aberrant epigenetic states are ubiquitous in cancer and often linked to the developmental abnormalities of cancer cells (PMID: 30361341). In blood cancers, chromatin accessibility patterns have been shown to reflect the cancer’s cell-of-origin as well as regulatory changes that appear to contribute to the process of malignant transformation and cancer evolution (PMID: 27526324, 31996669, 29785028, 30503705, 30673601). Changes in chromatin accessibility have been investigated over the course of targeted therapy in patients with chronic lymphocytic leukemia (PMID: 31996669) and combined with chemosensitivity screening to identify promising drug combination therapies (PMID: 30692684). Chromatin accessibility landscapes have also been mapped in solid tumors, including breast cancer (PMID: 31152164), colon cancer (PMID: 22499810, 28169291), glioblastoma (PMID: 30275445, DOI: 10.1101/370726), gastric cancer (27677335), and lung cancer (PMID: 31209061, 27374332). Pediatric cancers tend to carry particularly pronounced epigenetic changes, contrasting with their comparatively low rate of somatic mutations. For example, the EWS-FLI1 fusion oncogene in Ewing sarcoma has been shown to impose de novo enhancers and super-enhancers on the tumor cells (PMID: 25704812, 25453903); and epigenome profiling has uncovered subtype-specific regulatory mechanisms in atypical teratoid rhabdoid tumors (PMID: 27960086) and in Langerhans cell histiocytosis (31345789).

An interesting line of research has investigated the role of the tumour-infiltrating/tumour-associated cells in solid tumors. Epigenetic changes have been implicated in T cell exhaustion in the context of chronic inflammation and the tumor microenvironment (PMID: 27789799, 30778252), which compromises the ability of these T cells to fight the tumor. Immunotherapy, most notably blocking of the PD1/PD-L1 pathway, has been shown to revert some of the epigenetic changes associated with T cell exhaustion (PMID: 27789795, 31375813, 28648661) and is widely useful for the treatment of those solid tumors that have a high degree of immunogenicity (32433532). However, not all exhausted T cells can be rejuvenated by immune checkpoint blockade, as some T cells appear to transition to a fixed epigenetic state that renders them resistant to reprogramming (30778252). In addition to immunotherapy, selective epigenetic reprogramming can be used to alter the T-cell landscape resulting in enhanced treatment efficiency (PMID: 26503055, PMID: 28625481).

Beyond cancer, where chromatin accessibility has been studied most extensively, changes in chromatin accessibility have also been observed in immune diseases such as inflammatory bowel disease (29695774) and rheumatoid arthritis (29765031). Moreover, changes in epigenome and chromatin accessibility profiles have been observed in post-mortem brain tissue from patients with Alzheimer’s disease (PMID: 30559478), schizophrenia (PMID: 30087329) and autism spectrum disorder (PMID: 27863250). In summary, chromatin accessibility profiling of primary patient samples is already widely used for identifying disease-linked changes in chromatin structure and transcription regulation, and there is substantial scope for new discoveries as researchers move beyond cancer and are investigating regulatory mechanisms in many diseases that have yet received little attention.

\subsection*{Chromatin accessibility variation across populations}
 
Extension of chromatin accessibility assays to populations of diverse genetic backgrounds has proven valuable for advancing our understanding of how sequence variation impacts cis-regulation within a species. A striking 90\% of disease-associated variants in humans identified via GWAS localize to gene-distal non-coding loci, obfuscating functional predictions (PMID: 22955828, 22955986, 28039028). Mounting evidence has implicated alteration of gene regulation as a key driver of phenotypic evolution and disease proliferation. Quantitative trait loci (QTL) mapping of molecular traits, such as expression variation (eQTL), provides an attractive approach for deciphering the gene regulatory potential of genetic variants within a population. Leveraging a molecular QTL framework, a large-scale DNase-seq panel of 70 lymphoblastoid cell lines from the Yoruba HapMap showed that approximately 50\% of chromatin accessibility associated variants coincide with variants associated with expression variation, with the allele conferring increased accessibility generally associated with increased gene expression (PMID: 22307276). This study also provided evidence that sequence alterations underlying cis-elements perturb transcription factor binding affinities, leading to weakened or ablated binding. An analysis of CD4+ T cell chromatin accessibility from 105 healthy donors revealed that only 15\% of genetic variants embedded within accessible chromatin regions affect the relative accessibility of the cognate regions (PMID: 29988122). Thus, the majority of genetic variants located within accessible chromatin appear to lack functional consequences. The same study further demonstrated that pairwise correlations of accessible regions (co-accessible regions) readily recapitulates three-dimensional higher-order chromatin interactions as defined by in situ HiC data, suggesting that local chromatin accessibility among pairs of regions are coordinated with higher-order genome structure, particularly within the same topologically-associated domains (TADs). In line with these findings, local chromatin accessibility in a subset of regions were associated with variants located 10’s to 100’s of kilobases away, reflecting putative interactions. Importantly, integration of population-scale accessibility data captured 10-30\% of previously reported autoimmune-associated variants and explained 1-7\% of disease heritability. Taken together, population-based analysis of chromatin accessibility provides a powerful approach for dissecting the regulatory potential of genetic variants associated with a trait of interest. Additional studies in other tissues and disease states leveraging single-cell technologies have the potential to systematically map all chromatin accessibility modifying variants in a cell-type specific fashion.
 
\subsection*{Evolution of chromatin accessibility}
 
The use of chromatin accessibility data has greatly facilitated the identification of causal genetic variants underlying disease and trait variation; however, it is also proving useful to study the evolution of gene regulation and morphological evolution between species. A major advantage of incorporating chromatin accessibility data into these studies is that DNA sequence variation is often too high in intergenic regions to identify cis-regulatory elements using sequence-based alignments alone (ref?). This is especially problematic for studies in plants, where sequence turnover between related plant species is much greater than what is observed between related animal species (ref?). Regardless, sequence-based methods can be used to identify conserved non-coding sequences, which are highly enriched in accessible chromatin. Comparative epigenomics is revealing important clues about the evolution of gene regulation. For example, major morphological transitions, such as the loss of limbs in snakes and eye degeneration in subterranean mammals, have been linked to loss of CREs (ref?). These CREs were discovered using a combination of tissue-specific ATAC-seq and comparative genomics. In another study, chromatin accessibility data in combination with H3K27ac and H3K4me3 was used to identify promoters and enhancers in liver tissue of 20 mammalian species (Villar et al., 2015). It was determined that the rate of sequence variation is much greater for enhancers in comparison to promoters. This was reflected by a lower conservation of enhancers between species, yet, newly evolved enhancers were more likely to be under positive selection in a lineage specific manner. Rapid evolution of cis-regulatory regions has also identified in a comparative epigenomics study of numerous flowering plant species ranging in genome size from ~150 Mb to ~5,000 Mb (ref?). The frequency of distal chromatin accessible regions was correlated with genome size and their distal location from genes was mostly likely due to transposon and repeat expansion in these plants. The lack of distal CREs in Capsaspora owczarzaki, a unicellular organism sister to other animal species, has led to the hypothesis that distal regulation is a feature of animal multicellularity, however, with the increase in profiles of chromatin accessibility across taxa it seems more likely distal regulation is a consequence of genome size (ref?). Additional comparative epigenomic studies of chromatin accessibility across diverse taxa and of species that represent key nodes in the tree-of-life will further unveil diverse mechanisms in the evolution of gene regulatory mechanisms.

\section*{Reproducibility and Resources}

The genomics community has been leading the way in creating standards for data information, data quality and data deposition for decades. This reflects that many genome-wide datasets serve as community resources and, as a result, they are repeatedly used and incorporated into future studies by individual labs. These efforts also improve reproducibility, especially for data analyses. To increase the usability of epigenomics data, it must be available in public data repositories and it should include the original unprocessed sequencing data, in addition to processed output files from analyzing the data. For example, it is common practice to include files that contain regions enriched for chromatin modification and or accessibility for ChIP-seq and chromatin accessibility mapping assays, respectively. There are numerous ways to publicly distribute raw and processed epigenome datasets ranging from custom websites from consortia or individual labs to national data archives. At a minimum, epigenome sequencing datasets should be replicated and deposited to a well-funded and stable data archive facility such as the National Center for Biotechnology Information (i.e. Sequence Read Archive or Gene Expression Omnibus), European Nucleotide Archive, DNA Data Bank of Japan, or Dryad Digital Repository to name a few examples. Although not required, it is also useful if data are hosted in publicly accessible genome browsers. This increases data dissemination and provides a user-friendly tool for scientists not as familiar with computational methods for analyzing data. The deposited data should include supplementary information to facilitate interpretation and reproducibility. These data entry requirements will range from information regarding the sample used to technical aspects of experimental design to the type of instrument used for sequencing. For example, data entry requirements that are useful to addresses issues associated with reproducibility could include sources of possible biological variation (i.e. genotype, sex of samples, age, tissue/organ/cell type) and technical variation (i.e. antibodies – lot number, nucleases/integrases – lot number, sequencing library procedure, instrument used for sequencing and type of sequencing run). They are also important variables that can be incorporated into data analyses as covariates or to correct for batch effects. Genome assembly and genome annotation versions used in data analyses should also be provided. Altogether, this information is useful to assess reproducibility. Lastly, distribution of custom code and descriptions of computational methods are also paramount to reproducibility. As one example, the ENCODE Consortia has developed extensive open source software that is accompanied with ‘best practices’ and descriptive details on the rationale for data processing steps, thresholds and quality metrics for data evaluation. In general, software used for data analyses should include the software version and parameter options applied. Custom code should be disseminated through public hosts such as GitHub, as opposed to personal communication with the developer. Efforts to address the biological, experimental and computational variables described above will increase reproducibility in addition to the usability of these data for years to come.

\section*{Limitations and optimizations}


\section*{Outlook}

The past decade has seen an explosion in studies examining chromatin accessibility and its variation in different cell types, tissues, organs and organisms. The current and future challenge is to dissect the function of these regulatory regions in relation to other regulatory layers and gene expression. Accessibility alone does not reveal the activity state or the functional properties of the region (whether it acts as a promoter, enhancer, silencer), or which factors are bound to the region or its potential role in other functions such as 3D genome topology or replication origins. Moreover, information on the identity of the target genes, and whether a regulatory region is functionally required for gene expression, is also missing.

Many of these challenges can be overcome by a more holistic multi-omics approach, by profiling multiple molecular layers from the same sample, such as the transcriptome, chromatin modifications and transcription factor occupancy, in addition to chromatin accessibility. A common approach is to run multiple omics methods on fractions of the same sample, using protocols optimized separately for each assay, thus generating comparable datasets. However, running separate assays can introduce batch effects that are difficult to mitigate computationally, which can be a drawback of this strategy.

Chromatin accessibility profiling in single cells has surged dramatically in recent years, in part due to combinatorial indexing (sciATAC-seq) and the recent availability of commercial kits for droplet-based scATAC-seq. We expect further improvements to the assay in the coming years as this trend keeps increasing. In contrast to RNA, which has a high dynamic range, there are only two loci that can be measured simultaneously in a diploid genome by single cell regulatory genomics based methods. As a result, the data is mostly binary and still very sparse due to the low coverage per cell, making the analysis of accessibility and other regulatory features at the single cell level extremely challenging and a certain degree of data aggregation across cells or features is usually required. It is also difficult to estimate the sensitivity of scATAC-seq. Roughly ~10-15% of known peaks are recovered per single cell, but it is actually not known how many regulatory elements are accessible in any given cell at any instance in time. Technical improvements during the past couple of years have boosted cell coverage, which ameliorated both issues to some extent and resulted in a significant increase in data resolution, allowing a sharper distinction between cell types as well as regulatory changes. Given this inherent difference between scRNA-seq and scATAC-seq (and scChIP) data, specialised computational tools have been developed that address the sparsity and binary nature of scATAC-seq data and facilitate more integrated analyses across groups of cells (cisTopic (Bravo González-Blas et al. 2019, PMID: 30962623), snapATAC (Fang et al. bioRxiv 2019), ArchR (Granja et al. bioRxiv 2020)). However, the availability of tools designed for scATAC-seq is still very limited when it comes to specific analysis tasks, such as pseudotime and trajectory inference. While comparisons of performance and applicability of scATAC-seq methods have been performed (Chen et al. 2019, PMID: 31739806), there are no uniform pipelines adapted widely by the community, which complicates a systematic comparison and interpretation of results coming from different labs. We foresee major efforts in the coming years towards standardizing comprehensive computational pipelines for analysis.

Recent advances in single cell methods are pushing technologies to perform multi-omic measurements from the same cell. Multiple methods already exist for the joint profiling of chromatin accessibility and DNA methylation (scNOMe-seq (Pott. 2017, PMID: 28653622)), gene expression (sci-CAR (Cao et al. 2018, PMID: 30166440)) and protein levels (Pi-ATAC (Chen et al. 2018, PMID: 30389926)). Several technical challenges have so far limited the widespread application of these methods. Sample fixation, reaction conditions and other experimental parameters are often not compatible for multiple assays, complicating the optimization of joint protocols. Moreover, the resulting data is limited by the combined sensitivity of the methods, for example running two assays each having a 10% capture rate could result in a very small set of overlapping features. Profiling multiple molecular layers raises the non-trivial computational challenge of integrating the datasets. Methods that can handle the harmonization of bulk and single cell multi-omic measurements have recently been developed (MOFA (Argelaguet et al. 2018, PMID: 29925568), Seurat v3 (Stuart et al. 2019, PMID: 31178118)). A key feature required for future computational methods is flexibility; methods need to handle datasets coming from very different modalities, coming from the same cell or from the same sample and will need to impute missing molecular layers based on the ones that were profiled. Measuring multiple parameters from the same single cell should greatly advance our ability to link regulatory properties and deconstruct regulatory connections. Having information on coordinated changes in distal open chromatin regions (putative enhancers) and gene transcription from the same cell, for example, would greatly help to link enhancers to their potential target genes. We anticipate important developments in both experimental and computational multi-omic approaches in the coming years.

Functionality of accessible chromatin regions can also be probed by perturbation, for example by mutation of key transcription factors. The high degree of cellular heterogeneity in complex systems, such as developing embryos, has limited the usefulness of this approach.  However, single cell accessibility profiling could solve this issue, by identifying the impact of the mutations directly in the affected cell types, revealing both changes in regulation as well as alterations in cell fate decisions. Large-scale perturbation and profiling of regulatory networks has been performed in cell culture models by coupling CRISPR screening with scATAC-seq readout (Perturb-ATAC (Rubin et al. 2019, PMID: 30580963)). In more complex systems, where high-throughput targeted mutagenesis is not feasible, natural sequence variation could be exploited as a large-scale perturbation tool. In this context, profiling accessibility both intra- and inter- species can give insights into regulatory variation and functionality, as discussed above.

Finally, a particularly exciting area of future development is the integration of chromatin accessibility profiling with imaging-based approaches. Current protocols involve tissue dissociation to extract cells or nuclei, which leads to the loss of the native spatial context. ATAC-see (Chen et al. 2016, PMID: 27749837) mitigates this problem by performing the Tn5 reaction in-situ on microscopy slides and using fluorescent adaptors that are compatible with both imaging and sequencing. Further integration of ATAC-seq with high-throughput FISH and other imaging-based methods will lead to new ways of interrogating the genome of complex systems in situ after stimuli and perturbations.

\section*{Author contributions}

\hl{XXX}

\section*{Acknowledgments}

\hl{XXX}

\begin{thebibliography}{100}

% \section*{References}

\input{references}

\end{thebibliography}

\end{multicols}

\end{document}
