\documentclass[10pt]{article}
\usepackage{marginnote}
\usepackage[paperheight=25cm,paperwidth=18cm,lmargin=1.7cm,rmargin=1.7cm,top=2.2cm,bottom=2cm,marginparwidth=3.2cm,marginparsep=-3.2cm]{geometry}
% \usepackage[paperheight=25cm,paperwidth=18cm,lmargin=1.7cm,rmargin=1.7cm,top=2.2cm,bottom=2cm]{geometry}
\setlength\columnsep{30pt}
\usepackage{multicol}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{array}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage[auth-sc]{authblk}
\usepackage{longtable}
\usepackage{multirow}
\usepackage{hyperref}
\usepackage{enumerate}
\usepackage[labelfont=bf]{caption}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{mdframed}
\usepackage{graphics}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{array}
\usepackage{lscape}
\usepackage{caption}
\usepackage{breakurl}
\usepackage{todonotes}
\usepackage{hanging}
\usepackage[final]{pdfpages}
\usepackage[leftFloats,CaptionAfterwards]{fltpage}
\usepackage[numbers,sort&compress]{natbib}
\setlength{\bibsep}{3pt}
\usepackage{abstract}
\usepackage{enumitem}
\usepackage{titlesec}
\usepackage{etoolbox}
\usepackage{soul}
\patchcmd{\thebibliography}{\section*}{\section}{}{}
\titleformat{\section}[block]{\bf \fontfamily{phv}\selectfont{\Large\bfseries\filcenter}}{\thesection}{0.6em}{}
\titleformat{\subsection}[block]{\bf\fontfamily{phv}\selectfont{\normalsize\bfseries\filcenter}}{\thesubsection}{0.4em}{}

\hypersetup{
  colorlinks,
  citecolor=Blue,
  linkcolor=Red,
  urlcolor=Violet}
  
\usepackage{helvet}

\usepackage{titlesec}

% \titleformat{\section}[leftmargin]{\normalfont\sffamily\bfseries\filleft}{}{0pt}{}
% \titlespacing{\section}{4pc}{1.5ex plus .1ex minus .2ex}{1pc}

\makeatletter
\def\@biblabel#1{\@ifnotempty{#1}{#1.}}
\makeatother

\newenvironment{Figure}
{\par\medskip\noindent\minipage{\linewidth}}
{\endminipage\par\medskip}

\makeatletter
\renewcommand{\maketitle}{\bgroup\setlength{\parindent}{0pt}
\begin{flushleft}
  \textbf{\@title}
  \@author
\end{flushleft}\egroup
}
\makeatother


\title{\bf \begin{flushleft}\fontfamily{phv}\selectfont{\Large Single-molecule multikilobase-scale profiling of chromatin accessibility using m$^6$A-SMAC-seq and m$^6$A-CpG-GpC-SMAC-seq}
\end{flushleft}}
\renewcommand\Authfont{\normalsize}
\author[1,*,$\#$]{\fontfamily{phv}\selectfont{\textbf{Georgi K. Marinov}}}
\author[1,*,$\#$]{\fontfamily{phv}\selectfont{\textbf{Zohar Shipony}}}
\author[1,2]{\fontfamily{phv}\selectfont{\textbf{Anshul Kundaje}}}
\author[1,3,4,5]{\fontfamily{phv}\selectfont{\textbf{William J. Greenleaf}}}
\renewcommand\Affilfont{\itshape\normalsize}
\affil[1]{Department of Genetics, Stanford University, Stanford, CA 94305, USA}
\affil[2]{Department of Computer Science, Stanford University, Stanford, CA 94305, USA}
\affil[3]{Center for Personal Dynamic Regulomes, Stanford University, Stanford, California 94305, USA}
\affil[4]{Department of Applied Physics, Stanford University, Stanford, California 94305, USA}
\affil[5]{Chan Zuckerberg Biohub, San Francisco, California, USA}
\affil[*]{These authors contributed equally to this work}
\affil[$\#$]{Corresponding authors}
\date{}

\def\changemargin#1#2{\list{}{\rightmargin#2\leftmargin#1}\item[]}
\let\endchangemargin=\endlist 

\theoremstyle{definition}
\newtheorem{note}{}

\begin{document}
\maketitle

\renewcommand{\abstractname}{\noindent\fontfamily{phv}\selectfont{\centerline{}
Summary}}

\renewenvironment{abstract}
 {\small
  \begin{flushleft}
  \bfseries \noindent{\large\abstractname}\par\nobreak\smallskip\vspace{-.5em}\vspace{0pt}
  \end{flushleft}
  \list{}{
    \setlength{\leftmargin}{.0cm}%
    \setlength{\rightmargin}{\leftmargin}%
  }%
  \item\relax}
 {\endlist}
 
% \renewenvironment{abstract}
%   {\small\quotation
%   {\bfseries\noindent{\large\abstractname}\par\nobreak\smallskip}}
%   {\endquotation}

\renewcommand{\figurename}{Fig.}

\centerline{}
\begin{abstract}
\noindent\noindent{\normalsize A hallmark feature of active \textit{cis}-regulatory elements (CREs) in eukaryotes is their nucleosomal depletion and, accordingly, higher accessibility to enzymatic treatment. This property has been the basis of a number of sequencing-based assays for genome-wide identification and tracking the activity of CREs across different biological conditions, such as DNAse-seq, ATAC-seq, NOMe-seq and others. However, the fragmentation of DNA inherent to many of these assays and the limited read length of short-read sequencing platforms have so far not allowed the simultaneous measurement of the chromatin accessibility state of CREs located distally from each other. The combination of labeling accessible DNA with DNA modifications and nanopore sequencing has made it possible to develop such assays. Here, we provide a detailed protocol for carrying out the SMAC-seq assay (\underline{\textbf{S}}ingle-\underline{\textbf{M}}olecule long-read \underline{\textbf{A}}ccessible \underline{\textbf{C}}hromatin mapping \underline{\textbf{seq}}uencing), in its m$^6$A-SMAC-seq and m$^6$A-CpG-GpC-SMAC-seq variants, together with methods for data processing and analysis, and discuss key experimental and analytical considerations for working with SMAC-seq datasets.
\centerline{}
\centerline{}
\indent\indent\textbf{Key words:} Chromatin accessibility, SMAC-seq, Nanopore sequencing, DNA modifications, m$^6$A, EcoGII.}
\end{abstract}
\centerline{}
\centerline{}
\noindent\makebox[\textwidth]{\rule{\textwidth}{1.5pt}}

\begin{changemargin}{3.7cm}{0cm}

\reversemarginpar\marginpar{\section{Introduction}}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig1.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Overview of the SMAC-seq experimental protocol}. Chromatin is treated with m$^6$A and optionally also with CpG and GpC 5mC methyltransferases, which preferentially methylate DNA bases within accessible chromatin. HMW DNA is then isolated and subjected to nanopore sequencing. After read mapping and identifying modified bases, the accessibility state within individual chromatin fibers can be reconstructed.}
\label{Fig1}
\end{center}
\end{figure*}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig2.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Impact of the use of dense modifications on the theoretical resolution of methylation-based chromatin accessibility assays}. (A and B) Theoretical average resolution of the SMAC-seq assay for different modification sequence contexts in the \textit{S. cerevisiae} and \textit{H. sapiens} genomes. (C and D) Limitations of using GpC m$ ^5$C modifications alone due to the non-uniform distribution of GpC dinucleotides in the genome, which results in many large regions without any informative positions.}
\label{Fig2}
\end{center}
\end{figure*}

Chromatin accessibility is a key feature of the regulation of gene expression and many other aspects of chromatin biology in eukaryotes. Nearly all eukaryote genomes are packaged by nucleosomes, with each nucleosome being a dimer of two tetramers composed of the four core nucleosomal histones H3, H4, H2A and H2B. Packaging by nucleosomes has a generally inhibitory effect on RNA polymerase activity and to the occupancy of DNA by regulatory proteins. Accordingly, active regulatory regions in the genome are characterized by depleted nucleosomal occupancy and increased chromatin accessibility. This has turned out to be a highly useful property enabling the identification of candidate \textit{cis}-regulatory elements and the tracking of their activity across cell types and conditions.

Mapping accessible chromatin relies on the preferential enzymatic action of various reagents whose access to DNA is occluded by the presence of nucleosomes. Four decades ago it was initially recognized that active cREs are hypersensitive to cleavage by DNase enzymes \cite{Wu1980,Keene1981,McGhe1981}. DNase hypersensitivity remained the primary approach for mapping cREs well into the genomic era, being first coupled to microarrays \cite{Dorschner2004,Sabo2004,Sabo2006}, and eventually high-throughput massively parallel sequencing \cite{Crawford2006,Boyle2008,Thurman2012}. 

The advent of high-throughput sequencing enabled the development of numerous novel strategies for mapping active CREs. ATAC-seq \cite{Buenrostro2013}, which relies on the preferential insertion of the Tn5 transposase enzyme into open chromatin, has emerged as the most convenient, versatile, and widely used method for studying the chromatin state of the eukaryotic cell, including down to single cell level \cite{Buenrostro2015,Cusanovich2015}.

Other methods have also been developed, using restriction enzymes \cite{Chereji2019}, nicking enzymes \cite{Ponnaluri2017}, small molecules \cite{Umeyama2017}, viral integration \cite{Timms2019}, and others. 

All of these methods share two common features -- they involve fragmentation of DNA and they enrich for accessible DNA during sequencing library generation. Consequently, it is first, not possible to enumerate accessibility states within the cellular population, i.e. how often is a given CRE accessible, and second, there is no way to study the relationship between the chromatin states of distant regulatory elements, as the linkage between them is lost during fragmentation.

An alternative strategy to cleavage-based methods is to label accessible DNA with methyltransferase enzymes, then read out methylation states using high-throughput sequencing. This is the basic idea behind the NOMe-seq assay \cite{Kelly2012} and its later dSMF extention \cite{Krebs2017}. NOMe-seq uses the GpC methyltransferase M.CviPI to label accessible DNA at GpC positions. Genomic DNA is then subjected to bisulfite readout, providing single-molecule and fractional methylation (and thus accessibility) maps genome-wide. Only M.CviPI can be used in mammalian genomes due to the presence of endogenous CpG methylation, and only the m$^5$C modification can be utilized as this is what can be read out with basepair resolution using short-read sequencing. This presents a limitation, as GpC nucleotides are only found once every $\sim$25 bp in a mammalian genome. In organisms such as \textit{Drosophila} that do not have endogenous methylation, both a GpC and a CpG methyltransferase (M.SssI) can be used, increasing resolution to $\sim$10 bp on average, in the form of the dSMF assay. This has allowed the enumeration of protein occupancy states at unprecedented resolution at a single-molecule level \cite{Krebs2017}. Yet short-read approaches of this kind are still quite limited in their capabilities. 

First, these resolution values are averages. In reality genomes contains some quite large stretches with no informative positions (Figure \ref{Fig2}), and not much can be done to address that limitation as long as m$^5$C in GpC/CpG contexts is the only available modification. 

Second, it is only possible to analyze fragments no longer than 600 bp due to read-length limitations of short-read sequencers. Even this has been very difficult to achieve, as DNA methylation has traditionally been mapped using bisulfite sequencing, and bisulfite treatment severely degrades DNA to lengths considerably shorter than 600 bp. The introduction of the EM-seq method \cite{EMSeq2019} as an alternative to bisulfite conversion has largely eliminated the degradation issue, but short reads are still short reads, making it impossible to study chromatin states on the scale of many kilobases along the chromatin fiber. 

With the advent of long read sequencing technologies, and especially nanopore sequencing, these limitations have been overcome. Nanopore sequencing is capable of reading out arbitrary DNA modifications \cite{Simpson2017,Rand2017}, and of doing so along the length of DNA molecules tens of kilobases long, allowing the simultaneous capture of the chromatin states of CREs located far apart. This has enabled the development of a qualitatively new class of functional genomic assays \cite{Shipony2018,Wang2019,Aughey2018}
 
The MeSMLR-seq\cite{Wang2019} and nanoNOMe \cite{Aughey2018} assays have adapted the NOMe-seq approach to nanopore sequencing, using a GpC methyltransferase to label accessible DNA, then reading it out using nanopore sequencing. However, while this approach preserves long-range contiguity, it still suffers from the limitations imposed by the density of informative modification positions in the genome (Figure \ref{Fig2}). 

In contrast, SMAC-seq \cite{Shipony2018} uses dense modifications, found once every few nucleotides in the genome. Accessible DNA is enzymatically labeled using a methyltransferase enzyme (or multiple such enzymes), high-molecular weight (HMW) DNA is isolated, then subjected to nanopore sequencing, which allows for the direct detection of DNA modifications and thus the assembly of an accessibility map at the single molecule level and on multikilobase scales (Figure \ref{Fig1}). In addition, the dense modifications that SMAC-seq is based on also provide information about nucleosome occupancy/positioning \cite{Schones2008} and even transcription factor footprints  \cite{Hesselberth2009,Neph2012b}. Finally, the long reads provided by nanopore sequencing allow chromatin accessibility and nucleosome positioning to be profiled within repetitive regions of the genome that are otherwise not uniquely mappable using short reads.

Here, we describe an m$^6$A-SMAC-seq protocol based on the m$^6$A (N$^6$-Methyladenosine) methyltransferase EcoGII \cite{Murray2018}, which labels A bases nonspecifically in all contexts (\textit{see} \textbf{Note \ref{EcoGIIefficiency}}) as well as a m$^6$A-CpG-GpC-SMAC-seq protocol, which uses multiple modifications (m$^6$A and m$^5$C modifications in CpG and GpC contexts) and which can be used in organisms without endogenous DNA methylation. 

We also describe basic data processing and analysis procedures for working with SMAC-seq datasets.

\end{changemargin} 

\noindent\makebox[\textwidth]{\rule{\textwidth}{1.5pt}}

\begin{changemargin}{3.7cm}{0cm} 

\reversemarginpar\marginpar{\section{Materials}}

SMAC-seq uses standard laboratory reagents with the exception of the m$^6$A methyltransferase in the m$^6$A version of the assay (\textit{see} \textbf{Note \ref{EcoGII}}). Other versions of the assay involving different modifications may also require custom reagents.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{SMAC-seq \newline buffers \newline and \newline reagents}}}

\begin{enumerate}
\item Nuclei Lysis Buffer: 10 mM Tris pH 7.4, 10 mM NaCl, 3 mM MgCl$_2$, 0.1 mM EDTA, 0.5\% NP-40

\item Nuclei Wash Buffer. This is the same as the Lysis Buffer except for the absence of NP-40: 10 mM Tris pH 7.4, 10 mM NaCl, 3 mM MgCl$_2$, 0.1 mM EDTA

\item M.CviPI Reaction Buffer: 50 mM Tris-HCl pH 8.5, 50 mM NaCl, 10 mM DTT, 

\item CutSmart Reaction buffer: 1$\times$ NEB CutSmart buffer, 0.3 M sucrose

\item Stop Buffer: 20 mM Tris-HCl pH 8.5, 600 mM NaCl, 1\% SDS, 10 mM EDTA

\item Sorbitol Buffer: 1.4 M Sorbitol, 40 mM HEPES-KOH pH 7.5, 0.5 mM MgCl$_2$

\item 100T Zymolase (Zymo Cat. \# E1005)

\item M.CviPI methyltransferase (NEB Cat. \# M0227)

\item EcoGII methyltransferase (NEB Cat. \# M0603S) (\textit{see} \textbf{Note \ref{EcoGII}})

\item M.SssI methyltransferase (NEB Cat. \# M0226)

\item 32 mM S-adenosylmethionine (SAM)

\item Sucrose solution (prepare a highly concentrated solution, e.g. 2 M)

\item Molecular biology-grade 1 M MgCl$_2$ solution

\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{HMW \newline DNA \newline isolation}}}

We have most often used the MagAttract HMW DNA Kit (Qiagen ), but other approaches for isolating HMW can also be applied, such as the NEB Monarch Genomic DNA Purification Kit, the Nanobind CBB Big DNA Kit, and others. 

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{HMW DNA \newline size \newline selection}}}

Several solutions now also exist for HMW size selection that eliminates shorter fragments. We have used the Short Read Eliminator Kit (Circulomics) with fairly consistent levels of success, but equivalent approaches are also applicable.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Nanopore sequencers and flowcells}}}

SMAC-seq data can be generated using any of the Oxford Nanopore Technologies (ONT) platforms (Flongle, MinION, GridION or PromethION). Which one to use is a decision to be made on the basis of the desired output, which in turn is determined by the needed coverage based on genome size, the properties of the genome studied, etc. (\textit{see} \textbf{Note \ref{ONT}} and \textbf{Note \ref{coverage}}).

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Nanopore \newline sequencing \newline reagents}}}

ONT offers a variety of library preparation options, the two main ones relevant to SMAC-seq being:

\begin{enumerate}
\item The Ligation Sequencing Kits (Cat \# SQK-LSK109 and SQK-LSK109-XL) are to be used if maximum read length is desired. These require $\sim$1000 ng of input HMW DNA. 
\item The Rapid Barcoding Kit (Cat \# SQK-RBK004) uses a transposases to simultaneously fragment DNA and attach adapters to the resulting pieces. Thus it will yield shorter molecules ($\leq \sim$10 kbp) but it allows the pooling of multiple samples in the same run (which is useful if, for example, working with an organism with a small genome and on a PromethION) and works with smaller amount of input HMW DNA ($\sim$400 ng). 
\end{enumerate}

Which kit is to be used depends on what the optimal choice is with respect to the particular research question and experimental system.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{General \newline materials and \newline equipment}}}

\begin{enumerate}
\item 1.5-mL microcentrifuge tubes, preferably low protein and DNA binding (\textit{see} \textbf{Note \ref{Tubes}})
\item 2-mL, 15-mL and 50-mL tubes
\item Magnetic stands r 1.5 mL and 2 mL tubes
\item Thermomixer
\item Molecular biology-grade 200 proof EtOH
\item Tabletop centrifuge
\item Nuclease-free H$_2$O
\item 1$\times$ PBS
\item AMPure beads
\item QuBit fluorometer (ThermoFisher Scientific) or equivalent
\item QuBit dsDNA HS kit (ThermoFisher Scientific)
\item TapeStation (Agilent)
\item TapeStation Genomic DNA Reagents (Agilent)
\item TapeStation Genomic DNA Screentape  (Agilent)
% \item 1$\times$ PBS buffer solution
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Computational \newline resources}}}

The computational analyses described are designed to run on standard Linux systems through the UNIX command line. The maximal memory usage depends on the size of the datasets but is usually less than $\sim$50GB. However, note that nanopore sequencing datasets can occupy very large amounts of disk space (i.e. many terabytes), thus it is advisable to use a computing system with ample storage (\textit{see} \textbf{Note \ref{diskspace}})

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Genomic \newline sequence and \newline annotation files}}}

\begin{enumerate}
\item A FASTA file containing the GRCh38 version of the human genome can be downloaded from the UCSC Genome Browser at \burl{http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz}. Genome files can also be obtained from ENSEMBL (\burl{http://ensemblgenomes.org/}) and from the NCBI website (\burl{http://www.ncbi.nlm.nih.gov/assembly/}). However, it has to be noted that in the case of the human genome, reference FASTA files available in public repositories contain alternative haplotype contigs, i.e. alternative versions of sequences already present in the assembly. These alternative haplotypes should be removed from reference files before use. The ENCODE Project \cite{ENCODE2012} provides such filtered files from its portal at \burl{https://www.encodeproject.org/data-standards/reference-sequences/}. 

The \verb|sacCer3| version of the \textit{Saccharomyces cerevisiae} genome can be obtained from \burl{http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.fa.gz}

\item Genome annotations in GTF format can be obtained from UCSC, ENSEMBL, NCBI or ENCODE. 
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Software \newline packages}}}

\begin{enumerate}
\item UCSC Genome Browser \cite{Kuhn2013,Kent2010} utilities: \burl{http://hgdownload.cse.ucsc.edu/admin/exe/}
\item R: \burl{https://www.r-project.org/}
\item Python (version 2.7 or higher) \burl{https://www.python.org/}
\item TGL Kmeans: \burl{https://github.com/tanaylab/tglkmeans}
\item SciPy: \burl{https://www.scipy.org/}
\item Matplotlib: \burl{https://matplotlib.org/}
\item Minimap2 \cite{Li2016} (version 2.17) \burl{https://github.com/lh3/minimap2}
\item Tombo (version 1.5) \burl{https://nanoporetech.github.io/tombo/}
\item Albacore  \burl{https://nanoporetech.com/}
\item Megalodon  \burl{https://github.com/nanoporetech/megalodon}
\item Guppy  \burl{https://nanoporetech.com/}
\item Rerio \burl{https://github.com/nanoporetech/rerio}
\item \verb|tabix|: \burl{http://www.htslib.org/doc/tabix.html} (\textit{see} \textbf{Note \ref{tabix}})
\item Additional scripts:  

\burl{https://github.com/georgimarinov/SMAC-seq-scripts}. 

Contains python scripts for processing and post-processing of SMAC-seq data used in the examples shown below.
\end{enumerate}

\end{changemargin} 

\noindent\makebox[\textwidth]{\rule{\textwidth}{1.5pt}}

\begin{changemargin}{3.7cm}{0cm} 
\reversemarginpar\marginpar{\section{Methods}}

The principle behind the assay and the typical SMAC-seq experimental procedure are outlined in Figure \ref{Fig1}. SMAC-seq consists of the following basic steps:

\begin{enumerate}
\item Nuclei isolation.
\item Enzymatic treatment of chromatin.
\item HMW DNA extraction.
\item Nanopore sequencing.
\item Read mapping and calling modified basis.
\item Aggregate and single-molecule accessibility analysis.
\end{enumerate}

We provide several slightly different protocols for working with yeast (\textit{see} \textbf{Note \ref{yeast}}) as well as with mammalian and fly cells.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Nuclei isolation \newline (budding yeast)}}}

Start with $2.5 \times 10^8$ yeast cells (the equivalent to $1 \times 10^6$ human cells).

\begin{enumerate}
\item Spin cells for 1 min at 13000 rpm. Remove supernatant
\item Wash cells with 100 $\mu$L Sorbitol Buffer 
\item Spin cells 1 min at 13000 rpm. Remove supernatant
\item Resuspend pellet in 200 $\mu$L Sorbitol Buffer + 10 mM DTT + 0.5 mg/mL 100T Zymolase
\item Incubate for 5 min at 30 degrees, shaking 300 rpm
\item Centrifuge for 2 min at 5000 rpm. Remove supernatant
\item Add 100 $\mu$L SB buffer (no DTT) and resuspend gently
\item Centrifuge for 2 min at 5000 rpm. Remove supernatant
\item Add 100 $\mu$L ice-cold lysis buffer
\item Incubate on ice for 10 minutes
\item Spin down at 5000 rpm for 5 min at $4\,^{\circ}\mathrm{C}$
\item Wash with 100 $\mu$L cold wash buffer 
\item Spin at 5000 rpm for 5 min at $4\,^{\circ}\mathrm{C}$
\item Resuspend in M.CviPI Reaction Buffer (100 $\mu$L for now?)
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Enzymatic \newline treatment of \newline chromatin for \newline m$^6$A-GpC-CpG-SMAC-seq \newline  (budding yeast)}}}

\begin{enumerate}
\item Add 200 U of M.CviPI and 200 U of EcoGII
\item Add SAM at a final concentration of 0.6 mM and sucrose at a final concentration of 300 mM
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\item Add 128 pmol SAM (= 4 $\mu$L 32 mM solution) and another 100 U of both enzymes 
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\item Add 60 U of M.SssI
\item Add 128 pmol SAM (= 4 $\mu$L 32 mM solution)  (\textit{see} \textbf{Note \ref{SAM}})
\item Add MgCl$_2$ at a final concentration of 10 mM
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\end{enumerate}

Stop reaction by adding an equal volume of Stop Buffer

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Nuclei \newline  isolation for \newline human, \textit{Drosophila}, \newline and other cells \newline without cell walls}}}

Start with $1 \times 10^6$ diploid human cells. Scale accordingly according to genome size, variations in cell ploidy, the aimed for amount of sequencing (\textit{see} \textbf{Note \ref{CellNumnber}}), etc.

\begin{enumerate}
\item Wash cells with 1$\times$ PBS 
\item Centrifuge for 5 min at 500 $g$ at $4\,^{\circ}\mathrm{C}$. Remove supernatant
\item Resuspend cells in 200 $\mu$L ice-cold Nuclei Lysis Buffer
\item Incubate on ice for 10 minutes
\item Centrifuge for 5 min at 500 $g$ at $4\,^{\circ}\mathrm{C}$. Remove supernatant
\item Resuspend nuclei in 200 $\mu$L cold Nuclei Wash Buffer
\item Centrifuge for 5 min at 500 $g$ at $4\,^{\circ}\mathrm{C}$. Remove supernatant
\item Resuspend nuclei in 200 $\mu$L  CutSmart Reaction buffer
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Enzymatic \newline treatment of \newline chromatin for \newline m$^6$A-SMAC-seq \newline (human cells)}}}

\begin{enumerate}
\item Add 200 U of EcoGII
\item Add SAM at 0.6 mM and sucrose at 300 mM
\item Incubate at $37\,^{\circ}\mathrm{C}$ for 10 min. 
\end{enumerate}

Stop reaction by adding SDS to a concentration of 0.2\%.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Enzymatic \newline treatment of \newline chromatin for \newline m$^6$A-GpC-CpG-SMAC-seq \newline (\textit{Drosophila} cells)}}}

\begin{enumerate}
\item Add 200 U of M.CviPI and 200 U of EcoGII
\item Add SAM at a final concentration of 0.6 mM and sucrose at a final concentration of 300 mM
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\item Add 128 pmol SAM (= 4 $\mu$L 32 mM solution) and another 100 U of both enzymes 
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\item Add 60 U of M.SssI
\item Add 128 pmol SAM
\item Add MgCl$_2$ at 10 mM
\item Incubate at $30\,^{\circ}\mathrm{C}$ for 7.5 min. 
\end{enumerate}

Stop reaction by adding SDS to a concentration of 0.2\%.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{HMW DNA \newline isolation}}}

Here we describe HMW DNA using the Qiagen MagAttract HMW DNA Kit. Many other kits/protocols can also be used with similar success

\begin{enumerate}
\item Add 20 $\mu$L Proteinase K into a 2 mL tube.
\item Add 200 $\mu$l of sample
\item Add 4 $\mu$L RNase A solution and 150 $\mu$L Buffer AL. Mix by vortexing.
\item Incubate at room temperature for 30 min.
\item Add 15 $\mu$L MagAttract Suspension G beads.
\item Add 280 $\mu$L Buffer MB and incubate at room temperature for 3 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully and completely remove the supernatant
\item Add 700 $\mu$L Buffer MW1 and incubate at room temperature for 1 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully and completely remove the supernatant
\item Add 700 $\mu$L Buffer MW1 and incubate at room temperature for 1 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully and completely remove the supernatant
\item Add 700 $\mu$L Buffer PE and incubate at room temperature for 1 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully and completely remove the supernatant
\item Add 700 $\mu$L Buffer PE and incubate at room temperature for 1 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully and completely remove the supernatant
\item Add 700 $\mu$L nuclear-free H$_2$O by slowly pipetting on the side of the tube opposite to the beads while on the magnetic stand. Do not disturb the pellet, otherwise DNA loss an ensue.
\item Remove H$_2$O, and repeat the H$_2$O wash step
\item Add an appropriate volume of Buffer AE, i.e. 100--200 $\mu$L (\textit{see} \textbf{Note \ref{Elution}}). 
\item Incubate at room temperature for 3 min at 1400 rpm in a Thermomixer.
\item Separate the beads on a magnetic stand, carefully transfer the supernatant to a new DNA lo-bind tube using a wide bore tip.
\item Measure DNA concentration using a Qubit dsDNA HS assay.
\item Evaluate the DNA size distribution profile on the TapeStaion using the gDNA screentape and reagents
\item Store the DNA at $4\,^{\circ}\mathrm{C}$ (\textit{see} \textbf{Note \ref{DNAStorage}})
\end{enumerate}

\begin{figure*}
\begin{center}
\includegraphics[width=14cm]{Fig3.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{HMW DNA isolation and size selection for long-read sequencing}. It is of critical importance for the success of SMAC-seq experiments (and many other long read-based assays) to use high quality HMW DNA as input to sequencing. Numerous protocols exist for isolating HMW DNA and HMW DNA size selection. Shown are TapeStation gDNA profiles for a DNA sample with poor size distribution (A), a DNA sample with good size distribution (B), and a DNA sample after size selection using the Circulomics Short Read Eliminator Kit (C).}
\label{Fig3}
\end{center}
\end{figure*}

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{DNA size \newline selection}}}

Selection of very HMW DNA using the Circulomics Short Read Eliminator Kit is described here. Use either the SRE or the SRE XL version depending on the properties of the genome studied and the input DNA size distribution. The SRE XL version will remove fragments $\leq$40 kbp while the SRE one will eliminate fragments $\leq$25 kbp.

\begin{enumerate}
\item Start with a total volume of 60 $\mu$L at DNA concentration between 50 and 150 ng/$\mu$L in a 1.5 mL DNA LoBind tube.
\item Add 60 $\mu$L of Buffer SRE or Buffer SRE XL. Mix by tapping
\item Centrifuge at 10,000 $g$ for 30 minutes at room temperature.
\item Carefully remove the supernatant without disturbing the DNA pellet (note that the pellet is not visible; always place the tube with the hinge facing outwards to ensure reliable positioning of the pellet at the bottom of the tube)
\item Add 200 $\mu$L of 70\% EtOH (make it fresh immediately before use). Do not tap or mix. Centrifuge at 10,000 $g$ for 2 minutes at RT.
\item Carefully remove the supernatant without disturbing the DNA pellet.
\item Repeat the 70\% EtOH wash and centrifugation step
\item Add at least 50 $\mu$L Buffer EB and incubate at room temperature for 20 minutes (\textit{see} \textbf{Note \ref{Elution}}). 
\item Resuspend well by tapping
\item Measure DNA concentration using a Qubit dsDNA HS assay.
\item Evaluate the DNA size distribution profile on the TapeStaion using the gDNA screentape and reagents
\item Store the DNA at $4\,^{\circ}\mathrm{C}$ (\textit{see} \textbf{Note \ref{DNAStorage}})
\end{enumerate}

Example TapeStation results for poor-quality, high-quality and post-size selection HMW DNA are shown in Figure \ref{Fig3}.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Nanopore library construction \& sequencing}}}

Carry out nanopore library construction and sequencing according to the manufacturer's instructions depending on the particular kit and flowcell/sequencer being used.

\centerline{}
\reversemarginpar\marginpar{\subsection{\textit{Computational analysis}}}

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig4V2.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Summary of the SMAC-seq analysis workflow}. For Tombo processing, raw nanopore read traces are first subjected to base calling, mapped to the reference genome, and modified bases are then identified after ``resquiggling'' of the reads. The newer Megalodon-based processing combines these steps in one. Per-read modification calls are then extracted, and converted into a common file format that allows for downstream tasks to be carried out.}
\label{Fig4}
\end{center}
\end{figure*}

The basic processing of SMAC-seq data described here consists of the following steps:

\begin{enumerate}
\item Initial base calling
\item Read mapping 
\item Generating modification calls
\item Compilation of basic data statistics
\item Generation of aggregate modification scores
\item Generation of averaged coverage tracks
\end{enumerate}

Analysis at the single molecule level can be subsequently carried out.

The overall workflow is summarized in Figure \ref{Fig4}.

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Read \newline mapping \newline and \newline modification \newline calling}}}

There are two different ways to extract modifications. 

Historically, SMAC-seq per-read modification calls were extracted using Tombo, which is a non-model based DNA modification caller for any context. It is no longer updated and requires basecalling using the older and less accurate base calling software Albacore.

The state-of-art way to call modifications is Megalodon. Megalodon is a command-line tool that combines base calling using Guppy with modified base calling based on pre-trained modification calling models in the Rerio package, in which all-context m$^6$A and 5mC models are available. Because of the higher accuracy of these models and the ease of use of Megalodon, this is at present the preferred method for calling modifications.

Calling modifications with Tombo involves the following steps (run these commands for each individual fast5 file in parallel to speed up the process):

\begin{enumerate}
\item Basecalling using Albacore. Tombo requires that reads are first base-called using Albacore. Running Albacore requires the user to specify the exact type of flowcell and the kit used to build the library, as follows:

\begin{verbatim}
    read_fast5_basecaller.py --flowcell {FLOW_CELL}
    --kit {RUN_KIT}  -i {FAST5_DIR} 
    -t {NUMBER_OF_THREADS} -s {OUTPUT_DIR}
    -o fastq,fast5 --disable_filtering
\end{verbatim}

\item Read preprocessing. Following base calling at the read level using Albacore, Tombo maps every read to its corresponding fast5 signal track, as follows:

\begin{verbatim}
    tombo preprocess annotate_raw_with_fastqs 
    --processes {NUMBER_OF_THREADS} --overwrite 
    --fast5-basedir {FAST5_DIR} 
    --fastq-filenames {ALBACORE_PRODUCED_FAST5}
\end{verbatim}

\item Tombo resquiggling. Next, the reads are mapped and nanopore signal is ``resquiggled'' against the reference genome as follows (note that Tombo uses \verb|minimap2| to carry out the mapping):

\begin{verbatim}
    tombo resquiggle  --ignore-read-locks
    --processes {NUMBER_OF_THREADS} --overwrite
    {FAST5_DIR} {REFERENCE_GENOME}
\end{verbatim}

\item Tombo \textit{de novo} modification calling. To call m$^6$A and 5mC modifications in all contexts we use the \textit{de novo} mode of Tombo as follows:

\begin{verbatim}
    tombo detect_modifications de_novo 
    --statistics-file-basename {STATS_FILE_NAME}
    --per-read-statistics-basename {MODS_FILE_NAME}
    --processes {NUMBER_OF_THREADS}
    --multiprocess-region-size 2000000
    --fast5-basedirs {FAST5_DIR}
\end{verbatim}
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Tombo \newline extraction}}}

Default Tombo outputs do not include information about modification at the basepair single-molecule level. These need to be extracted using the Tombo Python API using custom-written scripts.

Run the \verb|TomboSingleReadsExtract-tombo_de_novo-1.5.py| script in order to convert Tombo \verb|per_read_stats| files into text files. The script has multiple options for different sequence contexts, excluding certain sequence contexts, etc.:

\begin{verbatim}
python TomboSingleReadsExtract-tombo_de_novo-1.5.py 
   tombo.per_read_stats genome.fa outfile_prefix 
   [-m5C-only] [-m6A-only] [-CG-only] [-CG-CG-only] 
   [-GC-only] [-m6A-CG-only] [-m6A-GC-only] 
   [-m6A-GC-CG-only] [-doT] [-T-only]
   [-generic bases(comma-separated)]
   [-excludeContext string(...,stringN) radius]
   [-excludeChr chr1[...,chrN]] 
   [-chrPrefix string]
\end{verbatim}

Example for \verb|A| positions:

\begin{verbatim}
python TomboSingleReadsExtract-tombo_de_novo.py 
   0.tombo.per_read_stats genome.fa 0.tombo.m6A-only 
   -m6A-only
\end{verbatim}

Example for \verb|A|, CpG and GpC positions:

\begin{verbatim}
python TomboSingleReadsExtract-tombo_de_novo.py 
   0.tombo.per_read_stats genome.fa 0.tombo.m6A-GC-CG-only 
   -m6A-GC-CG-only
\end{verbatim}

Run the script for each individual \verb|tombo.per_read_stats| file.

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Read \newline mapping and \newline modification \newline calling \newline using \newline Megalodon}}}

Megalodon is run in one step as follows:

\begin{verbatim}
megalodon {LOCATION_OF_FAST5_FILES} 
--guppy-params "-d {PATH_TO_RERIO_MODELS}"
--guppy-config res_dna_r941_min_modbases-all-context_v001.cfg 
--outputs basecalls,mod_basecalls,per_read_mods 
--reference {REFRENCE_GENOME}
--write-mods-text --output-directory {OUTPUT_DIR} 
--guppy-server-path {LOCATION_OF_GUPPY_BIN}
\end{verbatim}

For the purposes of downstream single-molecule analysis the \verb|--outputs| \verb|basecalls|\verb|,mod_basecalls|\verb|,per_read_mods| and \verb|--write-mods-text| options need to be specified. These will result in output of per-read modifications in a text format.

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Megalodon per-read modification extraction}}}

To extract the per-read modification, we run the following script:

\begin{verbatim}
python megalodon-to-single_line.py 
*.per_read_modified_base_calls.txt 
*megalodon.reads.tsv
\end{verbatim}

Run the script for each individual Megalodon file.

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Merging \newline and \newline indexing}}}

Merge the converted files into a single file, and sort by coordinates in the same step:

\begin{verbatim}
cat *.reads.tsv | sort -k1,1 -k2,2n -k3,3n 
     | bgzip > merged.reads.tsv.bgz 
\end{verbatim}

Then \verb|tabix|-index the file:

\begin{verbatim}
tabix -s 1 -b 2 -e 3 merged.reads.tsv.bgz
\end{verbatim}

This will create a \verb|tabix|-index |.bgz| file in the following format, with one entry for each read:

\begin{enumerate}
\item Column 1: chromosome
\item Column 2: left-most modified/informative position within the read
\item Column 3: right-most modified/informative position within the read
\item Column 4: \verb|.| character (for legacy reasons)
\item Column 5: nanopore read ID
\item Column 6: \verb|nan| (for legacy reasons)
\item Column 7: comma-separated list of modified/informative positions
\item Column 8: comma-separated list of Tombo probabilities, matching the order of the positions in Column 7.
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Calculate \newline mapping \newline statistics}}}

Calculate read mapping statistics as follows:

\begin{verbatim}
python NanoporeTSVMappingStats.py 
       merged.reads.tsv.bgz 
       NanoporeTSVMappingStats-merged
\end{verbatim}

This will produce a short report with the total number of mapped reads, the total number of mapped bases, the mean mapped read length and the median read length.

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Create \newline coverage \newline file}}}

While the true strength of SMAC-seq lies in the single-molecule analysis, SMAC-seq data can also be highly informative at an aggregate level, which allows for CREs and positioned nucleosomes to be discerned by visualization of average SMAC-seq profiles on a genome browser. 

For the purpose of such analyses, a coverage file in the style of the output from the popular bisulfite sequencing analysis tool Bismark \cite{Krueger2011} is created, using the \verb|methylation-reads-tsv-to_coverage.py| script:

\begin{verbatim}
python methylation_reads_all.tsv threshold outfile 
 [-stranded +|-] [-minAbsLogLike float] 
 [-minAbsPValue float] 
 [-BayesianIntegration window(bp) step alpha beta 
 pseudosamplesize] [-N6mAweight pseudosamplesize genome.fa] 
 [-saveNewSingleMoleculeFile filename]
\end{verbatim}

Nanopore DNA modification data is not binary, instead it is recorded as probabilities. It thus has to be binarized at some threshold. We have found, through exploration of the parameter space and comparison to known biological truths, that the most intuitive threshold of 0.5 works optimally \cite{Shipony2018}. Example:

\begin{verbatim}
python methylation-reads-tsv-to_coverage.py 
  merged.reads.tsv.bgz 0.5 
  merged.cutoff_0.5.coverage
\end{verbatim}

Convert the resulting plain text file to a \verb|.bgz| file:

\begin{verbatim}
cat merged.cutoff_0.5.coverage | 
  bgzip > merged.cutoff_0.5.coverage.bgz
\end{verbatim}

Then \verb|tabix|-index it:

\begin{verbatim}
tabix -s 1 -b 2 -e 3 merged.cutoff_0.5.coverage.bgz
\end{verbatim}

The format of the \verb|coverage| file is as follows:

\begin{enumerate}
\item Column 1: chromosome
\item Column 2: left-most position of the modified/informative sequence context
\item Column 3: right-most position of the modified/informative sequence context
\item Column 4: number reads in which the sequence context is methylated
\item Column 5: number reads in which the sequence context is unmethylated
\item Column 6: total number of reads
\end{enumerate}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Bayseian \newline integration}}}

Even when using m$^6$A, SMAC-seq still does not cover every single nucleotide in the genome, and coverage varies substantially between different locations depending on local sequence content differences. In addition, base calling for ONT data is still far from perfectly accurate (\textit{see} \textbf{Note \ref{ERRORRATE}}), and detecting modifications is particularly challenging. On the other hand, the biologically meaningful length scale for DNA accessibility is not necessary the individual basepair, but somewhat larger sequence contexts. 

For these reasons we often use aggregate accessibility scores over fixed-length windows, which combine information over all available informative positions in the window, thus providing more reliable, even if coarser-grained, views of accessibility patterns. This is done using a simple Bayesian procedure, as follows.

For a given window of width $w$, specified by coordinates $c,i,i+w$ (where $c$ is the chromosome, and $i$ is the leftmost coordinate of the window), and for all reads $r \in R_{c,i,i+w}$ fully spanning the window, we obtain all Tombo probabilities $p_{r,(c,j)}$ such that $j \in [i,i+w)$ for the assayed sequence contexts on the corresponding genomic strand (\textit{see} \textbf{Note \ref{stranded}}). We usually use a Beta prior $B$($\alpha$,$\beta$), with $\alpha = \beta = 10$, which is updated based on each probability $p_{r,(c,j)}$ for all $j \in [i,i+w)$ (but the prior can be easily changed if necessary, see below), in order to obtain a final accessibility score $p_{r,(c,i,i+w)}$ for read $r$ and window $c,i,i+w$. 

This Bayseian integration calculation is also carried out using the same \verb|methylation-reads-tsv-to_coverage.py| script. For efficiency of calculation, compute it in parallel on the individual converted tombo files, as follows (for a 10-bp context and (10,10) prior):

\begin{verbatim}
python methylation-reads-tsv-to_coverage.py 
  0.tombo 0.5 
  0.tombo.all0.cutoff_0.5.coverage.BI_w10_a10_b10 
  -minAbsPValue 0.4 -BayesianIntegration 10 1 10 10 50 
  -saveNewSingleMoleculeFile 
  0.tombo.BI_w10_a10_b10.reads.tsv
\end{verbatim}

Merge the Bayesian integration files:

\begin{verbatim}
cat *tombo.BI_w10_a10_b10.reads.tsv 
  | sort -k1,1 -k2,2n -k3,3n 
  | bgzip > merged.BI_w10_a10_b10.reads.tsv.bgz
\end{verbatim}

Then \verb|tabix|-index the resulting \verb|.bgz| file:

\begin{verbatim}
tabix -s 1 -b 2 -e 3 merged.BI_w10_a10_b10.reads.tsv.bgz
\end{verbatim}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Filtering \newline fully \newline methylated \newline reads}}}

On occasions, we observe a population of reads that appear as fully methylated across their whole length or over large segments of it. They are most likely derived from dead cells or represent some other undesired artifact. In order to remove such potentially artifactual reads, we obtain a ``filtered'' read set by removing all reads containing a $\geq$1-kbp stretch that is $\geq$75\% methylated (while also filtering out reads shorter than 1 kb).

This operation can be carried out using the \verb|filterFullyMethylatedReads.py| script as follows:

\begin{verbatim}
python filterFullyMethylatedReads.py 
  methylation_reads_all.tsv WindowSize minFraction 
  [-keepShort] [-missingBasesFilter genome.fa 
  basecontexts(comma-separated) minFraction 
  [-doMBFSet]]
\end{verbatim}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Create \newline genome \newline browser \newline tracks}}}

In order to create average-methylation (and thus accessibility) tracks that can be visualized on a genome browser such the UCSC or the WashU ones, use the following script:

\begin{verbatim}
python coverage_to_wig.py coverage.bgz window step chrField 
  MfieldID UfieldID chrom.sizes outprefix [-minCov N_reads]
\end{verbatim}

Where the \verb|M| and the \verb|U| fields indicate the column IDs of the numbers of methyllated and unmethylated reads, respectively, and the \verb|window| and \verb|step| parameters specify the width and the stride used for averaging the signal (i.e. window of 50 and step of 5 means that the average methylation level over 50 bp windows tiling the genome every 5 bp will be outputted).

This script will output two \verb|bedGraph| files -- a \verb|coverage.wig| one (which contains the number of reads covering a position) and a \verb|meth.wig| one (which contains the fraction of methylated reads). These can then be converted into \verb|bigWig| files that can in turn be displayed on a genome browser using the \verb|wigToBigWig| program from the UCSC utilities:

\begin{verbatim}
wigToBigWig meth.wig chrom.sizes meth.wig
\end{verbatim}

Where the \verb|chrom.sizes| files contains one line per chromosomes including the chromosome name and its length in bp (tab-separated).

An example of an average SMAC-seq profile is shown in Figure \ref{Fig5}.

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig5.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Examples of average m$^6$A-CpG-GpC-SMAC-seq profiles visualized on the UCSC Genome Browser}. Shown is a subtelomeric regions on chrXVI. SMAC-seq signal provides information both about accessible open chromatin measures (peaks in DNAse-seq data) and positioned nucleosomes. The latter are shown here in the form of H4S47C chemical nucleosome mapping \cite{Brogaard2012}, which maps the positions of dyads (SMAC-seq signal is enriched on nucleosome linkers, thus the inverse relationship between the two). SMAC-seq, being a long-read assay, also provides information about repetitive regions of the genome (in this case, telomeres, which are not uniquely mappable with short reads as shown by the 36-mer unique mappability track).}
\label{Fig5}
\end{center}
\end{figure*}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Making \newline metaplots \newline around a \newline position}}}

A common analysis task is to generate a metaplot around a given set of genomic features (such as TSSs, positioned nucleosomes, TF binding motifs, and others). The \verb|coverage.bgz| can be used to make such metaplots, as follows, with a variety of parameters (window size, minimal coverage per position, different input file formats, stranded or unstranded, and others):

\begin{verbatim}
python signalAroundPeaks-nano.py inputfilename chrFieldID 
 posField strandField radius window coverage.bgz 
 outputfilename [-bismark.cov] [-bed] [-minCov N] 
 [-unstranded] [-narrowPeak]
\end{verbatim}

Examples of such plots around yeast transcription starts sites and human occupied CTCF motifs are shown in Figure \ref{Fig6}.

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig6.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Examples of average SMAC-seq metaprofiles over predefined genomic features}. (A and B) Average m$^6$-CpG-GpC SMAC-seq profiles for the top 20\% and bottom 20\% of genes (ranked by expression levels) in \textit{S. cerevisiae}. Profiles are split by modification channel. (C) Average m$^6$-SMAC-seq profile around CTCF ChIP-seq peaks in the human GM12878 cell line. CTCF is known to strongly position nucleosomes in the vicinity of its occupancy sites \cite{Fu2008}. ChIP-seq peaks were obtained from the ENCODE Project Consortium \cite{ENCODE2012}.}
\label{Fig6}
\end{center}
\end{figure*}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Making \newline single \newline molecule \newline plots}}}

One of the two key strengths of SMAC-seq is the ability to analyze accessibility at the single molecule level. There are many ways to do that, due to the non-binary nature of raw nanopore data and of the long length of nanopore reads, which allows/requires analysis at different resolution levels. Single molecule maps can be generated using the continuous modification probability values or they can be binarized.

The \verb|SMAC-footprints-from-methylation-reads-tsv-tabix.py| and \verb|SMAC-footprints-from-methylation-reads-tsv-tabix-kmeans.py| scripts can be used to generate such plots. The first script will apply hierarchical clustering while the second one will use $k$-means (in our experience, we obtain decidedly better results using the $k$-means approach). The commands are otherwise the same. There is a wide variety of options regarding the input list of region (which can be in any format), the display (averaging over arbitrary number of basepairs), subsampling of reads, color schemes, binarization or continuous display, and others:

\begin{verbatim}
python methylation_reads_all.tsv peak_list chrFieldID 
  leftFieldID rightFieldID strandFieldID tabix_path 
  outfile_prefix  [-resize factor] [-subset N] 
  [-label fieldID] [-minCov fraction]
  [-minPassingBases fraction] [-minReads N] 
  [-unstranded]  [-minAbsLogLike float] 
  [-scatterPlot colorscheme minScore maxScore color|none]
  [-window bp] [-readStrand +|-] 
  [-printMatrix] [-deleteMatrix] [-binarize threshold]' 
\end{verbatim}
 
The following command will generate binarized single molecule maps retaining only reads that completely span the input set of regions, averaging over 10bp windows:

\begin{verbatim}
python SMAC-footprints-from-methylation-reads-tsv-tabix-kmeans.py 
  SMAC-seq.reads.tsv.bgz regions.bed 0 1 2 3 tabix 
  SMAC-seq.regions.binary-0.5-gist_heat.10bp 
  -window 10 -minCov 1 -binarize 0.5 
  -scatterPlot gist_heat 0 1.1 w -unstranded
\end{verbatim}

An example of such a single-molecule level visualization for yeast m$^6$A-CpG-GpC-SMAC-seq data is shown in Figure \ref{Fig7}A.

The following command will generate continuous-signal single molecule maps retaining only reads that completely span the input set of regions, averaging over 10bp windows:

\begin{verbatim}
python SMAC-footprints-from-methylation-reads-tsv-tabix-kmeans.py 
  SMAC-seq.reads.tsv.bgz regions.bed 0 1 2 3 tabix 
  SMAC-seq.regions.binary-0.5-RdYlBu.10bp 
  -window 10 -minCov 1 
  -scatterPlot RdYlBu 0 1 w -unstranded
\end{verbatim}

An example of such a single-molecule level visualization for yeast m$^6$A-CpG-GpC-SMAC-seq data is shown in Figure \ref{Fig7}B.

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig7.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Examples of single molecule m$^6$A-CpG-GpC-SMAC-seq maps in \textit{S. cerevisiae}}. Shown is the yeast rDNA locus, binarized (A) and as a continuous display (B). Yeast rDNA is organized into multicopy ($\sim$150) arrays, consisting of $\sim$9.1 kb units, each containing a copy of the 35S precursor pre-rRNA, transcribed by Pol I, a 5S RNA, transcribed by Pol III, and a replication origin ARS element, located in non-transcribed (NTS) regions of the array. The rDNA chromatin structure adopts two distinct conformations\cite{Conconi1989,Goetze2010} -- an inactive nucleosomal state and an extremely highly transcriptionally active, largely devoid of nucleosomes (and thus highly accessible) state. Note that 1,000 reads were sampled at random for each plot, and that different samplings are shown in (A) and (B).}
\label{Fig7}
\end{center}
\end{figure*}

% \centerline{}
% \reversemarginpar\marginpar{\subsubsection{\textit{Calculating \newline single-molecule \newline correlations}}}

% Coordinated accessibility was evaluated as follows. For each pair of locations $(c,i_1,i_1 + r_1)$ and $(c,i_2,i_2 + r_2)$ (usually $r_1 = r_2$), a minimum number of reads $N$ was required that fully spans the $(c,i_1,i_2 + r)$ interval. All such reads were then obtained for each pair, and then subsampled multiple times down to $N$ reads (in order not to introduce bias in coordinated accessibility tests arising due to differential read coverage between locations closer/further apart). For each subsampling, the fraction of accessible regions $p_1$ and $p_2$ was estimated for each of the two locations using the Bayesian procedure described above, as well as the distribution of joint accessibilities over the four states $(0,0)$, $(1,0)$, $(0,1)$, and $(1,1)$. The two halves of the reads were then virtually split in half and recombined for a total of $10^3$ random combinations. The empirical distribution $\mathcal{N}(\mu,\sigma)$ of the four states was then estimated from these random combinations, where $\mu = N*(p_{(0,0)} + p_{(1,1)})$  if $p_{(0,0)} + p_{(1,1)} > 0.5$ and $\mu = N*(p_{(1,0)} + p_{(1,0)})$  if $p_{(0,0)} + p_{(1,1)} \leq 0.5$. Empirical coordinated accessibility $p$-values were then estimated based on the observed counts $|(0,0)| + |(1,1)|$ if $|(0,0)| + |(1,1)| > 0.5*N$ or $|(0,1)| + |(1,0)|$ if $|(0,0)| + |(1,1)| \leq 0.5*N$. Bonferroni correction was applied to account for multiple hypothesis testing.

% To estimate coaccessibility between all pairs of regions (with sufficient coverage), use the \verb|SingleMoleculeCorrelation-empirical-quantiles.py| script:

% \begin{verbatim}
% python SingleMoleculeCorrelation-empirical-quantiles.py 
%   methylation_reads_all.tsv peaks chrFieldID leftFiled 
%   RightFieldID minCoverage maxDist N_samplings 
%   tabix_location outfile [-subsample N] [-quantiles N]
% \end{verbatim}

% Example:

% \hl{EXPLAIN EXAMPLE}

% \begin{verbatim}
% python SingleMoleculeCorrelation-empirical-quantiles.py 
%  SMAC-seq.reads.tsv.bgz regions.bed 0 1 2 100 20000 1000 
%  tabix SMCorrelation.SMAC.regions.q5.ss50 
%  -quantiles 5 -subsample 50
% \end{verbatim}

% \begin{figure*}
% \begin{center}
% \includegraphics[width=14cm]{Fig8.png}
% \captionsetup{singlelinecheck=off,justification=justified}
% \caption{\small \textbf{Examples of coordinated accessibility estimates in \textit{S. cerevisiae}}. \hl{XXXXXX}.}
% \label{Fig8}
% \end{center}
% \end{figure*}

\centerline{}
\reversemarginpar\marginpar{\subsubsection{\textit{Calculating NMI \newline matrices}}}

Finally, another common analysis task when working with SMAC-seq data is to estimate the degree of single-molecule coaccessibility along the chromatin fiber. 

To this end, we apply a Normalized Mutual Information as follows. Each chromosome $c$ is split into windows of size $w$. For each such window $(c,i,i+w)$, the maximum range to the right of it, $(c,j,j+w)$ such that the span $(c,i,j+w)$ is covered by $\geq M$ reads, is identified. All reads spanning $(c,i,j+w)$ are then extracted and subsampled down to $M$ reads (usually $M = 100$). Accessibility scores are then aggregated and binarized for all windows located in the span $(c,i,j+w)$, and for all $M$ reads fully spanning it, resulting in a local co-accessibility matrix $LCM$ of size $M \times (j+w-i)/w$. A Normalized Mutual Information (NMI) score for each pair of columns $LCM_k$ and $LCM_l$ is then calculated as follows:

\begin{equation}
\begin{split}
MI(LCM_k,LCM_l) & = p(0,0) \log_2{ \left(\cfrac{p(0,0)}{p_k(0)\,p_l(0)}\right) } \\
                & + p(1,1) \log_2{ \left(\cfrac{p(1,1)}{p_k(1)\,p_l(1)}\right) } \\
                & + p(0,1) \log_2{ \left(\cfrac{p(0,1)}{p_k(0)\,p_l(1)}\right) } \\
                & + p(1,0) \log_2{ \left(\cfrac{p(1,0)}{p_k(1)\,p_l(0)}\right) } \\
\end{split}
\end{equation}

While in principle mutual information cannot be negative, NMI scores are normalized and rescaled in the interval $(-1,1)$ so that anti-correlated regions are given negative scores (this is done for visualization and interpretation purposes):

\begin{equation}
NMI(LCM_k,LCM_l) = \begin{cases}
\cfrac{MI(LCM_k,LCM_l)}{\sqrt{H(LCM_k)H(LCM_l)}} & \text{ for } p(0,0) + p(1,1) \geq 0.5 \\
-\cfrac{MI(LCM_k,LCM_l)}{\sqrt{H(LCM_k)H(LCM_l)}} & \text{ for } p(0,0) + p(1,1) < 0.5
\end{cases}
\end{equation}

Where $H$ refers to the entropy of each individual distribution. 

To calculate NMI matrices, the \verb|SingleMoleculeCorrelation-NMI-matrix.py| script can be used.

\begin{verbatim}
python SingleMoleculeCorrelation-NMI-matrix.py 
  SMAC-seq.reads.tsv.bgz regions.bed
 chrFieldID leftField rightFieldID minCoverage 
 windowsize stepsize tabix_location outfileprefix 
 [-subsample N] [-expectedMaxDist bp] [-label fieldID]
\end{verbatim}

Example:

\begin{verbatim}
python SingleMoleculeCorrelation-NMI-matrix.py 
  SMAC-seq.reads.tsv.bgz regions.bed 0 1 2 50 1 1200 tabix 
  NMI.min50cov.1bp.regions.SMAC-seq -expectedMaxDist 1500
\end{verbatim}

If running genome-wide, split the genome into overlapping bins for paralellization efficiency, e.g. 50-kbp in size with a 10-kbp stride, and calculate a separate matrix for each, then take the average NMI values for each pair of coordinates for downstream analyses.

An example of the results of NMI analysis for yeast m$^6$A-CpG-GpC-SMAC-seq data is shown in Figure \ref{Fig9}.

\begin{figure*}[!ht]
\begin{center}
\includegraphics[width=14cm]{Fig9.png}
\captionsetup{singlelinecheck=off,justification=justified}
\caption{\small \textbf{Example of а m$^6$A-CpG-GpC-SMAC-seq coaccessibility maps (measured in terms of NMI) in \textit{S. cerevisiae}}. Shown is the promoter region of the \textit{DCS1} gene, together with chemical mapping nucleosome positioning data (top), the single-molecule SMAC-seq map (middle), and the coaccessibility map (bottom).}
\label{Fig9}
\end{center}
\end{figure*}

\end{changemargin} 

% \centerline{}
% \end{changemargin} 

% \noindent\makebox[\textwidth]{\rule{\textwidth}{1.5pt}}

\begin{changemargin}{3.7cm}{0cm} 
\reversemarginpar\marginpar{\section{Notes}}

\begin{enumerate}

\item \label{EcoGIIefficiency} EcoGII deposits m$^6$A modifications without a sequence preference, but it does not do so with perfect efficiency. It is not conclusively established why that is, i.e. whether the presence of neighboring already modified bases prevents further methylation or whether perhaps the enzyme is highly non-processive and stays bound to DNA for a prolonged period after completion of the reaction, thus occluding neighboring bases from further enzymatic action. The methylation efficiency reported by NEB is $\sim$50\%, however, this is based on a relatively short treatment (5 minutes). On the other hand, based on the original more detailed study describing EcoGII \cite{Murray2018}, methylation efficiency seems to be closer to 80\% for a prolonged treatment of about an hour. The incubation times during a SMAC-seq experiment would place the expected efficiency somewhere in between these values.

Unfortunately, the most straightforward imaginable experiment that would properly establish EcoGII methylation effiencies in the context of a nanopore-based experiment -- nanopore sequencing of naked DNA subjected to EcoGII treatment -- is not in fact possible because of the strong bias of the Oxford Nanopore platform against fully methylated templates, which simply do not sequence well and are mostly discarded. 

\item \label{EcoGII} EcoGII is commercially available as a solution at relatively low concentration, and if sufficiently many units of it are to be used, the volume needed becomes too large and could interfere with the labeling reaction. For these reasons, we are using a custom-made highly concentrated EcoGII from NEB.

\item \label{SAM} SAM is unstable. This is one of the reasons why it is added twice to the reaction, and it is also why it should be handled carefully, avoiding repeated freeze-thaw cycles.

\item \label{ONT} ONT offers multiple sequencing platforms, and it is advisable to be familiar with the properties of each and the tradeoffs between them.

The MinION is the main ONT sequencing platform, typically generating $\sim$10 Gbp data (although higher throughput runs have also been observed in practice, up to 20 Gbp) and several million reads. In the context of SMAC-seq, a single MinION is often sufficient to generate adequate coverage over a small genome such as that of yeasts.

The Flongle is a miniaturized flowcell that also runs on the MinION instrument, typically generating $\sim$100,000 reads. It is not sufficient for production-scale runs, but as it is priced at $\sim$ 1/9 of the cost of a MinION flow cell, it is ideal for testing protocol, carrying out QC runs, etc. 

The GridION can use either MinION or Flongle flowcells and run five of them in parallel.

The PromethION is the high-throughout ONT sequencer. It uses different flowcells, each of which can generate up to $\sim$100 Gbp of data (and more than 10 million reads), and can run 48 such flowcells in parallel at the same time. Each such flowcell is priced at more than twice the cost of a MinION flowcell. To study larger and more complex eukaryotic genomes using SMAC-seq, the throughput of the PromethION becomes necessary, and often multiple such flowcells are needed.

\item \label{coverage} It is important to note that ``coverage'' means very different things in the contexts of genome sequencing and SMAC-seq. Usually, ``coverage'' refers to how many reads cover a given position in the genome on average. However, the more relevant metric for SMAC-seq is instead ``coverage at length $L$'', i.e. how many reads cover two position spread apart at a given instance. One of the main goals of SMAC-seq is to capture the coordinated behavior of distal CREs and this is only possible when sufficiently many single molecules containing both CREs have been sequenced. Across eukaryotes a clear trend is observed -- as genome size increases, CREs become spread apart more and more. Thus while two yeast promoters are on average 1.5-2 kbp apart, the distance between an enhancer and its cognate promoter in a mammalian genome is often tens of kilobases. Thus the required sequencing throughput to achieve the same effective ``coverage at legnth $L$'' does not scale linearly once the fact that even with careful size selection there are still many more shorter nanopore reads than very long ones is taken into account.

\item \label{Tubes}Low-binding tubes are preferable in order to minimize DNA loss.

\item \label{tabix} The files containing single-molecule SMAC-seq information can be huge in size, surpassing 1 TB on occasions. Random access is critical for downstream analysis to be practical. The workflows described here achieve this by using \verb|tabix| indexing of coordinate-sorted files.

\item \label{diskspace} The sheer volume of nanopore sequencing data presents a different level of challenge in terms of computational infrastructure compared to short-read sequencing. A single PromethION flowcell can produce 100 Gbp of data within 48 hours, and a PromethION instrument can in principle run 48 such flowcells in parallel. 

However, base calls are far from the only information that needs to be stored. For analysis of SMAC-seq datasets (and of DNA modifications in general), the nanopore current signal itself is what is most important, as it is used during the resquiggling and DNA modification detection steps. Thus the actual disk space footprint of such a flowcell is between one and two orders of magnitude higher than storing the base calls alone. 

In addition, a separate challenge has historically been posed by the number of files. This has changed with more recent versions of the ONT processing software, but historically ONT data has been stored in a large number of individual small files, which could be so large that it reached the limit on the number of files per use that many shared computational clusters have in place, necessitating sequential processing of datasets in batches and cleaning of files in between each.

\item \label{CellNumnber} Nanopore sequencing involves no amplification of DNA while having strict constraints on the minimum amount of DNA that is to be used as input to each sequencing run. A typical PromethION run uses at least 1 $\mu$g of DNA, but if size selection is to be applied prior to it, this corresponds to several times more input DNA per run. A typical diploid human cell contains $\sim$6 pg of DNA, thus 1 $\times$ $10^6$ cells contain $\sim$6 ug of DNA. Multiple PromethION runs are required to obtain good coverage for a mammalian-sized genome, thus tens of micrograms of DNA are needed as input to size selection and then sequencing. Scale up reactions accordingly based on the specifics of the experiment with these considerations in mind.

\item \label{Elution} Elution volumes are important for nanopore sequencing. All ONT sequencing kits have a minimum requirement for the amount of input DNA but also a maximum limit to the volume in which it is contained. Concentrating DNA using beads will result in significant losses while doing so by evaporation leads to its degradation. Thus it is best to have a large amount of DNA in a small volume. However, there is a tradeoff between the elution volume and the efficiency of elution -- larger elution volumes lead to better overall yields. Thus the optimal elution volume is to be decided based on the number of cells used for the SMAC-seq reaction and the exact ONT kits that are to be used for sequencing.

\item \label{DNAStorage} HMW DNA is stable for a long time at 4$\,^{\circ}\mathrm{C}$, but it is strongly recommended not to freeze it at -20$\,^{\circ}\mathrm{C}$ or -80$\,^{\circ}\mathrm{C}$ as this will likely result in fragmentation. Also, highly concentrated HWM DNA can sometimes precipitate out of solution after prolonged storage so make sure to inspect tubes before use. Resuspend by tapping the tubes with your fingers, do not pipette up and down as this is also thought to lead to HMW DNA degradation.

In addition, always transfer HMW DNA using wide bore tips to prevent shearing.

\item \label{yeast} Yeast (and fungal cells in general) have thick cell walls comprised of polysaccharides, lipids and chitin in various proportions. They present a barrier to the access of most enzymes to the nucleus, thus protocols tailored to such cells involve treatment with zymolyase or chitinase enzymes \cite{Schep2015}, with the exact details varying depending on the species studied.

\item \label{ERRORRATE} Nanopore sequencing is a powerful tool for detecting DNA modifications, but discerning modified bases from raw nanopore signal is not yet a fully resolved problem, especially for methylation modifications, which do not provide a huge shift in current signal relative to the unmodified base. Detection of m$^6$A is more challenging that detection of m$^5$C, possibly because a single methyl group changes the overall properties of a purine base to a lesser extent than it does for a pyrimidine. In addition, it should be noted that current implementations of nanopore sequencing do not actually read out a single bases at a time. Instead, they read several bases at a time and the problem of base calling and modification detection is solved not in the small space of bases but in the much larger space of $k$-mers of size 5 or 6. 

Base calling errors are therefore at present an unavoidable part of the reality of dealing with nanopore datasets.

In our experience, the error rate for calling m$^6$A at the level of a single base within a single molecule in the context of SMAC-seq is in the 20--25\% range, while that for m$^5$C is somewhere around 15\%.

However, we expect the performance to improve significantly in the future through a combination of computational and experimental approaches.

\item \label{stranded} Unlike CpG and GpC sequence contexts, which are symmetric, and therefore bases that are to be modified are present at the same position on both strand, m$^6$A provides different information on the forward and reverse strand, as it is not a symmetric sequence context. This is a partial limitation of m$^6$A-SMAC-seq, because different  profiles can be generated from the two strands in some situations. 

\end{enumerate}

\end{changemargin} 

\noindent\makebox[\textwidth]{\rule{\textwidth}{1.5pt}}
\centerline{}
\begin{changemargin}{3.7cm}{0cm} 
\reversemarginpar\marginpar{\section*{Acknowledgements}} 

The authors thank members of the Greenleaf and Kundaje labs for many helpful discussions. This work was supported by NIH grants UM1HG009436 and P50HG007735 (to W.J.G.). WJG is a Chan Zuckerberg investigator. Z.S. is supported by EMBO Long-Term Fellowship EMBO ALTF 1119-2016 and by Human Frontier Science Program Long-Term Fellowship HFSP LT 000835/2017-L. G.K.M. was supported by the Stanford School of Medicine Dean's Fellowship. 

\end{changemargin} 

\begin{thebibliography}{100}

\begin{multicols}{2}
\begin{small}

\input{references}

\end{small}
\end{multicols}

\end{thebibliography}

\end{document}
