% Documentation describing writing Tests for Biopython modules
\documentclass{article}
\usepackage{url}
\usepackage{fullpage}
\usepackage{hevea}
\usepackage{graphicx}

% Make links between references
\usepackage{hyperref}
\newif\ifpdf
\ifx\pdfoutput\undefined
  \pdffalse
\else
  \pdfoutput=1
  \pdftrue
\fi
\ifpdf
  \hypersetup{colorlinks=true, hyperindex=true, citecolor=red, urlcolor=blue}
\fi

\begin{document}

\title{Sequence motif analysis using  Biopython}
\author{Bartek Wilczynski (bartek@mimuw.edu.pl)}

\maketitle
\tableofcontents

\section{Short introduction}
\label{sec:intro}
This short tutorial describes some of the functionality of the
\verb|Bio.Motif| package included in Biopython distribution. It is intended
for people who are involved in analysis of sequence motif, so I'll
assume that you are familiar with basic notions of motif analysis. In
case something is unclear, please look into Section \ref{sec:links}
for some relevant links.

It should be also noted, that \verb|Bio.Motif| is based on two
Biopython modules, \verb|Bio.AlignAce| and \verb|Bio.MEME| and is
meant to replace them. It provides (almost) all of the functionality
of these modules and unifies the basic motif object implementation.

Speaking of other libraries, if you are reading this you might be
interested in the TAMO \cite{tamo}
(\url{http://fraenkel.mit.edu/TAMO/}), which is another python library
designed to deal with sequence motifs. It supports more \emph{de-novo}
motif finders, but it is not a part of biopython (so requires a bit of
work) and has some restrictions on commercial use.

\section{Motif objects}
\label{sec:object}
Since we are interested in motif analysis, we need to take a look at
\verb|Motif| objects in the first place. For that we need to import 
the Motif library:
\begin{verbatim}
from  Bio import Motif
\end{verbatim}
and we can start creating our first motif objects. Let's create a DNA motif:
\begin{verbatim}
from Bio.Alphabet import IUPAC
m=Motif.Motif(alphabet=IUPAC.unambiguous_dna)
\end{verbatim}
This is for now just an empty container, so let's add some sequences to our newly created motif:
\begin{verbatim}
from Bio.Seq import Seq
m.add_instance(Seq("TATAA",m.alphabet))
m.add_instance(Seq("TATTA",m.alphabet))
m.add_instance(Seq("TATAA",m.alphabet))
m.add_instance(Seq("TATAA",m.alphabet))
\end{verbatim}
Now we have a full \verb|Motif| instance, so we can try to get some
basic information about it. Let's start with length and consensus
sequence:
\begin{verbatim}
>>> m.length
5
>>> m.consensus()
Seq('TATAA', IUPACUnambiguousDNA())
\end{verbatim}
In case of DNA motifs, we can also get a reverse complement of a motif:
\begin{verbatim}
>>> m.reverse_complement()
<Bio.Motif.Motif.Motif object at 0xb39890>
>>> m.reverse_complement().consensus()
Seq('TTATA', IUPACUnambiguousDNA())
>>> m.reverse_complement().instances  
[Seq('TTATA', IUPACUnambiguousDNA()), 
 Seq('TAATA', IUPACUnambiguousDNA()), 
 Seq('TTATA', IUPACUnambiguousDNA()), 
 Seq('TTATA', IUPACUnambiguousDNA())]
\end{verbatim}

We can also calculate the information content of a motif with a simple call:
\begin{verbatim}
>>> m.ic()
5.2735010263278932
\end{verbatim}
This gives us a number of bits of information provided by the motif,
which tells us how much we are different from background.

The most common representation of a motif is a PWM (Position Weight
Matrix). It summarizes the probabilities of finding any symbol (in
this case nucleotide) in any position of a motif. It can be computed by calling the \verb|.pwm()| method:
\begin{verbatim}
>>> m.pwm()
[{'A': 0.05, 'C': 0.05, 'T': 0.85, 'G': 0.05}, 
 {'A': 0.85, 'C': 0.05, 'T': 0.05, 'G': 0.05}, 
 {'A': 0.05, 'C': 0.05, 'T': 0.85, 'G': 0.05}, 
 {'A': 0.65, 'C': 0.05, 'T': 0.25, 'G': 0.05}, 
 {'A': 0.85, 'C': 0.05, 'T': 0.05, 'G': 0.05}]
\end{verbatim}
The probabilities in the motif's PWM are based on the counts in the
instances, but we can see, that even though there were no Gs and no Cs
in the instances, we still have non-zero probabilities assigned to
them. These come from pseudo-counts which are, roughly speaking, a
commonly used way to acknowledge the incompleteness of our knowledge
and avoid technical problems with calculating logarithms of $0$.

We can control the way that pseudo-counts are added with two
properties of Motif objects \verb|.background| is the probability
distribution over all symbols in the alphabet that we assume is found
in background (usually based on the GC content of the respective
genome). It is by default set to a uniform distribution upon creation of a motif:
\begin{verbatim}
>>> m.background  
{'A': 0.25, 'C': 0.25, 'T': 0.25, 'G': 0.25}
\end{verbatim}
The other parameter is \verb|.beta|, which states the amount of
pseudo-counts we should add to the PWM. By default it is set to $1.0$,
\begin{verbatim}
>>> m.beta
1.0
\end{verbatim}
so that the total input of pseudo-counts is equal to that of one instance. 

Using the background distribution and pwm with pseudo-counts added,
it's easy to compute the log-odds ratios, telling us what are the log
odds of a particular symbol to be coming from a motif against the
background. We can use the \verb|.log_odds()| method:

\begin{verbatim}
 >>> m.log_odds() 
[{'A': -2.3219280948873622, 
  'C': -2.3219280948873622, 
  'T': 1.7655347463629771, 
  'G': -2.3219280948873622}, 
 {'A': 1.7655347463629771, 
  'C': -2.3219280948873622, 
  'T': -2.3219280948873622, 
  'G': -2.3219280948873622}, 
 {'A': -2.3219280948873622, 
  'C': -2.3219280948873622, 
  'T': 1.7655347463629771, 
  'G': -2.3219280948873622}, 
 {'A': 1.3785116232537298, 
  'C': -2.3219280948873622, 
  'T': 0.0, 
  'G': -2.3219280948873622}, 
 {'A': 1.7655347463629771, 
  'C': -2.3219280948873622, 
  'T': -2.3219280948873622, 
  'G': -2.3219280948873622}
]
\end{verbatim}
Here we can see positive values for symbols more frequent in the motif
than in the background and negative for symbols more frequent in the
background. $0.0$ means that it's equally likely to see a symbol in
background and in the motif (e.g. 'T' in the second-last position).

\subsection{Reading and writing}
\label{sec:io}

Creating motifs from instances by hand is useful but boring, so it's
useful to have some I/O functions for reading and writing
motifs. There are no really well established standards for storing
motifs, but there's a couple of formats which are more used than
others. The most important distinction is whether the motif
representation is based on instances or on some version of PWM matrix.
On of the most popular motif databases JASPAR
(\url{http://jaspar.genereg.net}) stores motifs in both formats, so
let's look at how we can import JASPAR motifs from instances:
\begin{verbatim}
arnt=Motif.read(open("Arnt.sites"),"jaspar-sites")
\end{verbatim}
and from a count matrix:
\begin{verbatim}
srf=Motif.read(open("SRF.pfm"),"jaspar-pfm")
\end{verbatim}

The \verb|arnt| and \verb|srf| motifs can both do the same things for
us, but they use different internal representations of the motif. We
can tell that by inspecting the \verb|has_counts| and
\verb|has_instances| properties:
\begin{verbatim}
>>> arnt.has_instances
True
>>> srf.has_instances
False
>>> srf.has_counts
True
>>> srf.counts
{'A': [2, 9, 0, 1, 32, 3, 46, 1, 43, 15, 2, 2],
 'C': [1, 33, 45, 45, 1, 1, 0, 0, 0, 1, 0, 1],
 'G': [39, 2, 1, 0, 0, 0, 0, 0, 0, 0, 44, 43],
 'T': [4, 2, 0, 0, 13, 42, 0, 45, 3, 30, 0, 0]}
\end{verbatim}

There are conversion functions, which can help us convert between
different representations:
\begin{verbatim}
>>> arnt.make_counts_from_instances()
{'A': [8, 38, 0, 0, 0, 0],
 'C': [32, 0, 40, 0, 0, 0],
 'G': [0, 2, 0, 40, 0, 40],
 'T': [0, 0, 0, 0, 40, 0]}

>>> srf.make_instances_from_counts()
[Seq('GGGAAAAAAAGG', IUPACUnambiguousDNA()),
 Seq('GGCCAAATAAGG', IUPACUnambiguousDNA()),
 Seq('GACCAAATAAGG', IUPACUnambiguousDNA()),
....
\end{verbatim}
The important thing to remember here is that the method
\verb|make_instances_from_counts()| creates fake instances, because
usually there are very many possible sets of instances which give rise
to the same pwm, and if we have only the count matrix, we cannot
reconstruct the original one. This does not make any difference if we
are using the PWM as the representation of the motif, but one should
be careful with exporting instances from count-based motifs.

Speaking of exporting, let's look at export functions. We can export to fasta:
\begin{verbatim}
>>> print m.format("fasta")
> instance 0
TATAA
> instance 1
TATTA
> instance 2
TATAA
> instance 3
TATAA
\end{verbatim}
or to TRANSFAC-like matrix format (used by some motif processing software)
\begin{verbatim}
>>> print m.format("transfac")
XX
TY Motif
ID 
BF undef
P0 G A T C
01 0 0 4 0
02 0 4 0 0
03 0 0 4 0
04 0 3 1 0
05 0 4 0 0
XX
\end{verbatim}

Finally, if we have internet access, we can create a weblogo using a nice service at \url{http://weblogo.berkeley.edu} by Crooks et al. \cite{crooks2004}:
\begin{verbatim}
>>> arnt.weblogo("Arnt.png")
\end{verbatim}
We should get our logo saved as a png in the specified file.

\subsection{Searching for instances}
\label{sec:search}

The most frequent use for a motif is to find its instances in some
sequence. For the sake of this section, we will use an artificial sequence like this:

\begin{verbatim}
test_seq=Seq("TATGATGTAGTATAATATAATTATAA",m.alphabet)
\end{verbatim}

The simplest way to find instances, is to look for exact matches of
the true instances of the motif:
\begin{verbatim}
>>> for pos,seq in m.search_instances(test_seq):
...     print pos,seq.tostring()
... 
10 TATAA
15 TATAA
21 TATAA
\end{verbatim}
We can do the same with the reverse complement (to find instances on the complementary strand):
\begin{verbatim}
>>> for pos,seq in m.reverse_complement().search_instances(test_seq):
...     print pos,seq.tostring()
... 
12 TAATA
20 TTATA
\end{verbatim}

It's just as easy to look for positions, giving rise to high log-odds scores against our motif:
\begin{verbatim}
>>> for pos,score in m.search_pwm(test_seq,threshold=5.0):
...     print pos,score
... 
10 8.44065060871
-12 7.06213898545
15 8.44065060871
-20 8.44065060871
21 8.44065060871
\end{verbatim}
You may notice the threshold parameter, here set arbitrarily to
$5.0$. This is in $log_2$, so we are now looking only for words, which
are 32 times more likely to occur under the motif model than in the
background. The default threshold is $0.0$, which selects everything
that looks more like the motif, than the background.

If you want to use a less arbitrary way of selecting thresholds, you
can explore the \verb|Motif.score_distribution| class implementing an
distribution of scores for a given motif. Since the space for a score
distribution grows exponentially with motif.length, we are using an
approximation with a given precision to keep computation cost manageable:
\begin{verbatim}
>>> sd = Motif.score_distribution(m,precision=10**4)
\end{verbatim}
The sd object can be used to determine a number of different thresholds.

We can specify the requested false-positive rate (probability of ``finding'' a motif instance in background generated sequence):
\begin{verbatim}
>>> sd.threshold_fpr(0.01)
4.3535838726139886
\end{verbatim}

or the false-negative rate (probability of ``not finding'' an instance generated from the motif):
\begin{verbatim}
>>> sd.threshold_fnr(0.1)
0.26651713652234044
\end{verbatim}

or a threshold (approximately) satisfying some relation between fpr
and fnr ($\frac{fnr}{fpr}$, as proposed by Rahmann \cite{Rahmann2003}):
\begin{verbatim}
>>> sd.threshold_balanced(1000)
8.4406506087056368
\end{verbatim}

or a threshold satisfying (roughly) the equality between the
false-positive rate and the $-log$ of the information content (as used
in patser software by Hertz and Stormo).

For example, in case of our motif, you can get the threshold giving
you exactly the same results (for this sequence) as searching for
instances with balanced threshold with rate of $1000$.
\begin{verbatim}
>>> for pos,score in m.search_pwm(test_seq,threshold=sd.threshold_balanced(1000)):
...     print pos,score
... 
10 8.44065060871
15 8.44065060871
-20 8.44065060871
21 8.44065060871
\end{verbatim}

\subsection{Comparing motifs}
\label{sec:comp}
Once we have more than one motif, we might want to compare them. For
that, we have currently three different methods of \verb|Bio.Motif|
objects.

Before we start comparing motifs, I should point out that motif
boundaries are usually quite arbitrary. This means, that we often need
to compare motifs of different lengths, so comparison needs to involve
some kind of alignment.  This means, that we have to take into account two things:
\begin{itemize}
\item alignment of motifs
\item some function to compare aligned motifs
\end{itemize}
In \verb|Bio.Motif| we have 3 different functions for motif
comparison, which are based on the same idea behind motif alignment,
but use different functions to compare aligned motifs. Briefly
speaking, we are using ungapped alignment of PWMs and substitute the
missing columns at the beginning and end of the matrices with
background distribution. All three comparison functions are written in
such a way, that they can be interpreted as distance measures, however
only one (\verb|dist_dpq|) satisfies the triangle inequality. All of
them return the minimal distance and the corresponding offset between
motifs.

To show how these functions work, let us first load another motif,
which is similar to our test motif \verb|m|:
\begin{verbatim}
>>> ubx=Motif.read(open("Ubx.pfm"),"jaspar-pfm")
<Bio.Motif.Motif.Motif object at 0xc29b90>
>>> ubx.consensus()
Seq('TAAT', IUPACUnambiguousDNA())
\end{verbatim}

The first function we'll use to compare these motifs is based on
Pearson correlation. Since we want it to resemble a distance
measure. we actually take $1-r$, where r is the Pearson correlation
coefficient (PCC):
\begin{verbatim}
>>> m.dist_pearson(ubx)
(0.41740393308237722, 2)
\end{verbatim}
This means, that the best PCC between motif \verb|m| and  \verb|Ubx| is obtained with the following alignment:
\begin{verbatim}
bbTAAT
TATAAb
\end{verbatim}
where \verb|b| stands for background distribution. The PCC itself is
roughly $1-0.42=0.58$. If we try the reverse complement of the Ubx motif:

\begin{verbatim}
>>> m.dist_pearson(ubx.reverse_complement())
(0.25784180151584823, 1)
\end{verbatim}
We can see that the PCC is better (almost $0.75$), and the alignment is also different:
\begin{verbatim}
bATTA
TATAA
\end{verbatim}

There are two other functions \verb|dist_dpq|, which is a true metric based on the Kullback-Leibler divergence and proposed by Endres and Sendelin \cite{Endres2003}
\begin{verbatim}
>>> m.dist_dpq(ubx.reverse_complement())
(0.49292358382899853, 1)
\end{verbatim}

In case you need yet another way of comparing motifs, you can use the
\verb|dist_product| method, which is based on the product of
probabilities which can be interpreted as the probability of
independently generating the same instance by both motifs.

\begin{verbatim}
>>> m.dist_product(ubx.reverse_complement())
(0.16224587301064275, 1)
\end{verbatim}

\section{\emph{De novo} motif finding}
\label{sec:find}

Currently, biopython has only limited support for \emph{de novo} motif
finding. Namely, we support running and parsing of AlignAce \cite{Hughes2000} and
MEME\cite{Bailey1994}. Since the number of motif finding tools is growing rapidly, 
contributions of new parsers are welcome. 

\subsection{MEME}
\label{sec:meme}

Let's assume, you have run MEME on sequences of your choice with your
favorite parameters and saved the output in the file
\verb|meme.out|. You can retrieve the motifs reported by MEME by
running the following piece of code:

\begin{verbatim}
>>> motifsM = list(Motif.parse(open("meme.out"),"MEME"))
>>> motifsM
[<Bio.Motif.MEMEMotif.MEMEMotif object at 0xc356b0>]
\end{verbatim}

Besides the most wanted list of motifs, the result object contains more useful information, accessible through properties with self-explanatory names:
\begin{itemize}
\item \verb|.alphabet|
\item \verb|.datafile|
\item \verb|.sequence_names|
\item \verb|.version|
\item \verb|.command|
\end{itemize}

The motifs returned by MEMEParser can be treated exactly like regular
Motif objects (with instances), they also provide some extra
functionality, by adding additional information about the instances. 

\begin{verbatim}
>>> motifsM[0].consensus()
Seq('CTCAATCGTA', IUPACUnambiguousDNA())

>>> motifsM[0].instances[0].pvalue
8.71e-07
>>> motifsM[0].instances[0].sequence_name
'SEQ10;'
>>> motifsM[0].instances[0].start
3
>>> motifsM[0].instances[0].strand
'+'
\end{verbatim}


\subsection{AlignAce}
\label{sec:alignace}

We can do very similar things with AlignACE program. Assume, you have
your output in the file \verb|alignace.out|. You can parse your output
with the following code:

\begin{verbatim}
>>> motifsA=list(Motif.parse(open("alignace.out"),"AlignAce"))
\end{verbatim}

Again, your motifs behave as they should:
\begin{verbatim}
>>> motifsA[0].consensus()
Seq('TCTACGATTGAG', IUPACUnambiguousDNA())
\end{verbatim}

In fact you can even see, that AlignAce found a very similar motif as
MEME, it is just a longer version of a reverse complement of MEME
motif:
\begin{verbatim}
>>> motifsM[0].reverse_complement().consensus()
Seq('TACGATTGAG', IUPACUnambiguousDNA())
\end{verbatim}

If you have AlignAce installed on the same machine, you can also run
it directly from Biopython. Short example of how this can be done is
shown below (other parameters can be specified as keyword parameters):

\begin{verbatim}
>>> command="/opt/bin/AlignACE"
>>> input_file="test.fa"
>>> result=Motif.AlignAce(input_file,cmd=command,gcback=0.6,numcols=10)
>>> result
(<Bio.Application.ApplicationResult instance at 0xf49d78>,
 <Bio.File.UndoHandle instance at 0xf49d00>,
 <Bio.File.UndoHandle instance at 0xf49dc8>)
\end{verbatim}

Since AlignAce prints all its output to standard output, you can get
to your motifs by parsing the second member of the result:
\begin{verbatim}
motifs=list(Motif.parse(result[1],"AlignAce"))
\end{verbatim}


\section{Useful links }
\label{sec:links}

Wikipedia definitions:

\begin{itemize}
\item \url{http://en.wikipedia.org/wiki/Sequence_motif}
\item \url{http://en.wikipedia.org/wiki/Position_weight_matrix}
\item \url{http://en.wikipedia.org/wiki/Consensus_sequence}
\end{itemize}


Motif finding and comparison methods:

\begin{itemize}
\item AlignAce and CompareAce\url{http://www.psc.edu/general/software/packages/alignace/}
\item MEME and MAST \url{http://meme.sdsc.edu/meme/}
\item BioProspector \url{http://bioprospector.stanford.edu/}
\item MDScan \url{http://ai.stanford.edu/~xsliu/MDscan/}
\item Weeder
\item STAMP \url{http://www.benoslab.pitt.edu/stamp/}
\item WebMotifs \url{http://fraenkel.mit.edu/webmotifs/}
\item Comparison of different programs \url{http://bio.cs.washington.edu/assessment/}
\end{itemize}

\begin{thebibliography}{10}

\bibitem{tamo}
Gordon DB, Nekludova L, McCallum S, Fraenkel E: \textbf{TAMO: a flexible,
  object-oriented framework for analyzing transcriptional regulation using
  DNA-sequence motifs.} \emph{Bioinformatics} 2005, \textbf{21}(14):3164--5.

\bibitem{crooks2004}
Crooks GE, Hon G, Chandonia JM, steven E~Brenner: \textbf{WebLogo: A Sequence
  Logo Generator}. \emph{Genome Research} 2004, \textbf{14}:1188--1190.

\bibitem{Rahmann2003}
Rahmann S, Mueller T, Vingron M: \textbf{On the power of profiles for
  transcription factor binding site detection.} \emph{Stat Appl Genet Mol Biol}
  2003, \textbf{2}:Article7.

\bibitem{Endres2003}
Endres D, Schindelin J: \textbf{A new metric for probability distributions}.
  \emph{IEEE transactions on Information Theory} 2003,
  \textbf{49}(7):1858--1860.

\bibitem{Bailey1994}
Bailey TL, Elkan C: \textbf{{Fitting a mixture model by expectation
  maximization to discover motifs in biopolymers}}. \emph{Proc Int Conf Intell
  Syst Mol Biol} 1994, \textbf{2}:28--36.

\bibitem{Hughes2000}
Hughes JD, Estep PW, Tavazoie S, Church GM: \textbf{{Computational
  identification of cis-regulatory elements associated with groups of
  functionally related genes in Saccharomyces cerevisiae}}. \emph{J Mol Biol}
  2000, \textbf{296}(5):1205--1214.

\end{thebibliography}

\end{document}