# esl-msaweight

The `msaweight` module implements different _ad hoc_ sequence
weighting algorithms for compensating for overrepresentation in a
multiple sequence alignment.

The default weighting scheme in HMMER and Infernal is position-based
(PB) weighting
[[Henikoff & Henikoff, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/7966282).
Easel also provides tree-based GSC weights
[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887)
and filtering-based BLOSUM weights
[[Henikoff & Henikoff, 1992]](https://www.ncbi.nlm.nih.gov/pubmed/1438297).
All three weighting schemes result in similar performance in terms of
sensitivity/specificity of the model parameterizations they
produce. We tend to default to PB weights because they're the most
efficient to compute.


## Position-based weights

The original Henikoff & Henikoff PB scheme was defined for _ungapped_
alignments. Let $x_{ij}$ represent the residue in aligned sequence $i$
and column $j$; let $c_j(a)$ be the observed count of residue $a$ in
column $j$; and let $r_j$ be the number of different residues in
column $j$ (the number of nonzero $c_j(a)$ in column $j$). Then the
weight on sequence $i$ is set to:

\begin{equation}
    w_i = \frac{N}{L} \sum_{j=1}^{L} \frac{1}{r_j c_{j}(x_{ij})}
\end{equation}

The total weight $\sum_i w_i$ is $N$. 

What's this rule trying to do?  If the alignment consisted of only a
single column, PB weighting is trying to assign weights that make all
the weighted frequencies for the observed residues _equal_. For
example, if we observed 10 A's, 4 C's, and 1 T in 15 sequences of
length 1, the A sequences would each get weight 0.5, the C sequences
get weight 1.25, and the T sequence gets weight 5.0.  The weighted
counts would then be 5, 5, and 5, resulting in weighted frequencies of
0.33.  PB weighting is a simple rule in the spirit of a maximum
entropy approach, without the iterative optimization that a true
maximum entropy approach requires
[[Krogh & Mitchison, 1995]](https://www.ncbi.nlm.nih.gov/pubmed/7584440).
When averaged over all columns, sequences that use less abundant
residues in each column (i.e. more dissimilar sequences) will get
higher weights.

In the limit of very deep alignments ($N \rightarrow \infty$),
eventually every residue will be observed in every column ($r_j
\rightarrow 20$) but PB weights still work fine. I mention this
because every few years I wake up with a nightmare that PB weights
break on deep alignments. They don't. 

We need it to work for _gapped_ sequence alignments.  We extend the
Henikoff scheme to gapped sequence alignments by ignoring gaps and
renormalizing by raw (unaligned) sequence lengths.  Let function
$f(a)$ be either the PB weight increment or 0 depending on whether $a$
is a residue or a gap:

\begin{equation}
f(a) = \left\{ \begin{array}{r@{\quad}l}
                 \frac{1}{r_j c_{j}(a)} & \mbox{if } a \mbox{ is a residue} \\
                 0                      & \mbox{if } a \mbox{ is a gap}\\
                \end{array}
       \right.
\end{equation}

Define a sequence weight $w'_i$ as an average of these terms over all
non-gap residues in sequence $i$ (analogous to the ungapped PB
weights):

\begin{equation}
    w'_i = \frac{1}{\ell_i} \sum_{j=1}^{L} f(x_{ij})
\end{equation}

where $\ell_i$ is the raw (unaligned) length of sequence $i$ in
residues. (If $\ell_i = 0$ set $w'_i = 0$; yes, horribly, sequences
with no residues in them do appear in people's alignments.) Unlike in
the ungapped case, the sum of the $w'_i$ isn't a constant, so
finally renormalize so they sum to $N$:

\begin{equation}
   w_i = N \frac{w'_i}{\sum_k w'_k}
\end{equation}

(If pathologically all $w'_i = 0$, set $w_i = 0$ for all $i$.)

(Alternatively, one could use the original PB equation and treat gaps
as symbols. The resulting weights would be different. It is unclear in
principle which scheme should be favored, since we haven't defined a
principled rationale for what "correct" weights would look like for
gapped sequence alignments. I don't think it matters enough to worry
about it.)

Given a multiple sequence alignment of $N$ sequences and $L$ columns
(requiring $O(NL)$ space itself), computation of PB weights requires
$O(NL)$ time and $O(K)$ space for alphabet size $K$ (4 or 20).

## BLOSUM weights

The BLOSUM weighting scheme:
  * clusters sequences by single-linkage clustering at a given %
    identity threshold; 
  * evenly divides a total weight of 1 per cluster across all
    the sequences in the cluster;
  * renormalizes so $\sum_i w_i = N$.

Pairwise identity is calculated by `esl_dst_*PairId()` using the
minimum unaligned length of the two sequences as the denominator:

\begin{equation}
     \mathrm{pid}(ij) = \frac{\mbox{number of identical aligned pairs for } x_i, x_j} 
	                         { \mbox{MIN}( \ell_i, \ell_j ) }
\end{equation}

Single linkage clustering (using the efficient algorithm in
`esl_cluster_SingleLinkage()` requires $O(N)$ space and worst case
$O(LN^2)$ time, but typically requires about $O(LN \log N)$ time.


## GSC tree weights

GSC tree weights
[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887)
require $O(N^2)$ extra space for an all-vs-all pairwise distance
matrix and $O(LN^2 + N^3)$ time for constructing the distance matrix
followed by a UPGMA tree construction.