Tutorial A1: Sequence Manipulation

Being able to manipulate sequences is central to analyzing what machine learning models have learned because they allow you to easily ask hypotheticals. Many common downstream analyses, such as in silico marginalizations and feature attributions rely on sequence manipulations to handle motif insertion or construct background sequences, respectively. But, these forms of manipulations are also invaluable tools for dissecting individual regions and doing sequence design. For instance, given a locus of interest, one can shuffle regions (such as known motifs) and see what impact the shuffling has on the prediction. Alternatively, one can iterative insert motifs into a sequence to see whether a predictive model yields a desired result.

Given the foundational nature of these operations, they are central to tangermeme and are implemented in tangermeme.ersatz. Why ersatz?

a5989e4d54bc4aeb87ae133d72d6e77f

Accordingly, “ersatz sequences” in this context are those that are not natural sequences – either because they are alterations of a native sequence, or they are fully synthetic.
Substitutions

The simplest operation is the substitution, where some positions are switched out for others. Consider the following where an AP-1 motif (https://jaspar.elixir.no/matrix/MA1144.1/) is inserted into uniformly random nucleotides. Colloquially, “substitutions” are sometimes called “insertions”, and while there are similarities between the two they are not exactly the same thing. Let’s begin by randomly generating one sequence. Following the PyTorch format, the one-hot encoded sequences should have the shape (batch_size, alphabet_size, sequence_length). Generation can be made deterministic using the random_state parameter.

As a note, all of the operations in tangermeme.ersatz happen on one-hot encoded tensors but, for simplicity of seeing the effect of each operation in this tutorial, we will convert these tensors back to characters.

from tangermeme.utils import random_one_hot
from tangermeme.utils import characters

X = random_one_hot((1, 4, 20), random_state=0)
characters(X[0])

'ATCATTTTCTCGATGAAAGC'

Now, let’s put the motif in. We can either pass in a string, in which case the same motif is added in to each sequence in the batch at the same position, or we can pass in a one-hot encoding. If the one-hot encoding has shape (1, alphabet_size, motif_size), the same motif is added to each sequence similarly to if a string were used, but if the one-hot encoding has the shape (batch_size, alphabet_size, motif_size) then the i-th sequence in the batch has the i-th one-hot encoding substituted in.

from tangermeme.ersatz import substitute

X_ap1 = substitute(X, "TGACTCA")
characters(X_ap1[0])

'ATCATTTTGACTCAGAAAGC'

By default, the substitution will happen in the middle of the sequence. If you’d like to control where it happens you can pass in a parameter start with the index to start the substitution.

X_ap1 = substitute(X, "TGACTCA", start=2)
characters(X_ap1[0])

'ATTGACTCATCGATGAAAGC'

If we have a one-hot encoding rather than a string sequence, we can pass that in instead without needing to first convert it back to characters.

from tangermeme.utils import one_hot_encode

motif = one_hot_encode("TGACTCA").unsqueeze(0)

X_ap1 = substitute(X, motif)
characters(X_ap1[0])

'ATCATTTTGACTCAGAAAGC'

Multisubstitutions

Sometimes one would like to make multiple substitutions in the same sequence given some spacing between substitutions. Although this could be achieved by calling substitute multiple times, we can provide a convenient wrapper for this with the multisubstitue function. This function has a very similar signature to substitute except that it takes in a list of motifs and the spacing between them. We can try it out first given no spacing between the two motifs.

from tangermeme.ersatz import multisubstitute

X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 0, start=2)
characters(X_ap12[0])

'ATTGACTCATGACTCAAAGC'

Now, let’s add a little bit of spacing.

X_ap12 = multisubstitute(X, ["TGACTCA", "TGACTCA"], 2, start=2)
characters(X_ap12[0])

'ATTGACTCATCTGACTCAGC'

Finally, if we have more than two motifs we can optionally provide spacing values between each set of motifs. Note that if we keep the spacing value as an integer but provide more than two motifs that the same spacing is used between all motif pairs.

X_ap12 = multisubstitute(X, ["TGA", "TCA", "TGA", "TGAC"], [0, 2, 1], start=2)
characters(X_ap12[0])

'ATTGATCACTTGATTGACGC'

Insertions

Related to substitutions are insertions. As mentioned before, sometimes people say “insertions” when what they mean are “substitutions”, but insertions involve adding the new sequence without modifying or deleting any of the existing sequence. Essentially, the returned sequence will be longer than the original sequence because it now additionally contains the sequence being added. In contrast, substitutions preserve the length of the sequence because characters are explicitly being changed from those in the original sequence to those in the new sequence. Let’s see that in action.

from tangermeme.ersatz import insert

X_ap1 = insert(X, "TGACTCA")
characters(X_ap1[0])

'ATCATTTTCTTGACTCACGATGAAAGC'

X.shape[-1], len("TGACTCA"), len(characters(X_ap1[0]))

(20, 7, 27)

Deletion

In direct contrast to insertions are deletions: insertions involve adding new sequence to an existing sequence and deletions involve deleting existing sequence and not replacing it with anything. Accordingly, one only passes in sequences and the coordinates to delete instead of passing in any form of new sequence.

from tangermeme.ersatz import delete

X_del = delete(X, start=0, end=5)
characters(X[0]), characters(X_del[0])

('ATCATTTTCTCGATGAAAGC', 'TTTCTCGATGAAAGC')

Next, let’s try deleting a portion from the middle.

X_del = delete(X, start=10, end=15)
characters(X_del[0])

'ATCATTTTCTAAAGC'

X.shape[-1], X_del.shape[-1]

(20, 15)

Randomize

Deleting positions is one way that we can remove information from a sequence. However, this can pose some issues – both practically, in terms of needing a sequence of the same length when using machine learning models, and conceptually, in that removing a motif entirely isn’t really a biologically plausible alternative to observed sequence. An alternative approach to deleting positions is to replace those positions with randomly generated characters. This would keep the sequence the same length but remove the motif.

Here, we replace the first five positions with randomly generated characters and keep the remaining characters the same.

from tangermeme.ersatz import randomize

X_rand = randomize(X, start=0, end=5, random_state=0)
characters(X[0]), characters(X_rand[0, 0])

('ATCATTTTCTCGATGAAAGC', 'GGGGCTTTCTCGATGAAAGC')

Frequently, when using randomly generated sequences, one wishes to generate many randomizations so that one can average over the randomness induces by the sequences. To make this easy, tangermeme allows you to pass in a parameter n specifying the number of randomizations to perform, and returns a tensor with one more dimension than the original tensor whenever randomness is involved. Specifically, the returned tensor will have shape (n_orig_sequences, n_shuffles, alphabet_size, seq_len) so that you can shuffle each of many sequence many times.

X_rand = randomize(X, start=0, end=5, n=10, random_state=0)

for i in range(10):
    print(characters(X_rand[0, i]))

GGGGCTTTCTCGATGAAAGC
GCTTCTTTCTCGATGAAAGC
TGGTATTTCTCGATGAAAGC
AATTTTTTCTCGATGAAAGC
TTCTATTTCTCGATGAAAGC
GATGCTTTCTCGATGAAAGC
CTCGATTTCTCGATGAAAGC
GGGTGTTTCTCGATGAAAGC
CCGAGTTTCTCGATGAAAGC
GAACCTTTCTCGATGAAAGC

Shuffle

A problem with independently generating sequence at each position is that the sampled sequences might have unrealistic compositions. For instance, when you use uniformly randomly generated sequences, the GC content is fairly high compared to naturally occuring sequences. When trying to create backgrounds for specific loci, you might prefer to instead shuffle the positions to ensure that the composition of these backgrounds are the same, while any types of motif are disrupted.

We can use the shuffle function to completely shuffle a batch of sequences. Each returned sequence will have the same composition as the original sequence that was shuffled, but a different ordering of the elements. In the genomics setting, this means that the same number of each nucleotide will be present but motifs will likely be removed.

from tangermeme.ersatz import shuffle

X_shuf = shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'GTCCCATTTCTGTTAGAAAA')

Similar to the randomize function, if we want to shuffle a sequence many times, we can use the parameter n.

X_shuf = shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

GTCCCATTTCTGTTAGAAAA
GTGACACACAATACTTTGTT
ATATGCCTAATCAGTTATGC
GATCAATAGAGCTATCCTTT
TTCCTGCAAATTGTCTAAGA
ATCCATTCGGGACTTTATAA
GGTTTACTAACGCCAATTAT
TGTAAGTACCCTATTCTGAA
TTTTTATACAGCAGAGACTC
CAAACTTTCGCTTAGTATAG

Furthermore, we can restrict our shuffling to only a portion of the sequence. This can be valuable if you want to knock out a portion of the sequence, such as a known motif or broader regulatory region. All you need to do is specify the start (inclusive) and end (not inclusive) positions that the shuffling should occur.

X_shuf = shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATCTTTGGATCTAAAGC')

Dinucleotide Shuffle

In the genomics setting the CG dinucleotide plays an outsized role compared to other dinucleotides and so is significantly underrepresented in the genome. Because normal shuffling will disrupt dinucleotide content, and hence change the proportion of CGs in the sequence, sometimes one wants to use a shuffling strategy that explicitly preserves dinucleotide content.

In tangermeme, the dinucleotide_shuffle function operates similarly to the shuffle function. For instance, we can shuffle entire sequences:

from tangermeme.ersatz import dinucleotide_shuffle

X_shuf = dinucleotide_shuffle(X, random_state=0)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATTTCTCGATTGAAAGC')

We can generate many shuffles using the n parameter.

X_shuf = dinucleotide_shuffle(X, n=10, random_state=0)

for i in range(10):
    print(characters(X_shuf[0, i]))

ATCATTTCTCGATTGAAAGC
ATCTCAAATTTTCGATGAGC
ATTTTCTCAATCGAATGAGC
AAATTTTCTCATCGATGAGC
ATCTTCATCGATTTGAAAGC
ATCAATTCTTTCGAATGAGC
AATTCTCAATTCGATTGAGC
AAATCTCATTTCGATTGAGC
ATCATTTCTTCGAAATGAGC
ATCTTTCAATTCGAATGAGC

And we can shuffle only a portion of the sequence.

X_shuf = dinucleotide_shuffle(X, start=5, end=15, random_state=3)
characters(X[0]), characters(X_shuf[0, 0])

('ATCATTTTCTCGATGAAAGC', 'ATCATTCTTTCGATGAAAGC')

As a note, the strategy for doing dinucleotide shuffling that is implemented will always keep the first and last nucleotides the same. Depending on the sequence composition and the length of the region being shuffled, it can be impossible to produce new dinucleotide shuffled sequences. Passing in verbose will raise a warning when at least one position (other than the first and last positions) are always the same character. Regardless of the value of verbose, an error will be raised when all returned sequences are identical.

X_shuf = dinucleotide_shuffle(X, start=10, end=15, random_state=0, n=100, verbose=True)
characters(X[0]), characters(X_shuf[0, 0])

Warning: At least one position in dinucleotide shuffle is identical across all positions.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 X_shuf = dinucleotide_shuffle(X, start=10, end=15, random_state=0, n=100, verbose=True)
      2 characters(X[0]), characters(X_shuf[0, 0])

File /shared/aemurphy/tangermeme/tangermeme/ersatz.py:612, in dinucleotide_shuffle(X, start, end, n, random_state, verbose)
    610 X_shufs = []
    611 for i in range(X.shape[0]):
--> 612   insert_ = _dinucleotide_shuffle(X[i, :, start:end], n_shuffles=n, 
    613               random_state=random_state+i, verbose=verbose)
    615    X_shuf = torch.clone(X[i:i+1]).repeat(n, 1, 1)
    616    X_shuf[:, :, start:end] = insert_

File /shared/aemurphy/tangermeme/tangermeme/ersatz.py:552, in _dinucleotide_shuffle(X, n_shuffles, random_state, verbose)
    549            print("Warning: At least one position in dinucleotide shuffle " +
    550                    "is identical across all positions.")
    551 if conserved.max(dim=0).values.min() == n_shuffles and n_shuffles > 1:
--> 552   raise ValueError("All dinucleotide shuffles yield identical " +
    553            "sequences, potentially due to a lack of diversity in sequence.")
    555 return shuffled_sequences

ValueError: All dinucleotide shuffles yield identical sequences, potentially due to a lack of diversity in sequence.

Reverse Complement

A common approach to training/testing genomic deep learning models is to use the reverse complement of a DNA sequence.

In tangermeme this can be done with character represnetations of DNA or with one-hot encodings using reverse_complement

from tangermeme.ersatz import reverse_complement

reverse_complement("ATCG",ohe=False)

'CGAT'

In the same manner as one_hot_encode, reverse_complement is designed to be used on one sequence (Tensor shape (alphabet length,sequence length))

motif = one_hot_encode("TGACTCA")
print(motif.shape)
rev_comp_motif = reverse_complement(motif)
print(rev_comp_motif.shape)
characters(rev_comp_motif)

torch.Size([4, 7])
torch.Size([4, 7])

'TGAGTCA'

This function can also handle missing values in DNA sequences, represented by ‘N’ or by [0,0,0,0]:

reverse_complement("ATCGN",ohe=False)

'NCGAT'

motif = one_hot_encode("TGACTCAN")
characters(reverse_complement(motif),allow_N=True)

'NTGAGTCA'

---

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Tutorial A2: Predictions

When one has a machine learning model, a natural next step is to make predictions with that model. Accordingly, tangermeme implements general-purpose functions for making predictions from PyTorch models in a memory-efficient manner, regardless of the number of inputs or outputs from the model. These functions can be used by themselves but their primary purpose is as building blocks for more complicated analysis functions.

Although making predictions using a model is conceptually simple to understand, there are several technical issues with doing efficiently so in practice. First, when your data is too big to fit in GPU memory you cannot simply move it all over at once and so you must make predictions in a batched manner. Here, a small number of examples are moved from the CPU to the GPU, predictions are made, and the results are moved back to the CPU. The batch size can be tuned to the largest number of examples that fit in GPU memory. Second, when a model has multiple outputs, making predictions in a batched manner yields a set of tensors for each batch. These tensors must be correctly concatenated across batches to make sure that the final output from the function matches the shape of the data as if all examples were run through the model at the same time. Third, some models have multiple inputs and so this function must be able to handle an optional set of additional arguments.
Predict

The simplest function that implements these ideas is predict, which takes in a model, data, and optional additional arguments, and makes batched predictions on the data given the model. The function can be run on GPUs, CPUs, or any other devices that work with PyTorch.

To demonstate it, let’s use a model that takes its inputs and flattens them before feeding them into a dense layer to make three predictions per example. The forward function takes in two optional arguments: alpha, which gets added to the predictions, and beta, which multiplies the predictions (but not alpha). By default, these are set such that the predictions are returned without modification.

import torch

class FlattenDense(torch.nn.Module):
    def __init__(self, length=10):
            super(FlattenDense, self).__init__()
            self.dense = torch.nn.Linear(length*4, 3)

    def forward(self, X, alpha=0, beta=1):
            X = X.reshape(X.shape[0], -1)
            return self.dense(X) * beta + alpha

Now let’s generate some random sequence and see what the model output looks like for it. We need to use torch.manual_seed because of the random initializations used in the torch.nn.Linear layer. As a side note, even though we are not training the model here, the usage doesn’t change based on whether the model is randomly initialized or trained.

from tangermeme.utils import random_one_hot
torch.manual_seed(0)

X = random_one_hot((5, 4, 10), random_state=0).float()
model = FlattenDense()

y = model(X)
y

tensor([[-0.3154, -0.1625, -0.3183],
        [-0.0866,  0.5461, -0.0244],
        [ 0.3089, -0.2828, -0.1485],
        [ 0.1671, -0.1341, -0.3094],
        [-0.0627,  0.0088,  0.3471]], grad_fn=<AddBackward0>)

This is simple enough to do for a simple model on a small amount of data. Let’s try using the built-in predict function with different batch sizes, to demonstrate how one would do batched predictions.

from tangermeme.predict import predict

y0 = predict(model, X, batch_size=2)
y0

tensor([[-0.3154, -0.1625, -0.3183],
        [-0.0866,  0.5461, -0.0244],
        [ 0.3089, -0.2828, -0.1485],
        [ 0.1671, -0.1341, -0.3094],
        [-0.0627,  0.0088,  0.3471]])

y0 = predict(model, X, batch_size=100)
y0

tensor([[-0.3154, -0.1625, -0.3183],
        [-0.0866,  0.5461, -0.0244],
        [ 0.3089, -0.2828, -0.1485],
        [ 0.1671, -0.1341, -0.3094],
        [-0.0627,  0.0088,  0.3471]])

Note that the tensor no longer has grad_fn=<AddBackward0> meaning that gradients were not calculated or stored. Specifically, the prediction loop is wrapped in torch.no_grad. By default, this function will move each batch to the GPU. However, it doesn’t have to. You can pass device='cpu' to have the predictions be made on the CPU.

y0 = predict(model, X, device='cpu')
y0

tensor([[-0.3154, -0.1625, -0.3183],
        [-0.0866,  0.5461, -0.0244],
        [ 0.3089, -0.2828, -0.1485],
        [ 0.1671, -0.1341, -0.3094],
        [-0.0627,  0.0088,  0.3471]])

Next, let’s consider the setting where you want to pass additional arguments into the forward function because the model is multi-input. Remember that our model can optionally take in alpha and beta parameters. All we have to do is pass in a tuple of args to the predict function where each element in args is a tensor containing values for one of the inputs to the model.

Let’s start off by looking at just passing in alpha to the model.

torch.manual_seed(0)
alpha = torch.randn(5, 1)

y + alpha

tensor([[ 1.2256,  1.3785,  1.2227],
        [-0.3800,  0.2527, -0.3178],
        [-1.8699, -2.4616, -2.3273],
        [ 0.7355,  0.4344,  0.2591],
        [-1.1472, -1.0757, -0.7374]], grad_fn=<AddBackward0>)

predict(model, X, args=(alpha,))

tensor([[ 1.2256,  1.3785,  1.2227],
        [-0.3800,  0.2527, -0.3178],
        [-1.8699, -2.4616, -2.3273],
        [ 0.7355,  0.4344,  0.2591],
        [-1.1472, -1.0757, -0.7374]])

Now, let’s try passing in both alpha and beta.

torch.manual_seed(1)
beta = torch.randn(5, 1)

y * beta + alpha

tensor([[ 1.3324,  1.4336,  1.3305],
        [-0.3165, -0.1477, -0.2999],
        [-2.1597, -2.1962, -2.1879],
        [ 0.6723,  0.4851,  0.3762],
        [-1.0562, -1.0885, -1.2414]], grad_fn=<AddBackward0>)

predict(model, X, args=(alpha, beta))

tensor([[ 1.3324,  1.4336,  1.3305],
        [-0.3165, -0.1477, -0.2999],
        [-2.1597, -2.1962, -2.1879],
        [ 0.6723,  0.4851,  0.3762],
        [-1.0562, -1.0885, -1.2414]])

This implementation is extremely flexible. It makes no assumptions on the shape of the underlying data (except that the batch_size dimension is the same), and so we could pass in bigger tensors if we wanted to without having to modify the code.

torch.manual_seed(0)
alpha = torch.randn(5, 3)

y + alpha

tensor([[ 1.2256, -0.4559, -2.4970],
        [ 0.4818, -0.5384, -1.4230],
        [ 0.7123,  0.5552, -0.8678],
        [-0.2362, -0.7307, -0.1273],
        [-0.9193,  1.1094, -0.7241]], grad_fn=<AddBackward0>)

predict(model, X, args=(alpha,))

tensor([[ 1.2256, -0.4559, -2.4970],
        [ 0.4818, -0.5384, -1.4230],
        [ 0.7123,  0.5552, -0.8678],
        [-0.2362, -0.7307, -0.1273],
        [-0.9193,  1.1094, -0.7241]])

This means that if you have a model with one input that is biological sequence and another input that is something more complicated – like an image, for instance – you can easily pass both into the model. They just need to be passed in in the same order as defined by the forward function.

```

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

Tutorial A3: DeepLIFT/SHAP

Although machine learning models – and, more specifically, deep learning models – are frequently referred to as “black box” and “uninterpretable”, these models can be invaluable tools for discovery. At a high level, if one rephrases the problem not of understanding what the set/s of learned weights mean but rather how changes in each input to the model affect the output of the model, principled methods can be devised. These methods aim to attribute to each feature in an example a value that corresponds in some meaningful manner to that feature’s “importance” to the predictions for that feature. The simplest such value is just the Euclidean distance between the predictions before and after making the substitution. Frequently, this value is then Z-score normalized independently for each position in the sequence. In genomics, each feature is a nucleotide at a position so there are 4*seq_len total features. Each attribution method has its own set of assumptions and interpretation of the output so it is important to fully understand the method being used.

Perhaps the most conceptually simple attribution method is in silico saturation mutagenesis (ISM). In this method, each potential substitution is considered and the attribution value is derived from the difference in prediction before and after making the substitution. Nucleotides that drive predictions higher, such as those within protein binding sites, are usually assigned a higher score and those that are irrelevant for the prediction are assigned near-zero values. Importantly, this method measures the individual effect that each nucleotide has on predictions and does not consider epistatic interactions. You can read more about it in the corresponding tutorial.

A challenge with ISM is that it requires 3*seq_len forward passes through the model; in contrast, gradient-based methods work by assigning an importance to each feature simultaneously based on the gradient with respect to an output. Put another way, perturbation-based methods require a number of forward passes proportional to the number of perturbations/inputs, whereas gradient-based methods require a number of backward passes proportional to the number of outputs of the model. If you have a model with a single output this means that you can calculate attributions for each input with a single backward pass. If you have a multi-task model with many tasks you will need one backward pass for each task that you care about.

Unfortunately, basic gradient-based/saliency methods are unstable and suffer from not having a reference to compare against. Imagine if you were trying to run ISM but did not have an original sequence to compare the predictions to. DeepLIFT provides a “rescale” correction for the instability of gradients and also calculates the gradient with respect to a reference sequence. The choice of reference sequence is critical for getting meaningful attributions but, fortunately, generating such references in genomics is fairly straightforward. DeepSHAP, which was developed concurrently, extends this idea to using multiple reference sequences and averaging over them. They are commonly called DeepLIFT/SHAP to recognize the connections between methods and concurrence of development.

tangermeme provides an implementation of DeepLIFT/SHAP that is based on the implementation found in Captum. However, this implementation has several advantages: (1) it is significantly more compact and better documented, (2) it is unit tested thoroughly to ensure convergence (see below) of common layers and models, (3) it allows vectorized calculations of attributions when the references are different for each sequence, (4) it automatically checks the convergence of each run and raises a warning if convergence is not reached, and (5) it can calculate attributions for models even when the full number of references do not fit in memory by running batches that do fit in memory and then aggregating the references after they have all been run.

Let’s take a look at how to do this using the Beluga model. We start by loading it up like normal.

import torch
from model import Beluga

model = Beluga()
model.load_state_dict(torch.load("deepsea.beluga.torch"))

<All keys matched successfully>

Now, let’s take a random one-hot encoding and put an AP-1 motif right in the middle.

from tangermeme.utils import random_one_hot
from tangermeme.ersatz import substitute

X = random_one_hot((1, 4, 2000)).type(torch.float32)
X = substitute(X, "GTGACTCATC")

Now we can run DeepLIFT/SHAP using the Beluga model focusing specifically on a task that predicts JunD binding in GM12878. The signature of DeepLIFT/SHAP somewhat mirrors that of the predict function with the notable changes being that an optional target parameter allows you to focus on a specific output and the random_state argument allows you to force DeepLIFT/SHAP to be deterministic when generating the randomly shuffled sequences as references.

from tangermeme.deep_lift_shap import deep_lift_shap

X_attr = deep_lift_shap(model, X, target=267, device='cpu', random_state=0)
X_attr.shape

torch.Size([1, 4, 2000])

We will get back a tensor that is the same shape as the input where the values are the attributions averaged across all references. As a technical note, these values are not the multipliers that are usually returned by other DeepLIFT/SHAP implementations. Rather, these are the multipliers after they have been corrected for the difference-from-reference property. Specifically, the multipliers can be thought of as the importance of each character to the prediction. But, using a one-hot encoding, multiple values cannot be 1 and so the presence of a 1 explicitly means that the other values at that position are 0. So, the true influence of a character at a position is the attribution of that character at that position minus the attribution taken by the character at the reference sequence.

Here, tangermeme makes several decisions for the user, which diverges from my goal of being as assumption-free as possible. However, this is because calculating attributions correctly is challenging but there is a well-used way to do it and I wanted to make sure that people – particularly those new to genomics – were doing it right.

Anyway, we can visualize these attributions using a logo where the height of the characters corresponds to the attribution values.

from matplotlib import pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
from tangermeme.plot import plot_logo

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 JunD")
plt.ylim(-0.05, 0.35)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_8_0.png

This looks good. When we substitute in an AP-1 motif that Beluga is known to respond to, the attributions highlight the motif and are pretty low everywhere else.

We can check and see that there are other tasks where this motif does not drive predictions. Specifically, we can find the binding of some very different protein in the same cell type and make sure that the attributions are low.

IMPORTANT: when comparing attributions at the same location across different tasks it is crucial that they must be done against the exact same reference sequences. When you set random_state to an integer and use the the same number of shuffles this is taken care for you. But if you do not set a random state you will get slightly different attributions for even the same task.

X_attr = deep_lift_shap(model, X, target=214, device='cpu',random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 ETS1")
plt.ylim(-0.05, 0.35)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_10_0.png

Looks good. This result shows that the inclusion of an AP-1 binding site does not drive predictions of ETS1.

Conversely, we can show that tasks that should be influenced by the inclusion of an AP-1 binding site also highlight the added motif. Specifically, because most AP-1 factors also drive chromatin accessibility we can look at a DNase target in the same cell type.

X_attr = deep_lift_shap(model, X, target=55, device='cpu',random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 DNase")
plt.ylim(-0.05, 0.35)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_12_0.png

DeepLIFT/SHAP is most useful because it can identify all the motifs that are driving model predictions in a region – not just a single one. To demonstrate this, we will insert two copies of the AP-1 motif as well as two other non-informative sequences.

from tangermeme.ersatz import multisubstitute

X = random_one_hot((1, 4, 2000)).type(torch.float32)
X = multisubstitute(X, ["GTGACTCATC", "GTGACTCATC", "CCTCTG", "AACTCTCC"], [12, 8, 15])
X_attr = deep_lift_shap(model, X, target=267, random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 JunD")
plt.ylim(-0.05, 0.35)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_14_0.png

As you might expect, the attributions highlight the two AP-1 motifs and are low for the other two sequences.
Large Model Handling

Working with massive models like Enformer can be challenging due to their large memory footprint. When making predictions, large memory requirements can be circumvented by simply setting a batch size to a small number or even just 1. Running DeepLIFT/SHAP is a bit more challenging because a number of references need to be operated on and averaged over for each example and this number of references should, ideally, be larger the larger the length of the sequence. Some implementations require that an example and all reference examples are able to fit in memory at the same time so that even when the batch size is set to 1 that still requires fitting in 2*n_shuffles sequences in memory where n_shuffles is the number of reference sequences used (multiplied by 2 because, in practice, the example being explained needs to be copied once for each reference because you’re calculating differences from reference internally).

The tangermeme implementation of DeepLIFT/SHAP overcomes this issue by having the batch size correspond to the number of example-reference pairs being processed at the same time. So, when batch_size=1, two sequences are being put in memory: the example and one of the references for that sequence. Batches of example-reference pairs are then processed with each batch being moved to the specified device and the results being moved back to the CPU. After each batch, if an example and all of its references have been processed, the average across all references is taken. Basically, this implementation allows one to run DeepLIFT/SHAP with a number of references larger than what fits in GPU memory.

We can see here what the attributions look like when setting the batch size to less than the number of shuffles.

X_attr = deep_lift_shap(model, X, target=267, n_shuffles=20, device='cpu', batch_size=1, random_state=0)
X_attr[0, :, :10]

tensor([[-0.0000e+00, -0.0000e+00, -0.0000e+00,  0.0000e+00, -0.0000e+00,
          5.6139e-06,  2.4708e-05, -8.2555e-06,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  4.2981e-06,  0.0000e+00,  3.3393e-05,
         -0.0000e+00,  0.0000e+00, -0.0000e+00, -7.3389e-05,  0.0000e+00],
        [-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00,
         -0.0000e+00, -0.0000e+00, -0.0000e+00,  0.0000e+00, -8.8689e-05],
        [-0.0000e+00, -3.2803e-06, -0.0000e+00,  1.4074e-05, -0.0000e+00,
         -0.0000e+00,  0.0000e+00, -0.0000e+00,  0.0000e+00,  0.0000e+00]])

X_attr = deep_lift_shap(model, X, target=267, n_shuffles=20, device='cpu', batch_size=20, random_state=0)
X_attr[0, :, :10]

tensor([[-0.0000e+00, -0.0000e+00, -0.0000e+00,  0.0000e+00, -0.0000e+00,
          5.6139e-06,  2.4708e-05, -8.2555e-06,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  4.2981e-06,  0.0000e+00,  3.3393e-05,
         -0.0000e+00,  0.0000e+00, -0.0000e+00, -7.3389e-05,  0.0000e+00],
        [-0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00,
         -0.0000e+00, -0.0000e+00, -0.0000e+00,  0.0000e+00, -8.8689e-05],
        [-0.0000e+00, -3.2803e-06, -0.0000e+00,  1.4074e-05, -0.0000e+00,
         -0.0000e+00,  0.0000e+00, -0.0000e+00,  0.0000e+00,  0.0000e+00]])

The results are the same! So, if you’re using a massive model you can still run a reasonable number of references without being limited by only having modest hardware.
References

The choice of reference is hugely important for getting intuitive results. Ideally, a reference should be a biologically plausible sequence that is not predicted to have the activity that you care about. In tangermeme, the default reference function is dinucleotide_shuffle. Dinucleotide shuffles are reasonable choices because they preserve GC dinucleotide content – important due to their role in being methylated – while being disruptive to any motifs present in the sequence.

Let’s take a look at what happens when we increase the number of shuffles used when analyzing the sequence with multiple motifs substituted into it.

X_attr1 = deep_lift_shap(model, X, target=267, device='cpu', n_shuffles=1, random_state=0)
X_attr2 = deep_lift_shap(model, X, target=267, device='cpu', n_shuffles=2, random_state=0)
X_attr5 = deep_lift_shap(model, X, target=267, device='cpu', n_shuffles=5, random_state=0)
X_attr20 = deep_lift_shap(model, X, target=267, device='cpu', n_shuffles=20, random_state=0)
X_attr50 = deep_lift_shap(model, X, target=267, device='cpu', n_shuffles=50, random_state=0)

plt.figure(figsize=(10, 6))
for i, X_attr in enumerate([X_attr1, X_attr2, X_attr5, X_attr20, X_attr50]):
    ax = plt.subplot(5, 1, i+1)
    plot_logo(X_attr[0, :, 950:1050], ax=ax)

    plt.title("{} dinucleotide shuffle/s".format([1, 2, 5, 20, 50][i]), fontsize=10)
    plt.ylim(-0.05, 0.5)

    if i == 2:
        plt.ylabel("Attributions")
    if i == 4:
        plt.xlabel("Genomic Coordinate")


plt.tight_layout()
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_22_0.png

As the number of references increase we can see that the two motif instances start to mirror each other more, and that the effect of additional shuffles seems to diminish as more are done. So, it is important to make sure that you are using a large enough number of sequences that the attributions have stabilized.

Something worth noting is that when only one reference sequence is used, some positions – even multiple within the second motif instance – appear to have a zero value. This is because, due to DeepLIFT/SHAP’s difference-from-reference nature, whenever the reference sequence has the same character as the original sequence the attribution is definitionally zero. When using multiple shuffles this effect is averaged out, but it is worth keeping this property in mind.

But, dinucleotide shuffles are not the only reference-generation algorithm that can be used. The references parameter can be any function (default ersatz.dinucleotide_shuffle) that takes in a batch of sequences, a number of shuffles, and an integer seed for controlling random state, and returns a tensor with shape (n_examples, n_references, len(alphabet), length). As a simple example, we can pass in a function that just returns all zeroes.

zeros = lambda X, n, random_state: torch.zeros(X.shape[0], n, *X.shape[1:])
X_attr = deep_lift_shap(model, X, target=267, references=zeros, device='cpu', n_shuffles=1, random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 JunD")
plt.ylim(-0.1, 0.6)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_24_0.png

The all-zeroes reference and its friend, the uniform reference (either 0.25 or 0.5 everywhere), have been proposed as an alternative to dinucleotide shuffling. These may be reasonable baselines in some cases. I do not think we actually know what the best reference is, in part because there is not really a clear way to evaluate references.

Finally, if you have preconstructed references and want to just pass in a tensor instead of a function you can do that.

zeros = torch.zeros(X.shape[0], 1, *X.shape[1:])
X_attr = deep_lift_shap(model, X, target=267, references=zeros, device='cpu', n_shuffles=1, random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 JunD")
plt.ylim(-0.1, 0.6)
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_26_0.png

Being able to use custom references gives you a large amount of flexibility when it comes to do subtle analyses. For example, imagine that you know a large portion of the sequence is not relevant for your analysis. You can hold that portion of the sequence constant and make the shuffles you do more meaningful in the area that you do care about (remember that the number of permutations increases exponentially with the number of positions you are shuffling). Here, we can focus on the left AP-1 motif substitution.

from tangermeme.ersatz import dinucleotide_shuffle

ref = dinucleotide_shuffle(X, n=20, start=960, end=985, random_state=0)
X_attr = deep_lift_shap(model, X, target=267, device='cpu', references=ref, random_state=0)

plt.figure(figsize=(10, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 950:1050], ax=ax)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attributions")
plt.title("DeepLIFT Attributions for GM12878 JunD")
plt.show()

../_images/tutorials_Tutorial_A3_Deep_LIFT_SHAP_28_0.png

Remember that DeepLIFT/SHAP use references and that you are explaining the difference from reference. If the references are uniformly identical to the example being explained because, for example, you only shuffled a portion of the sequence to construct the references, the attributions must definitionally be zero where you did not shuffle. This means you should not use the attribution values (which will be zero) in the non-shuffled portions of the sequence and that the attributions in the shuffled portion of the sequence refer to the additional increase in model prediction from just the shuffled portion.
Convergence Deltas

A key property of the DeepLIFT/SHAP algorithm is that the sum of attributions is equal to the difference in predictions between the original sequence and the reference sequence. However, there can be challenges with implementing DeepLIFT/SHAP correctly or ensuring that every layer in a neural network is being handled correctly. For example, max pooling layers require special care due to the maximum operation. If you mistakenly have dropout turned on the algorithm will not work.

Accordingly, one can check the “convergence deltas” to see whether the algorithm is working correctly. Importantly (and somewhat confusingly), DeepLIFT/SHAP is not an iterative algorithm and so “convergence” does not mean the same thing here as it does during training. What these measure is the difference between (the sum of the attributions) and the (difference in predictions between the original sequence and reference sequence). Parentheses added to make the sentence more comprehendable given two uses of the word “difference.” These two quantities should be equal and so when they are not something has gone wrong.

You can have the convergence deltas printed for each example-reference pair (one print statement per batch) using the print_convergence_deltas parameter.

X_attr = deep_lift_shap(model, X, target=267, device='cpu', random_state=0, print_convergence_deltas=True)

tensor([2.3842e-07, 4.7684e-07, 2.3842e-07, 2.3842e-07, 4.7684e-07, 4.7684e-07,
        4.7684e-07, 2.3842e-07, 4.7684e-07, 4.7684e-07, 1.1921e-06, 4.7684e-07,
        1.1921e-06, 4.7684e-07, 9.5367e-07, 4.7684e-07, 2.3842e-07, 2.3842e-07,
        7.1526e-07, 1.1921e-06], grad_fn=<AbsBackward0>)

These values seem reasonable. Usually, you would expect to see values in the range of around machine precision. However, there are two caveats to this statement. The first is that machine precision on a GPU seems to be lower than on a CPU. I do not know exactly why this is, but across several algorithms the difference of numbers that should be equal is around 1e-6 to 1e-8 on a CPU and around 1e-4 to 1e-5 on a GPU. A second caveat is that some layers, such as max pooling/unpooling operations in CUDNN, do not have deterministic implementations even when explicitly setting the use of deterministic algorithms in PyTorch. I would recommend reading https://pytorch.org/docs/stable/notes/randomness.html

In general, the convergence deltas should be small w.r.t the predictions. If you are getting a convergence delta in the range of ~1e-4 and predictions are above 0.1 that means the delta is three orders of magnitude smaller than the predictions and likely not too significant. If your convergence deltas are 1e-5 but the predictions are also in that range, that is likely a bigger problem.

Anyway, let’s take a look at what happens when we run the same thing on a GPU.

X_attr = deep_lift_shap(model, X, target=267, random_state=0, print_convergence_deltas=True)

tensor([2.4080e-05, 1.0657e-04, 1.9288e-04, 4.7278e-04, 1.8740e-04, 1.5259e-04,
        3.2091e-04, 2.3675e-04, 5.2857e-04, 4.6015e-05, 1.5044e-04, 1.0705e-04,
        5.5075e-05, 3.9053e-04, 4.7612e-04, 1.5640e-04, 7.3910e-05, 7.5817e-05,
        3.0828e-04, 2.6321e-04], device='cuda:0', grad_fn=<AbsBackward0>)

Looks like we are getting convergence deltas in the range that I mentioned you would see on a GPU if everything was fine. In practice, if you are concerned that there may be a bug with your model – either because of an undiagnosed bug in the algorithm somewhere or a layer in the model has not yet been registered – you should try running the model on a small number of examples on the CPU to see whether the deltas get much smaller.

Running Beluga on the GPU does demonstrate that this implementation of DeepLIFT/SHAP will raise a non-terminating warning when the convergence deltas get too high. These is useful for automatically knowing that everything is fine.
Additional Non-Linear Operations

A critical part of DeepLIFT/SHAP working (and having low convergence deltas) is knowing what to do when it encounters each type of layer in the model. Specifically, for linear layers nothing needs to be done, and for non-linear layers the “rescale” rule needs to be applied. The way the rescale rule is implemented changes between activation functions and max-pooling layers (because the maximum operation and the pooling operation yield very different implementation strategies from an element-wise non-linearity) and potentially other forms of layers that exist.

In tangermeme there is an internal dictionary mapping operations to how the rescale rule is implemented. If the operation is not in the dictionary the implementation assumes that it is a linear layer and nothing needs to be done. If you want to add additional operations or override how currently-implemented layers are handled you can pass in a dictionary of additional_nonlinear_ops. Each key should be the class type and the value should be the function that should be applied. To demonstrate how to do this, let’s try overriding the ReLU non-linearity operation with just doing nothing instead of rescaling the gradient.

_nonlinear = lambda x, grad_input, grad_output: grad_input
additional_nonlinear_ops = {torch.nn.ReLU: _nonlinear}

X_attr = deep_lift_shap(model, X, target=267, additional_nonlinear_ops=additional_nonlinear_ops, random_state=0,
    print_convergence_deltas=True)

tensor([10.4793, 11.5042,  3.3573,  8.9456,  2.0517, 10.3768,  4.9426,  6.8905,
         5.2843, 12.0783,  2.2528,  4.5493,  9.3421,  2.6807,  1.8102,  6.6792,
         6.8412,  3.7978,  7.0050,  8.7039], device='cuda:0',
       grad_fn=<AbsBackward0>)

/users/jmschr/github/tangermeme/tangermeme/deep_lift_shap.py:436: RuntimeWarning: Convergence deltas too high: tensor([10.4793, 11.5042,  3.3573,  8.9456,  2.0517, 10.3768,  4.9426,  6.8905,
         5.2843, 12.0783,  2.2528,  4.5493,  9.3421,  2.6807,  1.8102,  6.6792,
         6.8412,  3.7978,  7.0050,  8.7039], device='cuda:0',
       grad_fn=<AbsBackward0>)
  warnings.warn("Convergence deltas too high: " +

As you might expect, replacing the real rescaling operation with doing nothing causes massive convergence deltas. Extending this idea, if you are getting large convergence deltas it is possible that there are non-linear operations in your model that are not in the dictionary. Not all activation functions or pooling layers are currently supported. Most activation functions will be covered by the built-in _nonlinear function (not the one above, the one in tangermeme.deep_lift_shap) so if you are using a custom activation you should first try using that.
Comparison with Other Tools

As a sanity check we can confirm that this implementation of DeepLIFT/SHAP matches other implementations. See the vignette “Attribution Trickiness and DeepLiftShap Implementations” for a thorough discussion of comparisons across implementations.

```

Tutorial A4: Seqlets

    deep_lift_shap/saturation mutagenesis

Ultimately, our goal is to better understand the genomic sequences driving biological activity. To do this, we train machine learning models to make accurate predictions and then we use attribution methods (DeepLIFT/SHAP, in silico saturation mutagenesis, whatever) to highlight the individual nucleotides driving those model predictions. However, simply assigning some weight to each nucleotide is not particularly informative, nor is it a scalable way to understand what a model has learned.

Accordingly, after calculating attribution scores, a common task is to identify the contiguous spans of high attribution characters that drive model predictions and, potentially, form protein binding sites. These spans are called “seqlets” and form the basis of subsequent algorithms that aim to distill what these machine learning methods have learned. For example, seqlets can be clustered into recurring patterns that drive model predictions (e.g., as TF-MoDISco does), but seqlets can also be used as an automated way of annotating sequences and extracting spatial relationships between motifs. At a high level, one can think of a seqlet as a base cis-regulatory unit identified by the model.

Like with our other tutorials, let’s begin by loading the Beluga model.

import torch
from model import Beluga

model = Beluga()
model.load_state_dict(torch.load("deepsea.beluga.torch"))

<All keys matched successfully>

To demonstrate seqlet calling, we need clean attribution scores. In practice, one will usually run a seqlet caller on attributions from real genomic sequences. However, for demonstration purposes, we will randomly generate a sequence and insert three known motifs into it. Specifically, we will insert two identical AP-1 motifs and a YY-1 motif, at equal spacings from each other. Then, we will run DeepLIFT/SHAP using a DNase target.

from matplotlib import pyplot as plt
import seaborn; seaborn.set_style('whitegrid')

from tangermeme.utils import random_one_hot
from tangermeme.ersatz import multisubstitute
from tangermeme.deep_lift_shap import deep_lift_shap
from tangermeme.plot import plot_logo

X = random_one_hot((1, 4, 2000), random_state=0).type(torch.float32)
X = multisubstitute(X, ["GTGACTCATC", "GTGACTCATC", "ACTTCCGCG"], [25, 25])
X_attr = deep_lift_shap(model, X, device='cpu', n_shuffles=20, target=57, random_state=0)

plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0, :, 925:1075], ax=ax)
plt.ylim(-0.025, 0.25)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_3_0.png

Looks like the two AP-1 motifs are being pulled out neatly and the YY1 motif is a bit less clean. Hopefully, a seqlet caller will be able to identify all three patterns based on being much
Recursive Seqlets

A new seqlet calling algorithm implemented in tangermeme is the “recursive seqlet” caller. This method is meant to be a simplification – both conceptually and implementation-wise – on the seqlet calling algorithm used in TF-MoDISco. Conceptually, it works by first constructing a background distribution of attribution scores for all span lengths (between user-specified parameters) and then scoring all potential spans according to these background distributions. This process results in a (n_lengths, sequence_length) matrix of p-values.

There are several ways that one could decode this matrix to get actual seqlets, many of which introduce several new parameters that would need to be optimized, but the recursive algorithm relies on a simple definition: a seqlet is the longest span whose p-value is below a threshold where all internal spans also are below the p-value threshold. This gives a precise meaning to what a seqlet actually is and also circumvents several problems that can arise in decoding with other definitions, e.g., the flanks can be too long when you say a seqlet is simply the longest span below a threshold if there’s a central core that has really high attributions.

This definition results in a method that has very few parameters and is quite fast. Let’s see it in action.

from tangermeme.seqlet import recursive_seqlets

seqlets = recursive_seqlets(X_attr.sum(dim=1))
seqlets

	example_idx 	start 	end 	attribution 	p-value
1 	0 	1578 	1583 	-0.024013 	4.662937e-15
0 	0 	1033 	1039 	0.446336 	4.798464e-03
2 	0 	932 	937 	-0.020814 	5.285412e-03
3 	0 	962 	967 	0.357937 	6.717850e-03
4 	0 	998 	1002 	0.278854 	8.579600e-03

Note that the function is run on the the attributions corresponding to the observed characters and so is only two dimensions (with the dimensions being the examples and the other being the length of the examples). A dataframe is returned that is mostly BED formatted, with the first column being the index of the example and the second and third columns being the start and end of the seqlet call, respectively. The attributions are the sum of the attribution across the seqlet and the p-value is the value of the seqlet (at that position and at that length). The recursive property means that all internal spans must also have a p-value lower than the threshold, but those values do not affect the p-value of the seqlet returned here.

plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0], ax=ax, start=925, end=1100, annotations=seqlets, score_key='attribution')
plt.ylim(-0.025, 0.25)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_8_0.png

This example encapsulates the strengths and weaknesses of the recursive seqlet function. Starting with the strengths: it looks like it captured all three motifs, and did so quickly and with only one example to build background distributions with. No other false-positive seqlets were called. Further, it called only a total of five seqlets in the entire sequence, with the other seqlets being two low-attribution negative seqlets. As for weaknesses, it looks like it called two negative seqlets with very low attribution value and the spans for all the seqlets are relatively conservative. The negative seqlet calls are likely an artifact of having to construct background distributions for negative seqlets using only a single example where most spans have positive attributions, so a huge dirth of data. When calling seqlets using more examples, this issue should resolve, but one could get around this issue by making sure that the magnitude of the seqlet attributions is above some threshold.

As for having conservative boundaries… this is a consequence of the recursive seqlet property and the noise in the data. If too many characters have low attribution in a row – or even negative attributions – the recursive property may not hold for the shorter spans. Further, calling significant seqlets is going to require good background distributions of attribution values. A single example may not have enough examples of background distributions of all attribution strengths and also for all span lengths. I would expect that this has a saturation property, where having more examples provides a significant boost in seqlet boundary expansion until a certain point where it levels off.

The reasoning for focusing on finding the cores of seqlets and maybe missing some of the flanks is that seqlet calling is very imprecise and subjective. Where exactly does the YY1 seqlet on the right begin and end? Does it include the GCGGG toward the end? What about the subsequent CTA? In principle, this can be resolved by clustering together YY1 seqlets to find consensus patterns (as TF-MoDISCo does) but when just calling seqlets with no other information, the exact borders can be difficult to define.

So, if one views the recursive seqlet algorithm as good at finding cores of seqlets and maybe needs additional flanks on either side – or maybe you are in a setting where you need the surrounding context of the seqlet for your downstream application – we can expand the flanks of the seqlet with the additional_flanks parameter.

seqlets = recursive_seqlets(X_attr.sum(dim=1), additional_flanks=5)

plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0], ax=ax, start=925, end=1100, annotations=seqlets, score_key='attribution')
plt.ylim(-0.025, 0.25)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_10_0.png

Here, all we are doing is setting the start and end of the seqlet to have an offset of 5 more than the original call. It looks like the calls may be much better for most purposes, especially for ones where it is okay to have some uninformative positions on either end. Adding additional flanks does not affect the seqlet calling procedure and it does not influence the returned attribution score or the p-values, which remain for the original span. Further, seqlet calls cannot overlap each other, but the flanks of seqlets are not considered an “overlap.” Literally, when you add additional flanks, the start and the end just get changed at the very end of the procedure.

As mentioned before, the recursive seqlet procedure has very few parameters: min_seqlet_len, max_seqlet_len, and threshold (and additional_flanks, of course). The minimum and maximum seqlet lengths are used for defining the range of spans to calculate p-values for, with the minimum length (default 4) also defining the smallest span that has to have a p-value lower than the threshold, and the maximum seqlet length just being the maximum size a seqlet can be. Note that this value should probably not be smaller than 4 because of noise in the data. For instance, if you set it to 1, than any individual position that has an attribution that is not very high will ruin the entire seqlet, even if it is just a single weak character in an otherwise-strong motif. Increasing maximum seqlet length will increase runtime and memory usage a little bit but should be set to as large as you think a seqlet might potentially be (default 25).

Also, note that the threshold will depend a bit on the number of examples that seqlets are being calculated for. A threshold of 0.001 means, definitionally, that there should be at least 1000 examples of spans of that distance (and that the observed seqlet has a higher attribution than all of them). Let’s take a look at what happens when we set the threshold to a much smaller value, though.

seqlets = recursive_seqlets(X_attr.sum(dim=1), threshold=0.1)

plt.figure(figsize=(12, 2))
ax = plt.subplot(111)
plot_logo(X_attr[0], ax=ax, start=925, end=1100, annotations=seqlets, score_key='attribution')
plt.ylim(-0.025, 0.25)

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_12_0.png

Looks like the main seqlets are basically being discovered again but that some of the other small spans of modest attribution characters are also being picked up as seqlets. Depending on how clean your data is, this may be desirable, but also could be noise. Ultimately, seqlet calling in one pass like this is very challenging to do without any additional information.
TF-MoDISco Seqlets

TF-MoDISco is an algorithm for identifying repeating seqlet patterns. The first step in this algorithm is to call seqlets. Accordingly, it uses an approach that is somewhat similar to the recursive algorithm in that it defines a background distribution for positive and negative seqlets separately, but it has several more steps that are meant to be helpful in the final goal of finding repeating patterns. For instance, this seqlet calling method will first smooth the attribution values, uses a wide fixed window, and fits isotonic regressions to estimate background probabilities instead of the binning strategy employed by the recursive algorithm. Although the seqlet calls from this algorithm are also done using a greedy procedure, the recursive seqlet property does not hold for the seqlets called here.

We can run this seqlet calling method using the tfmodisco_seqlets function. This function takes in an attribution track – which should not contain hypothetical attributions – and returns a set of seqlets with positive attribution and a set of seqlets with negative attributions as pandas DataFrames with the same format as the other motif calling methods.

from tangermeme.seqlet import tfmodisco_seqlets

seqlets = tfmodisco_seqlets(X_attr.sum(dim=1))
print(seqlets.shape)
seqlets.head()

(19, 4)

	example_idx 	start 	end 	attribution
0 	0 	1021 	1062 	0.788878
1 	0 	941 	982 	0.685255
2 	0 	975 	1016 	0.623512
3 	0 	1000 	1041 	0.146656
4 	0 	885 	926 	0.128889

It looks like the attribution scores drop significantly after the third seqlet. This is likely because the first three seqlets correspond to the three known motifs that we inserted into the sequence and the ones after that are just false positives. Something to note about this approach to seqlet calling is that it assumes a minimum number of regions will be seqlets and so will adaptively change the thresholds internally to make sure that number of seqlets are called. This means that the method can overcall seqlets when examples do not have enough going on in them. Here, we can see that 19 seqlets are called, in comparison to just 5 when using the recursive seqlet caller. For TF-MoDISco, overcalling seqlets is not a big problem because the seqlets are more rigorously filtered in subsequent steps, but you may need to implement similar filtering (e.g., based on attribution scores) to remove these spuriously called seqlets from your downstream analyses.

plt.figure(figsize=(12, 3))
ax = plt.subplot(111)
plot_logo(X_attr[0], ax=ax, start=925, end=1075, annotations=seqlets, score_key="attribution")

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_17_0.png

Looks like all three of the motifs we added in are being underlined by the seqlet caller. Similar to the recursive seqlet caller, the TF-MoDISco seqlet caller will first identify core components of seqlets that cannot overlap and then subsequently expand the boundaries in a way that can overlap with other seqlets. The difference here is just that TF-MoDISco will expand out the flanks by default because that is useful for the subsequent clustering algorithm – i.e., the seqlet calling algorithm is purpose-driven for the rest of the algorithm, rather than being general-purpose.

You might notice that the boundaries seem a little bit… offset… with a longer span to the left of the seqlet than to the right. This is because, internally, the attribution at each position is replaced with the sum of the attributions starting at that position and moving to the right. This means that the chosen position for each seqlet is the left side of the seqlet. From this chosen position, a fixed-width span is chosen on either end as the seqlet boundaries.

By default, the TF-MoDISco seqlet caller calls seqlets that are quite long. This is because the subsequent TF-MoDISco pattern discovery algorithm uses a similarity score based on alignment and so it is better for the seqlets to have longer context and make sure to capture everything than it is to miss something important in the seqlet. If you would like to change the width you can alter the window_size and flank arguments. The window_size argument defines how many positions to the right to sum over when replacing the position-specific attribution, and the flank argument defines the additional number of nucleotides to add to the seqlet for padding.

seqlets = tfmodisco_seqlets(X_attr.sum(dim=1), window_size=11)

plt.figure(figsize=(12, 3))
ax = plt.subplot(111)
plot_logo(X_attr[0], ax=ax, start=925, end=1075, annotations=seqlets, score_key="attribution")

plt.xlabel("Genomic Coordinate")
plt.ylabel("Attribution")
plt.title("DeepLIFT/SHAP of Marginalization")
plt.show()

../_images/tutorials_Tutorial_A4_Seqlets_19_0.png

By decreasing the window size to 11 we get shorter seqlets but also we begin to get an overcalling of low-attribution seqlets. This is largely because, when a seqlet is called, a window surrounding the calling position is blanked out, preventing additional seqlets from being called at this position. When the window size is decreased, additional seqlets may be called that pick up on the flanking nucleotides that do not themselves have very high attribution scores but have relatively high attributions given a background sequence. However, the seqlets called for the known patterns are still several times stronger than any of the overcalled seqlets.

Taken together, seqlet calling and plotting of these annotations can be a powerful tool for understanding what your model is picking up on when it makes predictions. As we will see in later notebooks, the automatic partitioning of the genome into units that drive predictions will be a valuable step for many downstream processes that can be useful in summarizing the features driving predictions globally.

```