Goal¶

for each binding site, get the binding site class: implement Winner-take-all autoencoder: https://arxiv.org/pdf/1409.2752.pdf

TODO¶

[x] implement Winner-take-all autoencoder: https://arxiv.org/pdf/1409.2752.pdf
- [x] Spatial Sparsity
  - for each feature map, consider only the neuron with the highest acctivation accross spatial dimensions
- [x] Lifetime Sparsity
  - for each feature map, consider only top k% in the batch (in the paper they used k={20, 5}
[x] Evaluate each on the ChipNexus data
- visualize the deconv filters (also called atoms in the paper)
- visualize the activation layer (with and without the sparsity regularization)
- visualize the predicted track (with and without the sparsity regularization)
[x] Train seq->chipnexus, and (seq, chipnexus) -> chipnexus models
- simply stack sequence+signal
[ ] Run the final prediction turning off the regularization
- compare the two cases (on and off)
[ ] compute the PWM wrt to every output footprint
- can we see any clear binding footprints?

Technical improvements¶

[ ] use larger window size as input with valid convolutions

Organizatory¶

[ ] setup the model function with easy hyper-parameters

Potential problems¶

the width of the footprint could vary, hence a single deconv filter might not be enough to represent it

import keras
import pyBigWig
import matplotlib.pyplot as plt
from concise.preprocessing import encodeDNA
from concise.utils.fasta import read_fasta
from pybedtools import BedTool
from tqdm import tqdm
import pandas as pd
import numpy as np

from basepair.config import get_data_dir, create_tf_session
from basepair.datasets import get_sox2_data, seq_inp_exo_out
from basepair.math import softmax
from basepair.losses import twochannel_multinomial_nll
import basepair as bp
from basepair.layers import SpatialLifetimeSparsity
from basepair.plots import *
ddir = get_data_dir()

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

import keras.layers as kl
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History
from keras.models import Model, load_model

def get_model(mfn, mkwargs):
    """Get the model"""
    import datetime
    mdir = f"{ddir}/processed/chipnexus/exp/models/wta-ae"
    name = mfn + "_" + \
            ",".join([f'{k}={v}' for k,v in mkwargs.items()]) + \
            "." + str(datetime.datetime.now()).replace(" ", "::")
    !mkdir -p {mdir}
    ckp_file = f"{mdir}/{name}.h5"
    return eval(mfn)(**mkwargs), name, ckp_file

create_tf_session(1)
keras.backend.get_session().list_devices()

[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 5760073728)]

dfc = get_sox2_data()

100%|██████████| 9396/9396 [00:02<00:00, 3444.38it/s]

train, valid, test = seq_inp_exo_out()

100%|██████████| 9396/9396 [00:02<00:00, 3455.34it/s]

[x.shape for x in train]

[(5561, 201, 4), (5561, 201, 2), (5561, 7)]

dfc.head(2)

from basepair.models import seq_dense, seq_dense_wta, wta_ae

# hyper-parameters
mfn = "seq_dense_wta"
mkwargs = dict(filters=21,
               conv1_kernel_size=25,
               tconv_kernel_size=25,
               n_dil_layers=6,
               wta_lifetime_rate=1.0,
               n_bottleneck=10,
               #use_seq=True,
               lr=0.004)

# best valid so far: 108238.6558
model, name, ckp_file = get_model(mfn, mkwargs)
history = model.fit({"seq": train[0], "chip_nexus": train[1]}, train[1], 
          batch_size=256, 
          epochs=5,
          validation_split=0.2,
          callbacks=[EarlyStopping(patience=20),
                     History(),
                     ModelCheckpoint(ckp_file, save_best_only=True)]
         )
# get the best model
model = load_model(ckp_file, custom_objects={"twochannel_multinomial_nll": twochannel_multinomial_nll, 
                                             "SpatialLifetimeSparsity": SpatialLifetimeSparsity})

Train on 4448 samples, validate on 1113 samples
Epoch 1/5
4448/4448 [==============================] - 6s 1ms/step - loss: 61900.9574 - val_loss: 128206.6146
Epoch 2/5
4448/4448 [==============================] - 1s 117us/step - loss: 61581.3062 - val_loss: 126621.4253
Epoch 3/5
4448/4448 [==============================] - 1s 126us/step - loss: 61288.0131 - val_loss: 125790.6921
Epoch 4/5
4448/4448 [==============================] - 1s 144us/step - loss: 61156.8059 - val_loss: 124908.5045
Epoch 5/5
4448/4448 [==============================] - 1s 133us/step - loss: 60862.9582 - val_loss: 124121.3946

dfh = pd.DataFrame(history.history)

model.history

p = Seq2NexusTrack({"seq": train[0], "chip_nexus": train[1]}, #train[1], #
                   train[1], model)

p.plot()

p = Seq2NexusTrack({"seq": test[0], "chip_nexus": test[1]}, 
                   test[1], model)

p.plot(sort='top')

p = CAENexus({"seq": test[0], "chip_nexus": test[1]},
             test[1], test[2], model)

p.plot(10, sort='top', bottleneck_type='after', show_seq=True)

deconv_filters(model, figsize=(15, 4), ncol=5)

a = model.evaluate(test[0], test[1])

1951/1951 [==============================] - 0s 199us/step

a

8488.407087911648

	chr	cuts_neg	cuts_pos	end	start	seq	seq_id
0	chr17	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	17409072	17408871	CCAGGAAGCTGCCTCAGGCTAGCCTCCGGAAACACTCCACAGTATC...	chr17:17408871-17409072
1	chr5	[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	110284853	110284652	ATGAAAAGAGAGAGAAAACGCGTGCGGTTCAGCTAGCTCACTTTTT...	chr5:110284652-110284853