Goal

TODO

  • [x] implement Winner-take-all autoencoder: https://arxiv.org/pdf/1409.2752.pdf
    • [x] Spatial Sparsity
      • for each feature map, consider only the neuron with the highest acctivation accross spatial dimensions
    • [x] Lifetime Sparsity
      • for each feature map, consider only top k% in the batch (in the paper they used k={20, 5}
  • [x] Evaluate each on the ChipNexus data
    • visualize the deconv filters (also called atoms in the paper)
    • visualize the activation layer (with and without the sparsity regularization)
    • visualize the predicted track (with and without the sparsity regularization)
  • [x] Train seq->chipnexus, and (seq, chipnexus) -> chipnexus models
    • simply stack sequence+signal
  • [ ] Run the final prediction turning off the regularization
    • compare the two cases (on and off)
  • [ ] compute the PWM wrt to every output footprint
    • can we see any clear binding footprints?

Technical improvements

  • [ ] use larger window size as input with valid convolutions

Organizatory

  • [ ] setup the model function with easy hyper-parameters

Potential problems

  • the width of the footprint could vary, hence a single deconv filter might not be enough to represent it
In [2]:
import keras
import pyBigWig
import matplotlib.pyplot as plt
from concise.preprocessing import encodeDNA
from concise.utils.fasta import read_fasta
from pybedtools import BedTool
from tqdm import tqdm
import pandas as pd
import numpy as np

from basepair.config import get_data_dir, create_tf_session
from basepair.datasets import get_sox2_data, seq_inp_exo_out
from basepair.math import softmax
from basepair.losses import twochannel_multinomial_nll
import basepair as bp
from basepair.layers import SpatialLifetimeSparsity
from basepair.plots import *
ddir = get_data_dir()
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [3]:
import keras.layers as kl
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History
from keras.models import Model, load_model
In [4]:
def get_model(mfn, mkwargs):
    """Get the model"""
    import datetime
    mdir = f"{ddir}/processed/chipnexus/exp/models/wta-ae"
    name = mfn + "_" + \
            ",".join([f'{k}={v}' for k,v in mkwargs.items()]) + \
            "." + str(datetime.datetime.now()).replace(" ", "::")
    !mkdir -p {mdir}
    ckp_file = f"{mdir}/{name}.h5"
    return eval(mfn)(**mkwargs), name, ckp_file
In [5]:
create_tf_session(1)
keras.backend.get_session().list_devices()
Out[5]:
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 5760073728)]
In [5]:
dfc = get_sox2_data()
100%|██████████| 9396/9396 [00:02<00:00, 3444.38it/s]
In [6]:
train, valid, test = seq_inp_exo_out()
100%|██████████| 9396/9396 [00:02<00:00, 3455.34it/s]
In [7]:
[x.shape for x in train]
Out[7]:
[(5561, 201, 4), (5561, 201, 2), (5561, 7)]
In [8]:
dfc.head(2)
Out[8]:
chr cuts_neg cuts_pos end start seq seq_id
0 chr17 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 17409072 17408871 CCAGGAAGCTGCCTCAGGCTAGCCTCCGGAAACACTCCACAGTATC... chr17:17408871-17409072
1 chr5 [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... 110284853 110284652 ATGAAAAGAGAGAGAAAACGCGTGCGGTTCAGCTAGCTCACTTTTT... chr5:110284652-110284853
In [9]:
from basepair.models import seq_dense, seq_dense_wta, wta_ae
In [10]:
# hyper-parameters
mfn = "seq_dense_wta"
mkwargs = dict(filters=21,
               conv1_kernel_size=25,
               tconv_kernel_size=25,
               n_dil_layers=6,
               wta_lifetime_rate=1.0,
               n_bottleneck=10,
               #use_seq=True,
               lr=0.004)
In [16]:
# best valid so far: 108238.6558
model, name, ckp_file = get_model(mfn, mkwargs)
history = model.fit({"seq": train[0], "chip_nexus": train[1]}, train[1], 
          batch_size=256, 
          epochs=5,
          validation_split=0.2,
          callbacks=[EarlyStopping(patience=20),
                     History(),
                     ModelCheckpoint(ckp_file, save_best_only=True)]
         )
# get the best model
model = load_model(ckp_file, custom_objects={"twochannel_multinomial_nll": twochannel_multinomial_nll, 
                                             "SpatialLifetimeSparsity": SpatialLifetimeSparsity})
Train on 4448 samples, validate on 1113 samples
Epoch 1/5
4448/4448 [==============================] - 6s 1ms/step - loss: 61900.9574 - val_loss: 128206.6146
Epoch 2/5
4448/4448 [==============================] - 1s 117us/step - loss: 61581.3062 - val_loss: 126621.4253
Epoch 3/5
4448/4448 [==============================] - 1s 126us/step - loss: 61288.0131 - val_loss: 125790.6921
Epoch 4/5
4448/4448 [==============================] - 1s 144us/step - loss: 61156.8059 - val_loss: 124908.5045
Epoch 5/5
4448/4448 [==============================] - 1s 133us/step - loss: 60862.9582 - val_loss: 124121.3946
In [20]:
dfh = pd.DataFrame(history.history)
In [ ]:
model.history
In [137]:
p = Seq2NexusTrack({"seq": train[0], "chip_nexus": train[1]}, #train[1], #
                   train[1], model)
In [138]:
p.plot()
In [139]:
p = Seq2NexusTrack({"seq": test[0], "chip_nexus": test[1]}, 
                   test[1], model)
In [140]:
p.plot(sort='top')
In [ ]:
p = CAENexus({"seq": test[0], "chip_nexus": test[1]},
             test[1], test[2], model)
In [144]:
p.plot(10, sort='top', bottleneck_type='after', show_seq=True)
In [145]:
deconv_filters(model, figsize=(15, 4), ncol=5)
In [153]:
a = model.evaluate(test[0], test[1])
1951/1951 [==============================] - 0s 199us/step
In [159]:
a
Out[159]:
8488.407087911648