Questions

  • can we discover events when Sox2 and Oct4 are both bound?
  • could we discover/distinguish those events when using only Sox2 or only Oct4?
  • does the multi-task model do better than the single task one?

Goal

  • make a multi-task model predicting Sox2 and Oct4 ChipNexus signal git issue

TODO

  • [x] setup a new dataset function based on Sox2 peaks
  • [x] setup a new model having two output branches
  • [x] train the model
  • add the training curve plot bellow training
  • [x] modify the plotting code
    • show also the output tracks for Oct4
    • show also the sequence importance for Oct4
  • [ ] try out a few models to figure out the optimal hyper-parameters
  • [x] figure out how to best visualize the sequence importances
    • max w.r.t. each individual track
      • the problem here is that Oct4 might have multiple peaks present in the sequence
    • average?
    • top2 peaks?
  • [x] run modisco - how do the motifs look like?
  • [x] plot the average sequence distribution for each motif

setup a new dataset function based on Sox2 peaks

In [1]:
import pandas as pd
import numpy as np
from pybedtools import BedTool
from basepair.config import get_data_dir, create_tf_session
from tqdm import tqdm
from concise.preprocessing import encodeDNA
from basepair.datasets import get_sox2_data
from basepair.plots import plot_loss
import pyBigWig
from basepair.math import softmax
import matplotlib.pyplot as plt

from basepair.datasets import *
from basepair.datasets import sox2_oct4_peaks_sox2
from basepair import samplers
ddir = get_data_dir()
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [2]:
create_tf_session(1)
Out[2]:
<tensorflow.python.client.session.Session at 0x7f581a4aeb00>
In [3]:
train, valid, test = sox2_oct4_peaks_sox2()

setup a new model having two output branches

In [4]:
import keras.layers as kl
from keras.optimizers import Adam
from keras.models import Model
import keras.backend as K
from concise.utils.helper import get_from_module
from basepair.losses import twochannel_multinomial_nll
from basepair.layers import SpatialLifetimeSparsity
from basepair.models import seq_mutlitask
In [5]:
import keras.layers as kl
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History
from keras.models import Model, load_model
In [6]:
i=1
In [7]:
def get_model(mfn, mkwargs, i):
    """Get the model"""
    import datetime
    mdir = f"{ddir}/processed/chipnexus/exp/models/p-multi-task"
    name = mfn + "_" + \
            ",".join([f'{k}={v}' for k,v in mkwargs.items()]) + \
            "." + str(i)
    i+=1
    !mkdir -p {mdir}
    ckp_file = f"{mdir}/{name}.h5"
    return eval(mfn)(**mkwargs), name, ckp_file

train the model

  • add the training curve plot bellow training
In [22]:
# hyper-parameters
mfn = "seq_mutlitask"
use_profile = True
use_counts = True
mkwargs = dict(filters=32, 
               conv1_kernel_size=21,
               tconv_kernel_size=25,
               n_dil_layers=6,
               use_profile=use_profile,
               use_counts=use_counts,
               c_task_weight=10,
               lr=0.004)
In [23]:
data = sox2_oct4_peaks_sox2()
In [24]:
from basepair.preproc import transform_data
In [25]:
train, valid, test = transform_data(data, use_profile, use_counts)
In [59]:
i
Out[59]:
2
In [65]:
# best valid so far: 108238.6558
i += 1
model, name, ckp_file = get_model(mfn, mkwargs, i)
history = model.fit(train[0], 
                    train[1],
          batch_size=256, 
          epochs=100,
          validation_data=valid[:2],
          callbacks=[EarlyStopping(patience=5),
                     History(),
                     ModelCheckpoint(ckp_file, save_best_only=True)]
         )
# get the best model
model = load_model(ckp_file, custom_objects={"twochannel_multinomial_nll": twochannel_multinomial_nll, 
                                             "SpatialLifetimeSparsity": SpatialLifetimeSparsity})
Train on 5561 samples, validate on 1884 samples
Epoch 1/100
5561/5561 [==============================] - 8s 1ms/step - loss: 34.0286 - c_sox2_loss: 1.4291 - c_oct4_loss: 1.9738 - val_loss: 30.3724 - val_c_sox2_loss: 1.0194 - val_c_oct4_loss: 2.0179
Epoch 2/100
5561/5561 [==============================] - 1s 121us/step - loss: 28.9959 - c_sox2_loss: 1.0023 - c_oct4_loss: 1.8973 - val_loss: 30.6307 - val_c_sox2_loss: 1.0112 - val_c_oct4_loss: 2.0518
Epoch 3/100
5561/5561 [==============================] - 1s 123us/step - loss: 29.0111 - c_sox2_loss: 1.0002 - c_oct4_loss: 1.9010 - val_loss: 30.2526 - val_c_sox2_loss: 1.0102 - val_c_oct4_loss: 2.0150
Epoch 4/100
5561/5561 [==============================] - 1s 123us/step - loss: 28.6205 - c_sox2_loss: 0.9995 - c_oct4_loss: 1.8626 - val_loss: 30.4121 - val_c_sox2_loss: 1.0116 - val_c_oct4_loss: 2.0296
Epoch 5/100
5561/5561 [==============================] - 1s 119us/step - loss: 28.0229 - c_sox2_loss: 0.9977 - c_oct4_loss: 1.8045 - val_loss: 29.4393 - val_c_sox2_loss: 1.0111 - val_c_oct4_loss: 1.9329
Epoch 6/100
5561/5561 [==============================] - 1s 120us/step - loss: 27.1719 - c_sox2_loss: 0.9932 - c_oct4_loss: 1.7240 - val_loss: 27.4856 - val_c_sox2_loss: 1.0009 - val_c_oct4_loss: 1.7477
Epoch 7/100
5561/5561 [==============================] - 1s 161us/step - loss: 25.4774 - c_sox2_loss: 0.9799 - c_oct4_loss: 1.5679 - val_loss: 25.5659 - val_c_sox2_loss: 0.9798 - val_c_oct4_loss: 1.5768
Epoch 8/100
5561/5561 [==============================] - 1s 117us/step - loss: 24.3656 - c_sox2_loss: 0.9661 - c_oct4_loss: 1.4705 - val_loss: 24.5091 - val_c_sox2_loss: 0.9576 - val_c_oct4_loss: 1.4933
Epoch 9/100
5561/5561 [==============================] - 1s 118us/step - loss: 22.7692 - c_sox2_loss: 0.9430 - c_oct4_loss: 1.3339 - val_loss: 22.8212 - val_c_sox2_loss: 0.9290 - val_c_oct4_loss: 1.3532
Epoch 10/100
5561/5561 [==============================] - 1s 113us/step - loss: 21.5834 - c_sox2_loss: 0.9223 - c_oct4_loss: 1.2360 - val_loss: 22.6300 - val_c_sox2_loss: 0.9237 - val_c_oct4_loss: 1.3393
Epoch 11/100
5561/5561 [==============================] - 1s 117us/step - loss: 21.1704 - c_sox2_loss: 0.9135 - c_oct4_loss: 1.2035 - val_loss: 22.0185 - val_c_sox2_loss: 0.9102 - val_c_oct4_loss: 1.2917
Epoch 12/100
5561/5561 [==============================] - 1s 119us/step - loss: 21.4850 - c_sox2_loss: 0.9147 - c_oct4_loss: 1.2338 - val_loss: 21.8453 - val_c_sox2_loss: 0.9048 - val_c_oct4_loss: 1.2798
Epoch 13/100
5561/5561 [==============================] - 1s 156us/step - loss: 20.8029 - c_sox2_loss: 0.9042 - c_oct4_loss: 1.1761 - val_loss: 22.0641 - val_c_sox2_loss: 0.9056 - val_c_oct4_loss: 1.3008
Epoch 14/100
5561/5561 [==============================] - 1s 118us/step - loss: 20.3333 - c_sox2_loss: 0.8961 - c_oct4_loss: 1.1372 - val_loss: 22.2420 - val_c_sox2_loss: 0.9063 - val_c_oct4_loss: 1.3179
Epoch 15/100
5561/5561 [==============================] - 1s 117us/step - loss: 20.3050 - c_sox2_loss: 0.8948 - c_oct4_loss: 1.1357 - val_loss: 21.7690 - val_c_sox2_loss: 0.9019 - val_c_oct4_loss: 1.2750
Epoch 16/100
5561/5561 [==============================] - 1s 121us/step - loss: 20.0435 - c_sox2_loss: 0.8872 - c_oct4_loss: 1.1172 - val_loss: 22.2925 - val_c_sox2_loss: 0.9161 - val_c_oct4_loss: 1.3131
Epoch 17/100
5561/5561 [==============================] - 1s 125us/step - loss: 20.0018 - c_sox2_loss: 0.8872 - c_oct4_loss: 1.1130 - val_loss: 21.8078 - val_c_sox2_loss: 0.8997 - val_c_oct4_loss: 1.2811
Epoch 18/100
5561/5561 [==============================] - 1s 117us/step - loss: 19.6798 - c_sox2_loss: 0.8798 - c_oct4_loss: 1.0882 - val_loss: 21.8361 - val_c_sox2_loss: 0.9101 - val_c_oct4_loss: 1.2735
Epoch 19/100
5561/5561 [==============================] - 1s 118us/step - loss: 19.6996 - c_sox2_loss: 0.8755 - c_oct4_loss: 1.0945 - val_loss: 21.6809 - val_c_sox2_loss: 0.8948 - val_c_oct4_loss: 1.2732
Epoch 20/100
5561/5561 [==============================] - 1s 141us/step - loss: 19.9378 - c_sox2_loss: 0.8755 - c_oct4_loss: 1.1183 - val_loss: 21.6244 - val_c_sox2_loss: 0.8996 - val_c_oct4_loss: 1.2628
Epoch 21/100
5561/5561 [==============================] - 1s 121us/step - loss: 19.0159 - c_sox2_loss: 0.8586 - c_oct4_loss: 1.0430 - val_loss: 21.5820 - val_c_sox2_loss: 0.8966 - val_c_oct4_loss: 1.2616
Epoch 22/100
5561/5561 [==============================] - 1s 125us/step - loss: 18.6693 - c_sox2_loss: 0.8489 - c_oct4_loss: 1.0181 - val_loss: 21.6680 - val_c_sox2_loss: 0.8956 - val_c_oct4_loss: 1.2712
Epoch 23/100
5561/5561 [==============================] - 1s 129us/step - loss: 18.5973 - c_sox2_loss: 0.8381 - c_oct4_loss: 1.0216 - val_loss: 21.6395 - val_c_sox2_loss: 0.8930 - val_c_oct4_loss: 1.2710
Epoch 24/100
5561/5561 [==============================] - 1s 119us/step - loss: 18.5174 - c_sox2_loss: 0.8368 - c_oct4_loss: 1.0149 - val_loss: 22.0117 - val_c_sox2_loss: 0.8935 - val_c_oct4_loss: 1.3077
Epoch 25/100
5561/5561 [==============================] - 1s 119us/step - loss: 17.9606 - c_sox2_loss: 0.8212 - c_oct4_loss: 0.9748 - val_loss: 22.2764 - val_c_sox2_loss: 0.9013 - val_c_oct4_loss: 1.3264
Epoch 26/100
5561/5561 [==============================] - 1s 135us/step - loss: 18.2569 - c_sox2_loss: 0.8225 - c_oct4_loss: 1.0032 - val_loss: 24.3111 - val_c_sox2_loss: 0.9659 - val_c_oct4_loss: 1.4653
In [90]:
yv_pred = model.predict(test[0])
In [97]:
d
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-97-a00eedc35392> in <module>()
----> 1 ckp_model

NameError: name 'ckp_model' is not defined
In [98]:
ckp_file
Out[98]:
'/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/seq_mutlitask_filters=32,conv1_kernel_size=21,tconv_kernel_size=25,n_dil_layers=6,use_profile=False,use_counts=True,c_task_weight=10,lr=0.004.4.h5'
In [95]:
regression_eval(test[1][1][:,0], yv_pred[1][:,0])
In [103]:
# x10
# val_p_sox2_loss: 258.7645 - val_p_oct4_loss: 363.4225 - val_c_sox2_loss: 0.8482 - val_c_oct4_loss: 1.2600

modify the plotting code

  • show also the output tracks for Oct4
  • show also the sequence importance for Oct4
In [8]:
model = load_model("/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/seq_mutlitask_filters=32,conv1_kernel_size=21,tconv_kernel_size=25,n_dil_layers=6,use_profile=True,use_counts=True,c_task_weight=10,lr=0.004.5.h5")
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
2018-05-22 19:02:05,843 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-05-22 19:02:11,585 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
In [9]:
yt_pred = model.predict(test[0])
yv_pred = model.predict(valid[0])
In [10]:
from basepair.plots import regression_eval
In [11]:
print(ckp_file)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4c5a7b2cbee4> in <module>()
----> 1 print(ckp_file)

NameError: name 'ckp_file' is not defined

Valid

TODO - the performance of Sox2 could be better

In [ ]:
# Sox2
regression_eval(valid[1][2].sum(1), yv_pred[2].sum(1))
In [ ]:
# Oct4
regression_eval(valid[1][3].sum(1), yv_pred[3].sum(1))

Test

In [ ]:
# Sox2
regression_eval(test[1][2].sum(1), yt_pred[2].sum(1))
In [ ]:
# Oct4
regression_eval(test[1][3].sum(1), yt_pred[3].sum(1))
In [12]:
from basepair.plots import *
In [123]:
p = Seq2Sox2Oct4(test[0], test[1], model)
In [124]:
p.plot(sort='max_sox2',figsize=(20,2))
In [125]:
p.plot(sort='max_oct4',figsize=(20,2))

try out a few models to figure out the optimal hyper-parameters

figure out how to best visualize the sequence importances

  • max w.r.t. each individual track
    • the problem here is that Oct4 might have multiple peaks present in the sequence
  • average?
  • top2 peaks?
In [19]:
from basepair import samplers
import pandas as pd
from basepair.math import softmax
import numpy as np
import keras.backend as K
from keras.models import Model
from concise.utils.plot import seqlogo_fig, seqlogo
import matplotlib.pyplot as plt
In [127]:
seqlogo
Out[127]:
<function concise.utils.plot.seqlogo(letter_heights, vocab='DNA', ax=None)>
In [158]:
p = Seq2Nexus(test[0], test[1],test[2], model)
In [159]:
p.plot(sort='max_sox2', seq_grad='max', figsize=(20,12))

run modisco - how do the motifs look like?

In [27]:
# Functions to get the gradients
out = model.outputs[0]
inp = model.inputs[0]
pos_strand_ginp_max = K.function([inp], K.gradients(K.max(out[:,:, 0], axis=1), inp))
neg_strand_ginp_max = K.function([inp], K.gradients(K.max(out[:,:, 1], axis=1), inp))

Get predictions and gradients

In [28]:
from basepair.data import numpy_minibatch

# Pre-compute the predictions and bottlenecks
x = valid[0]
y_true = valid[1][0]
y_pred = model.predict(x)[0]
grads_pos = np.concatenate([pos_strand_ginp_max([batch])[0]
                            for batch in numpy_minibatch(x, 512)])
grads_neg = np.concatenate([neg_strand_ginp_max([batch])[0]
                            for batch in numpy_minibatch(x, 512)])
igrads_pos = grads_pos * x
igrads_neg = grads_neg * x

Correlate pos and neg-strand gradients

In [29]:
from scipy.spatial.distance import cosine, correlation

grads_pos_ext = grads_pos.reshape((grads_pos.shape[0], -1))
grads_neg_ext = grads_neg.reshape((grads_neg.shape[0], -1))

distances = np.array([correlation(grads_neg_ext[i], grads_pos_ext[i]) for i in range(len(grads_neg_ext))])
In [30]:
import numpy as np
import seaborn as sns
In [42]:
fig = plt.figure(figsize=(4,2))
plt.hist(distances, bins=50);
plt.xlabel("Importance score cosine distance\n(pos., neg. strand)")
plt.ylabel("Frequency");
plt.savefig('fig/icml18/sox2-similarity.png', dpi=600)
plt.savefig('fig/icml18/sox2-similarity.pdf', dpi=600)
plt.close(fig)    # close the figure
fig
Out[42]:
In [31]:
plt.figure(figsize=(6,6))
plt.subplot(211)
plt.hist(distances, bins=50);
plt.subplot(212)
values, base = np.histogram(distances, bins=40)
plt.plot(base[:-1], np.cumsum(values));
plt.grid()
plt.xlabel("Correlation Distance");
plt.ylabel("Fraction of data points");
In [29]:
include = distances<0.2
In [230]:
# We get roughly 1/3 of the points with high-correlation
In [231]:
top10_idx = pd.Series(np.where(distances<0.5)[0]).sample(10)
In [232]:
top10_idx = pd.Series(distances).sort_values(ascending=True).index[:10]
In [233]:
distances
Out[233]:
array([0.0087468 , 0.04102677, 0.01819867, ..., 0.01070821, 0.00906867,
       0.02070087])
In [40]:
(distances<0.2).dump(f"{ddir}/processed/chipnexus/motifs/oct4/modisco/valid_include.npy")
In [ ]:
f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"

Plot some

In [174]:
# Top maxcount indicies
top10_idx = pd.Series(y_true.max(1).sum(1)).sort_values(ascending=False).index[10:30]
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-174-3f561dfad0c0> in <module>()
      1 # Top maxcount indicies
----> 2 top10_idx = pd.Series(y_true.max(1).sum(1)).sort_values(ascending=False).index[10:30]

~/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32     return umr_sum(a, axis, dtype, out, keepdims)
     33 
     34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):

AxisError: axis 1 is out of bounds for array of dimension 1
In [155]:
# Top count indicies
top10_idx = pd.Series(y_true.sum(1).sum(1)).sort_values(ascending=False).index[:10]
In [156]:
# Random indicies
top10_idx = pd.Series(np.arange(len(y_true))).sample(10)
In [157]:
for i in top10_idx:
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True, figsize=(20,6))
    
    ax1.plot(np.arange(1,202), y_true[i,:,0], label="pos")#
    ax1.plot(np.arange(1,202), y_true[i,:,1], label="neg")
    ax1.set_ylabel("Observed\ncounts")
    ax1.legend()
    ax2.plot(np.arange(1,202), y_pred[i,:,0], label="pos")#
    ax2.plot(np.arange(1,202), y_pred[i,:,1], label="neg")
    ax2.set_ylabel("Predicted\n")
    ax2.legend()
    ax3.set_ylabel("Pos. strand")
    seqlogo(igrads_pos[i], ax=ax3);
    ax4.set_ylabel("Neg. strand")
    seqlogo(igrads_neg[i], ax=ax4);
    x_range = [1, 201]
    ax4.set_xticks(list(range(0, 201, 5)));

Run modisco

In [266]:
top_distances = distances<0.2

hyp_scores = grads_pos + grads_neg

hyp_scores = hyp_scores[top_distances]

hyp_scores = hyp_scores - hyp_scores.mean(-1, keepdims=True)

onehot_data = valid[0][top_distances]

scores = hyp_scores * onehot_data
In [267]:
len(hyp_scores)
Out[267]:
1884
In [268]:
# Visualize
import modisco.visualization
from modisco.visualization import viz_sequence

viz_sequence.plot_weights(scores[0])
viz_sequence.plot_weights(hyp_scores[0])
viz_sequence.plot_weights(onehot_data[0])
In [269]:
from imp import reload
In [270]:
%env MKL_THREADING_LAYER=GNU
env: MKL_THREADING_LAYER=GNU
In [271]:
import theano
In [272]:
onehot_data.shape
Out[272]:
(1884, 201, 4)
In [273]:
import h5py
import numpy as np
%matplotlib inline
import modisco
reload(modisco)
import modisco.backend
reload(modisco.backend.theano_backend)
reload(modisco.backend)
import modisco.nearest_neighbors
reload(modisco.nearest_neighbors)
import modisco.affinitymat
reload(modisco.affinitymat.core)
reload(modisco.affinitymat.transformers)
import modisco.tfmodisco_workflow.seqlets_to_patterns
reload(modisco.tfmodisco_workflow.seqlets_to_patterns)
import modisco.tfmodisco_workflow.workflow
reload(modisco.tfmodisco_workflow.workflow)
import modisco.aggregator
reload(modisco.aggregator)
import modisco.cluster
reload(modisco.cluster.core)
reload(modisco.cluster.phenograph.core)
reload(modisco.cluster.phenograph.cluster)
import modisco.core
reload(modisco.core)
import modisco.coordproducers
reload(modisco.coordproducers)
import modisco.metaclusterers
reload(modisco.metaclusterers)

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
    sliding_window_size=21, 
    flank_size=10,
    histogram_bins=100, 
    percentiles_in_bandwidth=10,
    overlap_portion=0.5,     
    min_cluster_size=50,    
    threshold_for_counting_sign=1.0,
    weak_threshold_for_counting_sign=0.7)(
        task_names=["task0"],
        contrib_scores={'task0': scores},
        hypothetical_contribs={'task0': hyp_scores},
        one_hot=onehot_data)
On task task0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Got 11104 coords
Computing thresholds
Bandwidth calculated: 0.7376611232757568
Computed threshold 1.2768272635247557
6282 coords remaining after thresholding
After resolving overlaps, got 6282 seqlets
2 activity patterns with support >= 50 out of 3 possible patterns
Metacluster sizes:  [6227, 55]
Idx to activities:  {0: '1', 1: '-1'}
On metacluster 1
Metacluster size 55
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 55
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 0.12 s
Starting affinity matrix computations
Normalization computed in 0.03 s
Cosine similarity mat computed in 0.04 s
Normalization computed in 0.01 s
Cosine similarity mat computed in 0.01 s
Finished affinity matrix computations in 0.05 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.0 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 6.78 s
Launching nearest neighbors affmat calculation job
Job completed in: 7.23 s
(Round 1) Computed affinity matrix on nearest neighbors in 14.05 s
Filtered down to 51 of 55
(Round 1) Retained 51 rows out of 55 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 51 samples in 0.000s...
[t-SNE] Computed neighbors for 51 samples in 0.001s...
[t-SNE] Computed conditional probabilities for sample 51 / 51
[t-SNE] Mean sigma: 0.310619
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.03046703338623047 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    1.5s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   11.6s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   14.4s finished
Louvain completed 200 runs in 21.682221174240112 seconds
Wrote graph to binary file in 0.0162203311920166 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.66697
Louvain completed 51 runs in 36.357438802719116 seconds
Preproc + Louvain took 58.16375923156738 s
Got 4 clusters after round 1
Counts:
{1: 13, 3: 7, 2: 12, 0: 19}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 19 seqlets
Trimmed 2 out of 19
Skipped 1 seqlets
Dropping cluster 0 with 16 seqlets due to sign disagreement
Aggregating for cluster 1 with 13 seqlets
Trimmed 2 out of 13
Dropping cluster 1 with 11 seqlets due to sign disagreement
Aggregating for cluster 2 with 12 seqlets
Trimmed 1 out of 12
Dropping cluster 2 with 11 seqlets due to sign disagreement
Aggregating for cluster 3 with 7 seqlets
Trimmed 0 out of 7
Dropping cluster 3 with 7 seqlets due to sign disagreement
len(seqlets) is 0 - bailing!
On metacluster 0
Metacluster size 6227
Relevant tasks:  ('task0',)
Relevant signs:  (1,)
(Round 1) num seqlets: 6227
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 9.65 s
Starting affinity matrix computations
Normalization computed in 0.83 s
Cosine similarity mat computed in 4.66 s
Normalization computed in 0.95 s
Cosine similarity mat computed in 4.93 s
Finished affinity matrix computations in 9.6 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 1.04 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 96.87 s
(Round 1) Computed affinity matrix on nearest neighbors in 217.28 s
Filtered down to 3033 of 6227
(Round 1) Retained 3033 rows out of 6227 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 3033 samples in 0.008s...
[t-SNE] Computed neighbors for 3033 samples in 0.089s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3033
[t-SNE] Computed conditional probabilities for sample 2000 / 3033
[t-SNE] Computed conditional probabilities for sample 3000 / 3033
[t-SNE] Computed conditional probabilities for sample 3033 / 3033
[t-SNE] Mean sigma: 0.220301
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.9918856620788574 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   14.1s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   17.3s finished
Louvain completed 200 runs in 31.39292621612549 seconds
Wrote graph to binary file in 12.127521753311157 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.727572
After 6 runs, maximum modularity is Q = 0.727584
Louvain completed 56 runs in 52.783995389938354 seconds
Preproc + Louvain took 98.1414110660553 s
Got 14 clusters after round 1
Counts:
{8: 173, 4: 311, 7: 197, 2: 334, 3: 315, 12: 27, 10: 91, 5: 243, 0: 508, 1: 358, 6: 241, 9: 142, 11: 81, 13: 12}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 508 seqlets
Trimmed 54 out of 508
Skipped 59 seqlets
Aggregating for cluster 1 with 358 seqlets
Trimmed 22 out of 358
Skipped 44 seqlets
Aggregating for cluster 2 with 334 seqlets
Trimmed 55 out of 334
Skipped 38 seqlets
Aggregating for cluster 3 with 315 seqlets
Trimmed 18 out of 315
Skipped 49 seqlets
Aggregating for cluster 4 with 311 seqlets
Trimmed 26 out of 311
Skipped 40 seqlets
Aggregating for cluster 5 with 243 seqlets
Trimmed 33 out of 243
Skipped 41 seqlets
Aggregating for cluster 6 with 241 seqlets
Trimmed 33 out of 241
Skipped 32 seqlets
Aggregating for cluster 7 with 197 seqlets
Trimmed 18 out of 197
Skipped 28 seqlets
Aggregating for cluster 8 with 173 seqlets
Trimmed 14 out of 173
Skipped 38 seqlets
Skipped 6 seqlets
Aggregating for cluster 9 with 142 seqlets
Trimmed 9 out of 142
Skipped 17 seqlets
Aggregating for cluster 10 with 91 seqlets
Trimmed 4 out of 91
Skipped 15 seqlets
Aggregating for cluster 11 with 81 seqlets
Trimmed 25 out of 81
Skipped 1 seqlets
Skipped 1 seqlets
Aggregating for cluster 12 with 27 seqlets
Trimmed 1 out of 27
Skipped 1 seqlets
Aggregating for cluster 13 with 12 seqlets
Trimmed 0 out of 12
Skipped 3 seqlets
(Round 2) num seqlets: 2305
(Round 2) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 3.85 s
Starting affinity matrix computations
Normalization computed in 0.38 s
Cosine similarity mat computed in 0.84 s
Normalization computed in 0.32 s
Cosine similarity mat computed in 0.77 s
Finished affinity matrix computations in 1.61 s
(Round 2) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.21 s
(Round 2) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 69.83 s
Launching nearest neighbors affmat calculation job
Job completed in: 69.37 s
(Round 2) Computed affinity matrix on nearest neighbors in 145.46 s
Not applying filtering for rounds above first round
(Round 2) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 2305 samples in 0.004s...
[t-SNE] Computed neighbors for 2305 samples in 0.063s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2305
[t-SNE] Computed conditional probabilities for sample 2000 / 2305
[t-SNE] Computed conditional probabilities for sample 2305 / 2305
[t-SNE] Mean sigma: 0.219928
(Round 2) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 1.0518958568572998 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   14.2s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   17.3s finished
Louvain completed 200 runs in 29.665878295898438 seconds
Wrote graph to binary file in 7.174563407897949 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.722526
After 2 runs, maximum modularity is Q = 0.730711
After 3 runs, maximum modularity is Q = 0.730801
After 4 runs, maximum modularity is Q = 0.730984
After 6 runs, maximum modularity is Q = 0.732045
Louvain completed 56 runs in 50.18093490600586 seconds
Preproc + Louvain took 88.62935519218445 s
Got 13 clusters after round 2
Counts:
{6: 144, 0: 434, 1: 357, 9: 115, 10: 45, 2: 291, 8: 116, 3: 237, 5: 153, 12: 12, 7: 124, 4: 233, 11: 44}
(Round 2) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 434 seqlets
Trimmed 45 out of 434
Skipped 22 seqlets
Aggregating for cluster 1 with 357 seqlets
Trimmed 58 out of 357
Skipped 9 seqlets
Aggregating for cluster 2 with 291 seqlets
Trimmed 106 out of 291
Skipped 16 seqlets
Aggregating for cluster 3 with 237 seqlets
Trimmed 74 out of 237
Skipped 4 seqlets
Aggregating for cluster 4 with 233 seqlets
Trimmed 32 out of 233
Skipped 11 seqlets
Aggregating for cluster 5 with 153 seqlets
Trimmed 33 out of 153
Skipped 4 seqlets
Aggregating for cluster 6 with 144 seqlets
Trimmed 34 out of 144
Skipped 4 seqlets
Aggregating for cluster 7 with 124 seqlets
Trimmed 18 out of 124
Skipped 11 seqlets
Aggregating for cluster 8 with 116 seqlets
Trimmed 31 out of 116
Skipped 2 seqlets
Aggregating for cluster 9 with 115 seqlets
Trimmed 42 out of 115
Aggregating for cluster 10 with 45 seqlets
Trimmed 5 out of 45
Skipped 1 seqlets
Aggregating for cluster 11 with 44 seqlets
Trimmed 0 out of 44
Skipped 36 seqlets
Aggregating for cluster 12 with 12 seqlets
Trimmed 0 out of 12
Got 13 clusters
Splitting into subclusters...
Inspecting for spurious merging
Wrote graph to binary file in 0.8585577011108398 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00796871
Louvain completed 21 runs in 16.888229370117188 seconds
Similarity is 0.9369463592028393; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.5535683631896973 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00915527
After 9 runs, maximum modularity is Q = 0.00915528
Louvain completed 29 runs in 23.062270641326904 seconds
Similarity is 0.9749927220847351; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2352142333984375 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0124411
Louvain completed 21 runs in 17.319679260253906 seconds
Similarity is 0.9718953897396376; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2210524082183838 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00699388
Louvain completed 21 runs in 16.389430046081543 seconds
Similarity is 0.9153464971620826; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2778606414794922 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0102081
Louvain completed 21 runs in 16.981793880462646 seconds
Similarity is 0.9678379259664387; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.12695574760437012 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0107158
Louvain completed 21 runs in 16.91047763824463 seconds
Similarity is 0.7620583134057276; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.06804585456848145 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00670445
Louvain completed 21 runs in 16.014127254486084 seconds
Similarity is 0.9015959238247756; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.05251049995422363 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00931806
After 6 runs, maximum modularity is Q = 0.00934598
Louvain completed 26 runs in 20.417680740356445 seconds
Similarity is 0.8063749107025145; is_dissimilar is False
Got 2 subclusters
Inspecting for spurious merging
Wrote graph to binary file in 0.1024315357208252 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0108229
Louvain completed 21 runs in 16.562198162078857 seconds
Similarity is 0.9592693482354967; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.09054327011108398 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0113031
Louvain completed 21 runs in 16.434577465057373 seconds
Similarity is 0.9490734747843734; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.07478022575378418 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0105235
Louvain completed 21 runs in 16.81160855293274 seconds
Similarity is 0.9306806135433177; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.08950996398925781 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.01083
Louvain completed 21 runs in 16.266108512878418 seconds
Similarity is 0.9045891394134682; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.0588986873626709 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = -0.0167635
After 2 runs, maximum modularity is Q = 0.00610611
After 3 runs, maximum modularity is Q = 0.00642396
After 4 runs, maximum modularity is Q = 0.00789468
Louvain completed 24 runs in 20.76047968864441 seconds
Similarity is 0.9077495375985554; is_dissimilar is False
Merging on 14 clusters
On merging iteration 1
Computing pattern to seqlet distances
Computing pattern to pattern distances
Collapsing 0 & 1 with prob 0.005213325762187972 and sim 0.9876791751232061
Collapsing 2 & 8 with prob 8.690870354286062e-06 and sim 0.9338340315269065
Collapsing 1 & 9 with prob 4.393169010983065e-05 and sim 0.9305307932726811
Collapsing 3 & 10 with prob 4.2267009505083e-06 and sim 0.9240314715581518
Collapsing 0 & 9 with prob 5.155265632299075e-05 and sim 0.917609114054792
Collapsing 4 & 8 with prob 2.0666627920554385e-06 and sim 0.9171718035410608
Aborting collapse as 2 & 4 have prob 1.154466979909516e-06 and sim 0.3987002160044585
Collapsing 1 & 3 with prob 0.00013036782422250225 and sim 0.9136723758111194
Aborting collapse as 9 & 10 have prob 1.4887673966122782e-07 and sim 0.7969590744628001
Collapsing 0 & 3 with prob 3.812735472045252e-05 and sim 0.9060132492910623
Aborting collapse as 9 & 10 have prob 1.4887673966122782e-07 and sim 0.7969590744628001
Collapsing 0 & 11 with prob 3.682883491126881e-05 and sim 0.8826333671319047
Trimmed 19 out of 644
Trimmed 3 out of 263
Skipped 5 seqlets
Trimmed 6 out of 707
Skipped 1 seqlets
Trimmed 1 out of 232
Skipped 2 seqlets
Trimmed 3 out of 738
On merging iteration 2
Computing pattern to seqlet distances
Computing pattern to pattern distances
Collapsing 0 & 2 with prob 0.0006245711808186924 and sim 0.9105909804374275
Collapsing 2 & 4 with prob 0.0002955546574045961 and sim 0.8832851960368194
Trimmed 8 out of 964
Skipped 2 seqlets
Trimmed 7 out of 1058
On merging iteration 3
Computing pattern to seqlet distances
Computing pattern to pattern distances
Got 7 patterns after merging
Performing seqlet reassignment
Cross contin jaccard time taken: 7.56 s
Cross contin jaccard time taken: 7.33 s
Discarded 5 seqlets
Skipped 13 seqlets
Skipped 6 seqlets
Skipped 3 seqlets
Got 5 patterns after reassignment
Total time taken is 972.97s
In [279]:
mkdir -p {ddir}/processed/chipnexus/motifs/sox2-oct4/modisco
In [275]:
!du -sh {ddir}/processed/chipnexus/motifs/sox2/modisco
58M	/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/motifs/sox2/modisco
In [280]:
modisco_file = f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"
modisco_file_pkl = f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"
In [281]:
from basepair.utils import write_pkl
In [192]:
write_pkl(tfmodisco_results, modisco_file_pkl)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-192-87dc9ce2ca7e> in <module>()
----> 1 write_pkl(tfmodisco_results, modisco_file_pkl)

~/workspace/basepair/basepair/utils.py in write_pkl(obj, fname, create_dirs, protocol)
      8             os.makedirs(os.path.dirname(fname), exist_ok=True)
      9     import pickle
---> 10     pickle.dump(obj, open(fname, 'wb'), protocol=protocol)
     11 
     12 

TypeError: can't pickle dict_values objects
In [249]:
!rm {modisco_file}
In [282]:
import h5py
import modisco.util
reload(modisco.util)
grp = h5py.File(modisco_file)
tfmodisco_results.save_hdf5(grp)
# TODO - write pickle
In [286]:
hdf5_results.close()
In [283]:
from collections import Counter
from modisco.visualization import viz_sequence
reload(viz_sequence)
from matplotlib import pyplot as plt

import modisco.affinitymat.core
reload(modisco.affinitymat.core)
import modisco.cluster.phenograph.core
reload(modisco.cluster.phenograph.core)
import modisco.cluster.phenograph.cluster
reload(modisco.cluster.phenograph.cluster)
import modisco.cluster.core
reload(modisco.cluster.core)
import modisco.aggregator
reload(modisco.aggregator)

import sklearn.decomposition
import sklearn.manifold

hdf5_results = h5py.File(modisco_file)

#patterns = (tfmodisco_results
#            .metacluster_idx_to_submetacluster_results[0]
#            .seqlets_to_patterns_result.patterns);
patterns = (list(hdf5_results
                 ["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"]["all_pattern_names"]))
print(len(patterns))
pattern_grp = (hdf5_results
                 ["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"])

for pattern_name in patterns:
    pattern = pattern_grp[pattern_name]
    print(pattern_name)
    print("total seqlets:",len(pattern["seqlets_and_alnmts"]["seqlets"]))
    #pattern.plot_counts(counts=aggregated_seqlet.get_per_position_seqlet_center_counts())
    background = np.array([0.27, 0.23, 0.23, 0.27])
    print("fwd:")
    viz_sequence.plot_weights(pattern["task0_contrib_scores"]["fwd"])
    viz_sequence.plot_weights(pattern["task0_hypothetical_contribs"]["fwd"])
    viz_sequence.plot_weights(viz_sequence.ic_scale(np.array(pattern["sequence"]["fwd"]),
                                                    background=background))

    print("reverse:")
    viz_sequence.plot_weights(pattern["task0_contrib_scores"]["rev"])
    viz_sequence.plot_weights(pattern["task0_hypothetical_contribs"]["rev"])
    viz_sequence.plot_weights(viz_sequence.ic_scale(np.array(pattern["sequence"]["rev"]),
                                                    background=background))
5
b'pattern_0'
total seqlets: 1054
fwd:
reverse:
b'pattern_1'
total seqlets: 242
fwd:
reverse:
b'pattern_2'
total seqlets: 184
fwd:
reverse:
b'pattern_3'
total seqlets: 64
fwd:
reverse:
b'pattern_4'
total seqlets: 48
fwd:
reverse:
In [285]:
hdf5_results
Out[285]:
[]
In [270]:
sl = tfmodisco_results.multitask_seqlet_creation_results.final_seqlets[0]
In [278]:
cl0 = hdf5_results.metacluster_idx_to_submetacluster_results[0]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-278-f54e30fb67a0> in <module>()
----> 1 cl0 = hdf5_results.metacluster_idx_to_submetacluster_results[0]

AttributeError: 'File' object has no attribute 'metacluster_idx_to_submetacluster_results'
In [276]:
sl.coor.example_idx
Out[276]:
2
In [277]:
len(pd.Series([sl.coor.example_idx for sl in cl0.seqlets]).unique())
Out[277]:
811
In [255]:
cl0.seqlets[0]
Out[255]:
<modisco.core.Seqlet at 0x7f47291decc0>
In [254]:
len(cl0.seqlets)
Out[254]:
1845
In [252]:
len(cl0.seqlets)
Out[252]:
250
In [ ]:
tfmodisco_results["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"]["all_pattern_names"]))
In [193]:
sl.coor.start
Out[193]:
91
In [197]:
sl.exidx_start_end_string
Out[197]:
'2_91_132'
In [222]:
tfmodisco_results.metaclustering_results.metacluster_indices
Out[222]:
array([0, 0, 1, ..., 0, 1, 1])
In [231]:
sl = tfmodisco_results.multitask_seqlet_creation_results.final_seqlets[0]
In [236]:
sl.coor.
Out[236]:
False
In [217]:
tfmodisco_results.metaclustering_results.metacluster_indices.max()
Out[217]:
1
In [219]:
len(tfmodisco_results.metaclustering_results.metacluster_indices)
Out[219]:
2095
In [198]:
sl.
Out[198]:
OrderedDict([('task0_contrib_scores',
              <modisco.core.Snippet at 0x7f47291de1d0>),
             ('task0_hypothetical_contribs',
              <modisco.core.Snippet at 0x7f47291de2e8>),
             ('sequence', <modisco.core.Snippet at 0x7f47291dea90>)])

plot the average sequence distribution for each motif