Questions¶

can we discover events when Sox2 and Oct4 are both bound?
could we discover/distinguish those events when using only Sox2 or only Oct4?
does the multi-task model do better than the single task one?

Goal¶

make a multi-task model predicting Sox2 and Oct4 ChipNexus signal git issue

TODO¶

[x] setup a new dataset function based on Sox2 peaks
[x] setup a new model having two output branches
[x] train the model
add the training curve plot bellow training
[x] modify the plotting code
- show also the output tracks for Oct4
- show also the sequence importance for Oct4
[ ] try out a few models to figure out the optimal hyper-parameters
[x] figure out how to best visualize the sequence importances
- max w.r.t. each individual track
  - the problem here is that Oct4 might have multiple peaks present in the sequence
- average?
- top2 peaks?
[x] run modisco - how do the motifs look like?
[x] plot the average sequence distribution for each motif

setup a new dataset function based on Sox2 peaks¶

import pandas as pd
import numpy as np
from pybedtools import BedTool
from basepair.config import get_data_dir, create_tf_session
from tqdm import tqdm
from concise.preprocessing import encodeDNA
from basepair.datasets import get_sox2_data
from basepair.plots import plot_loss
import pyBigWig
from basepair.math import softmax
import matplotlib.pyplot as plt

from basepair.datasets import *
from basepair.datasets import sox2_oct4_peaks_sox2
from basepair import samplers
ddir = get_data_dir()

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

create_tf_session(1)

<tensorflow.python.client.session.Session at 0x7f581a4aeb00>

train, valid, test = sox2_oct4_peaks_sox2()

setup a new model having two output branches¶

import keras.layers as kl
from keras.optimizers import Adam
from keras.models import Model
import keras.backend as K
from concise.utils.helper import get_from_module
from basepair.losses import twochannel_multinomial_nll
from basepair.layers import SpatialLifetimeSparsity
from basepair.models import seq_mutlitask

import keras.layers as kl
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History
from keras.models import Model, load_model

i=1

def get_model(mfn, mkwargs, i):
    """Get the model"""
    import datetime
    mdir = f"{ddir}/processed/chipnexus/exp/models/p-multi-task"
    name = mfn + "_" + \
            ",".join([f'{k}={v}' for k,v in mkwargs.items()]) + \
            "." + str(i)
    i+=1
    !mkdir -p {mdir}
    ckp_file = f"{mdir}/{name}.h5"
    return eval(mfn)(**mkwargs), name, ckp_file

train the model¶

add the training curve plot bellow training

# hyper-parameters
mfn = "seq_mutlitask"
use_profile = True
use_counts = True
mkwargs = dict(filters=32, 
               conv1_kernel_size=21,
               tconv_kernel_size=25,
               n_dil_layers=6,
               use_profile=use_profile,
               use_counts=use_counts,
               c_task_weight=10,
               lr=0.004)

data = sox2_oct4_peaks_sox2()

from basepair.preproc import transform_data

train, valid, test = transform_data(data, use_profile, use_counts)

i

2

# best valid so far: 108238.6558
i += 1
model, name, ckp_file = get_model(mfn, mkwargs, i)
history = model.fit(train[0], 
                    train[1],
          batch_size=256, 
          epochs=100,
          validation_data=valid[:2],
          callbacks=[EarlyStopping(patience=5),
                     History(),
                     ModelCheckpoint(ckp_file, save_best_only=True)]
         )
# get the best model
model = load_model(ckp_file, custom_objects={"twochannel_multinomial_nll": twochannel_multinomial_nll, 
                                             "SpatialLifetimeSparsity": SpatialLifetimeSparsity})

Train on 5561 samples, validate on 1884 samples
Epoch 1/100
5561/5561 [==============================] - 8s 1ms/step - loss: 34.0286 - c_sox2_loss: 1.4291 - c_oct4_loss: 1.9738 - val_loss: 30.3724 - val_c_sox2_loss: 1.0194 - val_c_oct4_loss: 2.0179
Epoch 2/100
5561/5561 [==============================] - 1s 121us/step - loss: 28.9959 - c_sox2_loss: 1.0023 - c_oct4_loss: 1.8973 - val_loss: 30.6307 - val_c_sox2_loss: 1.0112 - val_c_oct4_loss: 2.0518
Epoch 3/100
5561/5561 [==============================] - 1s 123us/step - loss: 29.0111 - c_sox2_loss: 1.0002 - c_oct4_loss: 1.9010 - val_loss: 30.2526 - val_c_sox2_loss: 1.0102 - val_c_oct4_loss: 2.0150
Epoch 4/100
5561/5561 [==============================] - 1s 123us/step - loss: 28.6205 - c_sox2_loss: 0.9995 - c_oct4_loss: 1.8626 - val_loss: 30.4121 - val_c_sox2_loss: 1.0116 - val_c_oct4_loss: 2.0296
Epoch 5/100
5561/5561 [==============================] - 1s 119us/step - loss: 28.0229 - c_sox2_loss: 0.9977 - c_oct4_loss: 1.8045 - val_loss: 29.4393 - val_c_sox2_loss: 1.0111 - val_c_oct4_loss: 1.9329
Epoch 6/100
5561/5561 [==============================] - 1s 120us/step - loss: 27.1719 - c_sox2_loss: 0.9932 - c_oct4_loss: 1.7240 - val_loss: 27.4856 - val_c_sox2_loss: 1.0009 - val_c_oct4_loss: 1.7477
Epoch 7/100
5561/5561 [==============================] - 1s 161us/step - loss: 25.4774 - c_sox2_loss: 0.9799 - c_oct4_loss: 1.5679 - val_loss: 25.5659 - val_c_sox2_loss: 0.9798 - val_c_oct4_loss: 1.5768
Epoch 8/100
5561/5561 [==============================] - 1s 117us/step - loss: 24.3656 - c_sox2_loss: 0.9661 - c_oct4_loss: 1.4705 - val_loss: 24.5091 - val_c_sox2_loss: 0.9576 - val_c_oct4_loss: 1.4933
Epoch 9/100
5561/5561 [==============================] - 1s 118us/step - loss: 22.7692 - c_sox2_loss: 0.9430 - c_oct4_loss: 1.3339 - val_loss: 22.8212 - val_c_sox2_loss: 0.9290 - val_c_oct4_loss: 1.3532
Epoch 10/100
5561/5561 [==============================] - 1s 113us/step - loss: 21.5834 - c_sox2_loss: 0.9223 - c_oct4_loss: 1.2360 - val_loss: 22.6300 - val_c_sox2_loss: 0.9237 - val_c_oct4_loss: 1.3393
Epoch 11/100
5561/5561 [==============================] - 1s 117us/step - loss: 21.1704 - c_sox2_loss: 0.9135 - c_oct4_loss: 1.2035 - val_loss: 22.0185 - val_c_sox2_loss: 0.9102 - val_c_oct4_loss: 1.2917
Epoch 12/100
5561/5561 [==============================] - 1s 119us/step - loss: 21.4850 - c_sox2_loss: 0.9147 - c_oct4_loss: 1.2338 - val_loss: 21.8453 - val_c_sox2_loss: 0.9048 - val_c_oct4_loss: 1.2798
Epoch 13/100
5561/5561 [==============================] - 1s 156us/step - loss: 20.8029 - c_sox2_loss: 0.9042 - c_oct4_loss: 1.1761 - val_loss: 22.0641 - val_c_sox2_loss: 0.9056 - val_c_oct4_loss: 1.3008
Epoch 14/100
5561/5561 [==============================] - 1s 118us/step - loss: 20.3333 - c_sox2_loss: 0.8961 - c_oct4_loss: 1.1372 - val_loss: 22.2420 - val_c_sox2_loss: 0.9063 - val_c_oct4_loss: 1.3179
Epoch 15/100
5561/5561 [==============================] - 1s 117us/step - loss: 20.3050 - c_sox2_loss: 0.8948 - c_oct4_loss: 1.1357 - val_loss: 21.7690 - val_c_sox2_loss: 0.9019 - val_c_oct4_loss: 1.2750
Epoch 16/100
5561/5561 [==============================] - 1s 121us/step - loss: 20.0435 - c_sox2_loss: 0.8872 - c_oct4_loss: 1.1172 - val_loss: 22.2925 - val_c_sox2_loss: 0.9161 - val_c_oct4_loss: 1.3131
Epoch 17/100
5561/5561 [==============================] - 1s 125us/step - loss: 20.0018 - c_sox2_loss: 0.8872 - c_oct4_loss: 1.1130 - val_loss: 21.8078 - val_c_sox2_loss: 0.8997 - val_c_oct4_loss: 1.2811
Epoch 18/100
5561/5561 [==============================] - 1s 117us/step - loss: 19.6798 - c_sox2_loss: 0.8798 - c_oct4_loss: 1.0882 - val_loss: 21.8361 - val_c_sox2_loss: 0.9101 - val_c_oct4_loss: 1.2735
Epoch 19/100
5561/5561 [==============================] - 1s 118us/step - loss: 19.6996 - c_sox2_loss: 0.8755 - c_oct4_loss: 1.0945 - val_loss: 21.6809 - val_c_sox2_loss: 0.8948 - val_c_oct4_loss: 1.2732
Epoch 20/100
5561/5561 [==============================] - 1s 141us/step - loss: 19.9378 - c_sox2_loss: 0.8755 - c_oct4_loss: 1.1183 - val_loss: 21.6244 - val_c_sox2_loss: 0.8996 - val_c_oct4_loss: 1.2628
Epoch 21/100
5561/5561 [==============================] - 1s 121us/step - loss: 19.0159 - c_sox2_loss: 0.8586 - c_oct4_loss: 1.0430 - val_loss: 21.5820 - val_c_sox2_loss: 0.8966 - val_c_oct4_loss: 1.2616
Epoch 22/100
5561/5561 [==============================] - 1s 125us/step - loss: 18.6693 - c_sox2_loss: 0.8489 - c_oct4_loss: 1.0181 - val_loss: 21.6680 - val_c_sox2_loss: 0.8956 - val_c_oct4_loss: 1.2712
Epoch 23/100
5561/5561 [==============================] - 1s 129us/step - loss: 18.5973 - c_sox2_loss: 0.8381 - c_oct4_loss: 1.0216 - val_loss: 21.6395 - val_c_sox2_loss: 0.8930 - val_c_oct4_loss: 1.2710
Epoch 24/100
5561/5561 [==============================] - 1s 119us/step - loss: 18.5174 - c_sox2_loss: 0.8368 - c_oct4_loss: 1.0149 - val_loss: 22.0117 - val_c_sox2_loss: 0.8935 - val_c_oct4_loss: 1.3077
Epoch 25/100
5561/5561 [==============================] - 1s 119us/step - loss: 17.9606 - c_sox2_loss: 0.8212 - c_oct4_loss: 0.9748 - val_loss: 22.2764 - val_c_sox2_loss: 0.9013 - val_c_oct4_loss: 1.3264
Epoch 26/100
5561/5561 [==============================] - 1s 135us/step - loss: 18.2569 - c_sox2_loss: 0.8225 - c_oct4_loss: 1.0032 - val_loss: 24.3111 - val_c_sox2_loss: 0.9659 - val_c_oct4_loss: 1.4653

yv_pred = model.predict(test[0])

d

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-97-a00eedc35392> in <module>()
----> 1 ckp_model

NameError: name 'ckp_model' is not defined

ckp_file

'/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/seq_mutlitask_filters=32,conv1_kernel_size=21,tconv_kernel_size=25,n_dil_layers=6,use_profile=False,use_counts=True,c_task_weight=10,lr=0.004.4.h5'

regression_eval(test[1][1][:,0], yv_pred[1][:,0])

# x10
# val_p_sox2_loss: 258.7645 - val_p_oct4_loss: 363.4225 - val_c_sox2_loss: 0.8482 - val_c_oct4_loss: 1.2600

modify the plotting code¶

show also the output tracks for Oct4
show also the sequence importance for Oct4

model = load_model("/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/seq_mutlitask_filters=32,conv1_kernel_size=21,tconv_kernel_size=25,n_dil_layers=6,use_profile=True,use_counts=True,c_task_weight=10,lr=0.004.5.h5")

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

2018-05-22 19:02:05,843 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

2018-05-22 19:02:11,585 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

yt_pred = model.predict(test[0])
yv_pred = model.predict(valid[0])

from basepair.plots import regression_eval

print(ckp_file)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4c5a7b2cbee4> in <module>()
----> 1 print(ckp_file)

NameError: name 'ckp_file' is not defined

Valid¶

TODO - the performance of Sox2 could be better¶

# Sox2
regression_eval(valid[1][2].sum(1), yv_pred[2].sum(1))

# Oct4
regression_eval(valid[1][3].sum(1), yv_pred[3].sum(1))

Test¶

# Sox2
regression_eval(test[1][2].sum(1), yt_pred[2].sum(1))

# Oct4
regression_eval(test[1][3].sum(1), yt_pred[3].sum(1))

from basepair.plots import *

p = Seq2Sox2Oct4(test[0], test[1], model)

p.plot(sort='max_sox2',figsize=(20,2))

p.plot(sort='max_oct4',figsize=(20,2))

try out a few models to figure out the optimal hyper-parameters¶

figure out how to best visualize the sequence importances¶

max w.r.t. each individual track
- the problem here is that Oct4 might have multiple peaks present in the sequence
average?
top2 peaks?

from basepair import samplers
import pandas as pd
from basepair.math import softmax
import numpy as np
import keras.backend as K
from keras.models import Model
from concise.utils.plot import seqlogo_fig, seqlogo
import matplotlib.pyplot as plt

seqlogo

<function concise.utils.plot.seqlogo(letter_heights, vocab='DNA', ax=None)>

p = Seq2Nexus(test[0], test[1],test[2], model)

p.plot(sort='max_sox2', seq_grad='max', figsize=(20,12))

run modisco - how do the motifs look like?¶

# Functions to get the gradients
out = model.outputs[0]
inp = model.inputs[0]
pos_strand_ginp_max = K.function([inp], K.gradients(K.max(out[:,:, 0], axis=1), inp))
neg_strand_ginp_max = K.function([inp], K.gradients(K.max(out[:,:, 1], axis=1), inp))

Get predictions and gradients¶

from basepair.data import numpy_minibatch

# Pre-compute the predictions and bottlenecks
x = valid[0]
y_true = valid[1][0]
y_pred = model.predict(x)[0]
grads_pos = np.concatenate([pos_strand_ginp_max([batch])[0]
                            for batch in numpy_minibatch(x, 512)])
grads_neg = np.concatenate([neg_strand_ginp_max([batch])[0]
                            for batch in numpy_minibatch(x, 512)])
igrads_pos = grads_pos * x
igrads_neg = grads_neg * x

Correlate pos and neg-strand gradients¶

from scipy.spatial.distance import cosine, correlation

grads_pos_ext = grads_pos.reshape((grads_pos.shape[0], -1))
grads_neg_ext = grads_neg.reshape((grads_neg.shape[0], -1))

distances = np.array([correlation(grads_neg_ext[i], grads_pos_ext[i]) for i in range(len(grads_neg_ext))])

import numpy as np
import seaborn as sns

fig = plt.figure(figsize=(4,2))
plt.hist(distances, bins=50);
plt.xlabel("Importance score cosine distance\n(pos., neg. strand)")
plt.ylabel("Frequency");
plt.savefig('fig/icml18/sox2-similarity.png', dpi=600)
plt.savefig('fig/icml18/sox2-similarity.pdf', dpi=600)
plt.close(fig)    # close the figure
fig

plt.figure(figsize=(6,6))
plt.subplot(211)
plt.hist(distances, bins=50);
plt.subplot(212)
values, base = np.histogram(distances, bins=40)
plt.plot(base[:-1], np.cumsum(values));
plt.grid()
plt.xlabel("Correlation Distance");
plt.ylabel("Fraction of data points");

include = distances<0.2

# We get roughly 1/3 of the points with high-correlation

top10_idx = pd.Series(np.where(distances<0.5)[0]).sample(10)

top10_idx = pd.Series(distances).sort_values(ascending=True).index[:10]

distances

array([0.0087468 , 0.04102677, 0.01819867, ..., 0.01070821, 0.00906867,
       0.02070087])

(distances<0.2).dump(f"{ddir}/processed/chipnexus/motifs/oct4/modisco/valid_include.npy")

f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"

Plot some¶

# Top maxcount indicies
top10_idx = pd.Series(y_true.max(1).sum(1)).sort_values(ascending=False).index[10:30]

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-174-3f561dfad0c0> in <module>()
      1 # Top maxcount indicies
----> 2 top10_idx = pd.Series(y_true.max(1).sum(1)).sort_values(ascending=False).index[10:30]

~/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32     return umr_sum(a, axis, dtype, out, keepdims)
     33 
     34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):

AxisError: axis 1 is out of bounds for array of dimension 1

# Top count indicies
top10_idx = pd.Series(y_true.sum(1).sum(1)).sort_values(ascending=False).index[:10]

# Random indicies
top10_idx = pd.Series(np.arange(len(y_true))).sample(10)

for i in top10_idx:
    fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, sharex=True, figsize=(20,6))
    
    ax1.plot(np.arange(1,202), y_true[i,:,0], label="pos")#
    ax1.plot(np.arange(1,202), y_true[i,:,1], label="neg")
    ax1.set_ylabel("Observed\ncounts")
    ax1.legend()
    ax2.plot(np.arange(1,202), y_pred[i,:,0], label="pos")#
    ax2.plot(np.arange(1,202), y_pred[i,:,1], label="neg")
    ax2.set_ylabel("Predicted\n")
    ax2.legend()
    ax3.set_ylabel("Pos. strand")
    seqlogo(igrads_pos[i], ax=ax3);
    ax4.set_ylabel("Neg. strand")
    seqlogo(igrads_neg[i], ax=ax4);
    x_range = [1, 201]
    ax4.set_xticks(list(range(0, 201, 5)));

Run modisco¶

top_distances = distances<0.2

hyp_scores = grads_pos + grads_neg

hyp_scores = hyp_scores[top_distances]

hyp_scores = hyp_scores - hyp_scores.mean(-1, keepdims=True)

onehot_data = valid[0][top_distances]

scores = hyp_scores * onehot_data

len(hyp_scores)

1884

# Visualize
import modisco.visualization
from modisco.visualization import viz_sequence

viz_sequence.plot_weights(scores[0])
viz_sequence.plot_weights(hyp_scores[0])
viz_sequence.plot_weights(onehot_data[0])

from imp import reload

%env MKL_THREADING_LAYER=GNU

env: MKL_THREADING_LAYER=GNU

import theano

onehot_data.shape

(1884, 201, 4)

import h5py
import numpy as np
%matplotlib inline
import modisco
reload(modisco)
import modisco.backend
reload(modisco.backend.theano_backend)
reload(modisco.backend)
import modisco.nearest_neighbors
reload(modisco.nearest_neighbors)
import modisco.affinitymat
reload(modisco.affinitymat.core)
reload(modisco.affinitymat.transformers)
import modisco.tfmodisco_workflow.seqlets_to_patterns
reload(modisco.tfmodisco_workflow.seqlets_to_patterns)
import modisco.tfmodisco_workflow.workflow
reload(modisco.tfmodisco_workflow.workflow)
import modisco.aggregator
reload(modisco.aggregator)
import modisco.cluster
reload(modisco.cluster.core)
reload(modisco.cluster.phenograph.core)
reload(modisco.cluster.phenograph.cluster)
import modisco.core
reload(modisco.core)
import modisco.coordproducers
reload(modisco.coordproducers)
import modisco.metaclusterers
reload(modisco.metaclusterers)

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
    sliding_window_size=21, 
    flank_size=10,
    histogram_bins=100, 
    percentiles_in_bandwidth=10,
    overlap_portion=0.5,     
    min_cluster_size=50,    
    threshold_for_counting_sign=1.0,
    weak_threshold_for_counting_sign=0.7)(
        task_names=["task0"],
        contrib_scores={'task0': scores},
        hypothetical_contribs={'task0': hyp_scores},
        one_hot=onehot_data)

On task task0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Done 0
Got 11104 coords
Computing thresholds
Bandwidth calculated: 0.7376611232757568

Computed threshold 1.2768272635247557
6282 coords remaining after thresholding
After resolving overlaps, got 6282 seqlets
2 activity patterns with support >= 50 out of 3 possible patterns
Metacluster sizes:  [6227, 55]
Idx to activities:  {0: '1', 1: '-1'}
On metacluster 1
Metacluster size 55
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 55
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 0.12 s
Starting affinity matrix computations
Normalization computed in 0.03 s
Cosine similarity mat computed in 0.04 s
Normalization computed in 0.01 s
Cosine similarity mat computed in 0.01 s
Finished affinity matrix computations in 0.05 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.0 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 6.78 s
Launching nearest neighbors affmat calculation job
Job completed in: 7.23 s
(Round 1) Computed affinity matrix on nearest neighbors in 14.05 s
Filtered down to 51 of 55
(Round 1) Retained 51 rows out of 55 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 51 samples in 0.000s...
[t-SNE] Computed neighbors for 51 samples in 0.001s...
[t-SNE] Computed conditional probabilities for sample 51 / 51
[t-SNE] Mean sigma: 0.310619
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.03046703338623047 seconds
Running Louvain modularity optimization

[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    1.5s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   11.6s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   14.4s finished

Louvain completed 200 runs in 21.682221174240112 seconds
Wrote graph to binary file in 0.0162203311920166 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.66697
Louvain completed 51 runs in 36.357438802719116 seconds
Preproc + Louvain took 58.16375923156738 s
Got 4 clusters after round 1
Counts:
{1: 13, 3: 7, 2: 12, 0: 19}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 19 seqlets
Trimmed 2 out of 19
Skipped 1 seqlets
Dropping cluster 0 with 16 seqlets due to sign disagreement
Aggregating for cluster 1 with 13 seqlets
Trimmed 2 out of 13
Dropping cluster 1 with 11 seqlets due to sign disagreement
Aggregating for cluster 2 with 12 seqlets
Trimmed 1 out of 12
Dropping cluster 2 with 11 seqlets due to sign disagreement
Aggregating for cluster 3 with 7 seqlets
Trimmed 0 out of 7
Dropping cluster 3 with 7 seqlets due to sign disagreement
len(seqlets) is 0 - bailing!
On metacluster 0
Metacluster size 6227
Relevant tasks:  ('task0',)
Relevant signs:  (1,)
(Round 1) num seqlets: 6227
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 9.65 s
Starting affinity matrix computations
Normalization computed in 0.83 s
Cosine similarity mat computed in 4.66 s
Normalization computed in 0.95 s
Cosine similarity mat computed in 4.93 s
Finished affinity matrix computations in 9.6 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 1.04 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 96.87 s
(Round 1) Computed affinity matrix on nearest neighbors in 217.28 s
Filtered down to 3033 of 6227
(Round 1) Retained 3033 rows out of 6227 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 3033 samples in 0.008s...
[t-SNE] Computed neighbors for 3033 samples in 0.089s...
[t-SNE] Computed conditional probabilities for sample 1000 / 3033
[t-SNE] Computed conditional probabilities for sample 2000 / 3033
[t-SNE] Computed conditional probabilities for sample 3000 / 3033
[t-SNE] Computed conditional probabilities for sample 3033 / 3033
[t-SNE] Mean sigma: 0.220301
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.9918856620788574 seconds
Running Louvain modularity optimization

[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   14.1s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   17.3s finished

Louvain completed 200 runs in 31.39292621612549 seconds
Wrote graph to binary file in 12.127521753311157 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.727572
After 6 runs, maximum modularity is Q = 0.727584
Louvain completed 56 runs in 52.783995389938354 seconds
Preproc + Louvain took 98.1414110660553 s
Got 14 clusters after round 1
Counts:
{8: 173, 4: 311, 7: 197, 2: 334, 3: 315, 12: 27, 10: 91, 5: 243, 0: 508, 1: 358, 6: 241, 9: 142, 11: 81, 13: 12}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 508 seqlets
Trimmed 54 out of 508
Skipped 59 seqlets
Aggregating for cluster 1 with 358 seqlets
Trimmed 22 out of 358
Skipped 44 seqlets
Aggregating for cluster 2 with 334 seqlets
Trimmed 55 out of 334
Skipped 38 seqlets
Aggregating for cluster 3 with 315 seqlets
Trimmed 18 out of 315
Skipped 49 seqlets
Aggregating for cluster 4 with 311 seqlets
Trimmed 26 out of 311
Skipped 40 seqlets
Aggregating for cluster 5 with 243 seqlets
Trimmed 33 out of 243
Skipped 41 seqlets
Aggregating for cluster 6 with 241 seqlets
Trimmed 33 out of 241
Skipped 32 seqlets
Aggregating for cluster 7 with 197 seqlets
Trimmed 18 out of 197
Skipped 28 seqlets
Aggregating for cluster 8 with 173 seqlets
Trimmed 14 out of 173
Skipped 38 seqlets
Skipped 6 seqlets
Aggregating for cluster 9 with 142 seqlets
Trimmed 9 out of 142
Skipped 17 seqlets
Aggregating for cluster 10 with 91 seqlets
Trimmed 4 out of 91
Skipped 15 seqlets
Aggregating for cluster 11 with 81 seqlets
Trimmed 25 out of 81
Skipped 1 seqlets
Skipped 1 seqlets
Aggregating for cluster 12 with 27 seqlets
Trimmed 1 out of 27
Skipped 1 seqlets
Aggregating for cluster 13 with 12 seqlets
Trimmed 0 out of 12
Skipped 3 seqlets
(Round 2) num seqlets: 2305
(Round 2) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 3.85 s
Starting affinity matrix computations
Normalization computed in 0.38 s
Cosine similarity mat computed in 0.84 s
Normalization computed in 0.32 s
Cosine similarity mat computed in 0.77 s
Finished affinity matrix computations in 1.61 s
(Round 2) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.21 s
(Round 2) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 69.83 s
Launching nearest neighbors affmat calculation job
Job completed in: 69.37 s
(Round 2) Computed affinity matrix on nearest neighbors in 145.46 s
Not applying filtering for rounds above first round
(Round 2) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 2305 samples in 0.004s...
[t-SNE] Computed neighbors for 2305 samples in 0.063s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2305
[t-SNE] Computed conditional probabilities for sample 2000 / 2305
[t-SNE] Computed conditional probabilities for sample 2305 / 2305
[t-SNE] Mean sigma: 0.219928
(Round 2) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 1.0518958568572998 seconds
Running Louvain modularity optimization

[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    2.2s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:   14.2s
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:   17.3s finished

Louvain completed 200 runs in 29.665878295898438 seconds
Wrote graph to binary file in 7.174563407897949 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.722526
After 2 runs, maximum modularity is Q = 0.730711
After 3 runs, maximum modularity is Q = 0.730801
After 4 runs, maximum modularity is Q = 0.730984
After 6 runs, maximum modularity is Q = 0.732045
Louvain completed 56 runs in 50.18093490600586 seconds
Preproc + Louvain took 88.62935519218445 s
Got 13 clusters after round 2
Counts:
{6: 144, 0: 434, 1: 357, 9: 115, 10: 45, 2: 291, 8: 116, 3: 237, 5: 153, 12: 12, 7: 124, 4: 233, 11: 44}
(Round 2) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 434 seqlets
Trimmed 45 out of 434
Skipped 22 seqlets
Aggregating for cluster 1 with 357 seqlets
Trimmed 58 out of 357
Skipped 9 seqlets
Aggregating for cluster 2 with 291 seqlets
Trimmed 106 out of 291
Skipped 16 seqlets
Aggregating for cluster 3 with 237 seqlets
Trimmed 74 out of 237
Skipped 4 seqlets
Aggregating for cluster 4 with 233 seqlets
Trimmed 32 out of 233
Skipped 11 seqlets
Aggregating for cluster 5 with 153 seqlets
Trimmed 33 out of 153
Skipped 4 seqlets
Aggregating for cluster 6 with 144 seqlets
Trimmed 34 out of 144
Skipped 4 seqlets
Aggregating for cluster 7 with 124 seqlets
Trimmed 18 out of 124
Skipped 11 seqlets
Aggregating for cluster 8 with 116 seqlets
Trimmed 31 out of 116
Skipped 2 seqlets
Aggregating for cluster 9 with 115 seqlets
Trimmed 42 out of 115
Aggregating for cluster 10 with 45 seqlets
Trimmed 5 out of 45
Skipped 1 seqlets
Aggregating for cluster 11 with 44 seqlets
Trimmed 0 out of 44
Skipped 36 seqlets
Aggregating for cluster 12 with 12 seqlets
Trimmed 0 out of 12
Got 13 clusters
Splitting into subclusters...
Inspecting for spurious merging
Wrote graph to binary file in 0.8585577011108398 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00796871
Louvain completed 21 runs in 16.888229370117188 seconds
Similarity is 0.9369463592028393; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.5535683631896973 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00915527
After 9 runs, maximum modularity is Q = 0.00915528
Louvain completed 29 runs in 23.062270641326904 seconds
Similarity is 0.9749927220847351; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2352142333984375 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0124411
Louvain completed 21 runs in 17.319679260253906 seconds
Similarity is 0.9718953897396376; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2210524082183838 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00699388
Louvain completed 21 runs in 16.389430046081543 seconds
Similarity is 0.9153464971620826; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.2778606414794922 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0102081
Louvain completed 21 runs in 16.981793880462646 seconds
Similarity is 0.9678379259664387; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.12695574760437012 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0107158
Louvain completed 21 runs in 16.91047763824463 seconds
Similarity is 0.7620583134057276; is_dissimilar is True
Inspecting for spurious merging
Wrote graph to binary file in 0.06804585456848145 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00670445
Louvain completed 21 runs in 16.014127254486084 seconds
Similarity is 0.9015959238247756; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.05251049995422363 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.00931806
After 6 runs, maximum modularity is Q = 0.00934598
Louvain completed 26 runs in 20.417680740356445 seconds
Similarity is 0.8063749107025145; is_dissimilar is False
Got 2 subclusters
Inspecting for spurious merging
Wrote graph to binary file in 0.1024315357208252 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0108229
Louvain completed 21 runs in 16.562198162078857 seconds
Similarity is 0.9592693482354967; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.09054327011108398 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0113031
Louvain completed 21 runs in 16.434577465057373 seconds
Similarity is 0.9490734747843734; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.07478022575378418 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.0105235
Louvain completed 21 runs in 16.81160855293274 seconds
Similarity is 0.9306806135433177; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.08950996398925781 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.01083
Louvain completed 21 runs in 16.266108512878418 seconds
Similarity is 0.9045891394134682; is_dissimilar is False
Inspecting for spurious merging
Wrote graph to binary file in 0.0588986873626709 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = -0.0167635
After 2 runs, maximum modularity is Q = 0.00610611
After 3 runs, maximum modularity is Q = 0.00642396
After 4 runs, maximum modularity is Q = 0.00789468
Louvain completed 24 runs in 20.76047968864441 seconds
Similarity is 0.9077495375985554; is_dissimilar is False
Merging on 14 clusters
On merging iteration 1
Computing pattern to seqlet distances
Computing pattern to pattern distances
Collapsing 0 & 1 with prob 0.005213325762187972 and sim 0.9876791751232061
Collapsing 2 & 8 with prob 8.690870354286062e-06 and sim 0.9338340315269065
Collapsing 1 & 9 with prob 4.393169010983065e-05 and sim 0.9305307932726811
Collapsing 3 & 10 with prob 4.2267009505083e-06 and sim 0.9240314715581518
Collapsing 0 & 9 with prob 5.155265632299075e-05 and sim 0.917609114054792
Collapsing 4 & 8 with prob 2.0666627920554385e-06 and sim 0.9171718035410608
Aborting collapse as 2 & 4 have prob 1.154466979909516e-06 and sim 0.3987002160044585
Collapsing 1 & 3 with prob 0.00013036782422250225 and sim 0.9136723758111194
Aborting collapse as 9 & 10 have prob 1.4887673966122782e-07 and sim 0.7969590744628001
Collapsing 0 & 3 with prob 3.812735472045252e-05 and sim 0.9060132492910623
Aborting collapse as 9 & 10 have prob 1.4887673966122782e-07 and sim 0.7969590744628001
Collapsing 0 & 11 with prob 3.682883491126881e-05 and sim 0.8826333671319047
Trimmed 19 out of 644
Trimmed 3 out of 263
Skipped 5 seqlets
Trimmed 6 out of 707
Skipped 1 seqlets
Trimmed 1 out of 232
Skipped 2 seqlets
Trimmed 3 out of 738
On merging iteration 2
Computing pattern to seqlet distances
Computing pattern to pattern distances
Collapsing 0 & 2 with prob 0.0006245711808186924 and sim 0.9105909804374275
Collapsing 2 & 4 with prob 0.0002955546574045961 and sim 0.8832851960368194
Trimmed 8 out of 964
Skipped 2 seqlets
Trimmed 7 out of 1058
On merging iteration 3
Computing pattern to seqlet distances
Computing pattern to pattern distances
Got 7 patterns after merging
Performing seqlet reassignment
Cross contin jaccard time taken: 7.56 s
Cross contin jaccard time taken: 7.33 s
Discarded 5 seqlets
Skipped 13 seqlets
Skipped 6 seqlets
Skipped 3 seqlets
Got 5 patterns after reassignment
Total time taken is 972.97s

mkdir -p {ddir}/processed/chipnexus/motifs/sox2-oct4/modisco

!du -sh {ddir}/processed/chipnexus/motifs/sox2/modisco

58M	/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/motifs/sox2/modisco

modisco_file = f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"
modisco_file_pkl = f"{ddir}/processed/chipnexus/motifs/sox2-oct4/modisco/counts.hdf5"

from basepair.utils import write_pkl

write_pkl(tfmodisco_results, modisco_file_pkl)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-192-87dc9ce2ca7e> in <module>()
----> 1 write_pkl(tfmodisco_results, modisco_file_pkl)

~/workspace/basepair/basepair/utils.py in write_pkl(obj, fname, create_dirs, protocol)
      8             os.makedirs(os.path.dirname(fname), exist_ok=True)
      9     import pickle
---> 10     pickle.dump(obj, open(fname, 'wb'), protocol=protocol)
     11 
     12 

TypeError: can't pickle dict_values objects

!rm {modisco_file}

import h5py
import modisco.util
reload(modisco.util)
grp = h5py.File(modisco_file)
tfmodisco_results.save_hdf5(grp)
# TODO - write pickle

hdf5_results.close()

from collections import Counter
from modisco.visualization import viz_sequence
reload(viz_sequence)
from matplotlib import pyplot as plt

import modisco.affinitymat.core
reload(modisco.affinitymat.core)
import modisco.cluster.phenograph.core
reload(modisco.cluster.phenograph.core)
import modisco.cluster.phenograph.cluster
reload(modisco.cluster.phenograph.cluster)
import modisco.cluster.core
reload(modisco.cluster.core)
import modisco.aggregator
reload(modisco.aggregator)

import sklearn.decomposition
import sklearn.manifold

hdf5_results = h5py.File(modisco_file)

#patterns = (tfmodisco_results
#            .metacluster_idx_to_submetacluster_results[0]
#            .seqlets_to_patterns_result.patterns);
patterns = (list(hdf5_results
                 ["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"]["all_pattern_names"]))
print(len(patterns))
pattern_grp = (hdf5_results
                 ["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"])

for pattern_name in patterns:
    pattern = pattern_grp[pattern_name]
    print(pattern_name)
    print("total seqlets:",len(pattern["seqlets_and_alnmts"]["seqlets"]))
    #pattern.plot_counts(counts=aggregated_seqlet.get_per_position_seqlet_center_counts())
    background = np.array([0.27, 0.23, 0.23, 0.27])
    print("fwd:")
    viz_sequence.plot_weights(pattern["task0_contrib_scores"]["fwd"])
    viz_sequence.plot_weights(pattern["task0_hypothetical_contribs"]["fwd"])
    viz_sequence.plot_weights(viz_sequence.ic_scale(np.array(pattern["sequence"]["fwd"]),
                                                    background=background))

    print("reverse:")
    viz_sequence.plot_weights(pattern["task0_contrib_scores"]["rev"])
    viz_sequence.plot_weights(pattern["task0_hypothetical_contribs"]["rev"])
    viz_sequence.plot_weights(viz_sequence.ic_scale(np.array(pattern["sequence"]["rev"]),
                                                    background=background))

5
b'pattern_0'
total seqlets: 1054
fwd:

reverse:

b'pattern_1'
total seqlets: 242
fwd:

reverse:

b'pattern_2'
total seqlets: 184
fwd:

reverse:

b'pattern_3'
total seqlets: 64
fwd:

reverse:

b'pattern_4'
total seqlets: 48
fwd:

reverse:

hdf5_results

[]

sl = tfmodisco_results.multitask_seqlet_creation_results.final_seqlets[0]

cl0 = hdf5_results.metacluster_idx_to_submetacluster_results[0]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-278-f54e30fb67a0> in <module>()
----> 1 cl0 = hdf5_results.metacluster_idx_to_submetacluster_results[0]

AttributeError: 'File' object has no attribute 'metacluster_idx_to_submetacluster_results'

sl.coor.example_idx

2

len(pd.Series([sl.coor.example_idx for sl in cl0.seqlets]).unique())

811

cl0.seqlets[0]

<modisco.core.Seqlet at 0x7f47291decc0>

len(cl0.seqlets)

1845

len(cl0.seqlets)

250

tfmodisco_results["metacluster_idx_to_submetacluster_results"]
                 ["metacluster0"]
                 ["seqlets_to_patterns_result"]
                 ["patterns"]["all_pattern_names"]))

sl.coor.start

91

sl.exidx_start_end_string

'2_91_132'

tfmodisco_results.metaclustering_results.metacluster_indices

array([0, 0, 1, ..., 0, 1, 1])

sl = tfmodisco_results.multitask_seqlet_creation_results.final_seqlets[0]

sl.coor.

False

tfmodisco_results.metaclustering_results.metacluster_indices.max()

1

len(tfmodisco_results.metaclustering_results.metacluster_indices)

2095

sl.

OrderedDict([('task0_contrib_scores',
              <modisco.core.Snippet at 0x7f47291de1d0>),
             ('task0_hypothetical_contribs',
              <modisco.core.Snippet at 0x7f47291de2e8>),
             ('sequence', <modisco.core.Snippet at 0x7f47291dea90>)])

Questions¶

Goal¶

TODO¶

setup a new dataset function based on Sox2 peaks¶

setup a new model having two output branches¶

train the model¶

modify the plotting code¶

Valid¶

TODO - the performance of Sox2 could be better¶

Test¶

try out a few models to figure out the optimal hyper-parameters¶

figure out how to best visualize the sequence importances¶

run modisco - how do the motifs look like?¶

Get predictions and gradients¶

Correlate pos and neg-strand gradients¶

Plot some¶

Run modisco¶

plot the average sequence distribution for each motif¶