Goal

  • develop an activity model predicting:
    • H3K27ac
    • Pol II
    • GRO-seq

metaplots html

Design descisions

  • dataset = all the chipnexus peaks
  • output = total number of counts in the +- 1kb of the region center

Tasks

  • [x] write the dataloader
  • [x] load all the data
  • [x] plot the response variable distribution (histogram, QQ-plots)
  • [x] get the predictions across all the regions from bpnet
    • [x] save to the hdf5 file
  • [x] train a simple model on top
  • [x] evaluate using simple scatterplots
  • [ ] exclude promoter-regions and re-train

Other tasks

  • [ ] fine-tune the whole model
  • [ ] get the importance scores
In [1]:
from basepair.imports import *
import pybedtools
from genomelake.extractors import BigwigExtractor
from basepair.extractors import StrandedBigWigExtractor, bw_extract
from kipoiseq.transforms import ResizeInterval
from basepair.modisco.table import ModiscoData
from pybedtools import BedTool
from basepair.cli.modisco import load_ranges
from basepair.plot.heatmaps import (heatmap_importance_profile, normalize, multiple_heatmap_stranded_profile,
                                    heatmap_stranded_profile)
from basepair.plot.profiles import plot_stranded_profile
import hvplot.pandas
from basepair.plots import regression_eval
Using TensorFlow backend.
In [2]:
create_tf_session(0)
Out[2]:
<tensorflow.python.client.session.Session at 0x7f4fe72926a0>

Load the data

In [3]:
df = pd.read_csv(f"{ddir}/processed/chipnexus/external-data.tsv", sep='\t')
df = df.set_index('assay')
In [4]:
df
Out[4]:
axis path
assay
DNase 0 /srv/scratch/avsec/wo...
DNase 0 /srv/scratch/avsec/wo...
DNase-HINT 0 /srv/scratch/avsec/wo...
... ... ...
Groseq 0 /srv/scratch/avsec/wo...
MNase-wt 0 /srv/scratch/avsec/wo...
MNase-h4 0 /srv/scratch/avsec/wo...

16 rows × 2 columns

Write the bed file

In [5]:
from basepair.cli.schemas import DataSpec

ds = DataSpec.load(f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/dataspec.yml")

!mkdir -p {ddir}/processed/activity/data/

def read_factor(factor, filename):
    df = pd.read_table(filename, header=None, usecols=[0, 1, 2])
    df[3] = factor
    df.columns = ['chrom', 'start', 'end', 'name']
    return df

dfc = pd.concat([read_factor(k, ds.task_specs[k].peaks) for k in ds.task_specs], axis=0)

dfc.to_csv(f"{ddir}/processed/activity/data/peaks.bed", sep='\t', index=False, header=False)
In [6]:
!head -n 2 {ddir}/processed/activity/data/peaks.bed
chrX	143483044	143483045	Oct4
chr3	122145563	122145564	Oct4

Prepare the bigwigs

In [7]:
assays = ['H3K27ac', 'PolII', 'Groseq']
In [8]:
def tolist(s):
    if isinstance(s, str):
        return [s]
    else:
        return list(s)
In [9]:
bigwigs = {a: tolist(df.loc[a].path) for a in assays}
In [10]:
bigwigs
Out[10]:
{'H3K27ac': ['/srv/scratch/avsec/workspace/chipnexus/data/raw/2018-10-13-histone-chipseq-PMID-28483418/H3K27ac_ChIP-seq_WT_rep1_blacklisted.bw',
  '/srv/scratch/avsec/workspace/chipnexus/data/raw/2018-10-13-histone-chipseq-PMID-28483418/H3K27ac_ChIP-seq_WT_rep2_blacklisted.bw'],
 'PolII': ['/srv/scratch/avsec/workspace/chipnexus/data/raw/2018-10-13-histone-chipseq-PMID-28483418/PolII_ChIP-seq_WT_rep1_blacklisted.bw',
  '/srv/scratch/avsec/workspace/chipnexus/data/raw/2018-10-13-histone-chipseq-PMID-28483418/PolII_ChIP-seq_WT_rep2_blacklisted.bw'],
 'Groseq': ['/srv/scratch/avsec/workspace/chipnexus/data/raw/2018-10-15-groseq-from-melanie/GRO-seq_WT_1_blacklisted.bw']}

Write the dataloader

In [13]:
from basepair.config import valid_chr, test_chr

from basepair.datasets import *
In [14]:
dl = ActivityDataset(f"{ddir}/processed/activity/data/peaks.bed", ds.fasta_file, bigwigs, 
                     excl_chromosomes=valid_chr + test_chr)
In [15]:
# load all
train = dl.load_all(num_workers=10)
100%|██████████| 1913/1913 [00:11<00:00, 172.21it/s]
In [37]:
dfy = pd.DataFrame(train['targets'])
In [92]:
valid = ActivityDataset(f"{ddir}/processed/activity/data/peaks.bed", ds.fasta_file, bigwigs, 
                        incl_chromosomes=valid_chr).load_all(num_workers=10)
100%|██████████| 599/599 [00:06<00:00, 92.94it/s]
In [94]:
dfy_valid = pd.DataFrame(valid['targets'])
In [93]:
test = ActivityDataset(f"{ddir}/processed/activity/data/peaks.bed", ds.fasta_file, bigwigs, 
                       incl_chromosomes=test_chr).load_all(num_workers=10)
100%|██████████| 566/566 [00:05<00:00, 96.32it/s]
In [95]:
dfy_test = pd.DataFrame(test['targets'])
In [38]:
dfy.head()
Out[38]:
H3K27ac PolII Groseq
0 72177.0 230052.0 1577.0
1 440983.0 164513.0 320665.0
2 135278.0 83775.0 228978.0
3 44406.0 22818.0 77130.0
4 246385.0 90010.0 282323.0
In [16]:
len(dl)
Out[16]:
61205
In [17]:
dl[0]
Out[17]:
{'inputs': {'seq': array([[0., 0., 0., 1.],
         [0., 0., 0., 1.],
         [0., 1., 0., 0.],
         ...,
         [0., 1., 0., 0.],
         [0., 1., 0., 0.],
         [1., 0., 0., 0.]], dtype=float32)},
 'targets': {'H3K27ac': 72177.0, 'PolII': 230052.0, 'Groseq': 1577.0},
 'metadata': {'ranges': GenomicRanges(chr='chrX', start=143482544, end=143483544, id='0', strand='*'),
  'ranges_wide': GenomicRanges(chr='chrX', start=143482044, end=143484044, id='Oct4', strand='.'),
  'name': 'Oct4'}}

Output distribution

Histograms

In [58]:
np.log10(1 + dfy).hvplot.hist(bins=100, subplots=True, height=200, shared_axes=True).cols(1)
Out[58]:

QQ-plots

In [59]:
import scipy.stats as stats
In [72]:
fig, axes = plt.subplots(1, len(assays), figsize=(9, 3), sharex=True, sharey=True)
for a, ax in zip(assays, axes):
    stats.probplot(np.log10(1 + dfy[a]), dist="norm", plot=ax);
    ax.set_title(a)
plt.tight_layout()

Get bottleneck predictions

In [34]:
from basepair.BPNet import BPNetPredictor
from keras.models import Model, Sequential
import keras.layers as kl
In [19]:
mdir = f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/"
In [20]:
bpnet = BPNetPredictor.from_mdir(mdir)
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
2018-11-19 05:03:24,295 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-11-19 05:03:34,007 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
In [21]:
bpnet.tasks
Out[21]:
['Oct4', 'Sox2', 'Nanog', 'Klf4']
In [26]:
bpnet.model.get_layer("add_9").output
Out[26]:
<tf.Tensor 'add_9/add_8:0' shape=(?, 1000, 64) dtype=float32>
In [27]:
bottleneck_model = Model(bpnet.model.inputs, bpnet.model.get_layer("add_9").output)
In [32]:
bottleneck_predictions = bottleneck_model.predict(train['inputs'], batch_size=32, verbose=1)
61205/61205 [==============================] - 33s 545us/step
In [97]:
bottleneck_predictions_valid = bottleneck_model.predict(valid['inputs'], batch_size=32, verbose=1)
bottleneck_predictions_test = bottleneck_model.predict(test['inputs'], batch_size=32, verbose=1)
19137/19137 [==============================] - 10s 508us/step
18086/18086 [==============================] - 9s 513us/step
In [33]:
bottleneck_predictions.shape
Out[33]:
(61205, 1000, 64)
In [35]:
top_model = Sequential([
    kl.GlobalAvgPool1D(input_shape=(1000, 64)),
    kl.Dense(3)
])
In [111]:
top_model = Sequential([
    kl.MaxPool1D(pool_size=50, input_shape=(1000, 64)),
    kl.Flatten(),
    kl.Dense(64, activation='relu'),
    # kl.Dropout(.5)
    kl.Dense(3)
])
In [112]:
from concise.metrics import var_explained
from keras.callbacks import EarlyStopping
from sklearn.preprocessing import StandardScaler
In [113]:
top_model.compile("adam", 'mse', metrics=[var_explained])
In [114]:
preproc = StandardScaler()
In [116]:
y = preproc.fit_transform(np.log10(1 + dfy).values)
y_valid = preproc.transform(np.log10(1 + dfy_valid).values)
y_test = preproc.transform(np.log10(1 + dfy_test).values)
In [117]:
top_model.fit(bottleneck_predictions, y, batch_size=512,
              epochs=100,
              validation_data=(bottleneck_predictions_valid, y_valid),
              callbacks=[EarlyStopping(patience=5)]
             )
Train on 61205 samples, validate on 19137 samples
Epoch 1/100
61205/61205 [==============================] - 35s 567us/step - loss: 1.0262 - var_explained: 0.0540 - val_loss: 0.8771 - val_var_explained: 0.1242
Epoch 2/100
61205/61205 [==============================] - 32s 527us/step - loss: 0.8564 - var_explained: 0.1476 - val_loss: 0.8612 - val_var_explained: 0.1382
Epoch 3/100
61205/61205 [==============================] - 31s 514us/step - loss: 0.8412 - var_explained: 0.1635 - val_loss: 0.8582 - val_var_explained: 0.1515
Epoch 4/100
61205/61205 [==============================] - 31s 506us/step - loss: 0.8357 - var_explained: 0.1726 - val_loss: 0.8517 - val_var_explained: 0.1523
Epoch 5/100
61205/61205 [==============================] - 30s 495us/step - loss: 0.8334 - var_explained: 0.1783 - val_loss: 0.8417 - val_var_explained: 0.1596
Epoch 6/100
61205/61205 [==============================] - 28s 458us/step - loss: 0.8240 - var_explained: 0.1838 - val_loss: 0.8515 - val_var_explained: 0.1522
Epoch 7/100
61205/61205 [==============================] - 33s 538us/step - loss: 0.8193 - var_explained: 0.1873 - val_loss: 0.8367 - val_var_explained: 0.1633
Epoch 8/100
61205/61205 [==============================] - 32s 521us/step - loss: 0.8150 - var_explained: 0.1932 - val_loss: 0.8634 - val_var_explained: 0.1593
Epoch 9/100
61205/61205 [==============================] - 29s 472us/step - loss: 0.8088 - var_explained: 0.1965 - val_loss: 0.8740 - val_var_explained: 0.1691
Epoch 10/100
61205/61205 [==============================] - 28s 462us/step - loss: 0.8105 - var_explained: 0.1988 - val_loss: 0.8332 - val_var_explained: 0.1686
Epoch 11/100
61205/61205 [==============================] - 31s 504us/step - loss: 0.8042 - var_explained: 0.2015 - val_loss: 0.8476 - val_var_explained: 0.1692
Epoch 12/100
61205/61205 [==============================] - 32s 522us/step - loss: 0.8018 - var_explained: 0.2040 - val_loss: 0.8312 - val_var_explained: 0.1677
Epoch 13/100
61205/61205 [==============================] - 29s 467us/step - loss: 0.8005 - var_explained: 0.2075 - val_loss: 0.8310 - val_var_explained: 0.1737
Epoch 14/100
61205/61205 [==============================] - 28s 455us/step - loss: 0.7957 - var_explained: 0.2091 - val_loss: 0.8292 - val_var_explained: 0.1715
Epoch 15/100
61205/61205 [==============================] - 27s 437us/step - loss: 0.7926 - var_explained: 0.2120 - val_loss: 0.8330 - val_var_explained: 0.1731
Epoch 16/100
61205/61205 [==============================] - 29s 469us/step - loss: 0.7934 - var_explained: 0.2132 - val_loss: 0.8474 - val_var_explained: 0.1708
Epoch 17/100
61205/61205 [==============================] - 26s 429us/step - loss: 0.7918 - var_explained: 0.2146 - val_loss: 0.8307 - val_var_explained: 0.1758
Epoch 18/100
61205/61205 [==============================] - 29s 478us/step - loss: 0.7911 - var_explained: 0.2158 - val_loss: 0.8271 - val_var_explained: 0.1715
Epoch 19/100
61205/61205 [==============================] - 27s 445us/step - loss: 0.7884 - var_explained: 0.2181 - val_loss: 0.8252 - val_var_explained: 0.1786
Epoch 20/100
61205/61205 [==============================] - 29s 481us/step - loss: 0.7856 - var_explained: 0.2207 - val_loss: 0.8222 - val_var_explained: 0.1758
Epoch 21/100
61205/61205 [==============================] - 28s 462us/step - loss: 0.7856 - var_explained: 0.2198 - val_loss: 0.8512 - val_var_explained: 0.1682
Epoch 22/100
61205/61205 [==============================] - 30s 489us/step - loss: 0.7876 - var_explained: 0.2195 - val_loss: 0.8361 - val_var_explained: 0.1764
Epoch 23/100
61205/61205 [==============================] - 29s 470us/step - loss: 0.7840 - var_explained: 0.2211 - val_loss: 0.8253 - val_var_explained: 0.1767
Epoch 24/100
61205/61205 [==============================] - 29s 470us/step - loss: 0.7804 - var_explained: 0.2236 - val_loss: 0.8260 - val_var_explained: 0.1773
Epoch 25/100
61205/61205 [==============================] - 28s 460us/step - loss: 0.7881 - var_explained: 0.2220 - val_loss: 0.8404 - val_var_explained: 0.1784
Out[117]:
<keras.callbacks.History at 0x7f05b393a400>

Evaluate

In [118]:
ypred_valid = top_model.predict(bottleneck_predictions_valid)
In [100]:
ypred_valid.shape
Out[100]:
(19137, 3)
In [120]:
fig, axes = plt.subplots(len(assays), 1, figsize=(5, 11), sharex=True, sharey=True)
for i, (a, ax) in enumerate(zip(assays, axes)):
    regression_eval(ypred_valid[:,i], y_valid[:,i], alpha=0.05, task=a, ax=ax);
plt.tight_layout()

Fine-tune

In [122]:
whole_model = Sequential([bottleneck_model, top_model])
In [123]:
whole_model.compile("adam", 'mse', metrics=[var_explained])
In [127]:
whole_model.fit(train['inputs']['seq'], y, batch_size=512,
                epochs=100,
                validation_data=(valid['inputs']['seq'], y_valid),
                callbacks=[EarlyStopping(patience=5)])
Train on 61205 samples, validate on 19137 samples
Epoch 1/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.6910 - var_explained: 0.3158 - val_loss: 0.8573 - val_var_explained: 0.1735
Epoch 2/100
61205/61205 [==============================] - 55s 898us/step - loss: 0.6619 - var_explained: 0.3485 - val_loss: 0.8192 - val_var_explained: 0.1749
Epoch 3/100
61205/61205 [==============================] - 55s 900us/step - loss: 0.6298 - var_explained: 0.3824 - val_loss: 0.8426 - val_var_explained: 0.1515
Epoch 4/100
61205/61205 [==============================] - 55s 901us/step - loss: 0.5947 - var_explained: 0.4155 - val_loss: 0.9512 - val_var_explained: 0.1563
Epoch 5/100
61205/61205 [==============================] - 55s 901us/step - loss: 0.5648 - var_explained: 0.4474 - val_loss: 0.8477 - val_var_explained: 0.1414
Epoch 6/100
61205/61205 [==============================] - 55s 902us/step - loss: 0.5450 - var_explained: 0.4743 - val_loss: 0.8559 - val_var_explained: 0.1323
Epoch 7/100
61205/61205 [==============================] - 55s 901us/step - loss: 0.5167 - var_explained: 0.5014 - val_loss: 0.8765 - val_var_explained: 0.1221
Out[127]:
<keras.callbacks.History at 0x7f08c54f6400>
In [128]:
ypred_valid = whole_model.predict(valid['inputs']['seq'])
In [129]:
ypred_valid.shape
Out[129]:
(19137, 3)
In [130]:
fig, axes = plt.subplots(len(assays), 1, figsize=(5, 11), sharex=True, sharey=True)
for i, (a, ax) in enumerate(zip(assays, axes)):
    regression_eval(ypred_valid[:,i], y_valid[:,i], alpha=0.05, task=a, ax=ax);
plt.tight_layout()

Train from scratch

In [141]:
import keras.backend as K
from keras.models import load_model

def reset_weights(model):
    session = K.get_session()
    for layer in model.layers: 
        if hasattr(layer, 'kernel_initializer'):
            layer.kernel.initializer.run(session=session)
In [137]:
whole_model.save(f"{ddir}/processed/activity/models/dense/fine-tuned.h5")
In [145]:
reinitialized_model = load_model(f"{ddir}/processed/activity/models/dense/fine-tuned.h5")
In [163]:
reset_weights(reinitialized_model.layers[0])
reset_weights(reinitialized_model.layers[1])
In [165]:
reinitialized_model.compile("adam", 'mse', metrics=[var_explained])
In [166]:
reinitialized_model.fit(train['inputs']['seq'], y, batch_size=512,
                epochs=100,
                validation_data=(valid['inputs']['seq'], y_valid),
                callbacks=[EarlyStopping(patience=5)])
Train on 61205 samples, validate on 19137 samples
Epoch 1/100
61205/61205 [==============================] - 58s 942us/step - loss: 1.2780 - var_explained: -0.2762 - val_loss: 0.9018 - val_var_explained: 0.0977
Epoch 2/100
61205/61205 [==============================] - 55s 895us/step - loss: 0.8525 - var_explained: 0.1490 - val_loss: 0.8654 - val_var_explained: 0.1464
Epoch 3/100
61205/61205 [==============================] - 55s 898us/step - loss: 0.8386 - var_explained: 0.1638 - val_loss: 0.8479 - val_var_explained: 0.1573
Epoch 4/100
61205/61205 [==============================] - 55s 897us/step - loss: 0.8318 - var_explained: 0.1739 - val_loss: 0.8420 - val_var_explained: 0.1624
Epoch 5/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.8190 - var_explained: 0.1854 - val_loss: 0.8296 - val_var_explained: 0.1719
Epoch 6/100
61205/61205 [==============================] - 55s 900us/step - loss: 0.8099 - var_explained: 0.1979 - val_loss: 0.8332 - val_var_explained: 0.1779
Epoch 7/100
61205/61205 [==============================] - 55s 902us/step - loss: 0.7944 - var_explained: 0.2122 - val_loss: 0.8246 - val_var_explained: 0.1844
Epoch 8/100
61205/61205 [==============================] - 55s 898us/step - loss: 0.7846 - var_explained: 0.2242 - val_loss: 0.8429 - val_var_explained: 0.1800
Epoch 9/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7831 - var_explained: 0.2324 - val_loss: 0.8144 - val_var_explained: 0.1911
Epoch 10/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7681 - var_explained: 0.2446 - val_loss: 0.8132 - val_var_explained: 0.1915
Epoch 11/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7558 - var_explained: 0.2576 - val_loss: 0.8215 - val_var_explained: 0.1958
Epoch 12/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7397 - var_explained: 0.2733 - val_loss: 0.8182 - val_var_explained: 0.1868
Epoch 13/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7221 - var_explained: 0.2911 - val_loss: 0.8235 - val_var_explained: 0.1914
Epoch 14/100
61205/61205 [==============================] - 55s 899us/step - loss: 0.7095 - var_explained: 0.3094 - val_loss: 0.8200 - val_var_explained: 0.1966
Epoch 15/100
61205/61205 [==============================] - 55s 901us/step - loss: 0.6958 - var_explained: 0.3258 - val_loss: 0.8235 - val_var_explained: 0.1777
Out[166]:
<keras.callbacks.History at 0x7f0551c6a0f0>
In [167]:
ypred_valid = reinitialized_model.predict(valid['inputs']['seq'])
In [168]:
ypred_valid.shape
Out[168]:
(19137, 3)
In [169]:
fig, axes = plt.subplots(len(assays), 1, figsize=(5, 11), sharex=True, sharey=True)
for i, (a, ax) in enumerate(zip(assays, axes)):
    regression_eval(ypred_valid[:,i], y_valid[:,i], alpha=0.05, task=a, ax=ax);
plt.tight_layout()