Goal

  • [x] analyze the data with BPNet

Notes

  • to run BPNet, we should probably inject this sequence in the middle of a random sequence and average the predictions across multiple random sequences

Tasks

  • [x] get the importance scores for all these sequences
  • [x] visualize the importance scores as well as the footprints
  • [x] correlate importance scores with activity
  • [X] build a simple predictive model
  • [x] compare to their RF model
    • genomic sequences: gkSVM (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.
      • See Figure 4

Conclusions

  • our model is not that good as theirs on synthetic constructs (however, their model greatly overfits the data (knows the location)
  • model trained on genomics sequences achieves auc=0.83 instead of theirs 0.75
In [1]:
import warnings
warnings.filterwarnings("ignore")
In [2]:
from basepair.exp.chipnexus.data import load_BarakCohnen_mESC_MPRA, load_BarakCohnen_mESC_MPRA_raw
Using TensorFlow backend.
In [3]:
from basepair.config import create_tf_session
In [5]:
create_tf_session(0)
Out[5]:
<tensorflow.python.client.session.Session at 0x7f93f3c3a0b8>
In [91]:
figures = f"{ddir}/figures/model-evaluation/ChIP-nexus/MPRA"
In [57]:
fig_genome = f"{ddir}/figures/activity/genome"
In [56]:
!mkdir {fig_genome}
In [4]:
dfs_gen, dfs_syn = load_BarakCohnen_mESC_MPRA()
df_gen, df_syn = load_BarakCohnen_mESC_MPRA_raw()
In [6]:
dfs_gen.head()
Out[6]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM interval CRE_type chrom start end expression_fc expression_fc_SEM expression_fc_log2
0 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 1.2789 1.3300 1.1202 0.1134 0.0980 chr10:108456620-10845... All_Mutated chr10 108456620 108456701 1.3045 0.1134 0.3835
1 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 2.5366 2.0188 2.4672 1.0134 1.3304 chr10:108456620-10845... Genomic chr10 108456620 108456701 2.2777 1.0134 1.1876
2 chr10:110940873-11094... CCAGCCCTCAACAGAGCAGCT... 0.8912 0.7321 0.8639 0.0673 0.1322 chr10:110940873-11094... All_Mutated chr10 110940873 110940954 0.8116 0.0673 -0.3011
3 chr10:110940873-11094... CCAGCCCTCAACAGAGCAGCT... 0.9943 1.0337 1.1185 0.1044 0.0994 chr10:110940873-11094... Genomic chr10 110940873 110940954 1.0140 0.1044 0.0201
4 chr10:115312950-11531... cagctatagtcccagtcTGTT... 0.9246 1.0169 1.0652 0.1435 0.0881 chr10:115312950-11531... All_Mutated chr10 115312950 115313031 0.9707 0.1435 -0.0429
In [7]:
dfs_syn.head()
Out[7]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM CRE_elements expression_fc expression_fc_SEM expression_fc_log2 Klf4 Klf4_orientation Sox2 Sox2_orientation Essrb Essrb_orientation Oct4 Oct4_orientation
0 Ef-Kf-Of-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 2.9462 3.5538 3.6890 0.2847 0.2054 Ef-Kf-Of-Sf 3.2500 0.2847 1.7004 True fwd True fwd True fwd True fwd
1 Ef-Kf-Of-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 3.7496 5.1441 4.6180 0.6682 0.6283 Ef-Kf-Of-Sr 4.4469 0.6682 2.1528 True fwd True rev True fwd True fwd
2 Ef-Kf-Of_synthetic AGCTACGTTCAAGGTCACGTA... 2.0628 2.7998 2.2773 0.2142 0.3677 Ef-Kf-Of 2.4313 0.2142 1.2817 True fwd False rev True fwd True fwd
3 Ef-Kf-Or-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 4.5477 5.1814 4.9207 0.7599 0.7516 Ef-Kf-Or-Sf 4.8646 0.7599 2.2823 True fwd True fwd True fwd True rev
4 Ef-Kf-Or-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 5.4429 7.1354 6.6851 0.9533 0.9347 Ef-Kf-Or-Sr 6.2892 0.9533 2.6529 True fwd True rev True fwd True rev
In [8]:
df_gen.head()
Out[8]:
CRE_id BC_seq Sequence RNA_counts_rep1 RNA_counts_rep2 RNA_counts_rep3 DNA_counts BC_exp_rep1 BC_exp_rep2 BC_exp_rep3 Rep1_BC_norm Rep2_BC_norm Rep3_BC_norm Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM
0 Basal AAAGACGCG NaN 247.2739 179.0611 160.3011 229.3460 1.0782 0.7807 0.6989 NaN NaN NaN NaN NaN NaN NaN NaN
1 Basal AAATCAAGC NaN 88.0702 107.7321 140.7916 181.2556 0.4859 0.5944 0.7768 NaN NaN NaN NaN NaN NaN NaN NaN
2 Basal AAATGGTCC NaN 205.2365 170.3033 173.5353 313.6059 0.6544 0.5430 0.5534 NaN NaN NaN NaN NaN NaN NaN NaN
3 Basal AACAAGAGG NaN 172.0582 236.9896 114.2612 200.5631 0.8579 1.1816 0.5697 NaN NaN NaN NaN NaN NaN NaN NaN
4 Basal AACGACCCT NaN 339.0788 209.2388 217.3385 374.8905 0.9045 0.5581 0.5797 NaN NaN NaN NaN NaN NaN NaN NaN
In [9]:
df_gen.groupby("CRE_id").size().value_counts()
Out[9]:
8      810
112      1
dtype: int64

There are 8 different BC-seq per sequence

Create full sequences

Gen

Genomic sequences were represented in the pool by 150 bp oligos with the following sequences:

GACTTACATTAGGGCCCGT[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCGGACTACGATACTG

Where [SEQ] is either a reference (gWT) or mutated (gMUT) genomic sequence of 81-82 bps

Syn

CTTCTACTACTAGGGCCCA[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCTACATGCTAGTTCATG

where [SEQ] is a 40-80 bp synthetic element comprised of concatenated 20 bp building blocks of pluripotency sites, as described previously, with the fifth position of the KLF4 site changed to ‘T’ to facilitate cloning (Fiore and Cohen 2016) . [FILL] is a random filler sequence of variable length to bring the total length of each sequence to 150 bp, and [BC] is a random 9 bp barcode.

In [10]:
from basepair.exp.chipnexus.simulate import generate_seq, random_seq

def gen_padded_sequence(sequence, barcode, seqlen=1000):
    """Generate MPRA construct sequence for Genomic sequences
    """
    fill = random_seq(150 - len(sequence) - len(barcode) - 59)
    core_seq = f"GACTTACATTAGGGCCCGT{sequence}AAGCTT{fill}GAATTCTCTAGAC{barcode}TGAGCTCGGACTACGATACTG"
    assert len(core_seq) == 150
    return generate_seq(core_seq, seqlen=seqlen)


def syn_padded_sequence(sequence, barcode, seqlen=1000):
    """Generate MPRA construct sequence for Synthetic sequences
    """
    fill = random_seq(150 - len(sequence) - len(barcode) - 61)
    core_seq = f'CTTCTACTACTAGGGCCCA{sequence}AAGCTT{fill}GAATTCTCTAGAC{barcode}TGAGCTCTACATGCTAGTTCATG'
    assert len(core_seq) == 150
    return generate_seq(core_seq, seqlen=seqlen)

def get_core_seq_ij(seqlen, core_seq_len=150):
    start = (seqlen - core_seq_len) // 2
    return start, start + core_seq_len
In [11]:
a = gen_padded_sequence("ACGTA", "ACGST")
assert len(a) == 1000
In [12]:
b = syn_padded_sequence("ACGTA", "ACGST")
assert len(a) == 1000

Get the importance scores for all these sequences

In [13]:
from basepair.seqmodel import SeqModel
In [14]:
mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE,[1,50],TRUE/'
In [15]:
# mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,gw,OSNK,1,0,0,FALSE,same,0.5,64,25,0.001,9,FALSE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/seq,peaks,OSN,0,10,1,FALSE,same,0.5,64,50,0.004,9,FALSE,[1,50],TRUE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE-2/'
In [16]:
bpnet = SeqModel.from_mdir(mdir)
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
2019-03-13 09:04:50,906 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2019-03-13 09:04:58,927 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

first, don't average across multiple flanking sequences

Build a simple predictive model based on motif instance locations

In [17]:
from keras.models import Model, Sequential
import keras.layers as kl
from concise.preprocessing import encodeDNA
from concise.preprocessing.sequence import one_hot2string, DNA
In [18]:
bottleneck_model = bpnet.bottleneck_model()
In [19]:
bpnet_seq_syn = encodeDNA([syn_padded_sequence(s, "AAAGACGCG") for s in dfs_syn.Sequence.str.upper()])
bpnet_seq_gen = encodeDNA([gen_padded_sequence(s, "AAAGACGCG") for s in dfs_gen.Sequence.str.upper()])
In [20]:
import skimage.measure
In [21]:
%time bpnet_bottleneck_syn = bottleneck_model.predict(bpnet_seq_syn)
CPU times: user 3.08 s, sys: 772 ms, total: 3.85 s
Wall time: 2.65 s
In [22]:
bpnet_bottleneck_syn.shape
Out[22]:
(622, 1000, 64)
In [23]:
%time bpnet_bottleneck_gen = bottleneck_model.predict(bpnet_seq_gen)
CPU times: user 2.72 s, sys: 224 ms, total: 2.94 s
Wall time: 603 ms
In [24]:
bpnet_bottleneck_syn.shape
Out[24]:
(622, 1000, 64)
In [25]:
# get the features
def pool_bottleneck(bottleneck, pool_size=37, outlen=80, agg_fn=np.max):
    i,j = get_core_seq_ij(bottleneck.shape[1])
    i += 19
    j = i + outlen
    out = skimage.measure.block_reduce(bottleneck[:,i:j, :, np.newaxis], (1, pool_size, 1, 1), agg_fn)
    return out.reshape((out.shape[0], -1))
In [26]:
## train a simple model
In [27]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LassoCV
from sklearn.model_selection import cross_val_predict

from basepair.plots import regression_eval

Synthetic

In the paper, they achieved R-squared of 0.87

In [28]:
bpnet_bottleneck_syn.shape
Out[28]:
(622, 1000, 64)
In [6]:
def regression_eval(y_true, y_pred, alpha=0.5, markersize=2, task="", ax=None, same_lim=False, loglog=False):
    if ax is None:
        fig, ax = plt.subplots(1)
    from scipy.stats import pearsonr, spearmanr
    xmax = max([y_true.max(), y_pred.max()])
    xmin = min([y_true.min(), y_pred.min()])
    
    if loglog:
        pearson, pearson_pval = pearsonr(np.log10(y_true), np.log10(y_pred))
        spearman, spearman_pval = spearmanr(np.log10(y_true), np.log(y_pred))
    else:
        pearson, pearson_pval = pearsonr(y_true, y_pred)
        spearman, spearman_pval = spearmanr(y_true, y_pred)
    if loglog:
        plt_fn = ax.loglog
    else:
        plt_fn = ax.plot
        
    plt_fn(y_pred, y_true, ".", 
           markersize=markersize, 
           rasterized=True,
           alpha=alpha)
    ax.set_xlabel("Predicted (FC)")
    ax.set_ylabel("Observed (FC)")
    
    if same_lim:
        ax.set_xlim((xmin, xmax))
        ax.set_ylim((xmin, xmax))
    rp = r"$R_{p}$"
    rs = r"$R_{s}$"
    ax.set_title(task)
    ax.text(.95, .2, f"{rp}={pearson:.2f}",
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes)
    ax.text(.95, .05, f"{rs}={spearman:.2f}", 
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes)
In [7]:
from basepair.imports import *
In [8]:
paper_config()
In [85]:
bpnet_bottleneck_syn_feat = pool_bottleneck(bpnet_bottleneck_syn, pool_size=15, outlen=80, agg_fn=np.max)

preds = cross_val_predict(RandomForestRegressor(100), bpnet_bottleneck_syn_feat, dfs_syn.expression_fc_log2.values, cv=5, n_jobs=-1)
In [93]:
!mkdir -p {figures}
In [108]:
import matplotlib.ticker as ticker
fig, ax = plt.subplots(1, 1, figsize=get_figsize(0.25, aspect=1))
regression_eval(2**preds, 2**dfs_syn.expression_fc_log2.values, markersize=5, alpha=0.4, ax=ax);
xrange = [0.6, 2**4]
ax.set_ylim(xrange)
ax.set_xlim(xrange)
ax.plot(xrange, xrange, c='grey', alpha=0.2)
ax.set_xscale('log', basex=2)
ax.set_yscale('log', basey=2)
ax.xaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
ax.yaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
plt.minorticks_off()
fig.savefig(f"{figures}/syn.obs-vs-pred.pdf")
fig.savefig(f"{figures}/syn.obs-vs-pred.png")

Genomic

In [32]:
bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 10, outlen=80)

# bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)

preds = cross_val_predict(RandomForestRegressor(100), bpnet_bottleneck_gen_feat, dfs_gen.expression_fc_log2.values, cv=5, n_jobs=-1)

regression_eval(dfs_gen.expression_fc_log2, preds, alpha=0.4);
In [33]:
subset = dfs_gen.CRE_type=='Genomic'
regression_eval(dfs_gen.expression_fc_log2[subset], preds[subset], alpha=0.4);

Cross-data comparison

Trained on genetic data, eval on synthetic

In [34]:
rf_gen = RandomForestRegressor(100, n_jobs=-1)
rf_gen.fit(bpnet_bottleneck_gen_feat, dfs_gen.expression_fc_log2.values)
Out[34]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [35]:
regression_eval(dfs_syn.expression_fc_log2, rf_gen.predict(bpnet_bottleneck_syn_feat), alpha=0.4);

Trained on synthetic data, eval on genetic

In [36]:
rf_syn = RandomForestRegressor(100)
rf_syn.fit(bpnet_bottleneck_syn_feat, dfs_syn.expression_fc_log2.values)

regression_eval(dfs_gen.expression_fc_log2, rf_syn.predict(bpnet_bottleneck_gen_feat), alpha=0.4);

Compare to their RF model

  • genomic sequences: gkSVM (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.
    • See Figure 4
In [36]:
from basepair.stats import perc

from concise.eval_metrics import auc, auprc
In [618]:
genomic = dfs_gen.CRE_type=='Genomic'
In [619]:
neg = perc(dfs_gen.expression_fc_log2.values[genomic])< 25
pos = perc(dfs_gen.expression_fc_log2.values[genomic])> 75
In [620]:
subset = pos | neg
In [621]:
y = pos[subset]
In [622]:
y.mean()
Out[622]:
0.5024875621890548

Bottleneck features

In [623]:
bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 5, outlen=80)
#bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)

preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')

print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])
auc, auprc
Out[623]:
(0.8318316831683168, 0.7877723788876447)

Both

In [624]:
bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 5, outlen=80)
bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)
preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')
print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])
auc, auprc
Out[624]:
(0.8327227722772278, 0.7798736186062161)

Only importance scores

In [625]:
bpnet_bottleneck_gen_feat = values_gen.values
preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')
print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])
auc, auprc
Out[625]:
(0.7976732673267327, 0.7900365048297909)

Genomic scatterplots

In [9]:
from basepair.utils import read_pkl
In [49]:
mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE,[1,50],TRUE/'
In [46]:
#mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,gw,OSNK,1,0,0,FALSE,same,0.5,64,25,0.001,9,FALSE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/seq,peaks,OSN,0,10,1,FALSE,same,0.5,64,50,0.004,9,FALSE,[1,50],TRUE/'

mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE-2/'
In [50]:
preds = read_pkl(f"{mdir}/activity//peaks/H3K27ac;PolII/model_pool/pool_type=max;n_hidden=[64];pool_size=50/test.preds.pkl")
In [51]:
from sklearn.linear_model import LinearRegression
In [58]:
fig, axes = plt.subplots(1, 2, figsize=get_figsize(.7, 1/2))

for i, c in enumerate(preds['columns']):
    ax = axes[i]
    #lm = LinearRegression()
    #lm.fit(preds['y_pred'][:, i, np.newaxis], preds['y_true'][:, i])
    #y_pred = lm.predict(preds['y_pred'][:,i, np.newaxis])
    y_pred = preds['y_pred'][:, i]
    regression_eval(np.exp(y_pred), 
                    np.exp(preds['y_true'][:, i]),
                    same_lim=True,
                    markersize=2, alpha=0.4, ax=ax, loglog=True);
    ax.set_title(c)
    plt.tight_layout()
fig.savefig(f"{fig_genome}/H3K27ac;PolII.obs-vs-pred.pdf")
fig.savefig(f"{fig_genome}/H3K27ac;PolII.obs-vs-pred.png")
In [ ]:
import matplotlib.ticker as ticker
fig, ax = plt.subplots(1, 1, figsize=get_figsize(0.25, aspect=1))
regression_eval(2**preds, 2**dfs_syn.expression_fc_log2.values, markersize=5, alpha=0.4, ax=ax);
xrange = [0.6, 2**4]
ax.set_ylim(xrange)
ax.set_xlim(xrange)
ax.plot(xrange, xrange, c='grey', alpha=0.2)
ax.set_xscale('log', basex=2)
ax.set_yscale('log', basey=2)
ax.xaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
ax.yaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
plt.minorticks_off()
fig.savefig(f"{figures}/syn.obs-vs-pred.pdf")
fig.savefig(f"{figures}/syn.obs-vs-pred.png")
In [116]:
ls {mdir}/'activity//peaks/H3K27ac;PolII/model_pool/pool_type=global_avg;n_hidden=[64];pool_size=null/test.preds.pkl'
kwargs.json  test.metrics.json   train.preds.pkl
model.h5     test.preds.pkl      valid.metrics.json
preproc.pkl  train.metrics.json  valid.preds.pkl