Goal¶

[x] analyze the data with BPNet

Notes¶

to run BPNet, we should probably inject this sequence in the middle of a random sequence and average the predictions across multiple random sequences

Tasks¶

[x] get the importance scores for all these sequences
[x] visualize the importance scores as well as the footprints
[x] correlate importance scores with activity
[X] build a simple predictive model
[x] compare to their RF model
- genomic sequences: gkSVM (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.
  - See Figure 4

Conclusions¶

our model is not that good as theirs on synthetic constructs (however, their model greatly overfits the data (knows the location)
model trained on genomics sequences achieves auc=0.83 instead of theirs 0.75

import warnings
warnings.filterwarnings("ignore")

from basepair.exp.chipnexus.data import load_BarakCohnen_mESC_MPRA, load_BarakCohnen_mESC_MPRA_raw

Using TensorFlow backend.

from basepair.config import create_tf_session

create_tf_session(0)

<tensorflow.python.client.session.Session at 0x7f93f3c3a0b8>

figures = f"{ddir}/figures/model-evaluation/ChIP-nexus/MPRA"

fig_genome = f"{ddir}/figures/activity/genome"

!mkdir {fig_genome}

dfs_gen, dfs_syn = load_BarakCohnen_mESC_MPRA()
df_gen, df_syn = load_BarakCohnen_mESC_MPRA_raw()

dfs_gen.head()

dfs_syn.head()

df_gen.head()

df_gen.groupby("CRE_id").size().value_counts()

8      810
112      1
dtype: int64

There are 8 different BC-seq per sequence

Create full sequences¶

Gen¶

Genomic sequences were represented in the pool by 150 bp oligos with the following sequences:

GACTTACATTAGGGCCCGT[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCGGACTACGATACTG

Where [SEQ] is either a reference (gWT) or mutated (gMUT) genomic sequence of 81-82 bps

Syn¶

CTTCTACTACTAGGGCCCA[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCTACATGCTAGTTCATG

where [SEQ] is a 40-80 bp synthetic element comprised of concatenated 20 bp building blocks of pluripotency sites, as described previously, with the fifth position of the KLF4 site changed to ‘T’ to facilitate cloning (Fiore and Cohen 2016) . [FILL] is a random filler sequence of variable length to bring the total length of each sequence to 150 bp, and [BC] is a random 9 bp barcode.

from basepair.exp.chipnexus.simulate import generate_seq, random_seq

def gen_padded_sequence(sequence, barcode, seqlen=1000):
    """Generate MPRA construct sequence for Genomic sequences
    """
    fill = random_seq(150 - len(sequence) - len(barcode) - 59)
    core_seq = f"GACTTACATTAGGGCCCGT{sequence}AAGCTT{fill}GAATTCTCTAGAC{barcode}TGAGCTCGGACTACGATACTG"
    assert len(core_seq) == 150
    return generate_seq(core_seq, seqlen=seqlen)


def syn_padded_sequence(sequence, barcode, seqlen=1000):
    """Generate MPRA construct sequence for Synthetic sequences
    """
    fill = random_seq(150 - len(sequence) - len(barcode) - 61)
    core_seq = f'CTTCTACTACTAGGGCCCA{sequence}AAGCTT{fill}GAATTCTCTAGAC{barcode}TGAGCTCTACATGCTAGTTCATG'
    assert len(core_seq) == 150
    return generate_seq(core_seq, seqlen=seqlen)

def get_core_seq_ij(seqlen, core_seq_len=150):
    start = (seqlen - core_seq_len) // 2
    return start, start + core_seq_len

a = gen_padded_sequence("ACGTA", "ACGST")
assert len(a) == 1000

b = syn_padded_sequence("ACGTA", "ACGST")
assert len(a) == 1000

Get the importance scores for all these sequences¶

from basepair.seqmodel import SeqModel

mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE,[1,50],TRUE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,gw,OSNK,1,0,0,FALSE,same,0.5,64,25,0.001,9,FALSE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/seq,peaks,OSN,0,10,1,FALSE,same,0.5,64,50,0.004,9,FALSE,[1,50],TRUE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE-2/'

bpnet = SeqModel.from_mdir(mdir)

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

2019-03-13 09:04:50,906 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

2019-03-13 09:04:58,927 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

first, don't average across multiple flanking sequences

Build a simple predictive model based on motif instance locations¶

from keras.models import Model, Sequential
import keras.layers as kl
from concise.preprocessing import encodeDNA
from concise.preprocessing.sequence import one_hot2string, DNA

bottleneck_model = bpnet.bottleneck_model()

bpnet_seq_syn = encodeDNA([syn_padded_sequence(s, "AAAGACGCG") for s in dfs_syn.Sequence.str.upper()])
bpnet_seq_gen = encodeDNA([gen_padded_sequence(s, "AAAGACGCG") for s in dfs_gen.Sequence.str.upper()])

import skimage.measure

%time bpnet_bottleneck_syn = bottleneck_model.predict(bpnet_seq_syn)

CPU times: user 3.08 s, sys: 772 ms, total: 3.85 s
Wall time: 2.65 s

bpnet_bottleneck_syn.shape

(622, 1000, 64)

%time bpnet_bottleneck_gen = bottleneck_model.predict(bpnet_seq_gen)

CPU times: user 2.72 s, sys: 224 ms, total: 2.94 s
Wall time: 603 ms

bpnet_bottleneck_syn.shape

(622, 1000, 64)

# get the features
def pool_bottleneck(bottleneck, pool_size=37, outlen=80, agg_fn=np.max):
    i,j = get_core_seq_ij(bottleneck.shape[1])
    i += 19
    j = i + outlen
    out = skimage.measure.block_reduce(bottleneck[:,i:j, :, np.newaxis], (1, pool_size, 1, 1), agg_fn)
    return out.reshape((out.shape[0], -1))

## train a simple model

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LassoCV
from sklearn.model_selection import cross_val_predict

from basepair.plots import regression_eval

Synthetic¶

In the paper, they achieved R-squared of 0.87

bpnet_bottleneck_syn.shape

(622, 1000, 64)

def regression_eval(y_true, y_pred, alpha=0.5, markersize=2, task="", ax=None, same_lim=False, loglog=False):
    if ax is None:
        fig, ax = plt.subplots(1)
    from scipy.stats import pearsonr, spearmanr
    xmax = max([y_true.max(), y_pred.max()])
    xmin = min([y_true.min(), y_pred.min()])
    
    if loglog:
        pearson, pearson_pval = pearsonr(np.log10(y_true), np.log10(y_pred))
        spearman, spearman_pval = spearmanr(np.log10(y_true), np.log(y_pred))
    else:
        pearson, pearson_pval = pearsonr(y_true, y_pred)
        spearman, spearman_pval = spearmanr(y_true, y_pred)
    if loglog:
        plt_fn = ax.loglog
    else:
        plt_fn = ax.plot
        
    plt_fn(y_pred, y_true, ".", 
           markersize=markersize, 
           rasterized=True,
           alpha=alpha)
    ax.set_xlabel("Predicted (FC)")
    ax.set_ylabel("Observed (FC)")
    
    if same_lim:
        ax.set_xlim((xmin, xmax))
        ax.set_ylim((xmin, xmax))
    rp = r"$R_{p}$"
    rs = r"$R_{s}$"
    ax.set_title(task)
    ax.text(.95, .2, f"{rp}={pearson:.2f}",
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes)
    ax.text(.95, .05, f"{rs}={spearman:.2f}", 
            verticalalignment='bottom',
            horizontalalignment='right',
            transform=ax.transAxes)

from basepair.imports import *

paper_config()

bpnet_bottleneck_syn_feat = pool_bottleneck(bpnet_bottleneck_syn, pool_size=15, outlen=80, agg_fn=np.max)

preds = cross_val_predict(RandomForestRegressor(100), bpnet_bottleneck_syn_feat, dfs_syn.expression_fc_log2.values, cv=5, n_jobs=-1)

!mkdir -p {figures}

import matplotlib.ticker as ticker
fig, ax = plt.subplots(1, 1, figsize=get_figsize(0.25, aspect=1))
regression_eval(2**preds, 2**dfs_syn.expression_fc_log2.values, markersize=5, alpha=0.4, ax=ax);
xrange = [0.6, 2**4]
ax.set_ylim(xrange)
ax.set_xlim(xrange)
ax.plot(xrange, xrange, c='grey', alpha=0.2)
ax.set_xscale('log', basex=2)
ax.set_yscale('log', basey=2)
ax.xaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
ax.yaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
plt.minorticks_off()
fig.savefig(f"{figures}/syn.obs-vs-pred.pdf")
fig.savefig(f"{figures}/syn.obs-vs-pred.png")

Genomic¶

bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 10, outlen=80)

# bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)

preds = cross_val_predict(RandomForestRegressor(100), bpnet_bottleneck_gen_feat, dfs_gen.expression_fc_log2.values, cv=5, n_jobs=-1)

regression_eval(dfs_gen.expression_fc_log2, preds, alpha=0.4);

subset = dfs_gen.CRE_type=='Genomic'
regression_eval(dfs_gen.expression_fc_log2[subset], preds[subset], alpha=0.4);

Cross-data comparison¶

Trained on genetic data, eval on synthetic¶

rf_gen = RandomForestRegressor(100, n_jobs=-1)
rf_gen.fit(bpnet_bottleneck_gen_feat, dfs_gen.expression_fc_log2.values)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

regression_eval(dfs_syn.expression_fc_log2, rf_gen.predict(bpnet_bottleneck_syn_feat), alpha=0.4);

Trained on synthetic data, eval on genetic¶

rf_syn = RandomForestRegressor(100)
rf_syn.fit(bpnet_bottleneck_syn_feat, dfs_syn.expression_fc_log2.values)

regression_eval(dfs_gen.expression_fc_log2, rf_syn.predict(bpnet_bottleneck_gen_feat), alpha=0.4);

Compare to their RF model¶

genomic sequences: gkSVM (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.
- See Figure 4

from basepair.stats import perc

from concise.eval_metrics import auc, auprc

genomic = dfs_gen.CRE_type=='Genomic'

neg = perc(dfs_gen.expression_fc_log2.values[genomic])< 25
pos = perc(dfs_gen.expression_fc_log2.values[genomic])> 75

subset = pos | neg

y = pos[subset]

y.mean()

0.5024875621890548

Bottleneck features¶

bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 5, outlen=80)
#bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)

preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')

print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])

auc, auprc

(0.8318316831683168, 0.7877723788876447)

Both¶

bpnet_bottleneck_gen_feat = pool_bottleneck(bpnet_bottleneck_gen, 5, outlen=80)
bpnet_bottleneck_gen_feat = np.concatenate([values_gen.values, bpnet_bottleneck_gen_feat], axis=1)
preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')
print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])

auc, auprc

(0.8327227722772278, 0.7798736186062161)

Only importance scores¶

bpnet_bottleneck_gen_feat = values_gen.values
preds = cross_val_predict(RandomForestClassifier(1000), bpnet_bottleneck_gen_feat[genomic][subset], y, cv=3, n_jobs=-1, method='predict_proba')
print("auc, auprc")
auc(y, preds[:,1]), auprc(y, preds[:,1])

auc, auprc

(0.7976732673267327, 0.7900365048297909)

Genomic scatterplots¶

from basepair.utils import read_pkl

mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE,[1,50],TRUE/'

#mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,gw,OSNK,1,0,0,FALSE,same,0.5,64,25,0.001,9,FALSE/'

# mdir = '../../../src/chipnexus/train/seqmodel/output/seq,peaks,OSN,0,10,1,FALSE,same,0.5,64,50,0.004,9,FALSE,[1,50],TRUE/'

mdir = '../../../src/chipnexus/train/seqmodel/output/nexus,peaks,OSNK,0,10,1,FALSE,same,0.5,64,25,0.004,9,FALSE-2/'

preds = read_pkl(f"{mdir}/activity//peaks/H3K27ac;PolII/model_pool/pool_type=max;n_hidden=[64];pool_size=50/test.preds.pkl")

from sklearn.linear_model import LinearRegression

fig, axes = plt.subplots(1, 2, figsize=get_figsize(.7, 1/2))

for i, c in enumerate(preds['columns']):
    ax = axes[i]
    #lm = LinearRegression()
    #lm.fit(preds['y_pred'][:, i, np.newaxis], preds['y_true'][:, i])
    #y_pred = lm.predict(preds['y_pred'][:,i, np.newaxis])
    y_pred = preds['y_pred'][:, i]
    regression_eval(np.exp(y_pred), 
                    np.exp(preds['y_true'][:, i]),
                    same_lim=True,
                    markersize=2, alpha=0.4, ax=ax, loglog=True);
    ax.set_title(c)
    plt.tight_layout()
fig.savefig(f"{fig_genome}/H3K27ac;PolII.obs-vs-pred.pdf")
fig.savefig(f"{fig_genome}/H3K27ac;PolII.obs-vs-pred.png")

import matplotlib.ticker as ticker
fig, ax = plt.subplots(1, 1, figsize=get_figsize(0.25, aspect=1))
regression_eval(2**preds, 2**dfs_syn.expression_fc_log2.values, markersize=5, alpha=0.4, ax=ax);
xrange = [0.6, 2**4]
ax.set_ylim(xrange)
ax.set_xlim(xrange)
ax.plot(xrange, xrange, c='grey', alpha=0.2)
ax.set_xscale('log', basex=2)
ax.set_yscale('log', basey=2)
ax.xaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
ax.yaxis.set_major_locator(ticker.LogLocator(base=2, numticks=5))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))
plt.minorticks_off()
fig.savefig(f"{figures}/syn.obs-vs-pred.pdf")
fig.savefig(f"{figures}/syn.obs-vs-pred.png")

ls {mdir}/'activity//peaks/H3K27ac;PolII/model_pool/pool_type=global_avg;n_hidden=[64];pool_size=null/test.preds.pkl'

kwargs.json  test.metrics.json   train.preds.pkl
model.h5     test.preds.pkl      valid.metrics.json
preproc.pkl  train.metrics.json  valid.preds.pkl

	CRE_id	Sequence	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM	interval	CRE_type	chrom	start	end	expression_fc	expression_fc_SEM	expression_fc_log2
0	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	1.2789	1.3300	1.1202	0.1134	0.0980	chr10:108456620-10845...	All_Mutated	chr10	108456620	108456701	1.3045	0.1134	0.3835
1	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	2.5366	2.0188	2.4672	1.0134	1.3304	chr10:108456620-10845...	Genomic	chr10	108456620	108456701	2.2777	1.0134	1.1876
2	chr10:110940873-11094...	CCAGCCCTCAACAGAGCAGCT...	0.8912	0.7321	0.8639	0.0673	0.1322	chr10:110940873-11094...	All_Mutated	chr10	110940873	110940954	0.8116	0.0673	-0.3011
3	chr10:110940873-11094...	CCAGCCCTCAACAGAGCAGCT...	0.9943	1.0337	1.1185	0.1044	0.0994	chr10:110940873-11094...	Genomic	chr10	110940873	110940954	1.0140	0.1044	0.0201
4	chr10:115312950-11531...	cagctatagtcccagtcTGTT...	0.9246	1.0169	1.0652	0.1435	0.0881	chr10:115312950-11531...	All_Mutated	chr10	115312950	115313031	0.9707	0.1435	-0.0429

	CRE_id	Sequence	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM	CRE_elements	expression_fc	expression_fc_SEM	expression_fc_log2	Klf4	Klf4_orientation	Sox2	Sox2_orientation	Essrb	Essrb_orientation	Oct4	Oct4_orientation
0	Ef-Kf-Of-Sf_synthetic	AGCTACGTTCAAGGTCACGTA...	2.9462	3.5538	3.6890	0.2847	0.2054	Ef-Kf-Of-Sf	3.2500	0.2847	1.7004	True	fwd	True	fwd	True	fwd	True	fwd
1	Ef-Kf-Of-Sr_synthetic	AGCTACGTTCAAGGTCACGTA...	3.7496	5.1441	4.6180	0.6682	0.6283	Ef-Kf-Of-Sr	4.4469	0.6682	2.1528	True	fwd	True	rev	True	fwd	True	fwd
2	Ef-Kf-Of_synthetic	AGCTACGTTCAAGGTCACGTA...	2.0628	2.7998	2.2773	0.2142	0.3677	Ef-Kf-Of	2.4313	0.2142	1.2817	True	fwd	False	rev	True	fwd	True	fwd
3	Ef-Kf-Or-Sf_synthetic	AGCTACGTTCAAGGTCACGTA...	4.5477	5.1814	4.9207	0.7599	0.7516	Ef-Kf-Or-Sf	4.8646	0.7599	2.2823	True	fwd	True	fwd	True	fwd	True	rev
4	Ef-Kf-Or-Sr_synthetic	AGCTACGTTCAAGGTCACGTA...	5.4429	7.1354	6.6851	0.9533	0.9347	Ef-Kf-Or-Sr	6.2892	0.9533	2.6529	True	fwd	True	rev	True	fwd	True	rev

	CRE_id	BC_seq	Sequence	RNA_counts_rep1	RNA_counts_rep2	RNA_counts_rep3	DNA_counts	BC_exp_rep1	BC_exp_rep2	BC_exp_rep3	Rep1_BC_norm	Rep2_BC_norm	Rep3_BC_norm	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM
0	Basal	AAAGACGCG	NaN	247.2739	179.0611	160.3011	229.3460	1.0782	0.7807	0.6989	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Basal	AAATCAAGC	NaN	88.0702	107.7321	140.7916	181.2556	0.4859	0.5944	0.7768	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Basal	AAATGGTCC	NaN	205.2365	170.3033	173.5353	313.6059	0.6544	0.5430	0.5534	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Basal	AACAAGAGG	NaN	172.0582	236.9896	114.2612	200.5631	0.8579	1.1816	0.5697	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Basal	AACGACCCT	NaN	339.0788	209.2388	217.3385	374.8905	0.9045	0.5581	0.5797	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN