Goal¶

develop modisco hit scoring

Tasks¶

find a nice example where to implement the function

TODO¶

sample a few points from the scatterplot: importance vs match to see what you are getting
try the same analysis for a few factors
- try to find the right cutoff based on the number of motifs you expect
eyeball a few regions (higlight the found seqlets in there)
- before doing the search, merge all teh metaclusters (ignore the importance score magnitude)

from basepair.imports import *
from basepair.samplers import top_sum_count, top_max_count
from basepair.plot.tracks import plot_tracks, filter_tracks
from kipoi.data_utils import get_dataset_item

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

model_dir = Path(f"{ddir}/processed//chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")

modisco_dir = model_dir / "modisco/by_peak_tasks/weighted/Oct4"

mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()

mr.plot_pssm("metacluster_0", "pattern_0", trim_frac=0.08);
mr.plot_pssm("metacluster_0", "pattern_2", trim_frac=0.08);

pattern_name = 'metacluster_0/pattern_0'

pattern = mr.get_pattern(pattern_name)

pattern.plot(rotate_y=0);

im = HDF5Reader(model_dir / "grad.valid.h5")
im.open()

d = im.load_all()

tasks = ['Oct4', 'Sox2', 'Klf4', 'Nanog']

all_hyp_contrib = (im.f['/grads/Oct4/weighted/0'][:] + im.f['/grads/Oct4/weighted/1'][:]) / 2
all_seq = im.f['/inputs/'][:]
all_contrib = all_hyp_contrib * all_seq
all_profile = im.f['/targets/profile/Oct4'][:]

idx = top_max_count(all_profile[:, 400:600], 80)

example_idx = 11128

gt = 'weighted'
hyp_contrib_scores = {f"{task}": mean(d['grads'][task][gt])
                             for task in tasks}
contrib_scores = {f"{task}": hyp_contrib_scores[f"{task}"] * all_seq
                      for task in tasks}
targets = {f"t/{task}": d['targets']['profile'][task] for task in tasks}

viz_dict = filter_tracks({**get_dataset_item(targets, example_idx), **get_dataset_item(contrib_scores, example_idx)}, xlim=[400, 600])

viz_dict.keys()

dict_keys(['t/Oct4', 't/Sox2', 't/Klf4', 't/Nanog', 'Oct4', 'Sox2', 'Klf4', 'Nanog'])

plot_tracks(viz_dict);

Focus on one specific example¶

contrib = all_contrib[example_idx, 400:600]
hyp_contrib = all_hyp_contrib[example_idx, 400:600]
seq = all_seq[example_idx, 400:600]

t = dict(contrib=contrib,
         hyp_contrib=hyp_contrib,
         seq=seq)

# target
plot_tracks(dict(contrib=contrib, hyp_contrib=hyp_contrib));

from basepair.modisco.results import *

pattern.trim_seq_ic(0.08).plot(['seq', 'contrib/Oct4', 'hyp_contrib/Oct4'], rotate_y=0, height=2, letter_width=0.5);

pattern = pattern.trim_seq_ic(0.08)

Requirements¶

for each sequence, get 5 scores:
- contrib
  - match
  - magnitude
- hyp_contrib
  - match
  - magnitude
- seq
  - match

extract these scores for all the sequences for the oct-sox motif
plot the distribution of these scores
repeat the same for sox2

Tasks¶

implement jaccard scanning
implement pwm scanning

Open questions¶

get only ~10 motifs (no long-ones) when not using multiple tasks

pattern.scan_importance?

Signature: pattern.scan_importance(contrib, hyp_contrib, tasks, pad_mode='median', n_jobs=8)
Docstring:
Scan the tracks using this pattern

Args:
  seq
  contrib
  hyp_contrib
  profile
  n_jobs (int): number of cores to use

Returns:
  a tuple containing match, importance scans, each with shape: (batch, seqlen, tasks, strand)
File:      ~/workspace/basepair/basepair/modisco/core.py
Type:      method

Oct4 match¶

pattern = mr.get_pattern("metacluster_0/pattern_0").trim_seq_ic(0.08)

pattern.plot("seq");

match, importance = pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, n_jobs=20)

100%|██████████| 19137/19137 [00:10<00:00, 1909.45it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8654.21it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7364.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8473.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7794.13it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7414.14it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7929.03it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6953.76it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8299.10it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8262.28it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7695.45it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7754.79it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7225.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7636.27it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6890.48it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7821.55it/s]

# optional
seq_match = pattern.scan_seq(all_seq, n_jobs=20)

100%|██████████| 19137/19137 [00:01<00:00, 10793.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 14401.75it/s]

dfm = pattern.get_instances(match, importance, seq_match, fdr=0.01, verbose=True)

Keeping 132276/19137000 (0.69%) of the instances. Cuttof=0.101

dfm.head()

Run for all major patterns¶

pattern_names = [
    ("Oct4-Sox2", "metacluster_0/pattern_0"),
    ("Errb", "metacluster_0/pattern_1"),
    ("Sox2", "metacluster_0/pattern_2"),
    ("Nanog", "metacluster_0/pattern_3"),
    ("Klf4", "metacluster_2/pattern_0"),
]

dfl = []
for tf, pattern_name in pattern_names:
    pattern = mr.get_pattern(pattern_name).trim_seq_ic(0.08)
    match, importance = pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, 
                                                n_jobs=20, verbose=True)
    seq_match = pattern.scan_seq(all_seq, n_jobs=20, verbose=True)
    dfm = pattern.get_instances(match, importance, seq_match, fdr=0.01, verbose=True)
    dfm['tf'] = tf
    dfl.append(dfm)
dfp = pd.concat(dfl)

100%|██████████| 19137/19137 [00:12<00:00, 1528.85it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8599.55it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6709.48it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7565.75it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7542.47it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8197.20it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7188.97it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7109.76it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7234.89it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7999.58it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7915.74it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7926.77it/s]
100%|██████████| 19137/19137 [00:03<00:00, 5916.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7625.95it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7585.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7463.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10909.13it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11639.00it/s]

Keeping 189652/19137000 (0.99%) of the instances. Match score cutoff=4.06

100%|██████████| 19137/19137 [00:02<00:00, 7442.40it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7924.03it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7572.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8248.14it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6795.31it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7112.42it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7829.40it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8127.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7711.46it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7864.57it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7357.49it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7990.17it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7694.17it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7402.82it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8580.19it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6732.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10892.01it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11603.31it/s]

Keeping 67333/19137000 (0.35%) of the instances. Match score cutoff=4.68

100%|██████████| 19137/19137 [00:01<00:00, 10984.22it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9847.24it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9814.00it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9249.67it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9604.04it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8991.37it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10230.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9084.68it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9668.51it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9490.37it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7844.07it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10136.61it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9112.22it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9316.59it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9746.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8964.41it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11423.32it/s]
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median for 19137 results
  r = func(a, **kwargs)
100%|██████████| 19137/19137 [00:01<00:00, 12600.78it/s]

Keeping 164470/19137000 (0.86%) of the instances. Match score cutoff=4.22

100%|██████████| 19137/19137 [00:01<00:00, 11249.31it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9277.47it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9729.26it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10727.04it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9655.15it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7894.89it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10247.56it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9939.05it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9819.55it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9217.90it/s] 
100%|██████████| 19137/19137 [00:01<00:00, 10493.92it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11508.18it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10857.95it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10498.79it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7815.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10127.58it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10486.78it/s]
100%|██████████| 19137/19137 [00:01<00:00, 14946.27it/s]

Keeping 194863/19137000 (1.02%) of the instances. Match score cutoff=4.27

100%|██████████| 19137/19137 [00:02<00:00, 9120.37it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9715.95it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9237.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8612.71it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10039.18it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9287.61it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8527.99it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10234.54it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9438.09it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8784.28it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10384.73it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8518.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10423.87it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9738.42it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10667.42it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8577.75it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10627.48it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11224.17it/s]

Keeping 168821/19137000 (0.88%) of the instances. Match score cutoff=4.25

!mkdir -p {ddir}/cache/dev/

!rmdir {ddir}/cache/dev/hit-scoring.feather

dfp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 785139 entries, 0 to 168820
Data columns (total 20 columns):
pattern           785139 non-null object
example_idx       785139 non-null int64
pattern_start     785139 non-null int64
pattern_end       785139 non-null int64
strand            785139 non-null object
pattern_len       785139 non-null int64
pattern_center    785139 non-null int64
imp_max_task      785139 non-null object
match_max_task    785139 non-null object
match_score       785139 non-null float64
seq_match         260256 non-null float64
match/Klf4        785139 non-null float64
match/Nanog       785139 non-null float64
match/Oct4        785139 non-null float64
match/Sox2        785139 non-null float64
imp/Klf4          785139 non-null float64
imp/Nanog         785139 non-null float64
imp/Oct4          785139 non-null float64
imp/Sox2          785139 non-null float64
tf                785139 non-null object
dtypes: float64(10), int64(5), object(5)
memory usage: 125.8+ MB

dfp.to_hdf(f"{ddir}/cache/dev/hit-scoring.hdf5", "/scores")

!du -sh {ddir}/cache/dev/hit-scoring.hdf5

105M	/users/avsec/workspace/basepair/data/cache/dev/hit-scoring.hdf5

dfp.head()

Eyeball some of the results¶

Dev¶

Convert to pd.DataFrame¶

seq_matchf  = seq_match.reshape((-1, 2))
choose_max_strand = np.argmax(matchf[idx_list], axis=-1)
matchfs = matchf[idx_list, choose_max_strand]
importancefs = importancef[idx_list]
seq_matchfs = seq_matchf[idx_list, choose_max_strand]

Sox2 match¶

sox2_pattern = mr.get_pattern("metacluster_0/pattern_2").trim_seq_ic(0.08)

sox2_pattern.plot("seq");

sox2_match, sox2_importance = sox2_pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, n_jobs=20)

100%|██████████| 19137/19137 [00:02<00:00, 9119.77it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10927.30it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10127.50it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10008.73it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11068.92it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11606.56it/s]
100%|██████████| 19137/19137 [00:01<00:00, 12399.26it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10028.57it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10126.60it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11626.90it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10450.87it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11682.17it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11667.27it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10487.09it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9030.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10331.45it/s]

sox2_seq_match = sox2_pattern.scan_seq(all_seq, n_jobs=20)

100%|██████████| 19137/19137 [00:01<00:00, 11013.53it/s]
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median for 19137 results
  r = func(a, **kwargs)
100%|██████████| 19137/19137 [00:01<00:00, 14288.89it/s]

Plots¶

match.shape

(19137, 1000, 4, 2)

import random

idx_list = np.random.randint(0, 19137000, 100000)

plt.hist(match.flatten()[idx_list], 100);

match.shape

(19137, 1000, 4, 2)

Correlation among tasks¶

matchf = match.reshape((-1, 4, 2)).mean(axis=-2)
importancef = importance.reshape((-1, 4, 2)).mean(axis=-1)
seq_matchf  = seq_match.reshape((-1, 2))
choose_max_strand = np.argmax(matchf[idx_list], axis=-1)
matchfs = matchf[idx_list, choose_max_strand]
importancefs = importancef[idx_list]
seq_matchfs = seq_matchf[idx_list, choose_max_strand]

np.sum(matchf.max(axis=-1) > 0.0673)

248034

thr = fdr_threshold_norm_right(matchf.max(axis=-1), 99, fdr=0.001)
thr

0.0883088459560064

keep = matchf.max(axis=-1) > thr

thr

0.0883088459560064

plt.hist(importancef[keep,0].max(axis=-1), 100);

plt.plot(matchf[keep].max(axis=-1),importancef[keep,0], "." , alpha=0.05);
plt.ylabel("Importance")
plt.xlabel("Match");

g = sns.jointplot(matchf[keep].max(axis=-1), importancef[keep,0], alpha=.01);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

TODO¶

sample a few points from this plot to get a sense what you are picking

np.sum(importancef[keep,0] > 0.2)

24967

plt.hist(importancef[keep,0], 100);

plt.hist(importancef.max(axis=-1), 100);

np.sum(matchf.max(axis=-1) > thr)

103100

np.mean(matchf.max(axis=-1) > thr)

0.005387469300308303

sox2_matchf = sox2_match.reshape((-1, 4, 2)).mean(axis=-2)
sox2_importancef = sox2_importance.reshape((-1, 4, 2))
sox2_seq_matchf  = sox2_seq_match.reshape((-1, 2))
sox2_choose_max_strand = np.argmax(sox2_matchf[idx_list], axis=-1)
sox2_matchfs = sox2_matchf[idx_list, sox2_choose_max_strand]
sox2_importancefs = sox2_importancef[idx_list, :, sox2_choose_max_strand]
sox2_seq_matchfs = sox2_seq_matchf[idx_list, sox2_choose_max_strand]

seq_matchf.shape

(19137000, 2)

matchf.shape

(19137000, 4, 2)

Correlation between tasks¶

import seaborn as sns

sns.pairplot(pd.DataFrame(matchf[idx_list[:10000], :, 0], columns=tasks))

<seaborn.axisgrid.PairGrid at 0x7fafd7b65550>

sns.pairplot(pd.DataFrame(importancefs, columns=tasks))

<seaborn.axisgrid.PairGrid at 0x7f6a0554af28>

Task¶

for now, let's just try to get solid motif instances
- importance scores sufficiently high
- match sufficiently high

Simplicifcations¶

for jaccard match average all the matches across tasks
choose the strand based on the jaccard match
select the right match threshold
- use normal distribution fitted on [0, 99] percentiles with FDR=0.1

Conclusions¶

pure PWM match performs poorly

plt.hist(matchfs, 100);

plt.hist(sox2_matchfs, 100);

plt.hist(importancefs.max(axis=-1), 100);

plt.hist(sox2_importancefs.max(axis=-1), 100);

from basepair.stats import fdr_threshold_norm_right

fdr_threshold_norm_right(matchfs)

0.06211395940893545

g = sns.jointplot(matchfs, importancefs[:, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

g = sns.jointplot(matchfs, importancefs.max(axis=-1), alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

fdr_threshold_norm_right(sox2_matchfs)

0.09175216612992457

g = sns.jointplot(sox2_matchfs, sox2_importancefs[:, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

g = sns.jointplot(sox2_matchfs, sox2_importancefs.max(axis=-1), alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

g = sns.jointplot(matchfs, seq_matchfs, alpha=0.05);
g.set_axis_labels("Match", "PWM-scan");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

fdr_threshold_norm_right(seq_matchfs)

11.241350809716671

stats.probplot(seq_matchfs, dist="norm", plot=plt);
plt.title("asd");

Text(0.5,1,'asd')

Interstingly, the PWM is a very poor predictor

g = sns.jointplot(matchf[idx_list, :,choose_max].mean(axis=-1), seq_matchf[idx_list,choose_max], alpha=0.05);
g.set_axis_labels("Match", "PWM-scan");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Figure out the right thresholding strategy¶

QQ-plots¶

import scipy.stats as stats
stats.probplot(matchfs, dist="norm", plot=plt);

stats.probplot(sox2_matchfs, dist="norm", plot=plt);

Experiemented with fitting gaussian mixture models - didn't work at all

np.sum(matchfs > threshold(matchfs) )

1717

from scipy.stats import norm

norm.ppf(0.8, loc=loc, scale=scale)

0.02518700846030738

loc, scale = norm.fit(matchfs)
loc, scale

(0.01119161561186437, 0.02010419697630615)

upper = np.percentile(matchfs,99)

loc, scale = norm.fit(matchfs[(matchfs < upper)])

p = norm.cdf(matchfs, loc, scale)

keep, padj = fdrcorrection(1-p, alpha=0.1)

keep.sum()

1717

keep.mean()

0.01717

import scipy.stats as stats
#measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(matchfs[~keep], dist="norm", plot=plt);

import scipy.stats as stats
#measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(matchfs[(matchfs < upper)], dist="norm", plot=plt);

skip.mean()

0.02149

plt.hist(p[skip], 100);

plt.hist(p[p> 0.9], 100);

plt.hist(padj, 100);

var

0.02010419697630615

Explore called instances¶

keep = matchfs > fdr_threshold_norm_right(matchfs)

plt.hist(matchfs[keep], 100);

g = sns.jointplot(matchfs[keep], importancefs[keep, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

g = sns.jointplot(matchfs[keep], importancefs[keep, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

plt.

Instance calling¶

sufficiently good match in one of the tasks

hyp_contrib_score_scan.shape

(19137, 985)

np.all(contrib_score_scan==np.maximum(contrib_score_scan_fwd, contrib_score_scan_rev))

False

np.all(hyp_contrib_score_scan==np.maximum(hyp_contrib_score_scan_fwd, hyp_contrib_score_scan_rev))

False

plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd)[:100000], ".")

[<matplotlib.lines.Line2D at 0x7f622772ba58>]

plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(hyp_contrib_score_scan_fwd)[:100000], ".")

[<matplotlib.lines.Line2D at 0x7f622c077780>]

plt.hist(np.ravel(pwm_scan_fwd), 100);
plt.xlabel("pwm scan score");

plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd_scale)[:100000], ".")

[<matplotlib.lines.Line2D at 0x7f622cf470f0>]

plt.plot(np.ravel(contrib_score_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd_scale)[:100000], ".");
plt.xlabel("Match")
plt.ylabel("scale");

plt.hist(np.ravel(contrib_score_scan_fwd), 100);

plt.plot(np.ravel(contrib_score_scan_fwd)[:100000], np.ravel(hyp_contrib_score_scan_fwd)[:100000], ".")

[<matplotlib.lines.Line2D at 0x7f622db85ba8>]

plt.plot(np.ravel(contrib_score_scan_fwd_scale)[:100000], np.ravel(hyp_contrib_score_scan_fwd_scale)[:100000], ".")

[<matplotlib.lines.Line2D at 0x7f622d9482e8>]

Plot scores for the example¶

viz_dict = dict(contrib=all_contrib[example_idx, 400:600],
                hyp_contrib=all_hyp_contrib[example_idx, 400:600],
                pwm_match=seq_match[example_idx, 400:600],
                contrib_match=match[example_idx, 400:600, 0],
                contrib_scale=importance[example_idx, 400:600, 0])

from basepair.modisco.results import Seqlet

pos_hits = np.argsort(-match[example_idx, 400:600, 0].max(axis=-1))[:2]  # top 2 hits

match[example_idx, 400:600, 0].sum()

0.00830483751167788

importance[example_idx, 400:600, 0].sum()

0.11577525138663863

pos_hits

array([], dtype=int64)

from basepair.utils import halve

def idx2seqlet(pos, pattern):
    # TODO handle the reverse-complementation
    i, j = halve(len(pattern))
    return Seqlet(None, pos-i, pos+j, name=pattern.name)

seqlets = [idx2seqlet(pos, pattern) for pos in pos_hits]

pattern.plot("seq");

plot_tracks(viz_dict, seqlets, rotate_y=0, legend=True);

Benchmarking the functions¶

%timeit score_region_cont_jaccard(qa, ta)

300 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit score_region_cont_jaccard(qa, ta_all_contrib[:100])

362 ms ± 44.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit score_region_cont_jaccard(qa, ta_all_contrib[:1000])

3.69 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit score_region_cont_jaccard(qa, ta_all_contrib[0]) # 1kb

2.14 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0], qa) # 1kb

18.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit score_region_cont_jaccard(qa, ta_all_contrib[0][:, :100]) # 100bp

293 µs ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0][:, :100], qa) # 100bp

1.73 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a, a_norm = score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0][:, :1000], qa)
b, b_norm = score_region_cont_jaccard(qa, ta_all_contrib[0][:, :1000]) # 100bp

assert b_norm[0] == a_norm[0]

%timeit np.stack([score_region_cont_jaccard(qa, ta_all_contrib[i]) for i in range(1000)])

2.34 s ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scanned =  np.stack(Parallel(10)(delayed(score_region_cont_jaccard)(qa, ta_all_contrib[i]) for i in tqdm(range(len(ta_all_contrib)))))

100%|██████████| 19137/19137 [00:04<00:00, 4517.30it/s]

ta_all_contrib.shape

(19137, 4, 1000)

# TODO - add unit-tests

other_jacc = jaccard_sim_func(ta_all_contrib[0][:, :16][np.newaxis]*qa_L1_norm / a_norm[0], qa[np.newaxis])[0,0]

np.allclose(other_jacc, a[0])

False

np.allclose(other_jacc, b[0])

True

other_jacc = jaccard_sim_func(ta_all_contrib[0][:, 1:17][np.newaxis]*qa_L1_norm / b_norm[1], qa[np.newaxis])[0,0]

np.allclose(other_jacc, a[1])

False

np.allclose(other_jacc, b[1])

True

plt.scatter(a_norm, b_norm)

<matplotlib.collections.PathCollection at 0x7f40fa670198>

plt.scatter(np.ravel(a), np.ravel(b));

%timeit score_region_cont_jaccard(qa, ta)

%timeit score_full_regionarr_with_perpos_continjaccard(ta, qa)

3.04 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pattern scanning for the whole group¶

from basepair.modisco.utils import ic_scale

def scan_pattern(pattern_grp, contrib_scores, hypothetical_contribs, one_hot, task_names, trim_frac=0, verbose=True):
    """
    final modisco results scan structure
    - sequence
      - fwd
      - rev
    - <task>
      - <imp score>
        - match
           - fwd
           - rev
        - scale
    """
    # TODO - add i and j to the output?

    pwm = pattern_grp['sequence']['fwd'][:]
    trim_ij = trim_pssm_idx(ic_scale(pwm), frac=trim_frac)
    pattern_len = len(pwm)
    
    def get_group_match(grp, name):
        for n in grp.keys():
            if name in n:
                return grp[n]
        raise ValueError("{name} doesn't match any group keys: {l}".format(name=name, l=list(grp.keys())))
        
    def parallel_score_continousjaccard_restructure(qa, ta, **kwargs):
        fwd, fwd_scale = parallel_score_continousjaccard(qa, ta, **kwargs)
        rev, rev_scale = parallel_score_continousjaccard(qa[::-1, ::-1], ta, **kwargs)
        return dict(match=dict(fwd=fwd, rev=rev),
                    scale=dict(fwd=fwd_scale, rev=rev_scale),
                   )
        
    trim_ij_rc = (pattern_len - trim_ij[1], pattern_len - trim_ij[0])
    return dict(sequence=dict(fwd=parallel_score_pssm(pwm[trim_ij[0]:trim_ij[1]], one_hot, verbose=verbose),
                              rev=parallel_score_pssm(pwm[trim_ij[0]:trim_ij[1]][::-1, ::-1], one_hot, verbose=verbose)),
                **{task: {"contrib_scores": parallel_score_continousjaccard_restructure(get_group_match(pattern_grp[task], "contrib_scores")['fwd'][trim_ij[0]:trim_ij[1]], contrib_scores),
                          "hypothetical_contribs": parallel_score_continousjaccard_restructure(get_group_match(pattern_grp[task], "hypothetical_contribs")['fwd'][trim_ij[0]:trim_ij[1]], hypothetical_contribs),}
                   for task in tqdm(task_names, disable=not verbose)}
               )

tasks

['Oct4', 'Sox2', 'Klf4', 'Nanog']

mp1 = scan_pattern(mr.get_pattern_grp("metacluster_0", "pattern_0"), all_contrib, all_hyp_contrib, all_seq, tasks, trim_frac=0.08)

100%|██████████| 19137/19137 [00:02<00:00, 7370.59it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7069.71it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4544.68it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4539.49it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4100.46it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4208.55it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4723.49it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4435.75it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4187.63it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4102.75it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4484.23it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4498.88it/s]
100%|██████████| 19137/19137 [00:04<00:00, 3981.99it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4432.95it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4553.70it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4578.30it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4319.89it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4492.56it/s]

all_seq.size

76548000

idx = pd.Series(np.arange(all_seq.shape[0])).sample(10000).values

plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Sox2']['contrib_scores']['match']['fwd'])[idx], ".")

[<matplotlib.lines.Line2D at 0x7f622d3dc940>]

plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Nanog']['contrib_scores']['match']['fwd'])[idx], ".")

[<matplotlib.lines.Line2D at 0x7f622b7a72b0>]

plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Klf4']['contrib_scores']['match']['fwd'])[idx], ".")

[<matplotlib.lines.Line2D at 0x7f622c90aeb8>]

mp1['Oct4']['contrib_scores']['match']['fwd'].shape

(19137, 985)

Ideas¶

how much of the blob can be explained by other motifs?
- what are the important motifs
first try to assign the
IDR: check Nathan's code - take from kundajelab/idr
- plot rank-rank

affinity propagation
- run on the predictions as well as real data

footprint -> DNA binding domain
- crystal structure?
  - oct-sox

orthogonal validation?
- expression - MPRA
- Nanog SELEX dataset
  - re-analyze the data?
- MPRA - HK27ac
  - try to predict the enhancer activity
  - stratify the peaks into acetylated vs non-acetylated
higher fraction of the
- allele specific binding?
- train in human on ips lines

transposable elements are not conserved in Human
simulations -> profile gives only a sense of what can be going on
affinity with SELEX
MPRA
acetylation
Human <-> mouse allelic effects
overlap with the histone data
question:
- mixture model also predicting things that are not bound
- repressors
  - question: are there any repressors?
- idea:
  - add more tasks to the output

	pattern	pattern_start	pattern_end	strand	pattern_len	pattern_center	imp_max_task	match_max_task	match_score	seq_match	match/Klf4	match/Nanog	match/Oct4	match/Sox2	imp/Klf4	imp/Nanog	imp/Oct4	imp/Sox2
0	metacluster_0/pattern_0	10	26	-	16	18	Nanog	Nanog	0.131307	1.674306	-0.111460	0.131307	-0.108184	-0.033806	0.010756	0.011936	0.003235	0.007702
1	metacluster_0/pattern_0	12	28	+	16	20	Nanog	Nanog	0.105893	1.087035	-0.076629	0.105893	-0.043110	-0.003427	0.010653	0.012066	0.003670	0.008551
2	metacluster_0/pattern_0	137	153	+	16	145	Nanog	Nanog	0.111272	-4.522111	0.110133	0.111272	-0.049516	0.070673	0.017080	0.019810	0.007401	0.009422
3	metacluster_0/pattern_0	263	279	-	16	271	Nanog	Nanog	0.103279	-11.100975	0.079367	0.103279	-0.090794	0.033805	0.012562	0.013120	0.009638	0.012889
4	metacluster_0/pattern_0	434	450	-	16	442	Sox2	Klf4	0.121448	-0.358489	0.137963	0.118600	0.117063	0.121448	0.621632	0.567201	0.343207	0.871215

	pattern	pattern_start	pattern_end	strand	pattern_len	pattern_center	imp_max_task	match_max_task	match_score	seq_match	match/Klf4	match/Nanog	match/Oct4	match/Sox2	imp/Klf4	imp/Nanog	imp/Oct4	imp/Sox2	tf
0	metacluster_0/pattern_0	10	26	-	16	18	Nanog	Nanog	6.314957	1.674306	-2.904895	6.314957	-2.215001	0.675620	0.359269	0.410711	0.078473	0.127572	Oct4-Sox2
1	metacluster_0/pattern_0	12	28	+	16	20	Nanog	Nanog	4.105405	1.087035	-1.943211	4.105405	-0.179062	0.789083	0.358538	0.411136	0.081749	0.139172	Oct4-Sox2
2	metacluster_0/pattern_0	434	450	-	16	442	Klf4	Klf4	5.085370	-0.358489	5.085370	4.306160	4.397326	3.336833	25.292760	18.074471	6.595614	10.993076	Oct4-Sox2
3	metacluster_0/pattern_0	436	452	+	16	444	Klf4	Klf4	9.589859	10.260480	9.589859	9.110400	7.749240	4.635757	26.049998	18.633804	6.611798	10.866910	Oct4-Sox2
4	metacluster_0/pattern_0	440	456	-	16	448	Klf4	Klf4	4.125266	-9.033820	4.125266	3.806090	3.721674	1.889072	23.658365	17.657431	6.011966	10.061822	Oct4-Sox2