Goal

  • develop modisco hit scoring

Tasks

  • find a nice example where to implement the function

TODO

  • sample a few points from the scatterplot: importance vs match to see what you are getting
  • try the same analysis for a few factors
    • try to find the right cutoff based on the number of motifs you expect
  • eyeball a few regions (higlight the found seqlets in there)
    • before doing the search, merge all teh metaclusters (ignore the importance score magnitude)
In [1]:
from basepair.imports import *
from basepair.samplers import top_sum_count, top_max_count
from basepair.plot.tracks import plot_tracks, filter_tracks
from kipoi.data_utils import get_dataset_item
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [2]:
model_dir = Path(f"{ddir}/processed//chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")
In [3]:
modisco_dir = model_dir / "modisco/by_peak_tasks/weighted/Oct4"
In [4]:
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
In [5]:
mr.plot_pssm("metacluster_0", "pattern_0", trim_frac=0.08);
mr.plot_pssm("metacluster_0", "pattern_2", trim_frac=0.08);
In [6]:
pattern_name = 'metacluster_0/pattern_0'
In [7]:
pattern = mr.get_pattern(pattern_name)
In [8]:
pattern.plot(rotate_y=0);
In [9]:
im = HDF5Reader(model_dir / "grad.valid.h5")
im.open()
In [10]:
d = im.load_all()
In [11]:
tasks = ['Oct4', 'Sox2', 'Klf4', 'Nanog']
In [12]:
all_hyp_contrib = (im.f['/grads/Oct4/weighted/0'][:] + im.f['/grads/Oct4/weighted/1'][:]) / 2
all_seq = im.f['/inputs/'][:]
all_contrib = all_hyp_contrib * all_seq
all_profile = im.f['/targets/profile/Oct4'][:]
In [13]:
idx = top_max_count(all_profile[:, 400:600], 80)
In [14]:
example_idx = 11128
In [15]:
gt = 'weighted'
hyp_contrib_scores = {f"{task}": mean(d['grads'][task][gt])
                             for task in tasks}
contrib_scores = {f"{task}": hyp_contrib_scores[f"{task}"] * all_seq
                      for task in tasks}
targets = {f"t/{task}": d['targets']['profile'][task] for task in tasks}
In [16]:
viz_dict = filter_tracks({**get_dataset_item(targets, example_idx), **get_dataset_item(contrib_scores, example_idx)}, xlim=[400, 600])
In [17]:
viz_dict.keys()
Out[17]:
dict_keys(['t/Oct4', 't/Sox2', 't/Klf4', 't/Nanog', 'Oct4', 'Sox2', 'Klf4', 'Nanog'])
In [18]:
plot_tracks(viz_dict);

Focus on one specific example

In [19]:
contrib = all_contrib[example_idx, 400:600]
hyp_contrib = all_hyp_contrib[example_idx, 400:600]
seq = all_seq[example_idx, 400:600]

t = dict(contrib=contrib,
         hyp_contrib=hyp_contrib,
         seq=seq)
In [20]:
# target
plot_tracks(dict(contrib=contrib, hyp_contrib=hyp_contrib));
In [21]:
from basepair.modisco.results import *
In [22]:
pattern.trim_seq_ic(0.08).plot(['seq', 'contrib/Oct4', 'hyp_contrib/Oct4'], rotate_y=0, height=2, letter_width=0.5);
In [23]:
pattern = pattern.trim_seq_ic(0.08)

Requirements

  • for each sequence, get 5 scores:
    • contrib
      • match
      • magnitude
    • hyp_contrib
      • match
      • magnitude
    • seq
      • match
  • extract these scores for all the sequences for the oct-sox motif
  • plot the distribution of these scores
  • repeat the same for sox2

Tasks

  • implement jaccard scanning
  • implement pwm scanning

Open questions

  • get only ~10 motifs (no long-ones) when not using multiple tasks
In [24]:
pattern.scan_importance?
Signature: pattern.scan_importance(contrib, hyp_contrib, tasks, pad_mode='median', n_jobs=8)
Docstring:
Scan the tracks using this pattern

Args:
  seq
  contrib
  hyp_contrib
  profile
  n_jobs (int): number of cores to use

Returns:
  a tuple containing match, importance scans, each with shape: (batch, seqlen, tasks, strand)
File:      ~/workspace/basepair/basepair/modisco/core.py
Type:      method

Oct4 match

In [25]:
pattern = mr.get_pattern("metacluster_0/pattern_0").trim_seq_ic(0.08)
In [26]:
pattern.plot("seq");
In [28]:
match, importance = pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, n_jobs=20)
100%|██████████| 19137/19137 [00:10<00:00, 1909.45it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8654.21it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7364.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8473.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7794.13it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7414.14it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7929.03it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6953.76it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8299.10it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8262.28it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7695.45it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7754.79it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7225.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7636.27it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6890.48it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7821.55it/s]
In [29]:
# optional
seq_match = pattern.scan_seq(all_seq, n_jobs=20)
100%|██████████| 19137/19137 [00:01<00:00, 10793.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 14401.75it/s]
In [38]:
dfm = pattern.get_instances(match, importance, seq_match, fdr=0.01, verbose=True)
Keeping 132276/19137000 (0.69%) of the instances. Cuttof=0.101
In [39]:
dfm.head()
Out[39]:
pattern example_idx pattern_start pattern_end strand pattern_len pattern_center imp_max_task match_max_task match_score seq_match match/Klf4 match/Nanog match/Oct4 match/Sox2 imp/Klf4 imp/Nanog imp/Oct4 imp/Sox2
0 metacluster_0/pattern_0 0 10 26 - 16 18 Nanog Nanog 0.131307 1.674306 -0.111460 0.131307 -0.108184 -0.033806 0.010756 0.011936 0.003235 0.007702
1 metacluster_0/pattern_0 0 12 28 + 16 20 Nanog Nanog 0.105893 1.087035 -0.076629 0.105893 -0.043110 -0.003427 0.010653 0.012066 0.003670 0.008551
2 metacluster_0/pattern_0 0 137 153 + 16 145 Nanog Nanog 0.111272 -4.522111 0.110133 0.111272 -0.049516 0.070673 0.017080 0.019810 0.007401 0.009422
3 metacluster_0/pattern_0 0 263 279 - 16 271 Nanog Nanog 0.103279 -11.100975 0.079367 0.103279 -0.090794 0.033805 0.012562 0.013120 0.009638 0.012889
4 metacluster_0/pattern_0 0 434 450 - 16 442 Sox2 Klf4 0.121448 -0.358489 0.137963 0.118600 0.117063 0.121448 0.621632 0.567201 0.343207 0.871215

Run for all major patterns

In [27]:
pattern_names = [
    ("Oct4-Sox2", "metacluster_0/pattern_0"),
    ("Errb", "metacluster_0/pattern_1"),
    ("Sox2", "metacluster_0/pattern_2"),
    ("Nanog", "metacluster_0/pattern_3"),
    ("Klf4", "metacluster_2/pattern_0"),
]
In [33]:
dfl = []
for tf, pattern_name in pattern_names:
    pattern = mr.get_pattern(pattern_name).trim_seq_ic(0.08)
    match, importance = pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, 
                                                n_jobs=20, verbose=True)
    seq_match = pattern.scan_seq(all_seq, n_jobs=20, verbose=True)
    dfm = pattern.get_instances(match, importance, seq_match, fdr=0.01, verbose=True)
    dfm['tf'] = tf
    dfl.append(dfm)
dfp = pd.concat(dfl)
100%|██████████| 19137/19137 [00:12<00:00, 1528.85it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8599.55it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6709.48it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7565.75it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7542.47it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8197.20it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7188.97it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7109.76it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7234.89it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7999.58it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7915.74it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7926.77it/s]
100%|██████████| 19137/19137 [00:03<00:00, 5916.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7625.95it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7585.91it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7463.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10909.13it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11639.00it/s]
Keeping 189652/19137000 (0.99%) of the instances. Match score cutoff=4.06
100%|██████████| 19137/19137 [00:02<00:00, 7442.40it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7924.03it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7572.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8248.14it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6795.31it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7112.42it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7829.40it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8127.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7711.46it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7864.57it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7357.49it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7990.17it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7694.17it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7402.82it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8580.19it/s]
100%|██████████| 19137/19137 [00:02<00:00, 6732.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10892.01it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11603.31it/s]
Keeping 67333/19137000 (0.35%) of the instances. Match score cutoff=4.68
100%|██████████| 19137/19137 [00:01<00:00, 10984.22it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9847.24it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9814.00it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9249.67it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9604.04it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8991.37it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10230.36it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9084.68it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9668.51it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9490.37it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7844.07it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10136.61it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9112.22it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9316.59it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9746.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8964.41it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11423.32it/s]
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median for 19137 results
  r = func(a, **kwargs)
100%|██████████| 19137/19137 [00:01<00:00, 12600.78it/s]
Keeping 164470/19137000 (0.86%) of the instances. Match score cutoff=4.22
100%|██████████| 19137/19137 [00:01<00:00, 11249.31it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9277.47it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9729.26it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10727.04it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9655.15it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7894.89it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10247.56it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9939.05it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9819.55it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9217.90it/s] 
100%|██████████| 19137/19137 [00:01<00:00, 10493.92it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11508.18it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10857.95it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10498.79it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7815.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10127.58it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10486.78it/s]
100%|██████████| 19137/19137 [00:01<00:00, 14946.27it/s]
Keeping 194863/19137000 (1.02%) of the instances. Match score cutoff=4.27
100%|██████████| 19137/19137 [00:02<00:00, 9120.37it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9715.95it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9237.69it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8612.71it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10039.18it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9287.61it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8527.99it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10234.54it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9438.09it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8784.28it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10384.73it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8518.34it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10423.87it/s]
100%|██████████| 19137/19137 [00:01<00:00, 9738.42it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10667.42it/s]
100%|██████████| 19137/19137 [00:02<00:00, 8577.75it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10627.48it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11224.17it/s]
Keeping 168821/19137000 (0.88%) of the instances. Match score cutoff=4.25
In [41]:
!mkdir -p {ddir}/cache/dev/
In [42]:
!rmdir {ddir}/cache/dev/hit-scoring.feather
In [39]:
dfp.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 785139 entries, 0 to 168820
Data columns (total 20 columns):
pattern           785139 non-null object
example_idx       785139 non-null int64
pattern_start     785139 non-null int64
pattern_end       785139 non-null int64
strand            785139 non-null object
pattern_len       785139 non-null int64
pattern_center    785139 non-null int64
imp_max_task      785139 non-null object
match_max_task    785139 non-null object
match_score       785139 non-null float64
seq_match         260256 non-null float64
match/Klf4        785139 non-null float64
match/Nanog       785139 non-null float64
match/Oct4        785139 non-null float64
match/Sox2        785139 non-null float64
imp/Klf4          785139 non-null float64
imp/Nanog         785139 non-null float64
imp/Oct4          785139 non-null float64
imp/Sox2          785139 non-null float64
tf                785139 non-null object
dtypes: float64(10), int64(5), object(5)
memory usage: 125.8+ MB
In [46]:
dfp.to_hdf(f"{ddir}/cache/dev/hit-scoring.hdf5", "/scores")
In [49]:
!du -sh {ddir}/cache/dev/hit-scoring.hdf5
105M	/users/avsec/workspace/basepair/data/cache/dev/hit-scoring.hdf5
In [34]:
dfp.head()
Out[34]:
pattern example_idx pattern_start pattern_end strand pattern_len pattern_center imp_max_task match_max_task match_score seq_match match/Klf4 match/Nanog match/Oct4 match/Sox2 imp/Klf4 imp/Nanog imp/Oct4 imp/Sox2 tf
0 metacluster_0/pattern_0 0 10 26 - 16 18 Nanog Nanog 6.314957 1.674306 -2.904895 6.314957 -2.215001 0.675620 0.359269 0.410711 0.078473 0.127572 Oct4-Sox2
1 metacluster_0/pattern_0 0 12 28 + 16 20 Nanog Nanog 4.105405 1.087035 -1.943211 4.105405 -0.179062 0.789083 0.358538 0.411136 0.081749 0.139172 Oct4-Sox2
2 metacluster_0/pattern_0 0 434 450 - 16 442 Klf4 Klf4 5.085370 -0.358489 5.085370 4.306160 4.397326 3.336833 25.292760 18.074471 6.595614 10.993076 Oct4-Sox2
3 metacluster_0/pattern_0 0 436 452 + 16 444 Klf4 Klf4 9.589859 10.260480 9.589859 9.110400 7.749240 4.635757 26.049998 18.633804 6.611798 10.866910 Oct4-Sox2
4 metacluster_0/pattern_0 0 440 456 - 16 448 Klf4 Klf4 4.125266 -9.033820 4.125266 3.806090 3.721674 1.889072 23.658365 17.657431 6.011966 10.061822 Oct4-Sox2

Eyeball some of the results

Dev

Convert to pd.DataFrame

In [378]:
seq_matchf  = seq_match.reshape((-1, 2))
choose_max_strand = np.argmax(matchf[idx_list], axis=-1)
matchfs = matchf[idx_list, choose_max_strand]
importancefs = importancef[idx_list]
seq_matchfs = seq_matchf[idx_list, choose_max_strand]

Sox2 match

In [35]:
sox2_pattern = mr.get_pattern("metacluster_0/pattern_2").trim_seq_ic(0.08)
In [36]:
sox2_pattern.plot("seq");
In [37]:
sox2_match, sox2_importance = sox2_pattern.scan_importance(contrib_scores, hyp_contrib_scores, tasks, n_jobs=20)
100%|██████████| 19137/19137 [00:02<00:00, 9119.77it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10927.30it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10127.50it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10008.73it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11068.92it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11606.56it/s]
100%|██████████| 19137/19137 [00:01<00:00, 12399.26it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10028.57it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10126.60it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11626.90it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10450.87it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11682.17it/s]
100%|██████████| 19137/19137 [00:01<00:00, 11667.27it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10487.09it/s]
100%|██████████| 19137/19137 [00:02<00:00, 9030.61it/s]
100%|██████████| 19137/19137 [00:01<00:00, 10331.45it/s]
In [38]:
sox2_seq_match = sox2_pattern.scan_seq(all_seq, n_jobs=20)
100%|██████████| 19137/19137 [00:01<00:00, 11013.53it/s]
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median for 19137 results
  r = func(a, **kwargs)
100%|██████████| 19137/19137 [00:01<00:00, 14288.89it/s]

Plots

In [41]:
match.shape
Out[41]:
(19137, 1000, 4, 2)
In [45]:
import random
In [46]:
idx_list = np.random.randint(0, 19137000, 100000)
In [86]:
plt.hist(match.flatten()[idx_list], 100);
In [138]:
match.shape
Out[138]:
(19137, 1000, 4, 2)

Correlation among tasks

In [378]:
matchf = match.reshape((-1, 4, 2)).mean(axis=-2)
importancef = importance.reshape((-1, 4, 2)).mean(axis=-1)
seq_matchf  = seq_match.reshape((-1, 2))
choose_max_strand = np.argmax(matchf[idx_list], axis=-1)
matchfs = matchf[idx_list, choose_max_strand]
importancefs = importancef[idx_list]
seq_matchfs = seq_matchf[idx_list, choose_max_strand]
In [356]:
np.sum(matchf.max(axis=-1) > 0.0673)
Out[356]:
248034
In [367]:
thr = fdr_threshold_norm_right(matchf.max(axis=-1), 99, fdr=0.001)
thr
Out[367]:
0.0883088459560064
In [380]:
keep = matchf.max(axis=-1) > thr
In [401]:
thr
Out[401]:
0.0883088459560064
In [383]:
plt.hist(importancef[keep,0].max(axis=-1), 100);
In [394]:
plt.plot(matchf[keep].max(axis=-1),importancef[keep,0], "." , alpha=0.05);
plt.ylabel("Importance")
plt.xlabel("Match");
In [400]:
g = sns.jointplot(matchf[keep].max(axis=-1), importancef[keep,0], alpha=.01);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

TODO

  • sample a few points from this plot to get a sense what you are picking
In [387]:
np.sum(importancef[keep,0] > 0.2)
Out[387]:
24967
In [385]:
plt.hist(importancef[keep,0], 100);
In [384]:
plt.hist(importancef.max(axis=-1), 100);
In [368]:
np.sum(matchf.max(axis=-1) > thr)
Out[368]:
103100
In [369]:
np.mean(matchf.max(axis=-1) > thr)
Out[369]:
0.005387469300308303
In [84]:
sox2_matchf = sox2_match.reshape((-1, 4, 2)).mean(axis=-2)
sox2_importancef = sox2_importance.reshape((-1, 4, 2))
sox2_seq_matchf  = sox2_seq_match.reshape((-1, 2))
sox2_choose_max_strand = np.argmax(sox2_matchf[idx_list], axis=-1)
sox2_matchfs = sox2_matchf[idx_list, sox2_choose_max_strand]
sox2_importancefs = sox2_importancef[idx_list, :, sox2_choose_max_strand]
sox2_seq_matchfs = sox2_seq_matchf[idx_list, sox2_choose_max_strand]
In [42]:
seq_matchf.shape
Out[42]:
(19137000, 2)
In [43]:
matchf.shape
Out[43]:
(19137000, 4, 2)

Correlation between tasks

In [49]:
import seaborn as sns
In [58]:
sns.pairplot(pd.DataFrame(matchf[idx_list[:10000], :, 0], columns=tasks))
Out[58]:
<seaborn.axisgrid.PairGrid at 0x7fafd7b65550>
In [87]:
sns.pairplot(pd.DataFrame(importancefs, columns=tasks))
Out[87]:
<seaborn.axisgrid.PairGrid at 0x7f6a0554af28>

Task

  • for now, let's just try to get solid motif instances
    • importance scores sufficiently high
    • match sufficiently high

Simplicifcations

  • for jaccard match average all the matches across tasks
  • choose the strand based on the jaccard match
  • select the right match threshold
    • use normal distribution fitted on [0, 99] percentiles with FDR=0.1

Conclusions

  • pure PWM match performs poorly
In [90]:
plt.hist(matchfs, 100);
In [97]:
plt.hist(sox2_matchfs, 100);
In [99]:
plt.hist(importancefs.max(axis=-1), 100);
In [100]:
plt.hist(sox2_importancefs.max(axis=-1), 100);
In [334]:
from basepair.stats import fdr_threshold_norm_right
In [335]:
fdr_threshold_norm_right(matchfs)
Out[335]:
0.06211395940893545
In [92]:
g = sns.jointplot(matchfs, importancefs[:, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [94]:
g = sns.jointplot(matchfs, importancefs.max(axis=-1), alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [322]:
fdr_threshold_norm_right(sox2_matchfs)
Out[322]:
0.09175216612992457
In [95]:
g = sns.jointplot(sox2_matchfs, sox2_importancefs[:, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [96]:
g = sns.jointplot(sox2_matchfs, sox2_importancefs.max(axis=-1), alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [346]:
g = sns.jointplot(matchfs, seq_matchfs, alpha=0.05);
g.set_axis_labels("Match", "PWM-scan");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [344]:
fdr_threshold_norm_right(seq_matchfs)
Out[344]:
11.241350809716671
In [517]:
stats.probplot(seq_matchfs, dist="norm", plot=plt);
plt.title("asd");
Out[517]:
Text(0.5,1,'asd')

Interstingly, the PWM is a very poor predictor

In [78]:
g = sns.jointplot(matchf[idx_list, :,choose_max].mean(axis=-1), seq_matchf[idx_list,choose_max], alpha=0.05);
g.set_axis_labels("Match", "PWM-scan");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Figure out the right thresholding strategy

QQ-plots

In [101]:
import scipy.stats as stats
stats.probplot(matchfs, dist="norm", plot=plt);
In [102]:
stats.probplot(sox2_matchfs, dist="norm", plot=plt);
  • Experiemented with fitting gaussian mixture models - didn't work at all
In [310]:
np.sum(matchfs > threshold(matchfs) )
Out[310]:
1717
In [336]:
from scipy.stats import norm
In [295]:
norm.ppf(0.8, loc=loc, scale=scale)
Out[295]:
0.02518700846030738
In [265]:
loc, scale = norm.fit(matchfs)
loc, scale
Out[265]:
(0.01119161561186437, 0.02010419697630615)
In [327]:
upper = np.percentile(matchfs,99)

loc, scale = norm.fit(matchfs[(matchfs < upper)])

p = norm.cdf(matchfs, loc, scale)
In [271]:
keep, padj = fdrcorrection(1-p, alpha=0.1)
In [272]:
keep.sum()
Out[272]:
1717
In [273]:
keep.mean()
Out[273]:
0.01717
In [274]:
import scipy.stats as stats
#measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(matchfs[~keep], dist="norm", plot=plt);
In [275]:
import scipy.stats as stats
#measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(matchfs[(matchfs < upper)], dist="norm", plot=plt);
In [164]:
skip.mean()
Out[164]:
0.02149
In [165]:
plt.hist(p[skip], 100);
In [166]:
plt.hist(p[p> 0.9], 100);
In [138]:
plt.hist(padj, 100);
In [117]:
var
Out[117]:
0.02010419697630615

Explore called instances

In [349]:
keep = matchfs > fdr_threshold_norm_right(matchfs)
In [350]:
plt.hist(matchfs[keep], 100);
In [351]:
g = sns.jointplot(matchfs[keep], importancefs[keep, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [351]:
g = sns.jointplot(matchfs[keep], importancefs[keep, 0], alpha=.1);
g.set_axis_labels("Match", "Importance");
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [ ]:
plt.

Instance calling

  • sufficiently good match in one of the tasks
In [92]:
hyp_contrib_score_scan.shape
Out[92]:
(19137, 985)
In [115]:
np.all(contrib_score_scan==np.maximum(contrib_score_scan_fwd, contrib_score_scan_rev))
Out[115]:
False
In [116]:
np.all(hyp_contrib_score_scan==np.maximum(hyp_contrib_score_scan_fwd, hyp_contrib_score_scan_rev))
Out[116]:
False
In [122]:
plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd)[:100000], ".")
Out[122]:
[<matplotlib.lines.Line2D at 0x7f622772ba58>]
In [119]:
plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(hyp_contrib_score_scan_fwd)[:100000], ".")
Out[119]:
[<matplotlib.lines.Line2D at 0x7f622c077780>]
In [123]:
plt.hist(np.ravel(pwm_scan_fwd), 100);
plt.xlabel("pwm scan score");
In [128]:
plt.plot(np.ravel(pwm_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd_scale)[:100000], ".")
Out[128]:
[<matplotlib.lines.Line2D at 0x7f622cf470f0>]
In [224]:
plt.plot(np.ravel(contrib_score_scan_fwd)[:100000], np.ravel(contrib_score_scan_fwd_scale)[:100000], ".");
plt.xlabel("Match")
plt.ylabel("scale");
In [226]:
plt.hist(np.ravel(contrib_score_scan_fwd), 100);
In [134]:
plt.plot(np.ravel(contrib_score_scan_fwd)[:100000], np.ravel(hyp_contrib_score_scan_fwd)[:100000], ".")
Out[134]:
[<matplotlib.lines.Line2D at 0x7f622db85ba8>]
In [135]:
plt.plot(np.ravel(contrib_score_scan_fwd_scale)[:100000], np.ravel(hyp_contrib_score_scan_fwd_scale)[:100000], ".")
Out[135]:
[<matplotlib.lines.Line2D at 0x7f622d9482e8>]

Plot scores for the example

In [90]:
viz_dict = dict(contrib=all_contrib[example_idx, 400:600],
                hyp_contrib=all_hyp_contrib[example_idx, 400:600],
                pwm_match=seq_match[example_idx, 400:600],
                contrib_match=match[example_idx, 400:600, 0],
                contrib_scale=importance[example_idx, 400:600, 0])
In [91]:
from basepair.modisco.results import Seqlet
In [121]:
pos_hits = np.argsort(-match[example_idx, 400:600, 0].max(axis=-1))[:2]  # top 2 hits
In [106]:
match[example_idx, 400:600, 0].sum()
Out[106]:
0.00830483751167788
In [109]:
importance[example_idx, 400:600, 0].sum()
Out[109]:
0.11577525138663863
In [96]:
pos_hits
Out[96]:
array([], dtype=int64)
In [137]:
from basepair.utils import halve

def idx2seqlet(pos, pattern):
    # TODO handle the reverse-complementation
    i, j = halve(len(pattern))
    return Seqlet(None, pos-i, pos+j, name=pattern.name)
In [131]:
seqlets = [idx2seqlet(pos, pattern) for pos in pos_hits]
In [129]:
pattern.plot("seq");
In [132]:
plot_tracks(viz_dict, seqlets, rotate_y=0, legend=True);

Benchmarking the functions

In [373]:
%timeit score_region_cont_jaccard(qa, ta)
300 µs ± 33.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [375]:
%timeit score_region_cont_jaccard(qa, ta_all_contrib[:100])
362 ms ± 44.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [376]:
%timeit score_region_cont_jaccard(qa, ta_all_contrib[:1000])
3.69 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [377]:
%timeit score_region_cont_jaccard(qa, ta_all_contrib[0]) # 1kb
2.14 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [378]:
%timeit score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0], qa) # 1kb
18.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [379]:
%timeit score_region_cont_jaccard(qa, ta_all_contrib[0][:, :100]) # 100bp
293 µs ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [380]:
%timeit score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0][:, :100], qa) # 100bp
1.73 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [315]:
a, a_norm = score_full_regionarr_with_perpos_continjaccard(ta_all_contrib[0][:, :1000], qa)
b, b_norm = score_region_cont_jaccard(qa, ta_all_contrib[0][:, :1000]) # 100bp
In [336]:
assert b_norm[0] == a_norm[0]
In [382]:
%timeit np.stack([score_region_cont_jaccard(qa, ta_all_contrib[i]) for i in range(1000)])
2.34 s ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [390]:
scanned =  np.stack(Parallel(10)(delayed(score_region_cont_jaccard)(qa, ta_all_contrib[i]) for i in tqdm(range(len(ta_all_contrib)))))
100%|██████████| 19137/19137 [00:04<00:00, 4517.30it/s]
In [391]:
ta_all_contrib.shape
Out[391]:
(19137, 4, 1000)
In [415]:
# TODO - add unit-tests
In [337]:
other_jacc = jaccard_sim_func(ta_all_contrib[0][:, :16][np.newaxis]*qa_L1_norm / a_norm[0], qa[np.newaxis])[0,0]
In [343]:
np.allclose(other_jacc, a[0])
Out[343]:
False
In [344]:
np.allclose(other_jacc, b[0])
Out[344]:
True
In [348]:
other_jacc = jaccard_sim_func(ta_all_contrib[0][:, 1:17][np.newaxis]*qa_L1_norm / b_norm[1], qa[np.newaxis])[0,0]
In [349]:
np.allclose(other_jacc, a[1])
Out[349]:
False
In [350]:
np.allclose(other_jacc, b[1])
Out[350]:
True
In [316]:
plt.scatter(a_norm, b_norm)
Out[316]:
<matplotlib.collections.PathCollection at 0x7f40fa670198>
In [318]:
plt.scatter(np.ravel(a), np.ravel(b));
In [ ]:
%timeit score_region_cont_jaccard(qa, ta)
In [159]:
%timeit score_full_regionarr_with_perpos_continjaccard(ta, qa)
3.04 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pattern scanning for the whole group

In [187]:
from basepair.modisco.utils import ic_scale

def scan_pattern(pattern_grp, contrib_scores, hypothetical_contribs, one_hot, task_names, trim_frac=0, verbose=True):
    """
    final modisco results scan structure
    - sequence
      - fwd
      - rev
    - <task>
      - <imp score>
        - match
           - fwd
           - rev
        - scale
    """
    # TODO - add i and j to the output?

    pwm = pattern_grp['sequence']['fwd'][:]
    trim_ij = trim_pssm_idx(ic_scale(pwm), frac=trim_frac)
    pattern_len = len(pwm)
    
    def get_group_match(grp, name):
        for n in grp.keys():
            if name in n:
                return grp[n]
        raise ValueError("{name} doesn't match any group keys: {l}".format(name=name, l=list(grp.keys())))
        
    def parallel_score_continousjaccard_restructure(qa, ta, **kwargs):
        fwd, fwd_scale = parallel_score_continousjaccard(qa, ta, **kwargs)
        rev, rev_scale = parallel_score_continousjaccard(qa[::-1, ::-1], ta, **kwargs)
        return dict(match=dict(fwd=fwd, rev=rev),
                    scale=dict(fwd=fwd_scale, rev=rev_scale),
                   )
        
    trim_ij_rc = (pattern_len - trim_ij[1], pattern_len - trim_ij[0])
    return dict(sequence=dict(fwd=parallel_score_pssm(pwm[trim_ij[0]:trim_ij[1]], one_hot, verbose=verbose),
                              rev=parallel_score_pssm(pwm[trim_ij[0]:trim_ij[1]][::-1, ::-1], one_hot, verbose=verbose)),
                **{task: {"contrib_scores": parallel_score_continousjaccard_restructure(get_group_match(pattern_grp[task], "contrib_scores")['fwd'][trim_ij[0]:trim_ij[1]], contrib_scores),
                          "hypothetical_contribs": parallel_score_continousjaccard_restructure(get_group_match(pattern_grp[task], "hypothetical_contribs")['fwd'][trim_ij[0]:trim_ij[1]], hypothetical_contribs),}
                   for task in tqdm(task_names, disable=not verbose)}
               )
In [190]:
tasks
Out[190]:
['Oct4', 'Sox2', 'Klf4', 'Nanog']
In [191]:
mp1 = scan_pattern(mr.get_pattern_grp("metacluster_0", "pattern_0"), all_contrib, all_hyp_contrib, all_seq, tasks, trim_frac=0.08)
100%|██████████| 19137/19137 [00:02<00:00, 7370.59it/s]
100%|██████████| 19137/19137 [00:02<00:00, 7069.71it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4544.68it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4539.49it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4100.46it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4208.55it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4723.49it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4435.75it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4187.63it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4102.75it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4484.23it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4498.88it/s]
100%|██████████| 19137/19137 [00:04<00:00, 3981.99it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4432.95it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4553.70it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4578.30it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4319.89it/s]
100%|██████████| 19137/19137 [00:04<00:00, 4492.56it/s]
In [199]:
all_seq.size
Out[199]:
76548000
In [217]:
idx = pd.Series(np.arange(all_seq.shape[0])).sample(10000).values
In [218]:
plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Sox2']['contrib_scores']['match']['fwd'])[idx], ".")
Out[218]:
[<matplotlib.lines.Line2D at 0x7f622d3dc940>]
In [219]:
plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Nanog']['contrib_scores']['match']['fwd'])[idx], ".")
Out[219]:
[<matplotlib.lines.Line2D at 0x7f622b7a72b0>]
In [220]:
plt.plot(np.ravel(mp1['Oct4']['contrib_scores']['match']['fwd'])[idx], np.ravel(mp1['Klf4']['contrib_scores']['match']['fwd'])[idx], ".")
Out[220]:
[<matplotlib.lines.Line2D at 0x7f622c90aeb8>]
In [221]:
mp1['Oct4']['contrib_scores']['match']['fwd'].shape
Out[221]:
(19137, 985)

Ideas

  • how much of the blob can be explained by other motifs?
    • what are the important motifs
  • first try to assign the
  • IDR: check Nathan's code - take from kundajelab/idr
    • plot rank-rank
  • affinity propagation
    • run on the predictions as well as real data
  • footprint -> DNA binding domain
    • crystal structure?
      • oct-sox
  • orthogonal validation?

    • expression - MPRA
    • Nanog SELEX dataset
      • re-analyze the data?
    • MPRA - HK27ac
      • try to predict the enhancer activity
      • stratify the peaks into acetylated vs non-acetylated
  • higher fraction of the

    • allele specific binding?
    • train in human on ips lines
  • transposable elements are not conserved in Human

  • simulations -> profile gives only a sense of what can be going on

  • affinity with SELEX
  • MPRA
  • acetylation
  • Human <-> mouse allelic effects
  • overlap with the histone data

  • question:

    • mixture model also predicting things that are not bound
    • repressors
      • question: are there any repressors?
    • idea:
      • add more tasks to the output