Motivation and goal

pattern table hows that many motifs are very similar to each other

  • Goal: how much of the motif heterogeneity is real and how much is just noise?
    • Rank motifs by how different they are in the marginal sense
  • Can you confidently merge or discard motifs?

Although the sequence motif is the same, the context changes hence the way this motif influences the binding are differnet.

Tasks

  • [ ] Find a single compelling example of a motif in differnet contexts by eyeballing the pattern table
  • [ ] design the statistical test for it
    • can it answer your questions / intuition?
  • [x] implement the three motif visualizations
    • predicted (same context)
    • predicted (random context, dinucl-shuffled)
    • observed
  • [ ] Compute the distance between two profiles (including a p-value using bootstrap)
    • think if we have to collapse them
  • [ ] think how to scale this up and merge similar motifs
    • add the major motifs (most frequent) within each group
    • measure the statistical difference to the major motifs w.r.t.
      • the profile counts
      • occurrence of other motifs

Questions

  • is there an issue with visualization
In [1]:
from basepair.imports import *
from pybedtools import BedTool
create_tf_session(1)
model_dir = Path(f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")
modisco_dir = model_dir / f"modisco/all/profile/"
output_dir = Path("/srv/www/kundaje/avsec/chipnexus/oct-sox-nanog-klf/models/n_dil_layers=9/modisco/all/profile")
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [2]:
# load
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
pattern_table = pd.read_csv(output_dir / "pattern_table.csv")
df_clusters = pd.read_csv(output_dir / 'motif_clustering/short-motif.seq-cluster.csv')  # cluster table

# read only the non-TE patterns
patterns = read_pkl(output_dir / 'patterns.pkl')
patterns = [p for p in patterns if p.name in df_clusters.pattern.tolist()]
tasks = patterns[0].tasks()
In [3]:
len(patterns)
Out[3]:
96
In [27]:
df_clusters.head()
Out[27]:
pattern cluster_order cluster
0 metacluster_7/pattern_6 0 8
1 metacluster_2/pattern_12 1 8
2 metacluster_3/pattern_3 2 8
3 metacluster_2/pattern_0 3 8
4 metacluster_7/pattern_1 4 8

Implementation: Model prediction for differnet context values

In [28]:
p = patterns[0]

Measured

In [29]:
p.plot("profile", rotate_y=0)
Out[29]:

Actual seqlets

In [30]:
from basepair.preproc import resize_interval
from basepair.plot.tracks import plot_tracks, filter_tracks
In [31]:
bpnet = BPNetPredictor.from_mdir(model_dir)
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
2018-11-28 09:11:41,784 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-11-28 09:11:54,918 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
In [32]:
intervals = [resize_interval(x, bpnet.input_seqlen()) for x in BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed")]
preds = bpnet.predict(intervals, compute_grads=False, use_strand=True)

scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in tasks}
In [33]:
plot_tracks(filter_tracks(scaled_preds, [400, 600]), fig_height_per_track=0.8, fig_width=15)
Out[33]:

Seqlets with di-nucleotide shuffled background

In [34]:
orig_intervals = list(BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed"))
In [35]:
widths = np.array([i.end - i.start for i in orig_intervals])
In [36]:
assert widths.std() == 0
In [37]:
motif_width = widths[0]
In [38]:
intervals = [resize_interval(x, bpnet.input_seqlen()) for x in orig_intervals]
In [39]:
input_seq = bpnet.get_seq(intervals, use_strand=True)
In [40]:
# figure out the right start and end in the whole array
center = bpnet.input_seqlen() // 2
start = center - motif_width // 2
end = center + motif_width // 2 + motif_width % 2
In [41]:
assert end - start == motif_width
In [42]:
input_seq.shape
Out[42]:
(9373, 1000, 4)
In [43]:
# one-hot -> strings
from concise.preprocessing.sequence import one_hot2string, DNA, encodeDNA

up_seqs = one_hot2string(input_seq[:,:start], DNA)
down_seqs = one_hot2string(input_seq[:,end:], DNA)
center_seqs = one_hot2string(input_seq[:,start:end], DNA)

assert len(up_seqs[0]) == start
assert len(down_seqs[0]) == bpnet.input_seqlen() - end
In [44]:
# di-nucleotide shuffle the sequences
from deeplift.dinuc_shuffle import dinuc_shuffle

shuffled_up_seqs = [dinuc_shuffle(s) for s in up_seqs]
shuffled_down_seqs = [dinuc_shuffle(s) for s in down_seqs]
In [45]:
shuffled_seqs = encodeDNA([shuffled_up_seqs[i] + center_seqs[i] + shuffled_down_seqs[i] 
                           for i in range(len(up_seqs))])

# shape still the same
assert shuffled_seqs.shape == input_seq.shape

assert not np.allclose(shuffled_seqs[0], input_seq[0])
In [46]:
# Make predictions
sh_preds = bpnet.seq_predict(shuffled_seqs, compute_grads=False)
In [47]:
sh_scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in tasks}
In [48]:
plot_tracks(filter_tracks(sh_scaled_preds, [400, 600]), fig_height_per_track=0.8, fig_width=15)
Out[48]:

Package into a function

In [49]:
from pybedtools import BedTool
import numpy as np
from basepair.preproc import resize_interval
from concise.preprocessing.sequence import one_hot2string, DNA, encodeDNA
from deeplift.dinuc_shuffle import dinuc_shuffle
from basepair.plot.tracks import filter_tracks


def predict_pattern_footprint_random_ctx(p, bpnet, output_dir):
    # load the original intervals and get the motif width
    orig_intervals = list(BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed"))
    widths = np.array([i.end - i.start for i in orig_intervals])
    assert widths.std() == 0
    motif_width = widths[0]

    # get sequences
    intervals = [resize_interval(x, bpnet.input_seqlen()) for x in orig_intervals]
    input_seq = bpnet.get_seq(intervals, use_strand=True)

    # figure out the right start and end in the whole array
    center = bpnet.input_seqlen() // 2
    start = center - motif_width // 2
    end = center + motif_width // 2 + motif_width % 2
    assert end - start == motif_width

    # one-hot -> strings
    up_seqs = one_hot2string(input_seq[:, :start], DNA)
    down_seqs = one_hot2string(input_seq[:, end:], DNA)
    center_seqs = one_hot2string(input_seq[:, start:end], DNA)
    assert len(up_seqs[0]) == start
    assert len(down_seqs[0]) == bpnet.input_seqlen() - end

    # di-nucleotide shuffle the sequences
    shuffled_up_seqs = [dinuc_shuffle(s) for s in up_seqs]
    shuffled_down_seqs = [dinuc_shuffle(s) for s in down_seqs]
    shuffled_seqs = encodeDNA([shuffled_up_seqs[i] + center_seqs[i] + shuffled_down_seqs[i]
                               for i in range(len(up_seqs))])
    # shape still the same
    assert shuffled_seqs.shape == input_seq.shape
    assert not np.allclose(shuffled_seqs[0], input_seq[0])

    # Make predictions
    sh_preds = bpnet.seq_predict(shuffled_seqs, compute_grads=False)
    sh_scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in sh_preds]).mean(0) for t in p.tasks()}
    return sh_scaled_preds


def predict_pattern_footprint(p, bpnet, output_dir):
    intervals = [resize_interval(x, bpnet.input_seqlen()) for x in BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed")]
    preds = bpnet.predict(intervals, compute_grads=False, use_strand=True)
    return {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in p.tasks()}


def footprint_context_pred(p, bpnet, output_dir):
    return {"measured": p.profile,
            "predicted": filter_tracks(predict_pattern_footprint(p, bpnet, output_dir), [400, 600]),
            "predicted_random_ctx": filter_tracks(predict_pattern_footprint_random_ctx(p, bpnet, output_dir), [400, 600])}
In [168]:
out = footprint_context_pred(p, bpnet, output_dir)
In [60]:
# FIX - matplotlib not showing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
In [186]:
for k,v in out.items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[186]:
Out[186]:
Out[186]:

Compute it for multiple motifs

Choose cluster 3 (Nanog).

In [50]:
idx = df_clusters[df_clusters.cluster==8].cluster_order.values
In [56]:
results_nanog = OrderedDict([(row.pattern, footprint_context_pred(patterns[row.cluster_order], bpnet, output_dir) )
                        for i,row in tqdm(df_clusters[df_clusters.cluster==3].iterrows())])
0it [00:00, ?it/s]
1it [00:01,  1.45s/it]
2it [00:02,  1.03s/it]
3it [00:02,  1.22it/s]
4it [00:19,  4.99s/it]
5it [00:25,  5.10s/it]
6it [00:29,  5.00s/it]
7it [00:33,  4.74s/it]
8it [00:35,  4.50s/it]
9it [00:37,  4.17s/it]
10it [00:38,  3.85s/it]
11it [00:39,  3.55s/it]
In [65]:
patterns_d = {p.name:p for p in patterns}
In [60]:
# FIX - matplotlib not showing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Nanog

Predict for three motifs

  • m0_p3
  • m4_p1
  • m1_p2
In [68]:
pname = 'metacluster_0/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[68]:
Out[68]:
Out[68]:
Out[68]:
In [69]:
pname = 'metacluster_4/pattern_1'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[69]:
Out[69]:
Out[69]:
Out[69]:
In [70]:
pname = 'metacluster_1/pattern_2'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[70]:
Out[70]:
Out[70]:
Out[70]:

Klf4

  • m2/p0
  • m6/p3
  • m2/p3
In [54]:
results_klf4 = OrderedDict([(row.pattern, footprint_context_pred(patterns[row.cluster_order], bpnet, output_dir) )
                            for i,row in tqdm(df_clusters[df_clusters.cluster==8].iterrows())])
0it [00:00, ?it/s]
1it [00:32, 32.89s/it]
2it [00:44, 22.48s/it]
3it [00:50, 16.90s/it]
4it [00:54, 13.72s/it]
5it [00:58, 11.76s/it]
6it [01:01, 10.29s/it]
7it [01:03,  9.10s/it]
8it [01:05,  8.16s/it]
9it [01:06,  7.39s/it]
10it [01:07,  6.73s/it]
11it [01:08,  6.19s/it]
12it [01:08,  5.73s/it]
13it [01:09,  5.33s/it]
14it [01:09,  4.99s/it]
15it [01:10,  4.69s/it]
16it [01:10,  4.43s/it]
17it [01:31,  5.37s/it]
18it [01:40,  5.59s/it]
19it [01:42,  5.40s/it]
In [72]:
pname = 'metacluster_2/pattern_0'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[72]:
Out[72]:
Out[72]:
Out[72]:
In [73]:
pname = 'metacluster_6/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[73]:
Out[73]:
Out[73]:
Out[73]:
In [74]:
pname = 'metacluster_2/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)
Out[74]:
Out[74]:
Out[74]:
Out[74]: