Motivation and goal¶

pattern table hows that many motifs are very similar to each other

Goal: how much of the motif heterogeneity is real and how much is just noise?
- Rank motifs by how different they are in the marginal sense
Can you confidently merge or discard motifs?

Although the sequence motif is the same, the context changes hence the way this motif influences the binding are differnet.

Tasks¶

[ ] Find a single compelling example of a motif in differnet contexts by eyeballing the pattern table
[ ] design the statistical test for it
- can it answer your questions / intuition?
[x] implement the three motif visualizations
- predicted (same context)
- predicted (random context, dinucl-shuffled)
- observed
[ ] Compute the distance between two profiles (including a p-value using bootstrap)
- think if we have to collapse them
[ ] think how to scale this up and merge similar motifs
- add the major motifs (most frequent) within each group
- measure the statistical difference to the major motifs w.r.t.
  - the profile counts
  - occurrence of other motifs

Questions¶

is there an issue with visualization

from basepair.imports import *
from pybedtools import BedTool
create_tf_session(1)
model_dir = Path(f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")
modisco_dir = model_dir / f"modisco/all/profile/"
output_dir = Path("/srv/www/kundaje/avsec/chipnexus/oct-sox-nanog-klf/models/n_dil_layers=9/modisco/all/profile")

/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

# load
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
pattern_table = pd.read_csv(output_dir / "pattern_table.csv")
df_clusters = pd.read_csv(output_dir / 'motif_clustering/short-motif.seq-cluster.csv')  # cluster table

# read only the non-TE patterns
patterns = read_pkl(output_dir / 'patterns.pkl')
patterns = [p for p in patterns if p.name in df_clusters.pattern.tolist()]
tasks = patterns[0].tasks()

len(patterns)

96

df_clusters.head()

Implementation: Model prediction for differnet context values¶

p = patterns[0]

Measured¶

p.plot("profile", rotate_y=0)

Actual seqlets¶

from basepair.preproc import resize_interval
from basepair.plot.tracks import plot_tracks, filter_tracks

bpnet = BPNetPredictor.from_mdir(model_dir)

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

2018-11-28 09:11:41,784 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:497: calling conv1d (from tensorflow.python.ops.nn_ops) with data_format=NHWC is deprecated and will be removed in a future version.
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead

WARNING:tensorflow:From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

2018-11-28 09:11:54,918 [WARNING] From /users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

intervals = [resize_interval(x, bpnet.input_seqlen()) for x in BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed")]
preds = bpnet.predict(intervals, compute_grads=False, use_strand=True)

scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in tasks}

plot_tracks(filter_tracks(scaled_preds, [400, 600]), fig_height_per_track=0.8, fig_width=15)

Seqlets with di-nucleotide shuffled background¶

orig_intervals = list(BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed"))

widths = np.array([i.end - i.start for i in orig_intervals])

assert widths.std() == 0

motif_width = widths[0]

intervals = [resize_interval(x, bpnet.input_seqlen()) for x in orig_intervals]

input_seq = bpnet.get_seq(intervals, use_strand=True)

# figure out the right start and end in the whole array
center = bpnet.input_seqlen() // 2
start = center - motif_width // 2
end = center + motif_width // 2 + motif_width % 2

assert end - start == motif_width

input_seq.shape

(9373, 1000, 4)

# one-hot -> strings
from concise.preprocessing.sequence import one_hot2string, DNA, encodeDNA

up_seqs = one_hot2string(input_seq[:,:start], DNA)
down_seqs = one_hot2string(input_seq[:,end:], DNA)
center_seqs = one_hot2string(input_seq[:,start:end], DNA)

assert len(up_seqs[0]) == start
assert len(down_seqs[0]) == bpnet.input_seqlen() - end

# di-nucleotide shuffle the sequences
from deeplift.dinuc_shuffle import dinuc_shuffle

shuffled_up_seqs = [dinuc_shuffle(s) for s in up_seqs]
shuffled_down_seqs = [dinuc_shuffle(s) for s in down_seqs]

shuffled_seqs = encodeDNA([shuffled_up_seqs[i] + center_seqs[i] + shuffled_down_seqs[i] 
                           for i in range(len(up_seqs))])

# shape still the same
assert shuffled_seqs.shape == input_seq.shape

assert not np.allclose(shuffled_seqs[0], input_seq[0])

# Make predictions
sh_preds = bpnet.seq_predict(shuffled_seqs, compute_grads=False)

sh_scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in tasks}

plot_tracks(filter_tracks(sh_scaled_preds, [400, 600]), fig_height_per_track=0.8, fig_width=15)

Package into a function¶

from pybedtools import BedTool
import numpy as np
from basepair.preproc import resize_interval
from concise.preprocessing.sequence import one_hot2string, DNA, encodeDNA
from deeplift.dinuc_shuffle import dinuc_shuffle
from basepair.plot.tracks import filter_tracks


def predict_pattern_footprint_random_ctx(p, bpnet, output_dir):
    # load the original intervals and get the motif width
    orig_intervals = list(BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed"))
    widths = np.array([i.end - i.start for i in orig_intervals])
    assert widths.std() == 0
    motif_width = widths[0]

    # get sequences
    intervals = [resize_interval(x, bpnet.input_seqlen()) for x in orig_intervals]
    input_seq = bpnet.get_seq(intervals, use_strand=True)

    # figure out the right start and end in the whole array
    center = bpnet.input_seqlen() // 2
    start = center - motif_width // 2
    end = center + motif_width // 2 + motif_width % 2
    assert end - start == motif_width

    # one-hot -> strings
    up_seqs = one_hot2string(input_seq[:, :start], DNA)
    down_seqs = one_hot2string(input_seq[:, end:], DNA)
    center_seqs = one_hot2string(input_seq[:, start:end], DNA)
    assert len(up_seqs[0]) == start
    assert len(down_seqs[0]) == bpnet.input_seqlen() - end

    # di-nucleotide shuffle the sequences
    shuffled_up_seqs = [dinuc_shuffle(s) for s in up_seqs]
    shuffled_down_seqs = [dinuc_shuffle(s) for s in down_seqs]
    shuffled_seqs = encodeDNA([shuffled_up_seqs[i] + center_seqs[i] + shuffled_down_seqs[i]
                               for i in range(len(up_seqs))])
    # shape still the same
    assert shuffled_seqs.shape == input_seq.shape
    assert not np.allclose(shuffled_seqs[0], input_seq[0])

    # Make predictions
    sh_preds = bpnet.seq_predict(shuffled_seqs, compute_grads=False)
    sh_scaled_preds = {t: np.stack([x['preds']['scaled_profile'][t] for x in sh_preds]).mean(0) for t in p.tasks()}
    return sh_scaled_preds


def predict_pattern_footprint(p, bpnet, output_dir):
    intervals = [resize_interval(x, bpnet.input_seqlen()) for x in BedTool(f"{output_dir}/modisco_seqlets/{p.name}.bed")]
    preds = bpnet.predict(intervals, compute_grads=False, use_strand=True)
    return {t: np.stack([x['preds']['scaled_profile'][t] for x in preds]).mean(0) for t in p.tasks()}


def footprint_context_pred(p, bpnet, output_dir):
    return {"measured": p.profile,
            "predicted": filter_tracks(predict_pattern_footprint(p, bpnet, output_dir), [400, 600]),
            "predicted_random_ctx": filter_tracks(predict_pattern_footprint_random_ctx(p, bpnet, output_dir), [400, 600])}

out = footprint_context_pred(p, bpnet, output_dir)

# FIX - matplotlib not showing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

for k,v in out.items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

Compute it for multiple motifs¶

Choose cluster 3 (Nanog).

idx = df_clusters[df_clusters.cluster==8].cluster_order.values

results_nanog = OrderedDict([(row.pattern, footprint_context_pred(patterns[row.cluster_order], bpnet, output_dir) )
                        for i,row in tqdm(df_clusters[df_clusters.cluster==3].iterrows())])

0it [00:00, ?it/s]
1it [00:01,  1.45s/it]
2it [00:02,  1.03s/it]
3it [00:02,  1.22it/s]
4it [00:19,  4.99s/it]
5it [00:25,  5.10s/it]
6it [00:29,  5.00s/it]
7it [00:33,  4.74s/it]
8it [00:35,  4.50s/it]
9it [00:37,  4.17s/it]
10it [00:38,  3.85s/it]
11it [00:39,  3.55s/it]

patterns_d = {p.name:p for p in patterns}

# FIX - matplotlib not showing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Nanog¶

Predict for three motifs¶

m0_p3
m4_p1
m1_p2

pname = 'metacluster_0/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

pname = 'metacluster_4/pattern_1'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

pname = 'metacluster_1/pattern_2'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_nanog[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

Klf4¶

m2/p0
m6/p3
m2/p3

results_klf4 = OrderedDict([(row.pattern, footprint_context_pred(patterns[row.cluster_order], bpnet, output_dir) )
                            for i,row in tqdm(df_clusters[df_clusters.cluster==8].iterrows())])

0it [00:00, ?it/s]
1it [00:32, 32.89s/it]
2it [00:44, 22.48s/it]
3it [00:50, 16.90s/it]
4it [00:54, 13.72s/it]
5it [00:58, 11.76s/it]
6it [01:01, 10.29s/it]
7it [01:03,  9.10s/it]
8it [01:05,  8.16s/it]
9it [01:06,  7.39s/it]
10it [01:07,  6.73s/it]
11it [01:08,  6.19s/it]
12it [01:08,  5.73s/it]
13it [01:09,  5.33s/it]
14it [01:09,  4.99s/it]
15it [01:10,  4.69s/it]
16it [01:10,  4.43s/it]
17it [01:31,  5.37s/it]
18it [01:40,  5.59s/it]
19it [01:42,  5.40s/it]

pname = 'metacluster_2/pattern_0'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

pname = 'metacluster_6/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

pname = 'metacluster_2/pattern_3'
patterns_d[pname].plot(['seq', 'contrib'], rotate_y=0)
for k,v in results_klf4[pname].items():
    plot_tracks(v, fig_height_per_track=0.8, fig_width=15, title=k)

	pattern	cluster_order	cluster
0	metacluster_7/pattern_6	0	8
1	metacluster_2/pattern_12	1	8
2	metacluster_3/pattern_3	2	8
3	metacluster_2/pattern_0	3	8
4	metacluster_7/pattern_1	4	8