Goal

  • determine when to merge and when not to merge different modisco patterns

Tasks

  • [x] plot the distribution of importance scores for each pattern for each task
    • shall we have a big boxplot or violin plot?
      • x-axis = differnet patterns / tasks
      • y-axis = improtance
  • [x] for each leaf in the hirearchical clustering, compare the distribution between two patterns vs. randomly shuffled version
    • null distribution = randomly assign seqlets to different groups (keep the same number of seqlets in each group)
      • do this many times
        • compare with the test statistic
    • test statistic options:
      • distance between two distributions
        • distance between two seq. PWM's / centroids
          • assume independence between bases (?)
        • do the same for two profile distributins (kl-divergence)
        • compare the two total-count distributions

Implement the stat. test for pairwise distance

  • [x] implement method p1.aligned_distance(p2, which='seq'), where which can be 'seq', 'counts_profile,counts_total`
    • this compares values from seq, contrib, hyp_contrib or profile using say kl.divergence
  • [X] implement function shuffle_seqlets(p1, p2) which returns two new seqlets derived from suffled seqlet values
  • [x] run this for two closest patterns
    • is any of the differnet features significantly differnet compared to the null while controlling for multiple testing?
  • [x] can we visualize this somehow in a matrix?
    • compare two neighbours
    • visualize the p-values as a heatmap with p-values as stars
      • (maybe) visualize the cluster/pattern group boundaries

Conclusions

  • motifs significantly differ based on sequence
  • profile counts don't differ that much

Open questions

  • is it possible that two patterns would not be very similar yet they would show very different importance profiles?

Next steps

  • [ ] implement the distance-weighted importance scores which would compute the average distance for which the patterns affects it
    • compare the values of differnet metaclusters and see if they impact something more distantly
  • [ ] checkout the importance scores of two patterns and see if the same bases get higlighted or not
  • [ ] can we generalize this step of computing features to any derived feature?
    • this would require to store seqlets also in the pattern and derive all the properties from the seqlets / importance scores themselves
In [1]:
# Imports
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from basepair.imports import *
hv.extension('bokeh')
Using TensorFlow backend.
In [2]:
# Common paths
model_dir = Path(f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")
modisco_dir = model_dir / f"modisco/all/deeplift/profile/"
In [3]:
mr = ModiscoResult(modisco_dir / 'modisco.h5')
mr.open()
In [4]:
patterns = read_pkl(modisco_dir / 'patterns.pkl')
In [5]:
patterns[0].attrs.keys()
Out[5]:
dict_keys(['features', 'align', 'cluster', 'stacked_seqlet_imp'])
In [6]:
tasks = patterns[0].tasks()
In [7]:
tasks
Out[7]:
['Klf4', 'Nanog', 'Oct4', 'Sox2']

Append metacluster activity

In [8]:
dfa = mr.metacluster_activity()
dfa.columns = dfa.columns.str.replace("/weighted", "")
In [9]:
def get_metacluster(pname):
    return int(pname.split("/")[0].split("_")[1])
In [10]:
patterns = [p.add_attr('metacluster', get_metacluster(p.name)) for p in patterns]
patterns = [p.add_attr('metacluster_activity', dict(dfa.loc[p.attrs['metacluster']])) for p in patterns]

Stat-test for difference between two groups

  • [x] for each leaf in the hirearchical clustering, compare the distribution between two patterns vs. randomly shuffled version
    • null distribution = randomly assign seqlets to different groups (keep the same number of seqlets in each group)
      • do this many times
        • compare with the test statistic
    • test statistic options:
      • distance between two distributions
        • distance between two seq. PWM's / centroids
          • assume independence between bases (?)
        • do the same for two profile distributins (kl-divergence)
        • compare the two total-count distributions
In [11]:
p1 = patterns[0].attrs['stacked_seqlet_imp'].aggregate()
p2 = patterns[1].attrs['stacked_seqlet_imp'].aggregate()
In [12]:
p1.plot("seq_ic")
p2.plot("seq_ic");
TF-MoDISco is using the TensorFlow backend.

Implement the stat. test for pairwise distance

  • [x] implement method p1.aligned_distance(p2, which='seq'), where which can be 'seq', 'counts_profile,counts_total`
    • this compares values from seq, contrib, hyp_contrib or profile using say kl.divergence
In [13]:
p1.aligned_distance_seq(p2)
Out[13]:
0.1506452696564208
In [14]:
p1.aligned_distance_profile(p2, pseudo_p=0)
Out[14]:
{'Klf4': 0.03329021017998457,
 'Nanog': 0.03392407321371138,
 'Oct4': 0.06577608967199922,
 'Sox2': 0.18537728488445282}
  • [x] implement function shuffle_seqlets(p1, p2) which returns two new seqlets derived from suffled seqlet values
In [15]:
s1 = patterns[0].attrs['stacked_seqlet_imp']
s2 = patterns[1].attrs['stacked_seqlet_imp']
In [16]:
from basepair.modisco.core import StackedSeqletImp, shuffle_seqlets
In [17]:
s1s, s2s = shuffle_seqlets(s1, s2)
In [18]:
s1.aggregate().plot('seq_ic');
s2.aggregate().plot('seq_ic');
In [19]:
s1s.aggregate().plot('seq_ic');
s2s.aggregate().plot('seq_ic');
  • [x] run this for two closest patterns
    • is any of the differnet features significantly differnet compared to the null while controlling for multiple testing?
In [195]:
from basepair.utils import flatten
from basepair.modisco.core import StackedSeqletImp, shuffle_seqlets


def compare_patterns(p1, p2):
    d = {"seq": p1.aligned_distance_seq(p2),
         "profile": p1.aligned_distance_profile(p2)}
    return flatten(d, separator='/')

def null_dist(s12, len_s1):
    s1s, s2s = s12.shuffle().split(len_s1)
    return compare_patterns(s1s.aggregate(), s2s.aggregate())

def dist_pval(s1, s2, permute=100, parallel=None, verbose=True):
    if parallel is None:
        parallel = Parallel(n_jobs=20)
        
    s12 = s1.append(s2)
    len_s1 = len(s1)
    df_null = pd.DataFrame(parallel(delayed(null_dist)(s12, len_s1) 
                                    for i in tqdm(range(permute), disable=not verbose)))

    alt = compare_patterns(s1.aggregate(), s2.aggregate())
    # what's the probability that we would observe the event or more extreme by chance?
    pvals = {t: np.mean(df_null[t] >= alt[t]) for t in alt}
    return pvals
In [210]:
%tqdm_restart
In [203]:
%time dist_pval(s1, s2, 1000, Parallel(n_jobs=40))
100%|██████████| 1000/1000 [00:31<00:00, 31.65it/s]
CPU times: user 4.8 s, sys: 5.07 s, total: 9.87 s
Wall time: 33.2 s
Out[203]:
{'seq': 0.0,
 'profile/Klf4': 0.306,
 'profile/Nanog': 0.427,
 'profile/Oct4': 0.083,
 'profile/Sox2': 0.548}
In [211]:
dfp_bak = dfp
In [212]:
out = []
#with Parallel(n_jobs=20) as parallel:  # re-use the pool of workers
for i in tqdm(range(len(patterns) - 1)):
    p1 = patterns[i]
    p2 = patterns[i+1]
    s1 = p1.attrs['stacked_seqlet_imp']
    s2 = p2.attrs['stacked_seqlet_imp']

    c1 = p1.attrs['cluster']
    c2 = p2.attrs['cluster']
    out.append({"patterns": f"{shorten_pattern(p1.name)}-{shorten_pattern(p2.name)}",
                "clusters": f"{c1}-{c2}",
                "same_cluster": c1 == c2,
                **dist_pval(s1, s2, permute=10000, parallel=Parallel(n_jobs=40), verbose=False)
               })
dfp = pd.DataFrame(out)
100%|██████████| 121/121 [58:28<00:00, 26.26s/it]
In [213]:
dfp = dfp.set_index(['clusters', 'patterns'])
In [219]:
ls {modisco_dir}
centroid_seqlet_matches.csv  kwargs.json                  pattern_table.csv
cluster-patterns.ipynb       log/                         pattern_table.html
figures/                     modisco.h5                   plots/
footprints.pkl               motif_clustering/            results.html
hparams.yaml                 patterns.pkl                 results.ipynb
instances.parq               pattern_table.clustered.csv  seqlets/
In [220]:
dfp.to_csv(f"{modisco_dir}/motif_clustering/motif-similarity-test.csv")
In [214]:
same_cluster = dfp['same_cluster']
del dfp['same_cluster']
In [207]:
%matplotlib inline

TODO - wilcoxon test on the different counts

In [38]:
dfp.head()
Out[38]:
profile/Klf4 profile/Nanog profile/Oct4 profile/Sox2 same_cluster seq
clusters patterns
8-8 m1_p7-m1_p8 0.052 0.262 0.002 0.000 True 0.0
m1_p8-m6_p2 0.209 0.201 0.999 0.007 True 0.0
m6_p2-m1_p0 1.000 0.801 0.365 0.655 True 1.0
m1_p0-m1_p1 1.000 1.000 1.000 1.000 True 1.0
m1_p1-m6_p8 1.000 1.000 0.912 0.000 True 1.0
In [34]:
# Why is this thing still slow as hell?
# Does it copy a lot of the memory? -> what's stored in attrs?
# Try out aggregate(idx) -> is it any faster?

Visualize the p-values as a heatmap with p-values as stars

  • [x] compare the p-values of two neighbours
  • (maybe) visualize the cluster/pattern group boundaries
In [217]:
fig, ax = plt.subplots(figsize=(3, 25))
sns.heatmap(-np.log10(dfp+1e-5), ax=ax, cmap='Blues')
Out[217]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb83860c4a8>

Major patterns from each group

In [79]:
clusters = pd.Series([p.attrs['cluster'] for p in patterns])

cluster_idx = {c: clusters[clusters == c].index.values for c in clusters.unique()}
cluster_idx
Out[79]:
{8: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]),
 7: array([14, 15, 16, 17, 18, 19, 20, 21]),
 6: array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]),
 5: array([43, 44, 45, 46, 47, 48, 49, 50, 51, 52]),
 2: array([53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75]),
 1: array([76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]),
 4: array([90, 91, 92, 93, 94, 95]),
 0: array([ 96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106]),
 3: array([107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121])}
In [181]:
pattern_groups = {c: [p for i,p in enumerate(patterns) if i in idx] for c,idx in cluster_idx.items()}

# Select the major pattern from each group by max n_seqlets
major_patterns = {c: cp[np.argmax([p.attrs['features']['n seqlets'] for p in cp])] 
                  for c,cp in pattern_groups.items()}
In [173]:
for c, mp in major_patterns.items():
    mp.plot("seq_ic");
    plt.title(f"{mp.name} {c}")
In [177]:
named_cluster = {8: "Klf4", 4: "Essrb", 0: "Oct4-Sox2", 1: "Sox2", 2: "Nanog"}

plot the distribution of importance scores for each pattern for each task

  • shall we have a big boxplot or violin plot?
    • x-axis = differnet patterns
    • color = tasks
    • y-axis = improtance
In [167]:
# slice(*p._trim_seq_ic_ij(trim_frac=0.08))
# 15:55
df = pd.concat([pd.DataFrame({"imp": p.attrs['stacked_seqlet_imp'].contrib[t][:,15:55].mean(axis=(1,2)),  # [:,slice(*p._trim_seq_ic_ij(trim_frac=0.08))]
                              "task": t, 
                              "pattern": shorten_pattern(p.name), 
                              "metacluster_act": p.attrs['metacluster_activity'][t],
                              "cluster": c}) 
                for t in tasks 
                for c,cpatterns in pattern_groups.items()
                for p in cpatterns])

df['ma'] = df.metacluster_act.map({1: "+", 0: ".", "-1": "-"})

df.pattern = pd.Categorical(df.pattern, categories=df.pattern.unique())

df.head()
Out[167]:
imp task pattern metacluster_act cluster ma
0 0.0083 Klf4 m1_p7 1 8 +
1 0.0043 Klf4 m1_p7 1 8 +
2 0.0065 Klf4 m1_p7 1 8 +
3 0.0037 Klf4 m1_p7 1 8 +
4 0.0047 Klf4 m1_p7 1 8 +
In [165]:
cluster=8

dodge_text = position_dodge(width=0.9)
data = df[df.cluster==cluster]
g = ggplot(aes(x='pattern', color='task', y='imp'), data=data) + \
  geom_violin(position='dodge') + \
  geom_text(aes(label='ma', y=data.imp.min()), position=dodge_text, data=data[['pattern', 'task', 'ma']].drop_duplicates()) +\
  theme_bw() + ggtitle(f"Cluster {cluster}")
g
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
Out[165]:
<ggplot: (-9223363279706182396)>

Subset to most conserved bases

In [182]:
df = pd.concat([pd.DataFrame({"imp": p.attrs['stacked_seqlet_imp'].contrib[t][:,slice(*major_patterns[c]._trim_seq_ic_ij(trim_frac=0.08))].mean(axis=(1,2)),
                              "task": t, 
                              "pattern": shorten_pattern(p.name), 
                              "metacluster_act": p.attrs['metacluster_activity'][t],
                              "cluster": c}) 
                for t in tasks 
                for c,cpatterns in pattern_groups.items()
                for p in cpatterns])

df['ma'] = df.metacluster_act.map({1: "+", 0: ".", "-1": "-"})

df.pattern = pd.Categorical(df.pattern, categories=df.pattern.unique())

df.head()
Out[182]:
imp task pattern metacluster_act cluster ma
0 0.0226 Klf4 m1_p7 1 8 +
1 0.0157 Klf4 m1_p7 1 8 +
2 0.0188 Klf4 m1_p7 1 8 +
3 0.0135 Klf4 m1_p7 1 8 +
4 0.0181 Klf4 m1_p7 1 8 +
In [183]:
plotnine.options.figure_size=(18, 3)
for cluster in df.cluster.unique():
    if cluster not in named_cluster:
        continue
    dodge_text = position_dodge(width=0.9)
    data = df[df.cluster==cluster]
    g = ggplot(aes(x='pattern', color='task', y='imp'), data=data) + \
      geom_violin(position='dodge') + \
      geom_text(aes(label='ma', y=data.imp.min()), position=dodge_text, data=data[['pattern', 'task', 'ma']].drop_duplicates()) +\
      theme_bw() + ggtitle(f"Cluster {cluster} {named_cluster[cluster]}")
    g.draw()
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or

[:, 15:55]

In [184]:
# slice(*p._trim_seq_ic_ij(trim_frac=0.08))
# 15:55
df = pd.concat([pd.DataFrame({"imp": p.attrs['stacked_seqlet_imp'].contrib[t][:,15:55].mean(axis=(1,2)),  # [:,slice(*p._trim_seq_ic_ij(trim_frac=0.08))]
                              "task": t, 
                              "pattern": shorten_pattern(p.name), 
                              "metacluster_act": p.attrs['metacluster_activity'][t],
                              "cluster": c}) 
                for t in tasks 
                for c,cpatterns in pattern_groups.items()
                for p in cpatterns])

df['ma'] = df.metacluster_act.map({1: "+", 0: ".", "-1": "-"})

df.pattern = pd.Categorical(df.pattern, categories=df.pattern.unique())

df.head()
Out[184]:
imp task pattern metacluster_act cluster ma
0 0.0083 Klf4 m1_p7 1 8 +
1 0.0043 Klf4 m1_p7 1 8 +
2 0.0065 Klf4 m1_p7 1 8 +
3 0.0037 Klf4 m1_p7 1 8 +
4 0.0047 Klf4 m1_p7 1 8 +
In [185]:
plotnine.options.figure_size=(18, 3)
for cluster in df.cluster.unique():
    if cluster not in named_cluster:
        continue
    dodge_text = position_dodge(width=0.9)
    data = df[df.cluster==cluster]
    g = ggplot(aes(x='pattern', color='task', y='imp'), data=data) + \
      geom_violin(position='dodge') + \
      geom_text(aes(label='ma', y=data.imp.min()), position=dodge_text, data=data[['pattern', 'task', 'ma']].drop_duplicates()) +\
      theme_bw() + ggtitle(f"Cluster {cluster} {named_cluster[cluster]}")
    g.draw()
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/plotnine/layer.py:517: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  return not cbook.iterable(value) and (cbook.is_numlike(value) or

Experiment with dask

In [130]:
import dask

import dask.dataframe as dd


import dask.multiprocessing

out = []
permute = 1000

for i in tqdm(range(len(patterns) - 1)):
    p1 = patterns[i]
    p2 = patterns[i+1]
    s1 = p1.attrs['stacked_seqlet_imp']
    s2 = p2.attrs['stacked_seqlet_imp']

    c1 = p1.attrs['cluster']
    c2 = p2.attrs['cluster']
    
    s12 = dask.delayed(s1.append(s2))
    len_s1 = len(s1)
    null_objects = [dask.delayed(null_dist)(s12, len_s1) for i in range(permute)]
    alt = compare_patterns(s1.aggregate(), s2.aggregate())
    out.append({"metadata": {"patterns": f"{shorten_pattern(p1.name)}-{shorten_pattern(p2.name)}",
                             "clusters": f"{c1}-{c2}",
                              "same_cluster": c1 == c2
                            },
                "null_objects": null_objects,
                "alt": alt
               })

outl = []
from dask.diagnostics import ProgressBar

with ProgressBar():
    objects, = dask.compute(out, scheduler='processes', num_workers=40)
for d in objects:
    df_null = pd.DataFrame(d['null_objects'])
    alt = d['alt']
    pvals = {t: np.mean(df_null[t] >= alt[t]) for t in alt}
    outl.append({**d['metadata'],  **pvals})
dfp = pd.DataFrame(outl)    

out_comp = dask.compute(out)

out_comp

# Aggregate

outl = []
objects = dask.compute(out)
for d in objects:
    df_null = pd.DataFrame(d['null_objects'])
    alt = d['alt']
    pvals = {t: np.mean(df_null[t] >= alt[t]) for t in alt}
    outl.append({**d['metadata'],  **pvals})
dfp = pd.DataFrame(outl)