Goal

  • improve the aggregated plot
  • investigate the aggregated plot for new motif instances

Tasks

Example to start with:

  • Sox2 motif (m0/p1): metacluster_0/pattern_1
  • and degenerate Sox2 motif (m1/p2): metacluster_1/pattern_2

Simple plot scale

  • [x] the importance scores in the aggregated plots scaled properly
  • [x] negative strand on the negative side
  • [x] 0 is always centered
  • [x] take max across all tasks
  • [ ] split the aggregated importance scores from the gradients

Entropy calculation

  • [ ] plot ~10-100 examples of motif + importance scores to get a feel (choose one direct and one in-direct motif)
  • [ ] entropy vs count distribution

    • for each seqlet
    • globally
  • calculate entropy for each region and the final aggregated distribution

    • show a histogram
    • or a scatterplot with the total counts in the particular region or in the total region

Conclusions

  • Plotting the profiles for inferred seqlet locations
    • the PWM is not the same unless we use a stringent cutoff: percnormed_score>0.9

TODO

  • [ ] implement plot_predict_grad with motif instance annotation
  • [ ] plot 10's of instances with Sox2 core and degenerated motif
    • Sox2 motif (m0/p1): metacluster_0/pattern_1
    • and degenerate Sox2 motif (m1/p2): metacluster_1/pattern_2
In [393]:
# common imports
from basepair.imports import *

Load the data

In [394]:
pattern1 = "metacluster_1/pattern_2"
pattern2 = "metacluster_0/pattern_1"

Modisco

In [367]:
ddir = Path(get_data_dir())
model_dir = ddir / "processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/"
modisco_dir = model_dir / "modisco/valid"
In [368]:
# Modisco
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
tasks = mr.tasks()
In [369]:
ds = DataSpec.load(model_dir / "dataspec.yaml")

d = HDF5Reader(model_dir / "grad.all.h5")
d.open()

id_hash = pd.DataFrame({"peak_id": d.f['/metadata/interval_from_task'][:], "example_idx": np.arange(d.f['/metadata/interval_from_task'].shape[0])})
In [370]:
d_valid = HDF5Reader(model_dir / "grad.valid.h5")
d_valid.open()
d_test = HDF5Reader(model_dir / "grad.test.h5")
d_test.open()

d_valid = d_valid.load_all()

d_test = d_test.load_all()

Labelled instances

In [371]:
# load the instances data frame
df = pd.read_csv(modisco_dir / "all.instances.tsv", sep='\t')
df['pattern_id'] = "metacluster_" + df.metacluster.astype(str) + "/pattern_" + df.pattern.astype(str)
df['seqlet_center'] = (df.seqlet_start + df.seqlet_end) / 2
patterns = df.pattern_id.unique().tolist()
pattern_pssms = {pattern: mr.get_pssm(*pattern.split("/")) for pattern in patterns}
append_pattern_loc(df, pattern_pssms, trim_frac=0.08)
df['pattern_center'] = (df.pattern_start + df.pattern_end) / 2
df = df.merge(id_hash, on="example_idx")
In [372]:
# Use thresholded values from now on
df_unfiltered = df
df = df[df.percnormed_score > .2]
In [373]:
df.seqlet_is_revcomp.max()
Out[373]:
False
In [374]:
df.seqlet_is_revcomp.mean()
Out[374]:
0.0

Setup contrib scores

In [375]:
grad_type='weighted'
In [376]:
d = d_valid
In [377]:
included_samples = HDF5Reader.load(f"{modisco_dir}/strand_distances.h5")['included_samples']
In [378]:
one_hot = d['inputs']
In [379]:
grad_type="count,weighted" # always plot both importance scores

thr_hypothetical_contribs = OrderedDict([(f"{task}/{gt}", mean(d['grads'][task][gt])[included_samples])
                             for task in tasks 
                             for gt in grad_type.split(",")])

thr_one_hot = one_hot[included_samples]
thr_contrib_scores = OrderedDict([(f"{task}/{gt}", thr_hypothetical_contribs[f"{task}/{gt}"] * thr_one_hot )
                      for task in tasks
                      for gt in grad_type.split(",")])

# importance_scores = {"grad": hyp_scores,
#                      "grad*input": scores}

# -------------------------------------------------
In [380]:
pattern = "metacluster_0/pattern_1"
In [ ]:
# Get the range

Functions

In [22]:
from basepair.plot.profiles import *

Plot

In [381]:
plot_profiles(mr.seqlets(), 
              thr_one_hot,
              tracks=tracks,#{},
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=['metacluster_2/pattern_0'] + mr.patterns()[:10],#[pattern1, pattern2],
              n_bootstrap=100,
              figsize=(14, 14))

Plot results accross all the sequences

Use grad.all.h5 to plot the aggregated profile.

Q: Is the profile also sharp as before?

In [16]:
d = HDF5Reader.load(model_dir / "grad.all.h5")
In [18]:
# TODO - get seqlets
df.head()
Out[18]:
Unnamed: 0 example_idx seqlet_start seqlet_end seqlet_is_revcomp seqlet_score metacluster pattern percnormed_score score ... example_end example_id example_start example_strand pattern_id seqlet_center pattern_start pattern_end pattern_center peak_id
0 0 0 516 557 False 1.899616 0 2 0.323415 0.178182 ... 143483544 0 143482544 * metacluster_0/pattern_2 536.5 522 531 526.5 Oct4
1 1 0 476 517 False 1.073116 0 1 0.327721 0.254454 ... 143483544 0 143482544 * metacluster_0/pattern_1 496.5 494 502 498.0 Oct4
3 524313 0 416 457 False 0.456975 4 1 0.356674 0.185247 ... 143483544 0 143482544 * metacluster_4/pattern_1 436.5 444 452 448.0 Oct4
5 589123 0 593 634 False 0.191342 6 6 0.247059 0.202925 ... 143483544 0 143482544 * metacluster_6/pattern_6 613.5 610 630 620.0 Oct4
7 4 1 477 518 False 1.489748 0 0 0.211156 0.249695 ... 122146063 1 122145063 * metacluster_0/pattern_0 497.5 477 491 484.0 Oct4

5 rows × 23 columns

In [34]:
df.columns
Out[34]:
Index(['Unnamed: 0', 'example_idx', 'seqlet_start', 'seqlet_end',
       'seqlet_is_revcomp', 'seqlet_score', 'metacluster', 'pattern',
       'percnormed_score', 'score', 'offset', 'revcomp', 'example_chr',
       'example_end', 'example_id', 'example_start', 'example_strand',
       'pattern_id', 'seqlet_center', 'pattern_start', 'pattern_end',
       'pattern_center', 'peak_id'],
      dtype='object')
In [ ]:
# pd.DataFrame -> seqlets
seqlets = {pattern: 
           [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=pattern,
                   strand="-" if row.revcomp else "+") 
           for i, row in df[(df.pattern_id == pattern) & \
                            (df.pattern_start > 20) & \
                            (df.pattern_start + 30 < 1000)].iterrows() ]
           for pattern in tqdm(df.pattern_id.unique())}
In [24]:
all_patterns = df.pattern_id.unique().tolist()
In [37]:
one_hot = d['inputs']

grad_type="count,weighted" # always plot both importance scores

thr_hypothetical_contribs = OrderedDict([(f"{task}/{gt}", mean(d['grads'][task][gt]))
                             for task in tasks 
                             for gt in grad_type.split(",")])

thr_one_hot = one_hot
thr_contrib_scores = OrderedDict([(f"{task}/{gt}", thr_hypothetical_contribs[f"{task}/{gt}"] * thr_one_hot )
                      for task in tasks
                      for gt in grad_type.split(",")])
In [52]:
df.pattern_end.max()
Out[52]:
992
In [ ]:
# TODO - is it possible to get different seqlet locations?
In [60]:
pattern = 'metacluster_0/pattern_0'
In [55]:
plot_profiles(seqlets, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=mr.patterns()[:1], #['metacluster_2/pattern_0'] + mr.patterns()[:10],#[pattern1, pattern2],
              n_bootstrap=100,
              figsize=(14, 14))
In [58]:
df.columns
Out[58]:
Index(['Unnamed: 0', 'example_idx', 'seqlet_start', 'seqlet_end',
       'seqlet_is_revcomp', 'seqlet_score', 'metacluster', 'pattern',
       'percnormed_score', 'score', 'offset', 'revcomp', 'example_chr',
       'example_end', 'example_id', 'example_start', 'example_strand',
       'pattern_id', 'seqlet_center', 'pattern_start', 'pattern_end',
       'pattern_center', 'peak_id'],
      dtype='object')
In [57]:
df.head()
Out[57]:
Unnamed: 0 example_idx seqlet_start seqlet_end seqlet_is_revcomp seqlet_score metacluster pattern percnormed_score score ... example_end example_id example_start example_strand pattern_id seqlet_center pattern_start pattern_end pattern_center peak_id
0 0 0 516 557 False 1.899616 0 2 0.323415 0.178182 ... 143483544 0 143482544 * metacluster_0/pattern_2 536.5 522 531 526.5 Oct4
1 1 0 476 517 False 1.073116 0 1 0.327721 0.254454 ... 143483544 0 143482544 * metacluster_0/pattern_1 496.5 494 502 498.0 Oct4
3 524313 0 416 457 False 0.456975 4 1 0.356674 0.185247 ... 143483544 0 143482544 * metacluster_4/pattern_1 436.5 444 452 448.0 Oct4
5 589123 0 593 634 False 0.191342 6 6 0.247059 0.202925 ... 143483544 0 143482544 * metacluster_6/pattern_6 613.5 610 630 620.0 Oct4
7 4 1 477 518 False 1.489748 0 0 0.211156 0.249695 ... 122146063 1 122145063 * metacluster_0/pattern_0 497.5 477 491 484.0 Oct4

5 rows × 23 columns

In [66]:
df[df.pattern_id == pattern].percnormed_score.plot.hist()
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f7756fd30>

Very stringent threshold

In [67]:
# pd.DataFrame -> seqlets
seqlets = {pattern: 
           [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=pattern,
                   strand="-" if row.revcomp else "+") 
           for i, row in df[(df.pattern_id == pattern) & \
                            (df.percnormed_score > 0.9) & \
                            (df.pattern_start > 20) & \
                            (df.pattern_start + 30 < 1000)].iterrows() ]}
In [68]:
plot_profiles(seqlets, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=[pattern],
              n_bootstrap=100,
              figsize=(14, 14))

percnormed_score > 0.2

In [69]:
# pd.DataFrame -> seqlets
seqlets = {pattern: 
           [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=pattern,
                   strand="-" if row.revcomp else "+") 
           for i, row in df[(df.pattern_id == pattern) & \
                            (df.percnormed_score > 0.2) & \
                            (df.pattern_start > 20) & \
                            (df.pattern_start + 30 < 1000)].iterrows() ]}
In [70]:
plot_profiles(seqlets, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=[pattern],
              n_bootstrap=100,
              figsize=(14, 14))

percnormed_score < 0.5

In [73]:
# pd.DataFrame -> seqlets
seqlets = {pattern: 
           [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=pattern,
                   strand="-" if row.revcomp else "+") 
           for i, row in df[(df.pattern_id == pattern) & \
                            (df.percnormed_score < 0.5) & \
                            (df.pattern_start > 20) & \
                            (df.pattern_start + 30 < 1000)].iterrows() ]}
In [74]:
plot_profiles(seqlets, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=[pattern],
              n_bootstrap=100,
              figsize=(14, 14))

seqlet_score < 0.2

In [76]:
df.seqlet_score.plot.hist(100);
In [78]:
plt.plot(df.seqlet_score, df.score, ".", alpha=0.1)
Out[78]:
[<matplotlib.lines.Line2D at 0x7f2588121e80>]
In [81]:
# pd.DataFrame -> seqlets
seqlets = {pattern: 
           [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=pattern,
                   strand="-" if row.revcomp else "+") 
           for i, row in df[(df.pattern_id == pattern) & \
                            (df.percnormed_score > 0.9) & \
                            (df.seqlet_score > 0.5) & \
                            (df.pattern_start > 20) & \
                            (df.pattern_start + 30 < 1000)].iterrows() ]}
In [82]:
plot_profiles(seqlets, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=[pattern],
              n_bootstrap=100,
              figsize=(14, 14))

Entropy calculation

  • [ ] plot ~10-100 examples of motif + importance scores to get a feel (choose one direct and one in-direct motif)
  • [ ] entropy vs count distribution

    • for each seqlet
    • globally
  • calculate entropy for each region and the final aggregated distribution

    • show a histogram
    • or a scatterplot with the total counts in the particular region or in the total region
In [30]:
d = HDF5Reader.load(model_dir / "grad.valid.h5")

included_samples = HDF5Reader.load(f"{modisco_dir}/strand_distances.h5")['included_samples']

one_hot = d['inputs']

grad_type="count,weighted" # always plot both importance scores

thr_hypothetical_contribs = OrderedDict([(f"{task}/{gt}", mean(d['grads'][task][gt])[included_samples])
                             for task in tasks 
                             for gt in grad_type.split(",")])

thr_one_hot = one_hot[included_samples]
thr_contrib_scores = OrderedDict([(f"{task}/{gt}", thr_hypothetical_contribs[f"{task}/{gt}"] * thr_one_hot )
                      for task in tasks
                      for gt in grad_type.split(",")])

tracks = {task: d['targets']['profile'][task][included_samples] for task in tasks}
In [31]:
thr_one_hot.shape
Out[31]:
(12388, 1000, 4)
In [32]:
thr_hypothetical_contribs['Klf4/count'].shape
Out[32]:
(12388, 1000, 4)
In [34]:
def filter_seqlets(seqlets, only_idx):
    return [s for s in seqlets if s.seqname == only_idx]

def df2seqlets(df):
    return [Seqlet(row.example_idx, 
                   row.pattern_start - 20, 
                   row.pattern_start + 30, 
                   name=row.pattern_id,
                   strand="-" if row.revcomp else "+") 
           for i, row in df.iterrows()]
In [35]:
df2seqlets(df[df.example_idx == 333])
Out[35]:
[Seqlet(seqname=333, start=488, end=538, name='metacluster_0/pattern_6', strand='+'),
 Seqlet(seqname=333, start=412, end=462, name='metacluster_21/pattern_0', strand='+'),
 Seqlet(seqname=333, start=513, end=563, name='metacluster_4/pattern_1', strand='+')]

TODO

  • modify the original BPNet plot by adding the modisco locations to it
In [37]:
from basepair.plot.utils import draw_box
In [257]:
# randomly sample a few examples
pattern = 'metacluster_0/pattern_0'
seqlets = mr._get_seqlets(pattern)

mr.plot_pssm(*pattern.split("/"));

for sid in pd.Series(np.arange(len(seqlets))).sample(10):

#sid = 0
    plot_profiles_single(seqlets[sid].resize(width=200),
                  thr_one_hot,
                  tracks=tracks,
                  importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
                  legend=False,
                  flip_neg=False, #True,
                  rotate_y=0, #75,
                  seq_height=1,#.5,
                  figsize=(14, 14))
In [161]:
from basepair.BPNet import BPNetPredictor

from basepair.config import create_tf_session
In [ ]:
create_tf_session(1)
In [ ]:
bp = BPNet
In [ ]:
plot_predict_grad
In [136]:
# randomly sample a few examples
pattern = 'metacluster_0/pattern_0'
seqlets = mr._get_seqlets(pattern)

#for sid in pd.Series(np.arange(len(seqlets))).sample(10):
sid = 0
plot_profiles({pattern:[seqlets[sid]]}, 
              thr_one_hot,
              tracks=d['targets']['profile'],
              importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
              legend=False,
              flip_neg=True,
              rotate_y=0, #75,
              seq_height=.5,
              patterns=[pattern],
              n_bootstrap=None,
              only_idx=0,
              figsize=(14, 14))

Plotting a heatmap of counts

Matplotlib example - overlay two heatmaps

In [59]:
m = np.ones(20)[np.newaxis] * np.arange(20)[:, np.newaxis]

fig, ax = plt.subplots(1, 3, sharey=True)
ax[0].imshow(m, alpha=1, cmap=plt.cm.Reds)
ax[1].imshow(m.T, alpha=1, cmap=plt.cm.Blues)
ax[2].imshow(m, alpha=1, cmap=plt.cm.Reds)
ax[2].imshow(m.T, alpha=.5, cmap=plt.cm.Blues);
fig.subplots_adjust(wspace=0)  # no space between plots
plt.setp([a.get_yaxis() for a in fig.axes[1:]], visible=False);  # no ticks
In [209]:
pattern = 'metacluster_0/pattern_2'
seqlets = mr._get_seqlets(pattern)
In [210]:
h2("Results for: ", pattern)
Out[210]:

Results for: metacluster_0/pattern_2

In [ ]:
# Load required data
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
tasks = mr.tasks()
d = HDF5Reader.load(model_dir / "grad.valid.h5")
included_samples = HDF5Reader.load(f"{modisco_dir}/strand_distances.h5")['included_samples']
In [404]:
from basepair.plot.vdom import fig2vdom
from vdom.helpers import *

from basepair.plot.heatmaps import multiple_heatmap_stranded_profile, multiple_heatmap_importance_profile, heatmap_sequence

def get_signal(seqlets, d, included_samples, tasks, resize_width=200):
    thr_one_hot = d['inputs'][included_samples]
    seq_len = thr_one_hot.shape[1]
    
    # filter seqlets
    valid_seqlets = [s.resize(resize_width) 
                     for s in seqlets 
                     if (s.center() > resize_width/2) and \
                        (s.center() + resize_width/2 < seq_len + 1)]
    
    # prepare data
    ex_signal = {task: extract_signal(d['targets']['profile'][task][included_samples], valid_seqlets)
                 for task in tasks}
    
    # grads
    ex_contrib_profile = {task: extract_signal(mean(d['grads'][task]["weighted"])[included_samples] * thr_one_hot, 
                                               valid_seqlets).sum(axis=-1)
                          for task in tasks}
    ex_contrib_counts = {task: extract_signal(mean(d['grads'][task]["count"])[included_samples] * thr_one_hot, 
                                           valid_seqlets).sum(axis=-1)
                         for task in tasks}
    
    ex_seq = extract_signal(thr_one_hot, valid_seqlets)

    total_counts = sum([x.sum(axis=-1).sum(axis=-1) for x in ex_signal.values()])
    sort_idx = np.argsort(-total_counts)
    return ex_signal, ex_contrib_profile, ex_contrib_counts, ex_seq, sort_idx


#plot_profiles_single(seqlet, x, tracks, importance_scores={}, figsize=(20, 2), legend=True,
#                     rotate_y=90,seq_height=1,flip_neg=False)
def get_signal_agg(d, included_samples, tasks):
    """Prepare the signal for plot_profiles_single
    """
    grad_type="count,weighted" # always plot both importance scores
    one_hot = d['inputs']
    thr_hypothetical_contribs = OrderedDict([(f"{task}/{gt}", mean(d['grads'][task][gt])[included_samples])
                                 for task in tasks 
                                 for gt in grad_type.split(",")])

    thr_one_hot = one_hot[included_samples]
    thr_contrib_scores = OrderedDict([(f"{task}/{gt}", thr_hypothetical_contribs[f"{task}/{gt}"] * thr_one_hot )
                          for task in tasks
                          for gt in grad_type.split(",")])

    tracks = {task: d['targets']['profile'][task][included_samples] for task in tasks}
    return thr_one_hot, tracks, thr_hypothetical_contribs, thr_contrib_scores


def vdm_heatmaps(seqlets, tasks, d, included_samples, pattern, pssm_fig=None, opened=False):
    ex_signal, ex_contrib_profile, ex_contrib_counts, ex_seq, sort_idx = get_signal(seqlets, d, included_samples, tasks)
    return div(details(summary("Sequence:"),
                pssm_fig,
                br(),
                fig2vdom(heatmap_sequence(ex_seq, sort_idx=sort_idx, figsize_tmpl=(10, 15), aspect='auto')),
                open=opened
               ),

        details(summary("ChIP-nexus counts:"),
               fig2vdom(multiple_plot_stranded_profile(ex_signal, figsize_tmpl=(20/len(ex_signal), 3))),
                fig2vdom(multiple_heatmap_stranded_profile(ex_signal, pmin=50, pmax=99, sort_idx=sort_idx, figsize=(20, 20))),
                open=opened
              ),
        details(summary("Importance scores (profile)"),
               fig2vdom(multiple_heatmap_importance_profile(ex_contrib_profile, sort_idx=sort_idx, figsize=(20, 20))),
                open=opened
              ),
        details(summary("Importance scores (counts)"),
               fig2vdom(multiple_heatmap_importance_profile(ex_contrib_counts, sort_idx=sort_idx, figsize=(20, 20))),
                open=opened
              )
       )

def write_pattern_pngs(seqlets, d, included_samples, tasks, pattern, output_dir):
    # get the data
    ex_signal, ex_contrib_profile, ex_contrib_counts, ex_seq, sort_idx = get_signal(seqlets, d, included_samples, tasks)
    # get the plots
    figs = dict(
        heatmap_seq = heatmap_sequence(ex_seq, sort_idx=sort_idx, figsize_tmpl=(10, 15), aspect='auto'),
        profile_aggregated = multiple_plot_stranded_profile(ex_signal, figsize_tmpl=(20/len(ex_signal), 3)),
        profile_heatmap = multiple_heatmap_stranded_profile(ex_signal, pmin=50, pmax=99, sort_idx=sort_idx, figsize=(20, 20)),
        contrib_profile = multiple_heatmap_importance_profile(ex_contrib_profile, sort_idx=sort_idx, figsize=(20, 20)),
        contrib_counts = multiple_heatmap_importance_profile(ex_contrib_counts, sort_idx=sort_idx, figsize=(20, 20)),
    )
    # write the figures
    for k, fig in figs.items():
        fig.savefig(os.path.join(output_dir, f + ".png"), bbox_inches='tight')
In [384]:
thr_one_hot, tracks, thr_hypothetical_contribs, thr_contrib_scores = get_signal_agg(tasks, d, included_samples)
In [ ]:
# TODO - plot profile
In [385]:
pattern
Out[385]:
'metacluster_0/pattern_1'
In [386]:
seqlets = mr._get_seqlets(pattern)
In [390]:
pattern
Out[390]:
'metacluster_0/pattern_1'
In [396]:
figs = plot_profiles({pattern: seqlets}, 
                      thr_one_hot,
                      tracks=tracks,#{},
                      importance_scores=thr_contrib_scores, #thr_hypothetical_contribs,
                      legend=False,
                      flip_neg=True,
                      rotate_y=0, #75,
                      seq_height=.5,
                      patterns=[pattern],
                      n_bootstrap=100,
                      figsize=(10, 10))
In [400]:
plot_profiles({pattern: seqlets}, 
                      thr_one_hot,
                      tracks={},
                      importance_scores=thr_hypothetical_contribs,
                      legend=False,
                      flip_neg=True,
                      rotate_y=0, #75,
                      seq_height=1,
                      patterns=[pattern],
                      n_bootstrap=100,
                      figsize=(10, 10)),
Out[400]:
([<Figure size 720x720 with 9 Axes>],)
In [ ]:
def modisco_plot(modisco_dir,
                 output_dir,
                 use_importance_scores=False,
                 figsize=(10, 10),
                 gpu=None):
    """Plot the results of a modisco run

    Args:
      modisco_dir: modisco directory
      output_dir: Output directory for writing the results
      figsize: Output figure size
    """
    if gpu is not None:
        create_tf_session(gpu)
    else:
        # Don't use any GPU's
        os.environ['CUDA_VISIBLE_DEVICES'] = ''

    output_dir = Path(output_dir)
    output_dir.parent.mkdir(parents=True, exist_ok=True)
    
    # load modisco
    mr = ModiscoResult(f"{modisco_dir}/modisco.h5")
    modisco_kwargs = read_json(f"{modisco_dir}/kwargs.json")
    grad_type = modisco_kwargs['grad_type']

    # load used strand distance filter
    included_samples = HDF5Reader.load(f"{modisco_dir}/strand_distances.h5")['included_samples']

    # load importance scores
    d = HDF5Reader.load(modisco_kwargs['imp_scores'])
    tasks = list(d['targets']['profile'])
    if isinstance(d['inputs'], dict):
        one_hot = d['inputs']['seq']
    else:
        one_hot = d['inputs']
    thr_hypothetical_contribs = {f"{task}/{gt}": mean(d['grads'][task][gt])[included_samples] 
                                 for task in tasks 
                                 for gt in grad_type.split(",")}
    thr_one_hot = one_hot[included_samples]
    thr_contrib_scores = {f"{task}/{gt}": thr_hypothetical_contribs[f"{task}/{gt}"] * thr_one_hot 
                          for task in tasks
                          for gt in grad_type.split(",")}
    tracks = {task: d['targets']['profile'][task][included_samples] for task in tasks}
    # -------------------------------------------------

    all_seqlets = mr.seqlets()
    all_patterns = mr.patterns()
    tasks = mr.tasks()
    
    # 1. Plots with tracks and contrib scores
    print("Writing results for contribution scores")
    plot_profiles(all_seqlets, 
                  thr_one_hot,
                  tracks=tracks,
                  importance_scores=thr_contrib_scores,
                  legend=False,
                  flip_neg=True,
                  rotate_y=0,
                  seq_height=.5,
                  patterns=all_patterns,
                  n_bootstrap=100,
                  fpath_template=str(output_dir / "{pattern}/agg_profile_contribcores"),
                  figsize=figsize)
    
    # 2. Plots only with hypothetical contrib scores
    print("Writing results for hypothetical contribution scores")
    plot_profiles(all_seqlets, 
                  thr_one_hot,
                  tracks={},
                  importance_scores=thr_hypothetical_contribs,
                  legend=False,
                  flip_neg=True,
                  rotate_y=0,
                  seq_height=1,
                  patterns=[pattern],
                  n_bootstrap=100,
                  fpath_template=str(output_dir / "{pattern}/agg_profile_hypcontribscores"),
                  figsize=figsize)
    
    # Plot others
    print("Plotting heatmaps")
    for pattern in tqdm(all_patterns):
        write_pattern_pngs(all_seqlets[pattern], 
                           d, 
                           included_samples, 
                           tasks, 
                           pattern, 
                           output_dir=str(output_dir / pattern)):
    
    mr.close()
In [405]:
os.makedirs?
[autoreload of basepair.plot.profiles failed: Traceback (most recent call last):
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 368, in superreload
    module = reload(module)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/imp.py", line 315, in reload
    return importlib.reload(module)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/importlib/__init__.py", line 166, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 618, in _exec
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/users/avsec/workspace/basepair/basepair/plot/profiles.py", line 256
    plt.savefig(basepath + '.pdf', dpi=600)
      ^
SyntaxError: invalid syntax
]
[autoreload of basepair.plot.vdom failed: Traceback (most recent call last):
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 368, in superreload
    module = reload(module)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/imp.py", line 315, in reload
    return importlib.reload(module)
  File "/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/importlib/__init__.py", line 166, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 618, in _exec
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/users/avsec/workspace/basepair/basepair/plot/vdom.py", line 92
    figures_url=os.path.join(figdir, f"{metacluster}/{pattern}"),
              ^
SyntaxError: invalid syntax
]
Signature: os.makedirs(name, mode=511, exist_ok=False)
Docstring:
makedirs(name [, mode=0o777][, exist_ok=False])

Super-mkdir; create a leaf directory and all intermediate ones.  Works like
mkdir, except that any intermediate path segment (not just the rightmost)
will be created if it does not exist. If the target directory already
exists, raise an OSError if exist_ok is False. Otherwise no exception is
raised.  This is recursive.
File:      ~/bin/anaconda3/envs/chipnexus/lib/python3.6/os.py
Type:      function
In [ ]:
        agg_profile_ig = plot_profiles({pattern: seqlets}, 
                      thr_one_hot,
                      tracks=tracks,
                      importance_scores=thr_contrib_scores,
                      legend=False,
                      flip_neg=True,
                      rotate_y=0, #75,
                      seq_height=.5,
                      patterns=[pattern],
                      n_bootstrap=100,
                      figsize=(10, 10))[0],
In [363]:
ex_signal, ex_contrib_profile, ex_contrib_counts, ex_seq, sort_idx = get_signal(mr, d, included_samples, "metacluster_0/pattern_0")
In [364]:
multiple_heatmap_importance_profile(ex_contrib_counts, sort_idx=sort_idx, figsize=(20, 20))
Out[364]:
In [344]:
vdm_heatmaps(mr, d, included_samples, "", opened=True)
Out[344]:
Sequence:

ChIP-nexus counts:
Importance scores (profile)
Importance scores (counts)