Goal

  • plot the strand distibution for deepLift

Tasks

  • [x] load the data

Conclusions

  • even though the deeplift distance between strands are generally slightly higher this is because the low-important regions have very high-importance
    • these get anyway filtered out by deeplift

Required files

-

In [2]:
# Imports
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from basepair.imports import *
hv.extension('bokeh')
Using TensorFlow backend.
In [3]:
from basepair.plot.config import paper_config
paper_config()
In [4]:
# Common paths
model_dir = Path(f"{ddir}/processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/")

Old distribution

In [5]:
modisco_dir = model_dir / f"modisco/all/profile/"
old_dist = HDF5Reader.load(modisco_dir / 'strand_distances.h5')
In [6]:
plt.hist(old_dist['distances'], 100);
plt.title("Old distribution");
In [7]:
np.percentile(old_dist['distances'], [10, 25, 50, 75, 90])
Out[7]:
array([0.0382, 0.0513, 0.0764, 0.119 , 0.1667])

New distribution

In [8]:
modisco_dir = model_dir / f"modisco/all/deeplift/profile/"
new_dist = HDF5Reader.load(modisco_dir / 'strand_distances.h5')
In [9]:
plt.hist(new_dist['distances'], 100);
plt.title("New distribution");
In [10]:
np.percentile(new_dist['distances'], [10, 25, 50, 75, 90])
Out[10]:
array([0.074 , 0.0985, 0.1329, 0.1719, 0.2092])

Plot some examples with higher discrepancy

In [11]:
from basepair.cli.imp_score import ImpScoreFile
from basepair.cli.modisco import load_imp_scores, load_included_samples
In [12]:
# TODO - add the following methods
#  - plot(idx, 'inputs'), ... which plots all the examples. 
In [13]:
imp_scores = ImpScoreFile(model_dir / "deeplift.all.h5")
In [14]:
worst_sequences = np.argsort( - new_dist['distances'])
In [34]:
idx = worst_sequences[1]
In [35]:
tasks = imp_scores.get_tasks()
In [36]:
seq = imp_scores.f.f['/inputs'][idx]
In [37]:
from basepair.plot.tracks import plot_tracks, filter_tracks
In [38]:
hyp_contrib = [(f"{t}/{s}", imp_scores.f.f[f'/hyp_imp/{t}/count/{si}'][idx])
  for t in imp_scores.get_tasks()
  for si,s in enumerate(['pos', 'neg'])]
contrib = [(s, v*seq) for s,v in hyp_contrib]
In [39]:
contrib[0][1].max()
Out[39]:
0.0010006177
In [40]:
a=1
In [41]:
[np.abs(x).max() for k,x in hyp_contrib]
Out[41]:
[0.004002471,
 0.0035649897,
 0.0061649983,
 0.0053784396,
 0.004787744,
 0.00435633,
 0.005078398,
 0.004768429]
In [42]:
[np.abs(x).max() for k,x in contrib]
Out[42]:
[0.0010006177,
 0.0008912474,
 0.0015412496,
 0.0013446099,
 0.001196936,
 0.0010890824,
 0.0012695995,
 0.0011921072]

Best sequences

In [43]:
idx = worst_sequences[-1]
In [44]:
tasks = imp_scores.get_tasks()
In [45]:
seq = imp_scores.f.f['/inputs'][idx]
In [46]:
from basepair.plot.tracks import plot_tracks, filter_tracks
In [53]:
hyp_contrib = [(f"{t}/{s}", imp_scores.f.f[f'/hyp_imp/{t}/weighted/{si}'][idx])
  for t in imp_scores.get_tasks()
  for si,s in enumerate(['pos', 'neg'])]
contrib = [(s, v*seq) for s,v in hyp_contrib]
In [54]:
contrib[0][1].max()
Out[54]:
0.105655484
In [56]:
plot_tracks(filter_tracks(contrib, [400, 600]), fig_height_per_track=0.5, fig_width=10)
Out[56]:
In [50]:
[np.abs(x).max() for k,x in hyp_contrib]
Out[50]:
[0.11109871,
 0.12027497,
 0.33767745,
 0.3235743,
 0.35692567,
 0.35611764,
 0.2534035,
 0.25925487]
In [51]:
[np.abs(x).max() for k,x in contrib]
Out[51]:
[0.06418837,
 0.06882809,
 0.19583368,
 0.1867047,
 0.19571249,
 0.19269861,
 0.1443874,
 0.14243364]