%load_ext autoreload
%autoreload 2
from matlas.fimo_hits import make_motif_df, combine_motif_dfs

logdir = "/mnt/lab_data/kundaje/msharmin/mouse_hem/mtf"
outdir = "/mnt/lab_data/kundaje/msharmin/mouse_hem/mtf/specific_hits/task_{}".format(273)


nonz_dfs = combine_motif_dfs(logdir, outdir, zscored=False)
z_dfs = combine_motif_dfs(logdir, outdir, zscored=True)

len(z_dfs)

TF-MoDISco is using the TensorFlow backend.
/users/msharmin/anaconda2/envs/basepair13/lib/python3.6/site-packages/scipy/stats/stats.py:2253: RuntimeWarning: invalid value encountered in true_divide
  return (a - mns) / sstd

897

In the folllwing plots, 1st panel is zscore and 2nd panel is raw score for instances with deeplift sum score and fimo score.¶

Any instance that are inside a MEL peak is considered. Rest of the instances are discarded. In each violin plot, for both fimo and deeplift density curve the number of instances are same.

from matlas.fimo_hits import plot_violins_alternate

plot_violins_alternate(z_dfs, 'zscore', nonz_dfs, 'raw_score')

/users/msharmin/anaconda2/envs/basepair13/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/users/msharmin/anaconda2/envs/basepair13/lib/python3.6/site-packages/matplotlib/pyplot.py:514: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

The following plots are similar to previous ones, except filtering is applied in the instances for the deeplift scores.¶

After selecting instances inside a MEL peak, no filtering is applied for fimo instances. For deeplift based instances, only those with high deeplift score is considered. Here, high deeplift score refers to a threshold, e.g. total importance or deeplift sum score must be greater than per_base_importance*motif_length and based on experience per_base_importance=0.0625 works well. 0.0625 is considered from 75 percentile of all possible sum_scores in the deeplift track divided by motif_length.

nonz_dfs2 = combine_motif_dfs(logdir, outdir, zscored=False, filter_low_deeplift=True)
z_dfs2 = combine_motif_dfs(logdir, outdir, zscored=True, filter_low_deeplift=True)

/users/msharmin/anaconda2/envs/basepair13/lib/python3.6/site-packages/scipy/stats/stats.py:2253: RuntimeWarning: invalid value encountered in true_divide
  return (a - mns) / sstd

plot_violins_alternate(z_dfs2, 'zscore', nonz_dfs2, 'raw_score')

Number of motif instances retained based on deeplift score shows that in MEL shows the presence of Ar, Bach, Bcl, Ets, Ctcf, Klf, Gata, Sp, Zfx etc.

from matlas.fimo_hits import plot_motif_counts

plot_motif_counts(nonz_dfs2, size=(25, 200))

The following plots can be zoomed in for closer inspection...¶

from matlas.fimo_hits import interactive_plot_motif_counts

interactive_plot_motif_counts(nonz_dfs2, height=400, width=1000)