Goal

  • analyze instance locations

Conclusions / questions

  • does Klf4 compete with others for the site?
    • there are many sites where the model says that if Klf4 binds, all other binding will be depleted?
  • oct4-sox2 is realatively in-frequent in the dataset. Is this
  • 3/2, 4/1 -> Nanog motif in 2 different context. One occuring in the middle while the other occuring on the side
  • m1/p0 and m1/p1 are frequently together
  • instance location assignment could be improved
  • how to distinquish co-binding vs nearby binding?
    • idea: have a look at the specific profile

Ideas

Shall we just run modisco for each peak-set separately?

  • conditioned that we chipped the factor, what are other factors in the vicinity doing?

Improvements

  • Aggregation plot
    • plot negative strand as -
    • always start at 0
    • ylim = [0, max(abs(pos), abs(neg))]
  • calculate entropy for each region and the final aggregated distribution

    • show a histogram
    • or a scatterplot with the total counts in the particular region or in the total region
  • threshold all the computations w.r.t. the total importance score in the region

  • [ ] compare the PWM, and the aggregated profile across all pattern locations (not just those identified by clustering with modisco)

    • can you improve the results by filtering either by seqlet_score or percnormscore?
  • [ ] scatterplot the spread or bi-modality of the distribution witin the sequence vs. seqlet_score

    • are high seqlet scores mostly in the center?
In [5]:
from tqdm import tqdm

from basepair.config import get_data_dir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from basepair.modisco import ModiscoResult, append_pattern_loc
from basepair.cli.schemas import DataSpec
from basepair.utils import read_pkl

from kipoi.readers import HDF5Reader
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

Load the data

Modisco

In [6]:
ddir = Path(get_data_dir())
model_dir = ddir / "processed/chipnexus/exp/models/oct-sox-nanog-klf/models/n_dil_layers=9/"
modisco_dir = model_dir / "modisco/valid"

!ls {model_dir}

!ls {modisco_dir}  # modisco dir
clustering     figures	       grad.test.h5   hparams.yaml	 modisco
cometml.json   grad.all.h5     grad.valid.h5  Intervene_results  results.html
dataspec.yaml  grad.test.2.h5  history.csv    model.h5		 results.ipynb
all.instances_bed	 kwargs.json  strand_distances.h5
all.instances.tsv	 modisco.h5   test.instances.tsv
all.seqlets.pkl		 modisco.log  test.seqlets.pkl
head.test.instances.tsv  plots
In [7]:
# Modisco
mr = ModiscoResult(modisco_dir / "modisco.h5")
mr.open()
tasks = mr.tasks()
In [9]:
ds = DataSpec.load(model_dir / "dataspec.yaml")
In [69]:
d = HDF5Reader(model_dir / "grad.all.h5")
d.open()

id_hash = pd.DataFrame({"peak_id": d.f['/metadata/interval_from_task'][:], "example_idx": np.arange(d.f['/metadata/interval_from_task'].shape[0])})

Labelled instances

In [70]:
# load the instances data frame
df = pd.read_csv(modisco_dir / "all.instances.tsv", sep='\t')
df['pattern_id'] = "metacluster_" + df.metacluster.astype(str) + "/pattern_" + df.pattern.astype(str)
df['seqlet_center'] = (df.seqlet_start + df.seqlet_end) / 2
patterns = df.pattern_id.unique().tolist()
pattern_pssms = {pattern: mr.get_pssm(*pattern.split("/")) for pattern in patterns}
append_pattern_loc(df, pattern_pssms, trim_frac=0.08)
df['pattern_center'] = (df.pattern_start + df.pattern_end) / 2
df = df.merge(id_hash, on="example_idx")
In [71]:
# Use thresholded values from now on
df_unfiltered = df
df = df[df.percnormed_score > .2]
In [178]:
ls {outdir}
Klf4.npy  Nanog.npy  Oct4.npy  Sox2.npy
In [17]:
seqlets = read_pkl(modisco_dir / "all.seqlets.pkl")
In [ ]:
assert len(df) == len(seqlets)

# total number of mapped seqlets
len(seqlets)

Grads and inputs

In [11]:
n_examples = d.f['/metadata/range/chr'].shape[0]
In [12]:
# total number of peaks / examples
n_examples
Out[12]:
98428
In [92]:
# sanity check the raw peak file
for task, ts in ds.task_specs.items():
    n_peaks = !zcat {ts.peaks} | wc -l
    print(f"{task}: {n_peaks[0]}")
Oct4: 21841
Sox2: 9396
Nanog: 18017
Klf4: 49174
In [ ]:
df.head()
In [ ]:
# almost every peak has a seqlet
len(df.example_id.unique()) / n_examples

Write the interval_from_task vectors

In [175]:
d.f['/metadata/interval_from_task'][:4]
Out[175]:
array(['Oct4', 'Oct4', 'Oct4', 'Oct4'], dtype=object)
In [177]:
outdir = model_dir / "clustering/peak-set/"
outdir.mkdir(parents=True, exist_ok=True)

metadata_task = d.f['/metadata/interval_from_task'][:]

for task in tasks:
    mask = metadata_task == task
    np.save(outdir / f"{task}.npy", mask)

Stats

Number of seqlets per example

In [21]:
from plotnine import *
import plotnine
In [22]:
plotnine.options.figure_size = (10, 1)
ggplot(aes(x="size"), df.groupby("example_id").pattern.agg(['size']).reset_index()) + \
  geom_bar() + theme_bw() + \
  xlab("Number of seqlets per example")
Out[22]:
<ggplot: (-9223363286210021102)>

Pattern frequency

In [23]:
# 20 most frequent patterns
fig = plt.figure(figsize=(3, 8))
df.pattern_id.value_counts().head(30).iloc[::-1].plot(kind='barh');
In [24]:
len(df.pattern_id.value_counts())
Out[24]:
70

Functions

In [88]:
from basepair.vdom_viz import fig2vdom
from vdom.helpers import *

def plot_n_per_example_ax(dfs, ax):
    dfs.groupby("example_id").size().value_counts().plot("barh", ax=ax)
    ax.set_title("# per example")

def plot_col_hist_ax(dfs, ax, col='score', xlim=None):
    dfs[col].plot.hist(bins=100, ax=ax)
    #ax.set_xlabel("# per example")
    if xlim is not None:
        ax.set_xlim(xlim)
    ax.set_title(col)

def plot_all_dist(df, pattern_id, figsize=(12,2), add_title="", **kwargs):
    """Merge all plots into a single one 
    """
    dfs = df[df.pattern_id == pattern_id]
    fig, ax = plt.subplots(1, 5, figsize=figsize)
    plot_col_hist_ax(dfs, ax[0], "seqlet_center", xlim=[0, 1000])
    ax[0].set_title(f"seqlet_center {add_title}")
    plot_n_per_example_ax(dfs, ax[1])
    plot_col_hist_ax(dfs, ax[2], "percnormed_score")
    plot_col_hist_ax(dfs, ax[3], "score")
    plot_col_hist_ax(dfs, ax[4], "seqlet_score")
    plt.tight_layout()
    return fig2vdom(fig, **kwargs)

import matplotlib.pyplot as plt
from concise.utils.plot import seqlogo_fig, seqlogo
from basepair.math import mean
from basepair.plots import draw_box

# helper functions for plotting the results
def get_scores(d, example_idx, task, grad_type="weighted"):
    return mean([d.f[f'/grads/{task}/{grad_type}/0'][example_idx],
                 d.f[f'/grads/{task}/{grad_type}/1'][example_idx]])

def get_sequence(d, example_idx):
    return d.f['/inputs'][example_idx]

def relevant_patterns(df, example_idx, xlim):
    patterns = []
    for i,row in df[df.example_idx == example_idx].sort_values("pattern_start").iterrows():
        xmin = row.pattern_start - xlim[0] + 0.5
        xmax = row.pattern_end - xlim[0]+ 0.5
        if xmin < 0:
            continue
        if xmax > xlim[1] - xlim[0]:
            continue
        patterns.append((row.pattern_id, row.revcomp))
    return patterns

def draw_patterns(df, example_idx, xlim, ax, add_label=True):
    for i,row in df[df.example_idx == example_idx].iterrows():
        xmin = row.pattern_start - xlim[0] + 0.5
        xmax = row.pattern_end - xlim[0]+ 0.5
        if xmin < 0:
            continue
        if xmax > xlim[1] - xlim[0]:
            continue
        draw_box(xmin, xmax, ax, 'g')  # trimmed pattern location
        if add_label:
            y = ax.get_ylim()[1] + (ax.get_ylim()[1] - ax.get_ylim()[0]) * 0.1
            if row.revcomp:
                prefix = "-"
            else:
                prefix = ""
            ax.text(row.pattern_start - xlim[0] + 0.5, y, 
                    s=prefix + f"mc_{row.metacluster}/p_{row.pattern}")

def plot_region(df, d, tasks, example_idx, xlim=[400,600]):
    fig, axes = plt.subplots(len(tasks), 1, figsize=(20, 2* len(tasks)), sharex=True)
    seq = get_sequence(d, example_idx)
    for i, task in enumerate(tasks):
        ax = axes[i]
        seqlogo((get_scores(d, example_idx, task) * seq)[xlim[0]:xlim[1]], ax=ax)
        draw_patterns(df, example_idx, xlim, ax, add_label=i==0)
        ax.set_ylabel(task)
        ax.set_xticks(list(range(0, xlim[1] - xlim[0] + 1, 5)));    

def rc_prefix(rc):
    if rc:
        return "-"
    else:
        return ""
        
def vdom_patterns(mr, patterns_rc_vec, trim_frac=0.08):
    """Get vdom patterns in a single row
    
    Args:
      mr: ModiscoResults object
      patterns_rc_vec: list of tuples (pattern_id, rc). Example: ("metacluster_0/pattern_1", True)
      trim_frac: how much to trim the pssm
    """
    return p([fig2vdom(mr.plot_pssm(*pattern_idx.split("/"), rc=rc, trim_frac=trim_frac, title=rc_prefix(rc) + pattern_idx, 
                                    letter_width=0.35, height=1.7), height=75)
              for pattern_idx, rc in patterns_rc_vec])

def dist_pairwise_hist(df, a, b, ax=None):
    if a != b:
        dfc_ab = df[df.pattern_id.isin([a,b])]
        counts = dfc_ab.groupby("example_idx").pattern_id.value_counts().unstack(fill_value=0)
        relevant_idx = counts.index[(counts==1).all(axis=1)]
        distances = dfc_ab[dfc_ab.example_idx.isin(relevant_idx)].set_index(['example_idx', 'pattern_id']).pattern_center.unstack()
        diff = distances[a] - distances[b]
        title = f"{a} - {b}"
    if a == b:
        examples = df[df.pattern_id == a].groupby("example_idx").filter(lambda x: len(x) == 2)
        diff = examples.groupby("example_idx").apply(lambda x: x.pattern_center.iloc[1] - x.pattern_center.iloc[0])
        title = f"{a}"
    if ax is None:
        fig,ax = plt.subplots(figsize=(15,2))
    ax.hist(diff, bins=100);
    ax.set_title(title);
In [60]:
df.head()
Out[60]:
Unnamed: 0 example_idx seqlet_start seqlet_end seqlet_is_revcomp seqlet_score metacluster pattern percnormed_score score ... example_chr example_end example_id example_start example_strand pattern_id seqlet_center pattern_start pattern_end pattern_center
0 0 0 516 557 False 1.899616 0 2 0.323415 0.178182 ... chrX 143483544 0 143482544 * metacluster_0/pattern_2 536.5 522 531 526.5
1 1 0 476 517 False 1.073116 0 1 0.327721 0.254454 ... chrX 143483544 0 143482544 * metacluster_0/pattern_1 496.5 494 502 498.0
4 4 1 477 518 False 1.489748 0 0 0.211156 0.249695 ... chr3 122146063 1 122145063 * metacluster_0/pattern_0 497.5 477 491 484.0
5 5 1 507 548 False 1.381093 0 0 0.781497 0.335416 ... chr3 122146063 1 122145063 * metacluster_0/pattern_0 527.5 519 533 526.0
10 10 2 569 610 False 0.461497 0 2 0.676585 0.220768 ... chr17 35504551 2 35503551 * metacluster_0/pattern_2 589.5 581 590 585.5

5 rows × 22 columns

The plot bellow shows the following:

  • positional distribution within the 1kb window
  • number of seqlets per example
  • PSSM
In [96]:
## fig_height = 100
div(*[div(span(str(i), ". ", row['index'] + f" (n={row.pattern_id})", 
            fig2vdom(mr.plot_pssm(*row['index'].split("/"), height=0.5, letter_width=0.15, trim_frac=0.08))),
            br(),
            *[plot_all_dist(df[df.peak_id == task], row['index'], add_title=task,
                            figsize=(16,1.8), height=100) for task in tasks],
           )
      for i, row in df.pattern_id.value_counts().reset_index().iterrows()])
Out[96]:
0. metacluster_1/pattern_0 (n=59068)

1. metacluster_2/pattern_0 (n=52170)

2. metacluster_0/pattern_0 (n=40564)

3. metacluster_0/pattern_1 (n=35094)

4. metacluster_2/pattern_3 (n=31215)

5. metacluster_0/pattern_2 (n=25087)

6. metacluster_1/pattern_1 (n=20328)

7. metacluster_3/pattern_2 (n=19276)

8. metacluster_4/pattern_1 (n=16044)

9. metacluster_1/pattern_2 (n=15511)

10. metacluster_3/pattern_6 (n=13137)

11. metacluster_4/pattern_0 (n=12320)

12. metacluster_4/pattern_2 (n=11480)

13. metacluster_3/pattern_1 (n=10655)

14. metacluster_4/pattern_3 (n=10154)

15. metacluster_7/pattern_0 (n=9145)

16. metacluster_6/pattern_4 (n=6858)

17. metacluster_1/pattern_4 (n=6775)

18. metacluster_10/pattern_0 (n=6037)

19. metacluster_1/pattern_5 (n=5770)

20. metacluster_3/pattern_0 (n=4978)

21. metacluster_3/pattern_5 (n=4455)

22. metacluster_28/pattern_0 (n=4423)

23. metacluster_0/pattern_4 (n=4301)

24. metacluster_0/pattern_3 (n=4243)

25. metacluster_15/pattern_0 (n=3882)

26. metacluster_0/pattern_8 (n=3825)

27. metacluster_21/pattern_0 (n=3779)

28. metacluster_6/pattern_2 (n=3603)

29. metacluster_2/pattern_1 (n=3526)

30. metacluster_2/pattern_4 (n=3384)

31. metacluster_3/pattern_7 (n=3371)

32. metacluster_22/pattern_0 (n=3371)

33. metacluster_31/pattern_0 (n=2979)

34. metacluster_6/pattern_1 (n=2965)

35. metacluster_7/pattern_2 (n=2949)

36. metacluster_1/pattern_3 (n=2770)

37. metacluster_7/pattern_1 (n=2767)

38. metacluster_0/pattern_6 (n=2755)

39. metacluster_0/pattern_5 (n=2597)

40. metacluster_9/pattern_1 (n=2592)

41. metacluster_7/pattern_3 (n=2567)

42. metacluster_3/pattern_4 (n=2468)

43. metacluster_9/pattern_0 (n=2197)

44. metacluster_9/pattern_2 (n=1940)

45. metacluster_25/pattern_0 (n=1922)

46. metacluster_6/pattern_3 (n=1833)

47. metacluster_0/pattern_9 (n=1778)

48. metacluster_7/pattern_4 (n=1673)

49. metacluster_18/pattern_0 (n=1631)

50. metacluster_13/pattern_0 (n=1629)

51. metacluster_10/pattern_1 (n=1527)

52. metacluster_0/pattern_11 (n=1519)

53. metacluster_6/pattern_6 (n=1511)

54. metacluster_6/pattern_0 (n=1477)

55. metacluster_0/pattern_10 (n=1369)

56. metacluster_3/pattern_3 (n=1108)

57. metacluster_2/pattern_2 (n=1062)

58. metacluster_1/pattern_6 (n=974)

59. metacluster_10/pattern_2 (n=886)

60. metacluster_23/pattern_1 (n=857)

61. metacluster_9/pattern_3 (n=791)

62. metacluster_18/pattern_2 (n=789)

63. metacluster_10/pattern_4 (n=688)

64. metacluster_18/pattern_1 (n=570)

65. metacluster_6/pattern_5 (n=438)

66. metacluster_23/pattern_0 (n=431)

67. metacluster_10/pattern_3 (n=423)

68. metacluster_9/pattern_4 (n=381)

69. metacluster_0/pattern_7 (n=239)

Analysis

Co-occurence stats

Compute the co-occurence stats.

Filter

In [28]:
from upsetplot import plot as upsetplt
In [14]:
all_patterns = mr.patterns()
In [15]:
# restrict to the central 500bp
df_center = df[(df.pattern_center > 250) & (df.pattern_center < 750)]
In [17]:
len(df)
Out[17]:
516881
In [18]:
len(df_center)
Out[18]:
443810
In [19]:
dfp = pd.DataFrame.from_dict([dict(pd.Series(all_patterns, 
                                             index=all_patterns).
                                   isin(group.pattern_id.unique()))
                               for name, group in tqdm(df_center.groupby("example_id"))])
100%|██████████| 96822/96822 [02:05<00:00, 768.82it/s]
In [97]:
# make the upsetplot
data_dict = dfp.groupby(all_patterns).size()
dfc = data_dict.reset_index().rename(columns={0:"n"})
dfcp = dfc.sort_values("n", ascending=False).head(40)  # take only 40 most frequent combinations
col_incl = dfcp.iloc[:, :-1].max()
dfcp_plot = dfcp.iloc[:, col_incl.tolist() + [True]]
dfcp_plot = dfcp_plot.set_index(list(dfcp_plot.columns[:-1]))

Plot

In [36]:
r = upsetplt(dfcp_plot.n)
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/upsetplot/plotting.py:24: PerformanceWarning: indexing past lexsort depth may impact performance.
  totals.append(data.loc[idxslice].sum())
/users/avsec/bin/anaconda3/envs/chipnexus/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Passing one of 'on', 'true', 'off', 'false' as a boolean is deprecated; use an actual boolean (True/False) instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)
In [98]:
# note interesting pairs to plot pairwise distances
pairs = [
     ["metacluster_2/pattern_0", "metacluster_1/pattern_0"],
     ["metacluster_2/pattern_0", "metacluster_0/pattern_0"],
     ["metacluster_2/pattern_0", "metacluster_28/pattern_0"],
     ["metacluster_1/pattern_0", "metacluster_2/pattern_3"],
     ["metacluster_0/pattern_0", "metacluster_0/pattern_1"],
     ["metacluster_0/pattern_0", "metacluster_0/pattern_2"],
     ["metacluster_0/pattern_0", "metacluster_3/pattern_2"],
     ["metacluster_1/pattern_0", "metacluster_1/pattern_1"],
     ["metacluster_1/pattern_2", "metacluster_0/pattern_0"],
     ["metacluster_1/pattern_2", "metacluster_0/pattern_1"],
]
In [37]:
# Visualie motifs for most frequent patterns from above
dfcp_plot_melt = pd.melt(dfcp_plot.reset_index().assign(idx=np.arange(len(dfcp_plot))), id_vars=["n","idx"])
patterns = dfcp_plot_melt.query('value').groupby("variable").n.sum().sort_values(ascending=False).index.tolist()

display(vdom_patterns(mr, zip(patterns, [False]*len(patterns)), trim_frac=0.08))

Pairwise distances with self

  • restrict to sites with exactly 2 instances of the same pattern and plot a pairwise distance
In [54]:
nrow = int(np.ceil(len(patterns)/2))
fig, axes = plt.subplots(nrow,2, figsize=(20, nrow * 1.5), sharex=True)
for pattern, ax in tqdm(zip(patterns, axes.flat)):
    dist_pairwise_hist(df[~df[['example_id', 'pattern_center']].duplicated()], pattern, pattern, ax)
    ax.set_xlim([-1000, 1000])
plt.tight_layout()
12it [00:43,  3.64s/it]

Pairwise distances with a pair

  • restrict to sites with exactly 1 instance for each motif and plot their pairwise distance
In [93]:
## TODO statify those plots by the peak_id
In [95]:
nrow = int(np.ceil(len(pairs)/2))
fig, axes = plt.subplots(nrow, 2, figsize=(20, nrow * 1.5), sharex=True)
for pair, ax in tqdm(zip(pairs, axes.flat)):
    dist_pairwise_hist(df[~df[['example_id', 'pattern_center']].duplicated()], pair[0], pair[1], ax)
    ax.set_xlim([-1000, 1000])
plt.tight_layout()
10it [00:03,  3.03it/s]

Eyeballing

  • plots bellow this point are used to check how well the assignment of seqlets to patterns work
In [36]:
example_idx = 12
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, 12, xlim)

In [38]:
example_idx = 29934
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)

In [39]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)
44980
In [40]:
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)

In [41]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
11682

In [601]:
# This example doesn't make any sense to me
In [66]:
df[df.example_idx == 5472]
Out[66]:
Unnamed: 0 example_idx seqlet_start seqlet_end seqlet_is_revcomp seqlet_score metacluster pattern percnormed_score score ... example_id example_start example_strand pattern_id pattern_start pattern_end seqlet_center pattern_start_rel pattern_end_rel strand
13626 13626 5472 486 527 False 0.361572 0 0 0.113469 0.221166 ... 5472 21224842 * metacluster_0/pattern_0 501 515 506.5 21225343 21225357 -
162727 162727 5472 551 592 False 0.433490 1 0 0.188333 0.156902 ... 5472 21224842 * metacluster_1/pattern_0 575 583 571.5 21225417 21225425 +
162728 162728 5472 574 615 False 0.273384 1 0 0.076667 0.132894 ... 5472 21224842 * metacluster_1/pattern_0 594 602 594.5 21225436 21225444 -
318708 318708 5472 425 466 False 0.580759 2 0 0.744276 0.353541 ... 5472 21224842 * metacluster_2/pattern_0 451 459 445.5 21225293 21225301 +
527944 527944 5472 510 551 False 0.354864 4 3 0.927461 0.239710 ... 5472 21224842 * metacluster_4/pattern_3 521 529 530.5 21225363 21225371 +
638799 638799 5472 457 498 False 0.696911 9 0 0.224932 0.347820 ... 5472 21224842 * metacluster_9/pattern_0 465 480 477.5 21225307 21225322 +

6 rows × 24 columns

In [72]:
sl = seqlets[638799]
In [98]:
s = sl.track_name_to_snippet['Sox2_contrib_scores']
In [99]:
seqlogo_fig(s.fwd, figsize=(10,2));
In [42]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)
example_idx = 5472
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
83953

In [99]:
seqlogo_fig(s.fwd, figsize=(10,2));
In [101]:
mr.plot_pssm("metacluster_9","pattern_0");
In [43]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
28157

In [44]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
55674

In [45]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
47460

In [46]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
11844

In [47]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
20764

In [49]:
example_idx = df.example_idx.sample(1).iloc[0]
print(example_idx)

xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)
36709

In [30]:
df[df.example_idx == 36709].iloc[0][['example_chr', 'example_start', 'example_end']]
Out[30]:
example_chr         chr13
example_start    99015585
example_end      99016585
Name: 78128, dtype: object

Comparing models

Load old Oct-Sox BPNet

In [34]:
from basepair.BPNet import BPNetPredictor
from keras.models import load_model
from basepair.config import create_tf_session
from pybedtools import Interval
In [4]:
create_tf_session(0)
Out[4]:
<tensorflow.python.client.session.Session at 0x7f5f5feeffd0>
In [5]:
model = load_model("/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/seq_mutlitask_filters=32,conv1_kernel_size=21,tconv_kernel_size=25,n_dil_layers=6,use_profile=True,use_counts=True,c_task_weight=10,lr=0.004.5.h5")
preproc = read_pkl("/users/avsec/workspace/basepair/basepair/../data/processed/chipnexus/exp/models/p-multi-task/preproc.pkl")
bp = BPNetPredictor(model, 
                    "/mnt/data/pipeline_genome_data/mm10/mm10_no_alt_analysis_set_ENCODE.fasta",
                    tasks = ["Sox2", "Oct4"],
                    preproc=preproc)
bws = {task: {s: f"/users/avsec/workspace/basepair-workflow/data/{task}/5prime.counts.{s}.bw" for s in ['pos', 'neg']} for task in ["Sox2", "Oct4"]}
genome_file = "/mnt/data/pipeline_genome_data/mm10/mm10.chrom.sizes"

Make predictions

In [39]:
df[(df.example_chr == "chr17") & (df.example_start < 35503946) & (df.example_end >35503946+201 )]
Out[39]:
Unnamed: 0 example_idx seqlet_start seqlet_end seqlet_is_revcomp seqlet_score metacluster pattern percnormed_score score offset revcomp example_chr example_end example_id example_start example_strand pattern_id pattern_start pattern_end
10 10 2 569 610 False 0.461497 0 2 0.676585 0.220768 2.0 0.0 chr17 35504551 2 35503551 * metacluster_0/pattern_2 581 590
11 11 2 490 531 False 2.261329 0 2 0.613195 0.211543 -11.0 1.0 chr17 35504551 2 35503551 * metacluster_0/pattern_2 523 532
63295 63295 31190 532 573 False 0.517389 0 2 0.652005 0.216822 2.0 0.0 chr17 35504588 31190 35503588 * metacluster_0/pattern_2 544 553
63296 63296 31190 453 494 False 2.241407 0 2 0.626132 0.212729 -11.0 1.0 chr17 35504588 31190 35503588 * metacluster_0/pattern_2 486 495
64345 64345 31492 559 600 False 0.467501 0 2 0.673997 0.220096 2.0 0.0 chr17 35504561 31492 35503561 * metacluster_0/pattern_2 571 580
64346 64346 31492 480 521 False 2.270571 0 2 0.614489 0.211709 -11.0 1.0 chr17 35504561 31492 35503561 * metacluster_0/pattern_2 513 522
104313 104313 49274 567 608 False 0.461385 0 2 0.679172 0.221136 2.0 0.0 chr17 35504553 49274 35503553 * metacluster_0/pattern_2 579 588
104314 104314 49274 488 529 False 2.259731 0 2 0.610608 0.211251 -11.0 1.0 chr17 35504553 49274 35503553 * metacluster_0/pattern_2 521 530
156141 156141 2 465 506 False 0.388818 1 0 0.338333 0.178223 5.0 0.0 chr17 35504551 2 35503551 * metacluster_1/pattern_0 481 489
197116 197116 31190 428 469 False 0.347628 1 0 0.306111 0.174058 5.0 0.0 chr17 35504588 31190 35503588 * metacluster_1/pattern_0 444 452
197364 197364 31492 455 496 False 0.374927 1 0 0.337222 0.177821 5.0 0.0 chr17 35504561 31492 35503561 * metacluster_1/pattern_0 471 479
220622 220622 49274 463 504 False 0.385134 1 0 0.337222 0.177787 5.0 0.0 chr17 35504553 49274 35503553 * metacluster_1/pattern_0 479 487
343236 343236 31190 368 409 False 0.496035 2 3 0.297872 0.374152 -5.0 0.0 chr17 35504588 31190 35503588 * metacluster_2/pattern_3 374 401
343476 343476 31492 395 436 False 0.514911 2 3 0.297872 0.376492 -5.0 0.0 chr17 35504561 31492 35503561 * metacluster_2/pattern_3 401 428

Core motif not recognized

In [47]:
# Filtered seqlets (percnormed_score > 0.2)
example_idx = 2
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df, example_idx, xlim), trim_frac=0.08))
plot_region(df, d, tasks, example_idx, xlim)

In [50]:
# Unfilterred seqlets
example_idx = 2
xlim=[400,600]
display(vdom_patterns(mr, relevant_patterns(df_unfiltered, example_idx, xlim), trim_frac=0.08))
plot_region(df_unfiltered, d, tasks, example_idx, xlim)

In [37]:
# old bpnet prediction
bp.plot_predict_grad([Interval("chr17", 35503946, 35503946+201)], bws=bws, 
                      profile_grad='weighted',
                      figsize=(20, 10))