# Parameters
data_name = "CH12.LX cell line"
task_idx = 271
tf = "JUN-mouse"
tf_biosample = "CH12.LX"
motif = "Jun"
chippeaks = "/mnt/lab_data2/msharmin/tf-mouse-atlas/chip_idrs/ENCFF612GBQ.bed.gz"
instancedir = "/mnt/lab_data2/msharmin/oc-atlas/mtf/atac_pval_hits/task_271/Jun"
agnostic_bed = "/mnt/lab_data2/msharmin/oc-atlas/mtf/agnostic_hits/Jun.bed.gz"
fimodir = "/mnt/lab_data2/msharmin/oc-atlas/mtf/atac_specific_hits/task_271"

%matplotlib inline

4 kinds of called instances¶

The plots in this report compare 4 sets of the motif instances. Fimo Hits refers to the all FIMO reported hits within Open Chromatin (OC) peaks, i.e. ATAC/DHS peaks. Hits with high Deeplift (Hits with low Deeplift) refers to the FIMO hits which are greater (less) than a threshold value, $K$ for the probability of a Fimo hit; the method of probability assignment is described later. Finally, Fimo Hits within 50b of OC summits refers to FIMO reported hits which are within 50b around the ATAC/DHS peak summit.

Assigning p-value to each Fimo instances¶

All sequences are divided into one of the 100 bins based on the percentage of GC-content, e.g. 45% GC-content of the central 200bp of a 1kb sequence goes to the 45th bin. Next for each 1kb sequence, we generate 10 dinucleotide shuffled reference and compute importance scores using the same model - this comprise the null importance scores. We generate one null distributions per GC-bin, i.e. nth null distribution is composed from the null sequences of n-th GC-bin sequences. We again used the importances from central 200bp of a 1kb sequence. The null distribution gives us a empirical distribution function. From the empirical distribution, each importance score of the original sequence can be assigned a p-value by 1-CDF. Now, a Fimo instance/hit belonging to that sequence gets a p-value by 1-CDF(mean_deeplift_score) of the nth empirical distribution when the sequence falls under nth GC-bin. Thus, an instance/hit with high/low deeplift score gets low/high p-value with respect to the null GC matched distribution.

The choice of $K$¶

Next, we slide the p-value from low to high (for example 0.05 to 0.25) and divide the set into High Deeplift hits and Low Deeplift hits. We can now measure the auPRC, Precision, Recall etc based on ChIPSeq peaks as gold standard. We chose a p-value as best $K$, which gives better precision/Recall and maximum auPRC difference between instance with high and low deeplift scores. The Following plots shows the choice of best probability value for each instance.

from matlas.fimo_hits import decide_optimum_threshold, check_performance_competition
best_key = decide_optimum_threshold(instancedir)
import numpy as np
tmp = np.load("{}/pval_pval.npz".format(instancedir), allow_pickle=True)
check_performance_competition(tmp.f.a.tolist(), tmp.f.b.tolist(), tmp.f.c.tolist(),
                              info="{}, {}, {}".format(task_idx, tf_biosample, motif),
                              selected_key=best_key
                             )

TF-MoDISco is using the TensorFlow backend.

load data from labcluster

Statistics of instance counts in the cell type specific OC peaks for the motif¶

The box plots for per peak instance counts shows the removal of redundancy by deeplift

from matlas.fimo_hits import collect_instance_statistics
hit_metadata = collect_instance_statistics(
    best_key, instancedir=instancedir, agnostic_bed=agnostic_bed, 
    atac_bed="{}/peaks.bed".format(fimodir), verbose=False
)

from matlas.fimo_hits import display_instance_statistics
display_instance_statistics(hit_metadata)

Performance comparison plot¶

For the following plot, we divided the Open Chromatin (ATAC/DHS) regions into two sets: a) True regions: regions which overlaps with the ChIPSeq peaks and b) False Regions: regions with no ChIPseq Peaks. Next, we computed the number of peaks covered by at least one of the called instances by each method - these gives us true positives, false positives, true negatives and false negatives. We expect to see a higher precision for Hits with high Deeplift and higher recall for Fimo Hits.

auPRC_scored refers to the auPRC scores using probability values instead of 1/0 and auPRC_labeled refers to the auPRC scores using 1/0 labels for the true and false instances. Imbalance referes to the ratio of True regions and (True regions + False Regions) from ChIPSeq peaks. Imbalance should be same fro any method and added here only for sanity check.

from matlas.fimo_hits import collect_performance_info
performances = collect_performance_info(best_key=best_key, 
                                        instancedir=instancedir)

from matlas.fimo_hits import plot_performance_info
plot_performance_info(performances, motif)

<Figure size 576x432 with 0 Axes>

Positional Distribution for the called instances¶

The following plots show CDF and distributions of the distances of each (true positive) instances.

In the CDF plot, we expect to see that higher fraction of the total number of instances for Hits with high Deeplift falls within $n$ base-pair distance of ChIPSeq peak summits.

In the distribution plot, we expect to see higher spike towards zero and smaller tail for the distribution of Hits with high Deeplift.

from matlas.fimo_hits import collect_distance_info
distances = collect_distance_info(best_key=best_key, instancedir=instancedir,
                                  null_key='1-cdf', key_to_split='1-cdf',
                                  fimobed="{}/{}.bed.gz".format(fimodir, motif),
                                  verbose=False
                                 )

from matlas.fimo_hits import plot_positional_information
plot_positional_information(distances, motif)

Enrichment plot¶

This plot uses the signal-pvalue bigwig files of the TF for the similar cell type from ENCODE. If importance scores (deeplift scores) of the model in question provides information of motif presence, we should see higher enrichment for the Hits with high Deeplift.

from matlas.fimo_hits import display_enrichment_plot
display_enrichment_plot(instancedir, task_idx, motif, tf, fileprefix='best-pval')

Previous Enrichment plot 1¶

For each 1kb sequence, we generate a dinucleotide shuffled reference and compute importance scores using the same model - this comprise the null importance scores. Next, t-distribution is the fitted distribution taking the null importance scores from central 200b. From the fitted distribution, each importance score of the original sequence can be assigned a probability. Now, a Fimo instance/hit belonging to that sequence gets a probability by the median probability score of the sub-seqeunce. Thus, an instance/hit with high/low deeplift score gets high/low probability with respect to the null sequence.

Next, we slide the value of probability from low to high (for example 0.25 to 0.95) and divide the set into High Deeplift hits and Low Deeplift hits. We can now measure the auPRC, Precision, Recall etc based on ChIPSeq peaks as gold standard. We chose a probability as best 𝐾 , which gives better precision/Recall and maximum auPRC difference between instance with high and low deeplift scores. The Following plots shows the choice of best probability value for each instance.

from matlas.fimo_hits import display_enrichment_plot
display_enrichment_plot(instancedir, task_idx, motif, tf, fileprefix='best-cdf')

Previous Enrichment plot 2¶

The following plot was drawn with different threshold for deeplift hits and removing the overlapping instances (added for sanity checks). Here, $K$ was decided by the 95-percentile of the per_base_deeplift score in the background sequences (di-nucleotide shuffled references).

#adding previous results to verify improvement and debug
from matlas.fimo_hits import display_enrichment_plot
display_enrichment_plot(instancedir, task_idx, motif, tf, fileprefix='all-pval')