Bias model training and quality check report

Preprocessing report

The image below should look closely like a Tn5 or DNase bias enzyme motif.

Training report

The val loss (validation loss) will decrease and saturate after a few epochs.

Bias model performance in peaks and non-peaks

Counts Metrics: The pearsonr in non-peaks should be greater than 0 (higher the better). The pearsonr in peaks should be greater than -0.3 (otherwise the bias model could potentially be capturing AT bias). MSE (Mean Squared Error) will be high in peaks.

Profile Metrics: Median JSD (Jensen Shannon Divergence between observed and predicted) lower the better. Median norm JSD is median of the min-max normalized JSD where min JSD is the worst case JSD i.e JSD of observed with uniform profile and max JSD is the best case JSD i.e 0. Median norm JSD is higher the better. Both JSD and median norm JSD are sensitive to read-depth. Higher read-depth results in better metrics.

What to do if your pearsonr in peaks is less than -0.3? In the range of -0.3 to -0.5 please be wary of your chrombpnet_wo_bias.h5 (that wil potentially be trained with this bias model) TFModisco showing lots of GC rich motifs (> 3 in the top-10). If this is not the case you can continue using the chrombpnet_wo_bias.h5. If you end up seeing a lot of GC rich motifs it is likely that bias model has learnt a different GC distribution than your GC-content in peaks. You might benefit from increasing the bias_threshold_factor argument input to the chrombpnet bias pipeline or chrombpnet bias train command used in training the bias model and retrain a new bias model. For more intuition about this argument refer to the FAQ section in wiki. If the value is less than -0.5 the chrombpnet training will automatically throw an error.

nonpeaks.pearsonr nonpeaks.mse peaks.pearsonr peaks.mse
counts_metrics 0.6 1.41 0.19 15.1
nonpeaks.median_jsd nonpeaks.median_norm_jsd peaks.median_jsd peaks.median_norm_jsd
profile_metrics 0.7 0.11 0.46 0.3

TFModisco motifs learnt from bias model (bias.h5) model

TFModisco motifs generated from profile contribution scores of the bias model. cwm_fwd, cwm_rev are the forward and reverse complemented consolidated motifs from contribution scores in subset of random peaks. These CWM motifs should be free from any Transcription Factor (TF) motifs and should contain either only bias motifs or random repeats. For each of these motifs, we use TOMTOM to find the top-3 closest matches (match_0, match_1, match_2) from a database consisting of both MEME TF motifs and heterogenous enzyme bias motifs that we have repeatedly seen in our datasets. The qvals (qval0,qval1,qval2) should be high (> 0.0001) if the closest hit is a TF motif (i.e indicating that the closest match is not the correct match) - this is also generally verifiable by eye as the closest match will look nothing like the CWMs. The qvals should be low if the closest hit is enzyme bias motif and generally verifiable that the top match looks like the CWM. The first 3-5 motifs in the list below should look like enzyme bias motif.

What to do if you find an obvious TF motif in the list?
Do not use this bias model as it will regress the contribution of the TF motifs (along with bias motifs) from the chrombpnet_nobias.h5. Reduce the bias_threshold_factor argument input to the chrombpnet bias pipeline or chrombpnet bias train command used in training the bias model and retrain a new bias model. For more intuition about this argument refer to the FAQ section in wiki.

What to do if you are unsure if a given CWM motif is resembling the match_0 logo for example?
Get marginal footprint on the match_0 motif logo (using the command chrombpnet footprints and make sure that the bias models footprint is closer to that of controls with no motif inserted - for examples look at FAQ )

pattern NumSeqs cwm_fwd cwm_rev match0 qval0 match0_logo match1 qval1 match1_logo match2 qval2 match2_logo
pos__0 7725 TN5_2 5.985320e-08 TN5_1 2.342440e-07 TN5_7 1.010820e-03
pos__1 4606 TN5_1 1.411760e-05 TN5_2 1.411760e-05 TN5_3 9.458130e-04
pos__2 4546 TN5_8 1.440170e-02 TN5_2 1.206640e-01 TN5_1 1.829830e-01
pos__3 3074 TN5_3 2.082130e-01 KLF4_MA0039.3 2.082130e-01 SP5_MOUSE.H11MO.0.C 2.082130e-01
pos__4 3041 TN5_2 4.047850e-13 TN5_4 3.189740e-05 TN5_5 3.189740e-05
pos__5 2636 TN5_3 9.455050e-09 TN5_1 3.167830e-05 TN5_7 5.405470e-02
pos__6 973 TN5_3 5.694430e-10 TN5_4 1.216380e-05 TN5_5 1.216380e-05
pos__7 859 TN5_3 7.334250e-06 TN5_4 7.424830e-05 TN5_5 7.424830e-05
pos__8 689 SPI1_HUMAN.H11MO.0.A 1.181530e-07 SPIB_MOUSE.H11MO.0.A 1.181530e-07 SPIB_HUMAN.H11MO.0.A 7.386650e-07
pos__9 662 TN5_1 1.651190e-06 TN5_3 4.388370e-06 TN5_2 2.251310e-04
pos__10 281 CTCF_MA0139.1 2.777850e-03 CTCF_HUMAN.H11MO.0.A 2.971750e-03 CTCF_MOUSE.H11MO.0.A 3.143240e-03
pos__11 250 TN5_4 9.944940e-04 TN5_5 9.944940e-04 KLF4_MA0039.3 9.294810e-02
pos__12 216 CTCF_C2H2_1 1.186410e-06 CTCF_MA0139.1 9.499530e-04 TN5_3 1.394240e-03
pos__13 55 TN5_6 9.903500e-04 TN5_2 9.528820e-03 TN5_8 9.685810e-02
pos__14 23 CTCF_C2H2_1 1.133350e-04 CTCF_MA0139.1 1.332970e-04 CTCF_MOUSE.H11MO.0.A 3.858170e-04

TFModisco motifs generated from counts contribution scores of the bias model. cwm_fwd, cwm_rev are the forward and reverse complemented consolidated motifs from contribution scores in subset of random peaks. These motifs should be free from any Transcription Factor (TF) motifs and should contain motifs either weakly related to bias motifs or random repeats. For each of these motifs, we use TOMTOM to find the top-3 closest matches (match_0, match_1, match_2) from a database consisting of both MEME TF motifs and heterogenous enzyme bias motifs that we have repeatedly seen in our datasets. The qvals should be high (> 0.0001) if the closest hit is a TF motif (i.e indicating that the closest match is not the correct match, this is also generally verifiable by eye and making sure the closest match looks nothing like the CWMs).

What to do if you find an obvious TF motif in the list?
Do not use this bias model as it will regress the contribution of the TF motifs (along with bias motifs) from the chrombpnet_nobias.h5. Reduce the bias_threshold_factor argument input to the chrombpnet bias pipeline or chrombpnet bias train command used in training the bias model and retrain a new bias model. For more intuition about this argument refer to the FAQ section in wiki.

What to do if you are unsure if a given CWM motif is resembling the match_0 logo for example?
Get marginal footprint on the match_0 motif logo (using the command chrombpnet footprints and make sure that the bias models footprint is closer to that of controls with no motif inserted - for examples look at FAQ )

pattern NumSeqs cwm_fwd cwm_rev match0 qval0 match0_logo match1 qval1 match1_logo match2 qval2 match2_logo
pos__0 4447 TN5_2 1.516890e-06 TN5_1 2.906540e-04 TN5_8 6.138520e-03
pos__1 2928 SPIB_HUMAN.H11MO.0.A 4.497340e-05 SPI1_MOUSE.H11MO.0.A 9.032240e-05 SPI1_HUMAN.H11MO.0.A 9.032240e-05
pos__2 1649 TBX1_TBX_1 8.533350e-01 FOS_HUMAN.H11MO.0.A 8.533350e-01 FOSL1_MA0477.1 8.533350e-01
pos__3 884 ZNF384_MA1125.1 1.000000e+00 None NaN None NaN
pos__4 807 DNASE_5 1.873980e-01 RXRA_MOUSE.H11MO.0.A 1.873980e-01 RARA_MOUSE.H11MO.0.A 4.803490e-01
pos__5 768 DNASE_5 6.597840e-07 RXRA_MOUSE.H11MO.0.A 1.011660e-03 RARA_MOUSE.H11MO.0.A 1.394140e-03
pos__6 700 CTCF_MA0139.1 1.239750e-11 CTCF_HUMAN.H11MO.0.A 9.631760e-09 CTCF_MOUSE.H11MO.0.A 9.899570e-08
pos__7 605 TN5_6 2.239280e-17 PAX5_HUMAN.H11MO.0.A 4.296710e-02 ZN121_HUMAN.H11MO.0.C 4.511070e-01
pos__8 570 PATZ1_HUMAN.H11MO.0.C 2.910190e-03 ZN467_HUMAN.H11MO.0.C 2.910190e-03 ZN281_HUMAN.H11MO.0.A 2.910190e-03
pos__9 484 Hoxc10.mouse_homeodomain_1 1.000000e+00 HOXD11_MA0908.1 1.000000e+00 HOXD11_homeodomain_1 1.000000e+00
pos__10 472 DNASE_5 3.851210e-05 TN5_6 1.276710e-01 GLI1_MOUSE.H11MO.0.C 7.228690e-01
pos__11 456 STAT1+STAT2_MA0517.1 8.340120e-01 STAT2_HUMAN.H11MO.0.A 8.340120e-01 STAT2_MOUSE.H11MO.0.A 8.340120e-01
pos__12 447 ZN770_HUMAN.H11MO.0.C 4.905620e-01 TFAP2A_AP2_4 4.905620e-01 TFAP2C_MA0814.1 4.905620e-01
pos__13 283 ZN250_HUMAN.H11MO.0.C 9.999990e-01 IKZF1_HUMAN.H11MO.0.C 9.999990e-01 IKZF1_MOUSE.H11MO.0.C 9.999990e-01
pos__14 244 ZNF384_MA1125.1 4.187880e-01 HOXA13_homeodomain_4 1.000000e+00 CPEB1_RRM_1 1.000000e+00
pos__15 242 TCF4_bHLH_2 7.780760e-01 Tcfl5_MA0632.1 7.780760e-01 Hes1_MA1099.1 7.780760e-01
pos__16 229 MAZ_HUMAN.H11MO.0.A 1.916520e-04 MAZ_MOUSE.H11MO.0.A 1.916520e-04 KLF15_HUMAN.H11MO.0.A 1.916520e-04
pos__17 218 VDR_MA0693.2 1.000000e+00 ZN134_HUMAN.H11MO.0.C 1.000000e+00 ZNF524_C2H2_1 1.000000e+00
pos__18 210 TN5_6 1.470600e-04 PAX5_HUMAN.H11MO.0.A 7.431570e-02 ZN121_HUMAN.H11MO.0.C 7.431570e-02
pos__19 207 HOXD12_MA0873.1 9.899780e-01 HOXD12_homeodomain_2 9.899780e-01 HOXB13_homeodomain_2 9.899780e-01
pos__20 207 VDR_MA0693.2 1.000000e+00 ZNF524_C2H2_1 1.000000e+00 ZN134_HUMAN.H11MO.0.C 1.000000e+00
pos__21 202 PRDM6_HUMAN.H11MO.0.C 2.528850e-02 ZNF384_MA1125.1 2.654620e-02 FOXJ3_HUMAN.H11MO.0.A 1.154040e-01
pos__22 182 TEAD2_MOUSE.H11MO.0.C 2.680350e-01 TEAD2_MA1121.1 2.680350e-01 SALL4_HUMAN.H11MO.0.B 2.858480e-01
pos__23 182 STAT1_MOUSE.H11MO.0.A 5.658240e-01 STAT2_HUMAN.H11MO.0.A 6.949530e-01 STAT2_MOUSE.H11MO.0.A 6.949530e-01
pos__24 153 ZN770_HUMAN.H11MO.0.C 7.393630e-02 EGR1_HUMAN.H11MO.0.A 1.591330e-01 EGR2_HUMAN.H11MO.0.A 3.063960e-01
pos__25 136 SPI1_ETS_1 1.015940e-01 SPI1_MA0080.4 1.015940e-01 SPIB_ETS_1 1.154970e-01
pos__26 113 STAT1+STAT2_MA0517.1 1.000000e+00 NFAC1_HUMAN.H11MO.0.B 1.000000e+00 NFATC2_MA0152.1 1.000000e+00
pos__27 90 TN5_3 1.000000e+00 HOXC10_homeodomain_1 1.000000e+00 TN5_4 1.000000e+00
pos__28 89 SPIC_ETS_1 1.225180e-03 SPIC_MA0687.1 1.225180e-03 Spic.mouse_ETS_1 1.225180e-03
pos__29 84 ZNF384_MA1125.1 1.646210e-02 PRDM6_HUMAN.H11MO.0.C 2.383060e-01 FOXJ3_HUMAN.H11MO.0.A 4.904350e-01
pos__30 80 ITF2_HUMAN.H11MO.0.C 4.868300e-03 HTF4_MOUSE.H11MO.0.A 1.239690e-02 ASCL2_MOUSE.H11MO.0.C 1.239690e-02
pos__31 65 ZN121_HUMAN.H11MO.0.C 6.633850e-06 PAX5_HUMAN.H11MO.0.A 2.347580e-04 TN5_6 1.051600e-03
pos__32 36 KLF4_MA0039.3 6.039690e-01 TBX20_TBX_5 6.039690e-01 TBX20_TBX_1 6.039690e-01
pos__33 30 ZN770_HUMAN.H11MO.0.C 1.908970e-09 ZN341_HUMAN.H11MO.0.C 1.912730e-02 ZSC22_HUMAN.H11MO.0.C 6.262330e-02
pos__34 22 ZN770_HUMAN.H11MO.0.C 5.568200e-01 ZN331_HUMAN.H11MO.0.C 5.568200e-01 TFAP4_HUMAN.H11MO.0.A 1.000000e+00