Description

ProCapNet[1] is a deep learning model that predicts transcription initiation from DNA sequence alone. The model uses a ~2kb-wide DNA sequence as input and outputs a 1kb-wide window of predicted transcription start sites, as measured by PRO-cap experiments, at base-resolution. Here, ProCapNet was trained across PRO-cap datasets in six different cell types and then used to generate six genome-wide prediction tracks.

Because the local sequence regulation that drives transcription initiation specifically (i.e. motifs such as the TATA box) is predominantly cell-type-agnostic, ProCapNet predictions are also largely cell-type-agnostic. This means that genome-wide ProCapNet predictions may highlight genomic regions that are transcriptionally active in cell types that are not included in the model's training datasets, allowing for imputation of PRO-cap readouts across cell types where the experiment has not yet been performed. The cell-type specificity of ProCapNet predictions can then be calibrated effectively using a cell-type-specific marker of active chromatin state, such as experimental measurements of chromatin accessibility or histone modifications like H3K27ac. The ProCapNet manuscript[1] contains more details.

PRO-cap[2] is a nascent RNA run-on assay that measures transcription initiation events, regardless of the stability or final product that is being transcribed. This means that PRO-cap data, and by extension ProCapNet predictions, can be found at regions besides promoters where initiation produces unstable transcripts, such as transcription originating from active enhancers. PRO-cap data also often shows regions of bidirectional transcription initiation, which occurs commonly at mammalian promoters. PRO-cap data is strand-specific and at base-resolution, so its measurements indicate exact bases that were the 5' end of nascent RNA transcripts.

Display Convention

Each track shows predicted PRO-cap signal wiggle data for a specific cell type. The y-axis values correspond directly to the predicted number of PRO-cap reads with their 5' end at each base (so different experiments may have different coverage reflected in different y-axis scaled globally). As PRO-cap data is strand-specific, each track is an overlay of the predictions across both DNA strands: the forward/positive/plus strand predictions are depicted as positive values, while the reverse/negative/minus strand predictions are depicted as negative values. Cell types are color-coded and ordered alphabetically.

Methods

For complete information on ProCapNet model design, training, performance evaluation, and interpretation analysis, please see the manuscript[1]. Below is a brief summary and further info on how these tracks were generated. Code for model training and all manuscript analyses is available on Github: https://github.com/kundajelab/ProCapNet. Code used specifically for generating these tracks is located within that repository, in the "predict_genomewide" directory.

Model Design & Training

ProCapNet is a "BPNet-like" model -- it has an architecture and training loss design very similar to that of the BPNet model for prediction of ChIP-seq and ChIP-nexus TF binding data[3].

One ProCapNet model was trained for each of 6 cell types where PRO-cap data was available from ENCODE[4]: K562, A562, Caco-2, Calu3, HUVEC, and MCF10A. Models were trained using 7-fold cross-validation, split by chromosome. Each ProCapNet model was trained on all training-set PRO-cap peaks in that cell type, plus randomly sampled, cell-type-matched transcriptionally inactive DNase-hypersensitive sites from the training chromosomes. The model takes as input 2,114bp of one-hot encoded DNA sequence from the reference genome and outputs a prediction of the number of PRO-cap reads at each of 1,000bp, in a window centered on the input sequece.

Track Generation

Genome-wide ProCapNet prediction tracks were generated for each of the six cell types. To make a single prediction for one sequence, we averaged the prediction from all 7-fold trained ProCapNet models and also took the average of the predictions for the forward vs. reverse strand sequences of the reference genome, to ensure the prediction was as robust as possible.

To extend this genome-wide, we extracted from the reference genome all possible sliding sequence windows of size 2114bp, with a stride of 250bp. Since the model outputs a 1kb prediction window, this means that for every base, the final prediction is the average of when ProCapNet was applied over 4 different sequences, each offset from the last by 250bp.

For hg38, we avoided generating predictions over any sequence where the majority of basepairs were unresolved (N) in the reference sequence.

Credits

ProCapNet was trained on ENCODE[4] PRO-cap datasets generated by the Yu Lab and Lis Lab at Cornell.

ProCapNet genome-wide predictions were generated within the GENCODE consortium[5] to assess how well ProCapNet predictions agree with current gene annotations (particularly TSS annotations).

Contacts

For questions, contact:

Data Release

All ProCapNet data hosted through this track is freely available to the public. Please cite the ProCapNet manuscript when using this data in your own work. You may also be interested in other ProCapNet artifacts hosted through the ENCODE portal, such as trained models. These are available through the accession IDs ENCSR740IPL, ENCSR072YCM, ENCSR182QNJ, ENCSR797DEF, ENCSR801ECP, ENCSR860TYZ for cell types K562, A673, Caco-2, Calu3, HUVEC, and MCF10A, respectively.

References

[1] Dissecting the cis-regulatory syntax of transcription initiation with deep learning.
Cochran, K., Yin, M., Mantripragada, A., Schreiber J., Marinov, G., Shah, S.R., Yu, H., Lis, J.T., & Kundaje, A. 2024. bioRxiv. Under review. doi: 10.1101/2024.05.28.596138; BioRxiv Link.

[2] Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Kwak, H., Fuda, N.J., Core, L.J., & Lis, J.T. 2013. Science 339(6122): pp. 950-953. doi: 10.1126/science.122938

[3] Base-resolution models of transcription-factor binding reveal soft motif syntax.
Avsec, Ž., Weilert, M., Shrikumar, A., Krueger, S., Alexandari, A., Dalal, K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., & Zeitlinger, J. 2021. Nature Genetics 53: pp. 354-366. doi: 10.1038/s41588-021-00782-6

[4] New developments on the Encyclopedia of DNA Elements (ENCODE) data portal.
Luo, Y., Hitz, B.C., Gabdank, I., Hilton, J.A., Kagda, M.S., Lam, B., Myers, Z., Sud, P., Jou, J., Lin, K., Baymuradov, U.K., Graham, K., Litton, C., Miyasato, S.R., Strattan, J.S., Jolanki, O., Lee, J., Tanaka, F.Y., Adenekan, P., O’Neill, E., & Cherry, J.M. 2020. Nucleic Acids Research 48(D1): pp. D882-D889. doi: 10.1093/nar/gkz1062

[5] GENCODE: the reference annotation for the ENCODE Project.
Harrow J., Frankish A., Gonzalez J.M., Tapanari E., Diekhans M., Kokocinski F., Aken B.L., Barrell D., Zadissa A., Searle S., et al. 2012. Genome Research 22: pp. 1775-1789. doi: 10.1101/gr.135350.111