PRO-cap & ProCapNet

Description

PRO-cap[1] is a nascent RNA run-on assay that measures transcription initiation events. PRO-cap data is strand-specific and base-resolution, so each PRO-cap read captures the exact 5' end or start of a nascent RNA transcript. Unlike RNA-seq or CAGE/RAMPAGE experiments, PRO-cap uniquely measures both stable and unstable RNAs, including transcription at enhancers.

ProCapNet[2] is a deep learning model that is trained to predict PRO-cap data from DNA sequence alone. The model uses a ~2kb-wide DNA sequence window as input and outputs a 1kb-wide window of predicted transcription start site usage at base-resolution. Here, one ProCapNet model was trained one each of six PRO-cap datasets from six different cell lines. Then, each model was used to generate two types of tracks: 1) ProCapNet predictions of transcription initiation, and 2) ProCapNet sequence contribution scores.

Because the local sequence features that regulate transcription initiation (i.e. the TATA box) are predominantly cell-type-agnostic, ProCapNet predictions are also partially cell-type-agnostic. This means that ProCapNet trained in any cell type may predict transcription at genomic regions that are transcriptionally active only in certain cell types. Thia trait allows ProCapNet to effectively impute PRO-cap data in contexts where the necessary PRO-cap experiment has not yet been performed. The ProCapNet manuscript[2] contains more details.

Sequence contribution scores represent how much each base in the sequence "contributed" to ProCapNet's prediction of TSS usage at this genomic region. Higher scores indicate more "important" bases and can often highlight the sequence motifs of canonically promoter-associated TFs (see the ProCapNet manuscript[2]).

Display Convention

PRO-cap Data

Each track shows PRO-cap wiggle data for a specific cell type. As PRO-cap data is strand-specific, each track is an overlay of the data across both DNA strands: forward-strand reads are depicted as positive values, while reverse-strand reads are depicted as negative values. The y-axis shows the number of TSSs measured at each base. Cell types are color-coded and ordered alphabetically.

Predicted PRO-cap

Each track shows predicted PRO-cap wiggle data for a specific cell type. As PRO-cap data is strand-specific, each track is an overlay of the predictions across both DNA strands: the forward-strand predictions are depicted as positive values, while the reverse-strand predictions are depicted as negative values. The y-axis shows the predicted number of PRO-cap-measured TSSs at each base. Note that different PRO-cap experiments have unequal read-depth or coverage, and this is reflected in the y-axis scales of predictions from models trained on different datasets. Cell types are color-coded and ordered alphabetically.

Sequence Contribution Scores

Each track shows ProCapNet sequence-contribution scores for a specific cell type in sequence logo wiggle format. The y-axis values are the scores for each individual base. Note that sequence contribution scores scale with the magnitude of predicted PRO-cap, which is impacted by training data read depth or coverage, so global y-axis scales may differ between ProCapNet models. Cell types are color-coded and ordered alphabetically.

Methods: PRO-cap Experiments

The full PRO-cap (CoPRO) experimental protocol is described in Tome et al. 2018[2]. Briefly, the PRO-cap protocol involves permeabilizing cells and performing a run-on reaction with biotin NTPs (which RNA polymerase incorporates into nascent transcripts), followed by selection for capped RNAs, and finally reverse transcription and paired-end sequencing.

The tracks hosted here are the merged (summed) bigWigs of all experimental replicates available for each cell type.

Methods: ProCapNet

For complete information on ProCapNet model design, training, performance evaluation, and interpretation, please see the manuscript[2]. Below is a brief summary and further info on how these tracks were generated.

Code for model training and all manuscript analyses is available on Github: https://github.com/kundajelab/ProCapNet.

Code used specifically for generating these tracks is located within that repository, in the "predict_genomewide" directory.

ProCapNet Model Design & Training

ProCapNet is a "BPNet-like" model -- the architecture and training loss design are very similar to that of the BPNet model for prediction of ChIP-seq and ChIP-nexus TF binding data[3].

One ProCapNet model was trained for each of 6 cell lines with PRO-cap data available from ENCODE[4]: K562, A562, Caco-2, Calu3, HUVEC, and MCF10A. Models were trained using 7-fold cross-validation, split by chromosome. Each ProCapNet model was trained on all training-set PRO-cap peaks in that cell type, plus random cell-type-matched DNase-hypersensitive sites from the training chromosomes. The model takes as input 2,114 bp of one-hot encoded DNA sequence from the reference genome and outputs a prediction of the number of PRO-cap reads at each of 1,000 bp, in a window centered on the input sequece.

Track Generation

Predicted PRO-cap

Genome-wide ProCapNet prediction tracks were generated for each of the six cell types. To make a single prediction for one sequence, we averaged the prediction from all 7-fold trained ProCapNet models and also took the average of the predictions for the forward vs. reverse-complemented sequences of the reference genome.

To generate predictions genome-wide, we extracted all possible sliding sequence windows of size 2114 bp, with a stride of 250 bp, from the reference genome. Since the model outputs a 1kb prediction window, this means that for every base, the final prediction is the average of when ProCapNet was applied over 4 different sequences, each offset from the last by 250 bp.

For hg38, we did not generate predictions over any sequence where the majority of basepairs were unresolved (N) in the reference sequence.

Sequence Contribution Scores

Sequence contribution score tracks were generated for each of the six cell types using the DeepSHAP algorithm. ProCapNet can generate two types of contribution scores: "profile" task scores, which explain the model's prediction of TSS positioning, and "coverage" or "read counts" task scores, which explain the model's prediction of total TSS count in a region. These tracks are "profile" or TSS-positioning scores. To make a single prediction for one sequence, we averaged the scores from all 7-fold trained ProCapNet models and also took the average of the scores for the forward vs. reverse-complemented sequences of the reference genome.

Sequence contribution scores were generated across all sequence windows centered on a MANE Select TSS, which is a subset of the GENCODE annotation of all transcripts in the genome[5]. Generally, this means that one 2,114 bp window of contribution scores was generated for each gene, approximately centered on the gene's primary promoter.

Credits

PRO-cap datasets were generated as part of ENCODE[4] by Sagar Shah and the Yu and Lis Labs at Cornell.

ProCapNet models were trained by Kelly Cochran in the Kundaje Lab at Stanford. ProCapNet genome-wide predictions and sequence contribution scores were generated by Kelly in collaboration with the GENCODE consortium[5].

Contacts

For questions, contact:

Sagar Shah, sshah@cornell.edu (PRO-cap data generation)
Haiyuan Yu, haiyuan.yu@cornell.edu (PRO-cap data generation)
John Lis, jtl10@cornell.edu (PRO-cap data generation)
Kelly Cochran, kelly.cochran36@gmail.com (ProCapNet models)
Anshul Kundaje, akundaje@stanford.edu (ProCapNet models)

Data Release

All PRO-cap data hosted through this track is freely available through the ENCODE portal at accession IDs ENCSR046BCI, ENCSR100LIJ, ENCSR935RNW, ENCSR261KBX, ENCSR098LLB, and ENCSR799DGV for cell types A673, Caco-2, Calu3, K562, HUVEC, and MCF10A, respectively.

All ProCapNet data hosted through this track is freely available to the public. Please cite the ProCapNet manuscript when using this data in your own work. You may also be interested in other ProCapNet artifacts hosted through the ENCODE portal, such as trained models. These are available through the accession IDs ENCSR072YCM, ENCSR182QNJ, ENCSR797DEF, ENCSR740IPL, ENCSR801ECP, ENCSR860TYZ for cell types A673, Caco-2, Calu3, K562, HUVEC, and MCF10A, respectively.

References

[1] Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Kwak, H., Fuda, N.J., Core, L.J., & Lis, J.T. 2013. Science 339(6122): pp. 950-953. doi: 10.1126/science.122938

[2] Dissecting the cis-regulatory syntax of transcription initiation with deep learning.
Cochran, K., Yin, M., Mantripragada, A., Schreiber J., Marinov, G., Shah, S.R., Yu, H., Lis, J.T., & Kundaje, A. 2024. bioRxiv. Under review. doi: 10.1101/2024.05.28.596138; BioRxiv Link.

[3] Base-resolution models of transcription-factor binding reveal soft motif syntax.
Avsec, Ž., Weilert, M., Shrikumar, A., Krueger, S., Alexandari, A., Dalal, K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., & Zeitlinger, J. 2021. Nature Genetics 53: pp. 354-366. doi: 10.1038/s41588-021-00782-6

[4] New developments on the Encyclopedia of DNA Elements (ENCODE) data portal.
Luo, Y., Hitz, B.C., Gabdank, I., Hilton, J.A., Kagda, M.S., Lam, B., Myers, Z., Sud, P., Jou, J., Lin, K., Baymuradov, U.K., Graham, K., Litton, C., Miyasato, S.R., Strattan, J.S., Jolanki, O., Lee, J., Tanaka, F.Y., Adenekan, P., O’Neill, E., & Cherry, J.M. 2020. Nucleic Acids Research 48(D1): pp. D882-D889. doi: 10.1093/nar/gkz1062

[5] GENCODE: the reference annotation for the ENCODE Project.
Harrow J., Frankish A., Gonzalez J.M., Tapanari E., Diekhans M., Kokocinski F., Aken B.L., Barrell D., Zadissa A., Searle S., et al. 2012. Genome Research 22: pp. 1775-1789. doi: 10.1101/gr.135350.111