Goal

  • explore the MPRA data

Conclusions

  • Sequences are mostly ~80bp long (min=40, max=82)

Description

Paper: https://www.biorxiv.org/content/biorxiv/early/2018/08/22/398107.full.pdf

Imports

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fs import open_fs
In [2]:
fs = open_fs("zip:///s/project/avsec/basepair/BarakCohen-mESC-MPRA/BarakCohen-mESC-MPRA.zip")  # TODO - put the zip file elsewhere
In [3]:
fs.listdir("/")  # available files
Out[3]:
['GEN_RPMsExpression_plusSeqs.txt', 'Syn_RPMsExpression_plusSeqs.txt']

Genomic sequences

400 genomic sequences with combinations of Oct/Sox/Klf/Essrb sites, along with a matched set of 400 sequences in which the binding sites have been mutated. Each sequence has three of these four sites. In this set about 28% of the sequences were statistically different from basal levels and for all of those their activity depended on in tact binding sites. It is not a huge data set but it is probably the one that is most useful for you. We ran the gkSVM package on this data (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.

GEN_RPMsExpression_plusSeqs.txt

data table for the GEN library of (mm10) genomic sequences selected to have three TFBS for no more than 1 of each of OCT4, SOX2, KLF4, & ESRRB, plus matched sequences with all three binding site mutated, as Barak previously described. These sequences likely have TFBSs for NANOG but this was not part of our selection process and therefore these sites were not explicitly mutated in the matched sequences.

MPRA sequence

Genomic sequences were represented in the pool by 150 bp oligos with the following sequences:

GACTTACATTAGGGCCCGT[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCGGACTACGATACTG

Where [SEQ] is either a reference (gWT) or mutated (gMUT) genomic sequence of 81-82 bps. Reference gWT sequences were selected by choosing regions of the genome within 100 bps of previously identified ChIP­seq peaks for these four pluripotency factors (Chen et al. 2008). A

Column description

The columns are as follows:

  • CRE_id = name of element tested; for SYN library, encoded as ordered forward(f) or reverse(r) TFBS abbreviations; for GEN library, mm10 coordinates + designation as "Genomic", for reference sequences, or "Mutated", where each TFBSs in the sequence has 2 substitutions made at high information positions.
  • BC_seq = barcode sequence matched for determining counts from RNA and DNA sequencing.
  • Sequence = sequence synthesized and tested upstream of a pou5f1 minimal promoter.
  • `RNA_counts_rep{1-3} = reads per million (RPM) for each barcode per transfection replicate.
  • DNA_counts = RPM for each barcode in sequenced DNA plasmid pool.
  • BC_exp_rep{1-3} = (RPM RNA/ RPM DNA) per barcode.
  • Rep{1-3}BCnorm = (BC_exp_rep{1-3} / average of Basal expression) per replicate; activity relative to basal.
  • Rep{1-3}CREnormalized = average of normalized BC expression for each 'CRE_id' per replicate.
  • Rep{1-3}CREnorm_SEM = standard error of mean for BC expression for each 'CRE_id' per replicate.
In [4]:
df_gen = pd.read_csv(fs.open("/GEN_RPMsExpression_plusSeqs.txt"), sep='\t')
In [5]:
df_gen.head()
Out[5]:
CRE_id BC_seq Sequence RNA_counts_rep1 RNA_counts_rep2 RNA_counts_rep3 DNA_counts BC_exp_rep1 BC_exp_rep2 BC_exp_rep3 Rep1_BC_norm Rep2_BC_norm Rep3_BC_norm Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM
0 Basal AAAGACGCG NaN 247.2739 179.0611 160.3011 229.3460 1.0782 0.7807 0.6989 NaN NaN NaN NaN NaN NaN NaN NaN
1 Basal AAATCAAGC NaN 88.0702 107.7321 140.7916 181.2556 0.4859 0.5944 0.7768 NaN NaN NaN NaN NaN NaN NaN NaN
2 Basal AAATGGTCC NaN 205.2365 170.3033 173.5353 313.6059 0.6544 0.5430 0.5534 NaN NaN NaN NaN NaN NaN NaN NaN
3 Basal AACAAGAGG NaN 172.0582 236.9896 114.2612 200.5631 0.8579 1.1816 0.5697 NaN NaN NaN NaN NaN NaN NaN NaN
4 Basal AACGACCCT NaN 339.0788 209.2388 217.3385 374.8905 0.9045 0.5581 0.5797 NaN NaN NaN NaN NaN NaN NaN NaN
In [6]:
reps = [1, 2, 3]
In [64]:
dfs_gen = df_gen[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()
In [65]:
def expand_CRE_id(dfs_gen):    
    dfs_gen = pd.concat([dfs_gen, dfs_gen.CRE_id.str.split("_", n=1, expand=True).rename(columns={0: 'interval', 1: 'CRE_type'})], axis=1)
    dfs_gen = pd.concat([dfs_gen, dfs_gen['interval'].str.split(":|-", expand=True).rename(columns={0: "chrom", 1: "start", 2: "end"})], 1)
    dfs_gen['start'] = dfs_gen['start'].astype(int)
    dfs_gen['end'] = dfs_gen['end'].astype(int)
    dfs_gen['expression_fc'] = dfs_gen[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
    dfs_gen['expression_fc_SEM'] = dfs_gen[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
    dfs_gen['expression_fc_log2'] = np.log2(dfs_gen['expression_fc'])
    return dfs_gen

dfs_gen = expand_CRE_id(dfs_gen)
In [47]:
df_gen.shape
Out[47]:
(6592, 18)
In [48]:
df_gen.BC_seq.unique().shape  # barcode
Out[48]:
(6592,)
In [49]:
df_gen.Sequence.unique().shape
Out[49]:
(811,)
In [50]:
df_gen.CRE_id.unique().shape
Out[50]:
(811,)
In [51]:
dfs_gen
Out[51]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM interval CRE_type chrom start end expression_fc expression_fc_SEM expression_fc_log2
112 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 1.2789 1.3300 1.1202 0.1134 0.0980 chr10:108456620-10845... All_Mutated chr10 108456620 108456701 1.3045 0.1134 0.3835
120 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 2.5366 2.0188 2.4672 1.0134 1.3304 chr10:108456620-10845... Genomic chr10 108456620 108456701 2.2777 1.0134 1.1876
128 chr10:110940873-11094... CCAGCCCTCAACAGAGCAGCT... 0.8912 0.7321 0.8639 0.0673 0.1322 chr10:110940873-11094... All_Mutated chr10 110940873 110940954 0.8116 0.0673 -0.3011
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6568 chrX:42432096-4243217... ATTCAGTTTCAGATGCAGATC... 1.3218 1.3660 1.6200 0.1891 0.1751 chrX:42432096-42432178 Genomic chrX 42432096 42432178 1.3439 0.1891 0.4264
6576 chrX:51684780-5168486... CAAAAATCCCATCTCCCCACT... 0.8897 1.0264 0.8753 0.1101 0.1214 chrX:51684780-51684862 All_Mutated chrX 51684780 51684862 0.9581 0.1101 -0.0618
6584 chrX:51684780-5168486... CAAAAATCCCATCTCCCCACA... 1.0267 0.9596 0.8702 0.0857 0.1037 chrX:51684780-51684862 Genomic chrX 51684780 51684862 0.9932 0.0857 -0.0099

809 rows × 15 columns

In [59]:
dfs_gen.expression_fc_log2.plot.hist(100);
In [62]:
dfs_gen.plot.scatter("Rep1_CRE_normalized", "Rep2_CRE_normalized");
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
In [14]:
dfs_gen.Sequence.str.len().value_counts()  # sequence lengths
Out[14]:
81    434
82    375
Name: Sequence, dtype: int64
In [15]:
# Interval widths match
assert np.all(dfs_gen.Sequence.str.len() == dfs_gen['end'] - dfs_gen['start'])
In [16]:
dfs_gen.CRE_type.value_counts()
Out[16]:
All_Mutated    406
Genomic        403
Name: CRE_type, dtype: int64
In [17]:
dfs_gen['interval'].value_counts().value_counts()  # only 5 sequences don't have the matched pair
Out[17]:
2    402
1      5
Name: interval, dtype: int64

Synthetic sequences

~700 synthetic sequences which are just concatamers of the binding sites in different orientations and orders. This set contains all possible 2-mers, 3-mers, and 4-mers of sites. We performed machine learning on this set in a number of different ways, some of which yielded very strong models. You can see the paper for details.

Syn_RPMsExpression_plusSeqs.txt

data table for the SYN library of concatenated combinations of 'reference' TFBSs for OCT4 (O), SOX2 (S), KLF4 (K), & ESRRB (E), as Barak previously described.

MPRA sequence

CTTCTACTACTAGGGCCCA[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCTACATGCTAGTTCATG

where [SEQ] is a 40-80 bp synthetic element comprised of concatenated 20 bp building blocks of pluripotency sites, as described previously, with the fifth position of the KLF4 site changed to ‘T’ to facilitate cloning (Fiore and Cohen 2016) . [FILL] is a random filler sequence of variable length to bring the total length of each sequence to 150 bp, and [BC] is a random 9 bp barcode.

In [18]:
df_syn = pd.read_csv(fs.open("/Syn_RPMsExpression_plusSeqs.txt"), sep='\t')
In [19]:
df_syn.head()
Out[19]:
CRE_id BC_seq Sequence RNA_counts_rep1 RNA_counts_rep2 RNA_counts_rep3 DNA_counts BC_exp_rep1 BC_exp_rep2 BC_exp_rep3 Rep1_BC_norm Rep2_BC_norm Rep3_BC_norm Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep1_CRE_norm_SEM Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM
0 Basal AACCTGAAA NaN 36.5613 42.5753 37.1954 184.9038 0.1977 0.2303 0.2012 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Basal AATCCAGGG NaN 43.1518 60.5907 25.8516 197.2409 0.2188 0.3072 0.1311 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Basal AATGACCGA NaN 27.2248 21.1038 18.6749 153.2742 0.1776 0.1377 0.1218 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Basal ACACAACGT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Basal ACACATGCG NaN 89.7557 82.1357 92.8342 375.5954 0.2390 0.2187 0.2472 NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [20]:
df_syn.shape
Out[20]:
(5104, 19)
In [21]:
df_syn.BC_seq.unique().shape  # barcode
Out[21]:
(5104,)
In [22]:
df_syn.Sequence.unique().shape
Out[22]:
(625,)
In [23]:
df_syn.CRE_id.unique().shape
Out[23]:
(625,)

Tidy table

In [66]:
# get only the sequence -> value mapping
dfs_syn = df_syn[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()

# all have a label synthetic
assert np.all(dfs_syn.CRE_id.str.split("_", expand=True)[1] == 'synthetic')

dfs_syn['CRE_elements'] = dfs_syn.CRE_id.str.replace("_synthetic", "")

# add expression_fc
dfs_syn['expression_fc'] = dfs_syn[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
dfs_syn['expression_fc_SEM'] = dfs_syn[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
dfs_syn['expression_fc_log2'] = np.log2(dfs_syn['expression_fc'])

# all sequence designs are unique
assert dfs_syn['CRE_elements'].unique().shape == dfs_syn['CRE_elements'].shape

factor_hash = {"K": "Klf4",
               "S": "Sox2",
               "E": "Essrb",
               "O": "Oct4"}
factors = list(factor_hash.values())

def expand_factors(df):
    def extract_cols(s, k,v):
        return pd.DataFrame({v: s.str.contains(k), 
                             v + "_orientation": np.where(s.str.contains(k+'f'), 'fwd', 'rev')})
    return pd.concat([df] + [extract_cols(df.CRE_elements, k,v) for k,v in factor_hash.items()], 1)

dfs_syn = expand_factors(dfs_syn)
In [67]:
dfs_syn.head()
Out[67]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM CRE_elements expression_fc expression_fc_SEM expression_fc_log2 Klf4 Klf4_orientation Sox2 Sox2_orientation Essrb Essrb_orientation Oct4 Oct4_orientation
112 Ef-Kf-Of-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 2.9462 3.5538 3.6890 0.2847 0.2054 Ef-Kf-Of-Sf 3.2500 0.2847 1.7004 True fwd True fwd True fwd True fwd
120 Ef-Kf-Of-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 3.7496 5.1441 4.6180 0.6682 0.6283 Ef-Kf-Of-Sr 4.4469 0.6682 2.1528 True fwd True rev True fwd True fwd
128 Ef-Kf-Of_synthetic AGCTACGTTCAAGGTCACGTA... 2.0628 2.7998 2.2773 0.2142 0.3677 Ef-Kf-Of 2.4313 0.2142 1.2817 True fwd False rev True fwd True fwd
136 Ef-Kf-Or-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 4.5477 5.1814 4.9207 0.7599 0.7516 Ef-Kf-Or-Sf 4.8646 0.7599 2.2823 True fwd True fwd True fwd True rev
144 Ef-Kf-Or-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 5.4429 7.1354 6.6851 0.9533 0.9347 Ef-Kf-Or-Sr 6.2892 0.9533 2.6529 True fwd True rev True fwd True rev
In [75]:
dfs_syn.expression_fc.plot.hist(100);
plt.xlabel("expression_fc")
plt.figure()
dfs_syn.expression_fc_log2.plot.hist(100);
plt.xlabel("expression_fc_log2");
In [26]:
# number of sequences containing that factor
dfs_syn[factors].sum()
Out[26]:
Klf4     550
Sox2     551
Essrb    551
Oct4     551
dtype: int64
In [29]:
# number of used elements
dfs_syn[factors].sum(axis=1).value_counts()
Out[29]:
4    384
3    191
2     47
dtype: int64
In [27]:
# sequence lengths
dfs_syn.Sequence.str.len().value_counts()
Out[27]:
80    384
60    191
40     47
Name: Sequence, dtype: int64

Wrap everything into a simple function

In [82]:
import basepair
Using TensorFlow backend.
In [98]:
# function available at
from basepair.exp.chipnexus.data import load_BarakCohnen_mESC_MPRA
In [94]:
df_gen, df_syn = load_BarakCohnen_mESC_MPRA()
In [95]:
df_gen.head()
Out[95]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM interval CRE_type chrom start end expression_fc expression_fc_SEM expression_fc_log2
112 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 1.2789 1.3300 1.1202 0.1134 0.0980 chr10:108456620-10845... All_Mutated chr10 108456620 108456701 1.3045 0.1134 0.3835
120 chr10:108456620-10845... AGTCATGGAGAACAGGGGTTA... 2.5366 2.0188 2.4672 1.0134 1.3304 chr10:108456620-10845... Genomic chr10 108456620 108456701 2.2777 1.0134 1.1876
128 chr10:110940873-11094... CCAGCCCTCAACAGAGCAGCT... 0.8912 0.7321 0.8639 0.0673 0.1322 chr10:110940873-11094... All_Mutated chr10 110940873 110940954 0.8116 0.0673 -0.3011
136 chr10:110940873-11094... CCAGCCCTCAACAGAGCAGCT... 0.9943 1.0337 1.1185 0.1044 0.0994 chr10:110940873-11094... Genomic chr10 110940873 110940954 1.0140 0.1044 0.0201
144 chr10:115312950-11531... cagctatagtcccagtcTGTT... 0.9246 1.0169 1.0652 0.1435 0.0881 chr10:115312950-11531... All_Mutated chr10 115312950 115313031 0.9707 0.1435 -0.0429
In [96]:
df_syn.head()
Out[96]:
CRE_id Sequence Rep1_CRE_normalized Rep2_CRE_normalized Rep3_CRE_normalized Rep2_CRE_norm_SEM Rep3_CRE_norm_SEM CRE_elements expression_fc expression_fc_SEM expression_fc_log2 Klf4 Klf4_orientation Sox2 Sox2_orientation Essrb Essrb_orientation Oct4 Oct4_orientation
112 Ef-Kf-Of-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 2.9462 3.5538 3.6890 0.2847 0.2054 Ef-Kf-Of-Sf 3.2500 0.2847 1.7004 True fwd True fwd True fwd True fwd
120 Ef-Kf-Of-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 3.7496 5.1441 4.6180 0.6682 0.6283 Ef-Kf-Of-Sr 4.4469 0.6682 2.1528 True fwd True rev True fwd True fwd
128 Ef-Kf-Of_synthetic AGCTACGTTCAAGGTCACGTA... 2.0628 2.7998 2.2773 0.2142 0.3677 Ef-Kf-Of 2.4313 0.2142 1.2817 True fwd False rev True fwd True fwd
136 Ef-Kf-Or-Sf_synthetic AGCTACGTTCAAGGTCACGTA... 4.5477 5.1814 4.9207 0.7599 0.7516 Ef-Kf-Or-Sf 4.8646 0.7599 2.2823 True fwd True fwd True fwd True rev
144 Ef-Kf-Or-Sr_synthetic AGCTACGTTCAAGGTCACGTA... 5.4429 7.1354 6.6851 0.9533 0.9347 Ef-Kf-Or-Sr 6.2892 0.9533 2.6529 True fwd True rev True fwd True rev