Goal¶

explore the MPRA data

Conclusions¶

Sequences are mostly ~80bp long (min=40, max=82)

Description¶

Paper: https://www.biorxiv.org/content/biorxiv/early/2018/08/22/398107.full.pdf

Imports¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fs import open_fs

fs = open_fs("zip:///s/project/avsec/basepair/BarakCohen-mESC-MPRA/BarakCohen-mESC-MPRA.zip")  # TODO - put the zip file elsewhere

fs.listdir("/")  # available files

['GEN_RPMsExpression_plusSeqs.txt', 'Syn_RPMsExpression_plusSeqs.txt']

Genomic sequences¶

400 genomic sequences with combinations of Oct/Sox/Klf/Essrb sites, along with a matched set of 400 sequences in which the binding sites have been mutated. Each sequence has three of these four sites. In this set about 28% of the sequences were statistically different from basal levels and for all of those their activity depended on in tact binding sites. It is not a huge data set but it is probably the one that is most useful for you. We ran the gkSVM package on this data (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.

GEN_RPMsExpression_plusSeqs.txt

data table for the GEN library of (mm10) genomic sequences selected to have three TFBS for no more than 1 of each of OCT4, SOX2, KLF4, & ESRRB, plus matched sequences with all three binding site mutated, as Barak previously described. These sequences likely have TFBSs for NANOG but this was not part of our selection process and therefore these sites were not explicitly mutated in the matched sequences.

MPRA sequence¶

Genomic sequences were represented in the pool by 150 bp oligos with the following sequences:

GACTTACATTAGGGCCCGT[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCGGACTACGATACTG

Where [SEQ] is either a reference (gWT) or mutated (gMUT) genomic sequence of 81-82 bps. Reference gWT sequences were selected by choosing regions of the genome within 100 bps of previously identified ChIPseq peaks for these four pluripotency factors (Chen et al. 2008). A

Column description¶

The columns are as follows:

CRE_id = name of element tested; for SYN library, encoded as ordered forward(f) or reverse(r) TFBS abbreviations; for GEN library, mm10 coordinates + designation as "Genomic", for reference sequences, or "Mutated", where each TFBSs in the sequence has 2 substitutions made at high information positions.
BC_seq = barcode sequence matched for determining counts from RNA and DNA sequencing.
Sequence = sequence synthesized and tested upstream of a pou5f1 minimal promoter.
`RNA_counts_rep{1-3} = reads per million (RPM) for each barcode per transfection replicate.
DNA_counts = RPM for each barcode in sequenced DNA plasmid pool.
BC_exp_rep{1-3} = (RPM RNA/ RPM DNA) per barcode.
Rep{1-3}BCnorm = (BC_exp_rep{1-3} / average of Basal expression) per replicate; activity relative to basal.
Rep{1-3}CREnormalized = average of normalized BC expression for each 'CRE_id' per replicate.
Rep{1-3}CREnorm_SEM = standard error of mean for BC expression for each 'CRE_id' per replicate.

df_gen = pd.read_csv(fs.open("/GEN_RPMsExpression_plusSeqs.txt"), sep='\t')

df_gen.head()

reps = [1, 2, 3]

dfs_gen = df_gen[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()

def expand_CRE_id(dfs_gen):    
    dfs_gen = pd.concat([dfs_gen, dfs_gen.CRE_id.str.split("_", n=1, expand=True).rename(columns={0: 'interval', 1: 'CRE_type'})], axis=1)
    dfs_gen = pd.concat([dfs_gen, dfs_gen['interval'].str.split(":|-", expand=True).rename(columns={0: "chrom", 1: "start", 2: "end"})], 1)
    dfs_gen['start'] = dfs_gen['start'].astype(int)
    dfs_gen['end'] = dfs_gen['end'].astype(int)
    dfs_gen['expression_fc'] = dfs_gen[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
    dfs_gen['expression_fc_SEM'] = dfs_gen[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
    dfs_gen['expression_fc_log2'] = np.log2(dfs_gen['expression_fc'])
    return dfs_gen

dfs_gen = expand_CRE_id(dfs_gen)

df_gen.shape

(6592, 18)

df_gen.BC_seq.unique().shape  # barcode

(6592,)

df_gen.Sequence.unique().shape

(811,)

df_gen.CRE_id.unique().shape

(811,)

dfs_gen

dfs_gen.expression_fc_log2.plot.hist(100);

dfs_gen.plot.scatter("Rep1_CRE_normalized", "Rep2_CRE_normalized");

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

dfs_gen.Sequence.str.len().value_counts()  # sequence lengths

81    434
82    375
Name: Sequence, dtype: int64

# Interval widths match
assert np.all(dfs_gen.Sequence.str.len() == dfs_gen['end'] - dfs_gen['start'])

dfs_gen.CRE_type.value_counts()

All_Mutated    406
Genomic        403
Name: CRE_type, dtype: int64

dfs_gen['interval'].value_counts().value_counts()  # only 5 sequences don't have the matched pair

2    402
1      5
Name: interval, dtype: int64

Synthetic sequences¶

~700 synthetic sequences which are just concatamers of the binding sites in different orientations and orders. This set contains all possible 2-mers, 3-mers, and 4-mers of sites. We performed machine learning on this set in a number of different ways, some of which yielded very strong models. You can see the paper for details.

Syn_RPMsExpression_plusSeqs.txt

data table for the SYN library of concatenated combinations of 'reference' TFBSs for OCT4 (O), SOX2 (S), KLF4 (K), & ESRRB (E), as Barak previously described.

MPRA sequence¶

CTTCTACTACTAGGGCCCA[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCTACATGCTAGTTCATG

where [SEQ] is a 40-80 bp synthetic element comprised of concatenated 20 bp building blocks of pluripotency sites, as described previously, with the fifth position of the KLF4 site changed to ‘T’ to facilitate cloning (Fiore and Cohen 2016) . [FILL] is a random filler sequence of variable length to bring the total length of each sequence to 150 bp, and [BC] is a random 9 bp barcode.

df_syn = pd.read_csv(fs.open("/Syn_RPMsExpression_plusSeqs.txt"), sep='\t')

df_syn.head()

df_syn.shape

(5104, 19)

df_syn.BC_seq.unique().shape  # barcode

(5104,)

df_syn.Sequence.unique().shape

(625,)

df_syn.CRE_id.unique().shape

(625,)

Tidy table¶

# get only the sequence -> value mapping
dfs_syn = df_syn[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()

# all have a label synthetic
assert np.all(dfs_syn.CRE_id.str.split("_", expand=True)[1] == 'synthetic')

dfs_syn['CRE_elements'] = dfs_syn.CRE_id.str.replace("_synthetic", "")

# add expression_fc
dfs_syn['expression_fc'] = dfs_syn[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
dfs_syn['expression_fc_SEM'] = dfs_syn[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
dfs_syn['expression_fc_log2'] = np.log2(dfs_syn['expression_fc'])

# all sequence designs are unique
assert dfs_syn['CRE_elements'].unique().shape == dfs_syn['CRE_elements'].shape

factor_hash = {"K": "Klf4",
               "S": "Sox2",
               "E": "Essrb",
               "O": "Oct4"}
factors = list(factor_hash.values())

def expand_factors(df):
    def extract_cols(s, k,v):
        return pd.DataFrame({v: s.str.contains(k), 
                             v + "_orientation": np.where(s.str.contains(k+'f'), 'fwd', 'rev')})
    return pd.concat([df] + [extract_cols(df.CRE_elements, k,v) for k,v in factor_hash.items()], 1)

dfs_syn = expand_factors(dfs_syn)

dfs_syn.head()

dfs_syn.expression_fc.plot.hist(100);
plt.xlabel("expression_fc")
plt.figure()
dfs_syn.expression_fc_log2.plot.hist(100);
plt.xlabel("expression_fc_log2");

# number of sequences containing that factor
dfs_syn[factors].sum()

Klf4     550
Sox2     551
Essrb    551
Oct4     551
dtype: int64

# number of used elements
dfs_syn[factors].sum(axis=1).value_counts()

4    384
3    191
2     47
dtype: int64

# sequence lengths
dfs_syn.Sequence.str.len().value_counts()

80    384
60    191
40     47
Name: Sequence, dtype: int64

Wrap everything into a simple function¶

import basepair

Using TensorFlow backend.

# function available at
from basepair.exp.chipnexus.data import load_BarakCohnen_mESC_MPRA

df_gen, df_syn = load_BarakCohnen_mESC_MPRA()

df_gen.head()

df_syn.head()

	CRE_id	BC_seq	Sequence	RNA_counts_rep1	RNA_counts_rep2	RNA_counts_rep3	DNA_counts	BC_exp_rep1	BC_exp_rep2	BC_exp_rep3	Rep1_BC_norm	Rep2_BC_norm	Rep3_BC_norm	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM
0	Basal	AAAGACGCG	NaN	247.2739	179.0611	160.3011	229.3460	1.0782	0.7807	0.6989	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Basal	AAATCAAGC	NaN	88.0702	107.7321	140.7916	181.2556	0.4859	0.5944	0.7768	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Basal	AAATGGTCC	NaN	205.2365	170.3033	173.5353	313.6059	0.6544	0.5430	0.5534	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Basal	AACAAGAGG	NaN	172.0582	236.9896	114.2612	200.5631	0.8579	1.1816	0.5697	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Basal	AACGACCCT	NaN	339.0788	209.2388	217.3385	374.8905	0.9045	0.5581	0.5797	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	CRE_id	Sequence	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM	interval	CRE_type	chrom	start	end	expression_fc	expression_fc_SEM	expression_fc_log2
112	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	1.2789	1.3300	1.1202	0.1134	0.0980	chr10:108456620-10845...	All_Mutated	chr10	108456620	108456701	1.3045	0.1134	0.3835
120	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	2.5366	2.0188	2.4672	1.0134	1.3304	chr10:108456620-10845...	Genomic	chr10	108456620	108456701	2.2777	1.0134	1.1876
128	chr10:110940873-11094...	CCAGCCCTCAACAGAGCAGCT...	0.8912	0.7321	0.8639	0.0673	0.1322	chr10:110940873-11094...	All_Mutated	chr10	110940873	110940954	0.8116	0.0673	-0.3011
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6568	chrX:42432096-4243217...	ATTCAGTTTCAGATGCAGATC...	1.3218	1.3660	1.6200	0.1891	0.1751	chrX:42432096-42432178	Genomic	chrX	42432096	42432178	1.3439	0.1891	0.4264
6576	chrX:51684780-5168486...	CAAAAATCCCATCTCCCCACT...	0.8897	1.0264	0.8753	0.1101	0.1214	chrX:51684780-51684862	All_Mutated	chrX	51684780	51684862	0.9581	0.1101	-0.0618
6584	chrX:51684780-5168486...	CAAAAATCCCATCTCCCCACA...	1.0267	0.9596	0.8702	0.0857	0.1037	chrX:51684780-51684862	Genomic	chrX	51684780	51684862	0.9932	0.0857	-0.0099

	CRE_id	BC_seq	Sequence	RNA_counts_rep1	RNA_counts_rep2	RNA_counts_rep3	DNA_counts	BC_exp_rep1	BC_exp_rep2	BC_exp_rep3	Rep1_BC_norm	Rep2_BC_norm	Rep3_BC_norm	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep1_CRE_norm_SEM	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM
0	Basal	AACCTGAAA	NaN	36.5613	42.5753	37.1954	184.9038	0.1977	0.2303	0.2012	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Basal	AATCCAGGG	NaN	43.1518	60.5907	25.8516	197.2409	0.2188	0.3072	0.1311	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Basal	AATGACCGA	NaN	27.2248	21.1038	18.6749	153.2742	0.1776	0.1377	0.1218	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Basal	ACACAACGT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Basal	ACACATGCG	NaN	89.7557	82.1357	92.8342	375.5954	0.2390	0.2187	0.2472	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	CRE_id	Sequence	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM	CRE_elements	expression_fc	expression_fc_SEM	expression_fc_log2	Klf4	Klf4_orientation	Sox2	Sox2_orientation	Essrb	Essrb_orientation	Oct4	Oct4_orientation
112	Ef-Kf-Of-Sf_synthetic	AGCTACGTTCAAGGTCACGTA...	2.9462	3.5538	3.6890	0.2847	0.2054	Ef-Kf-Of-Sf	3.2500	0.2847	1.7004	True	fwd	True	fwd	True	fwd	True	fwd
120	Ef-Kf-Of-Sr_synthetic	AGCTACGTTCAAGGTCACGTA...	3.7496	5.1441	4.6180	0.6682	0.6283	Ef-Kf-Of-Sr	4.4469	0.6682	2.1528	True	fwd	True	rev	True	fwd	True	fwd
128	Ef-Kf-Of_synthetic	AGCTACGTTCAAGGTCACGTA...	2.0628	2.7998	2.2773	0.2142	0.3677	Ef-Kf-Of	2.4313	0.2142	1.2817	True	fwd	False	rev	True	fwd	True	fwd
136	Ef-Kf-Or-Sf_synthetic	AGCTACGTTCAAGGTCACGTA...	4.5477	5.1814	4.9207	0.7599	0.7516	Ef-Kf-Or-Sf	4.8646	0.7599	2.2823	True	fwd	True	fwd	True	fwd	True	rev
144	Ef-Kf-Or-Sr_synthetic	AGCTACGTTCAAGGTCACGTA...	5.4429	7.1354	6.6851	0.9533	0.9347	Ef-Kf-Or-Sr	6.2892	0.9533	2.6529	True	fwd	True	rev	True	fwd	True	rev

	CRE_id	Sequence	Rep1_CRE_normalized	Rep2_CRE_normalized	Rep3_CRE_normalized	Rep2_CRE_norm_SEM	Rep3_CRE_norm_SEM	interval	CRE_type	chrom	start	end	expression_fc	expression_fc_SEM	expression_fc_log2
112	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	1.2789	1.3300	1.1202	0.1134	0.0980	chr10:108456620-10845...	All_Mutated	chr10	108456620	108456701	1.3045	0.1134	0.3835
120	chr10:108456620-10845...	AGTCATGGAGAACAGGGGTTA...	2.5366	2.0188	2.4672	1.0134	1.3304	chr10:108456620-10845...	Genomic	chr10	108456620	108456701	2.2777	1.0134	1.1876
128	chr10:110940873-11094...	CCAGCCCTCAACAGAGCAGCT...	0.8912	0.7321	0.8639	0.0673	0.1322	chr10:110940873-11094...	All_Mutated	chr10	110940873	110940954	0.8116	0.0673	-0.3011
136	chr10:110940873-11094...	CCAGCCCTCAACAGAGCAGCT...	0.9943	1.0337	1.1185	0.1044	0.0994	chr10:110940873-11094...	Genomic	chr10	110940873	110940954	1.0140	0.1044	0.0201
144	chr10:115312950-11531...	cagctatagtcccagtcTGTT...	0.9246	1.0169	1.0652	0.1435	0.0881	chr10:115312950-11531...	All_Mutated	chr10	115312950	115313031	0.9707	0.1435	-0.0429