Paper: https://www.biorxiv.org/content/biorxiv/early/2018/08/22/398107.full.pdf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fs import open_fs
fs = open_fs("zip:///s/project/avsec/basepair/BarakCohen-mESC-MPRA/BarakCohen-mESC-MPRA.zip") # TODO - put the zip file elsewhere
fs.listdir("/") # available files
400 genomic sequences with combinations of Oct/Sox/Klf/Essrb sites, along with a matched set of 400 sequences in which the binding sites have been mutated. Each sequence has three of these four sites. In this set about 28% of the sequences were statistically different from basal levels and for all of those their activity depended on in tact binding sites. It is not a huge data set but it is probably the one that is most useful for you. We ran the gkSVM package on this data (top 25% vs bottom 75%) and got AUROC and AUPRC in the neighborhood of 0.75.
GEN_RPMsExpression_plusSeqs.txt
data table for the GEN library of (mm10) genomic sequences selected to have three TFBS for no more than 1 of each of OCT4, SOX2, KLF4, & ESRRB, plus matched sequences with all three binding site mutated, as Barak previously described. These sequences likely have TFBSs for NANOG but this was not part of our selection process and therefore these sites were not explicitly mutated in the matched sequences.
Genomic sequences were represented in the pool by 150 bp oligos with the following sequences:
GACTTACATTAGGGCCCGT[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCGGACTACGATACTG
Where [SEQ] is either a reference (gWT) or mutated (gMUT) genomic sequence of 81-82 bps. Reference gWT sequences were selected by choosing regions of the genome within 100 bps of previously identified ChIPseq peaks for these four pluripotency factors (Chen et al. 2008). A
The columns are as follows:
CRE_id = name of element tested; for SYN library, encoded as ordered forward(f) or reverse(r) TFBS abbreviations; for GEN library, mm10 coordinates + designation as "Genomic", for reference sequences, or "Mutated", where each TFBSs in the sequence has 2 substitutions made at high information positions.BC_seq = barcode sequence matched for determining counts from RNA and DNA sequencing.Sequence = sequence synthesized and tested upstream of a pou5f1 minimal promoter.`RNA_counts_rep{1-3} = reads per million (RPM) for each barcode per transfection replicate.DNA_counts = RPM for each barcode in sequenced DNA plasmid pool.BC_exp_rep{1-3} = (RPM RNA/ RPM DNA) per barcode.Rep{1-3}BCnorm = (BC_exp_rep{1-3} / average of Basal expression) per replicate; activity relative to basal.Rep{1-3}CREnormalized = average of normalized BC expression for each 'CRE_id' per replicate.Rep{1-3}CREnorm_SEM = standard error of mean for BC expression for each 'CRE_id' per replicate.df_gen = pd.read_csv(fs.open("/GEN_RPMsExpression_plusSeqs.txt"), sep='\t')
df_gen.head()
reps = [1, 2, 3]
dfs_gen = df_gen[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()
def expand_CRE_id(dfs_gen):
dfs_gen = pd.concat([dfs_gen, dfs_gen.CRE_id.str.split("_", n=1, expand=True).rename(columns={0: 'interval', 1: 'CRE_type'})], axis=1)
dfs_gen = pd.concat([dfs_gen, dfs_gen['interval'].str.split(":|-", expand=True).rename(columns={0: "chrom", 1: "start", 2: "end"})], 1)
dfs_gen['start'] = dfs_gen['start'].astype(int)
dfs_gen['end'] = dfs_gen['end'].astype(int)
dfs_gen['expression_fc'] = dfs_gen[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
dfs_gen['expression_fc_SEM'] = dfs_gen[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
dfs_gen['expression_fc_log2'] = np.log2(dfs_gen['expression_fc'])
return dfs_gen
dfs_gen = expand_CRE_id(dfs_gen)
df_gen.shape
df_gen.BC_seq.unique().shape # barcode
df_gen.Sequence.unique().shape
df_gen.CRE_id.unique().shape
dfs_gen
dfs_gen.expression_fc_log2.plot.hist(100);
dfs_gen.plot.scatter("Rep1_CRE_normalized", "Rep2_CRE_normalized");
dfs_gen.Sequence.str.len().value_counts() # sequence lengths
# Interval widths match
assert np.all(dfs_gen.Sequence.str.len() == dfs_gen['end'] - dfs_gen['start'])
dfs_gen.CRE_type.value_counts()
dfs_gen['interval'].value_counts().value_counts() # only 5 sequences don't have the matched pair
~700 synthetic sequences which are just concatamers of the binding sites in different orientations and orders. This set contains all possible 2-mers, 3-mers, and 4-mers of sites. We performed machine learning on this set in a number of different ways, some of which yielded very strong models. You can see the paper for details.
Syn_RPMsExpression_plusSeqs.txt
data table for the SYN library of concatenated combinations of 'reference' TFBSs for OCT4 (O), SOX2 (S), KLF4 (K), & ESRRB (E), as Barak previously described.
CTTCTACTACTAGGGCCCA[SEQ]AAGCTT[FILL]GAATTCTCTAGAC[BC]TGAGCTCTACATGCTAGTTCATG
where [SEQ] is a 40-80 bp synthetic element comprised of concatenated 20 bp building blocks
of pluripotency sites, as described previously, with the fifth position of the KLF4 site changed to
‘T’ to facilitate cloning (Fiore and Cohen 2016) . [FILL] is a random filler sequence of variable
length to bring the total length of each sequence to 150 bp, and [BC] is a random 9 bp barcode.
df_syn = pd.read_csv(fs.open("/Syn_RPMsExpression_plusSeqs.txt"), sep='\t')
df_syn.head()
df_syn.shape
df_syn.BC_seq.unique().shape # barcode
df_syn.Sequence.unique().shape
df_syn.CRE_id.unique().shape
# get only the sequence -> value mapping
dfs_syn = df_syn[['CRE_id', 'Sequence', 'Rep1_CRE_normalized', 'Rep2_CRE_normalized','Rep3_CRE_normalized', 'Rep2_CRE_norm_SEM', 'Rep3_CRE_norm_SEM']].drop_duplicates().dropna()
# all have a label synthetic
assert np.all(dfs_syn.CRE_id.str.split("_", expand=True)[1] == 'synthetic')
dfs_syn['CRE_elements'] = dfs_syn.CRE_id.str.replace("_synthetic", "")
# add expression_fc
dfs_syn['expression_fc'] = dfs_syn[[f'Rep{i}_CRE_normalized' for i in range(1,3)]].mean(axis=1)
dfs_syn['expression_fc_SEM'] = dfs_syn[[f'Rep{i}_CRE_norm_SEM' for i in range(2,3)]].mean(axis=1)
dfs_syn['expression_fc_log2'] = np.log2(dfs_syn['expression_fc'])
# all sequence designs are unique
assert dfs_syn['CRE_elements'].unique().shape == dfs_syn['CRE_elements'].shape
factor_hash = {"K": "Klf4",
"S": "Sox2",
"E": "Essrb",
"O": "Oct4"}
factors = list(factor_hash.values())
def expand_factors(df):
def extract_cols(s, k,v):
return pd.DataFrame({v: s.str.contains(k),
v + "_orientation": np.where(s.str.contains(k+'f'), 'fwd', 'rev')})
return pd.concat([df] + [extract_cols(df.CRE_elements, k,v) for k,v in factor_hash.items()], 1)
dfs_syn = expand_factors(dfs_syn)
dfs_syn.head()
dfs_syn.expression_fc.plot.hist(100);
plt.xlabel("expression_fc")
plt.figure()
dfs_syn.expression_fc_log2.plot.hist(100);
plt.xlabel("expression_fc_log2");
# number of sequences containing that factor
dfs_syn[factors].sum()
# number of used elements
dfs_syn[factors].sum(axis=1).value_counts()
# sequence lengths
dfs_syn.Sequence.str.len().value_counts()
import basepair
# function available at
from basepair.exp.chipnexus.data import load_BarakCohnen_mESC_MPRA
df_gen, df_syn = load_BarakCohnen_mESC_MPRA()
df_gen.head()
df_syn.head()