Documentation¶
Modules¶
1. generators¶
This module contains classes for all the sequence data generators
Classes
MSequenceGenerator - The main base class for all generators.
Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays
MBPNetSequenceGenerator - Derives from MSequenceGenerator.
Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays
1.1 MSequenceGenerator¶
-
class
mseqgen.generators.MSequenceGenerator(input_config, batch_gen_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=1, batch_size=64, samples=None)¶ Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays
- Parameters
input_config (dict) –
python dictionary with information about the input data. Contains the following keys -
- data (str)
path to the json file containing task information. See README for more information on the format of the json file
- stranded (boolean)
True if data is stranded
- has_control (boolean)
True if control data has been included
batch_gen_params (dictionary) –
python dictionary with batch generation parameters. Contains the following keys -
- input_seq_len (int)
length of input DNA sequence
- output_len (int)
length of output profile
- max_jitter (int)
maximum value for randomized jitter to offset the peaks from the exact center of the input
- rev_comp_aug (boolean)
enable reverse complement augmentation
- negative_sampling_rate (float)
the fraction of batch_size that determines how many negative samples are added to each batch
- sampling_mode (str)
the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the two column pandas dataframe containing the chromosome position information is passed to the ‘samples’ argument of the class
- shuffle (boolean)
specify whether input data is shuffled at the begininning of each epoch
- mode (str)
’train’, ‘val’ or ‘test’
- num_positions” (int)
specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.
- step_size (int)
specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.
reference_genome (str) – the path to the reference genome fasta file
chrom_sizes (str) – path to the chromosome sizes file
chroms (str) – the list of chromosomes that will be sampled for batch generation
num_threads (int) – number of parallel threads for batch generation, default = 10
epochs (int) – number of iterations for looping over input data, default = 1
batch_size (int) – size of each generated batch of data, default = 64
samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None
Members
-
gen()¶ Generator function to yield batches of data
-
len()¶ The number of batches per epoch
- Returns
number of batches of data generated in each epoch
- Return type
int
-
set_early_stopping()¶ Set early stopping flag to True
-
set_ready_for_next_epoch()¶ Set the variable that controls batch generation for the next epoch to True
-
set_stop()¶ Set stop flag to True
1.2 MBPNetSequenceGenerator¶
-
class
mseqgen.generators.MBPNetSequenceGenerator(input_config, batch_gen_params, bpnet_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=100, batch_size=64, samples=None)¶ Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays
- Parameters
input_config (dict) –
python dictionary with information about the input data. Contains the following keys -
- data (str)
path to the json file containing task information. See README for more information on the format of the json file
- stranded (boolean)
True if data is stranded
- has_control (boolean)
True if control data has been included
batch_gen_params (dictionary) –
python dictionary with batch generation parameters. Contains the following keys -
- input_seq_len (int)
length of input DNA sequence
- output_len (int)
length of output profile
- max_jitter (int)
maximum value for randomized jitter to offset the peaks from the exact center of the input
- rev_comp_aug (boolean)
enable reverse complement augmentation
- negative_sampling_rate (float)
the fraction of batch_size that determines how many negative samples are added to each batch
- sampling_mode (str)
the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the bed file containing the chromosome position information is passed to the ‘samples’ argument of the class
- shuffle (boolean)
specify whether input data is shuffled at the begininning of each epoch
- mode (str)
’train’, ‘val’ or ‘test’
- num_positions” (int)
specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.
- step_size (int)
specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.
bpnet_params (dictionary) –
python dictionary containing parameters specific to BPNet. Contains the following keys -
- name (str)
model architecture name
- filters (int)
number of filters for BPNet
- control_smoothing (list)
nested list of gaussiam smoothing parameters. Each inner list has two values - [sigma, window_size] for supplemental control tracks
reference_genome (str) – the path to the reference genome fasta file
chrom_sizes (str) – path to the chromosome sizes file
chroms (str) – the list of chromosomes that will be sampled for batch generation
num_threads (int) – number of parallel threads for batch generation
epochs (int) – number of iterations for looping over input data
batch_size (int) – size of each generated batch of data
samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None
Members
-
gen()¶ Generator function to yield batches of data
-
len()¶ The number of batches per epoch
- Returns
number of batches of data generated in each epoch
- Return type
int
-
set_early_stopping()¶ Set early stopping flag to True
-
set_ready_for_next_epoch()¶ Set the variable that controls batch generation for the next epoch to True
-
set_stop()¶ Set stop flag to True
2. sequtils¶
-
mseqgen.sequtils.one_hot_encode(sequences)¶ One hot encoding of a list of DNA sequences
- Parameters
sequences (list) –
- Returns
3-dimension numpy array with shape (len(sequences), len(list_item), 4)
- Return type
numpy.ndarray
-
mseqgen.sequtils.reverse_complement_of_sequences(sequences)¶ Reverse complement of DNA sequences
- Parameters
sequences (list) – python list of strings of DNA sequence of arbitraty length
- Returns
python list of strings
- Return type
list
-
mseqgen.sequtils.reverse_complement_of_profiles(profiles, stranded=True)¶ Reverse complement of an genomics assay signal profile
- Parameters
profiles (numpy.ndarray) – 3-dimensional numpy array, a batch of multitask profiles of shape (#examples, seq_len, #assays) if unstranded and (#examples, seq_len, #assays*2) if stranded. In the stranded case the assumption is: the postive & negative strands occur in pairs on axis=2(i.e. 3rd dimension) e.g. 0th & 1st index, 2nd & 3rd…
- Returns
3-dimensional numpy array
- Return type
numpy.ndarray
-
mseqgen.sequtils.getChromPositions(chroms, chrom_sizes, flank, mode='sequential', num_positions=-1, step=50)¶ Chromosome positions spanning the entire chromosome at a) regular intervals or b) random locations
- Parameters
chroms (list) – The list of required chromosomes
chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns
flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size
mode (str) – mode of returned position ‘sequential’ (from the beginning) or ‘random’
num_positions (int) – number of chromosome positions to return on each chromosome, use -1 to return positions across the entrire chromosome for all given chromosomes in chroms. mode=’random’ cannot be used with num_positions=-1
step (int) – the interval between consecutive chromosome positions in ‘sequential’ mode
- Returns
two column dataframe of chromosome positions (chrom, pos)
- Return type
pandas.DataFrame
-
mseqgen.sequtils.getPeakPositions(tasks, chroms, chrom_sizes, flank, drop_duplicates=False)¶ Peak positions for all the tasks filtered based on required chromosomes and other qc filters. Since ‘task’ here refers one strand of input/output, if the data is stranded the peaks will be duplicated for the plus and minus strand.
- Parameters
tasks (dict) – A python dictionary containing the task information. Each task in tasks should have the key ‘peaks’ that has the path to he peaks file
chroms (list) – The list of required test chromosomes
chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns
flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size
drop_duplicates (boolean) – True if duplicates should be dropped from returned dataframe.
- Returns
two column dataframe of peak positions (chrom, pos)
- Return type
pandas.DataFrame