Documentation¶

Modules¶

1. generators¶

This module contains classes for all the sequence data generators

Classes

MSequenceGenerator - The main base class for all generators.

Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays

MBPNetSequenceGenerator - Derives from MSequenceGenerator.

Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays

1.1 MSequenceGenerator¶

class mseqgen.generators.MSequenceGenerator(input_config, batch_gen_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=1, batch_size=64, samples=None)¶

Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays

Parameters

input_config (dict) –
python dictionary with information about the input data. Contains the following keys -

data (str)
path to the json file containing task information. See README for more information on the format of the json file

stranded (boolean)
True if data is stranded

has_control (boolean)
True if control data has been included
batch_gen_params (dictionary) –
python dictionary with batch generation parameters. Contains the following keys -

input_seq_len (int)
length of input DNA sequence

output_len (int)
length of output profile

max_jitter (int)
maximum value for randomized jitter to offset the peaks from the exact center of the input

rev_comp_aug (boolean)
enable reverse complement augmentation

negative_sampling_rate (float)
the fraction of batch_size that determines how many negative samples are added to each batch

sampling_mode (str)
the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the two column pandas dataframe containing the chromosome position information is passed to the ‘samples’ argument of the class

shuffle (boolean)
specify whether input data is shuffled at the begininning of each epoch

mode (str)
’train’, ‘val’ or ‘test’

num_positions” (int)
specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.

step_size (int)
specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.
reference_genome (str) – the path to the reference genome fasta file
chrom_sizes (str) – path to the chromosome sizes file
chroms (str) – the list of chromosomes that will be sampled for batch generation
num_threads (int) – number of parallel threads for batch generation, default = 10
epochs (int) – number of iterations for looping over input data, default = 1
batch_size (int) – size of each generated batch of data, default = 64
samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None

Members

gen()¶: Generator function to yield batches of data

len()¶

The number of batches per epoch

Returns: number of batches of data generated in each epoch
Return type: int

set_early_stopping()¶: Set early stopping flag to True

set_ready_for_next_epoch()¶: Set the variable that controls batch generation for the next epoch to True

set_stop()¶: Set stop flag to True

1.2 MBPNetSequenceGenerator¶

class mseqgen.generators.MBPNetSequenceGenerator(input_config, batch_gen_params, bpnet_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=100, batch_size=64, samples=None)¶

Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays

Parameters

input_config (dict) –
python dictionary with information about the input data. Contains the following keys -

data (str)
path to the json file containing task information. See README for more information on the format of the json file

stranded (boolean)
True if data is stranded

has_control (boolean)
True if control data has been included
batch_gen_params (dictionary) –
python dictionary with batch generation parameters. Contains the following keys -

input_seq_len (int)
length of input DNA sequence

output_len (int)
length of output profile

max_jitter (int)
maximum value for randomized jitter to offset the peaks from the exact center of the input

rev_comp_aug (boolean)
enable reverse complement augmentation

negative_sampling_rate (float)
the fraction of batch_size that determines how many negative samples are added to each batch

sampling_mode (str)
the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the bed file containing the chromosome position information is passed to the ‘samples’ argument of the class

shuffle (boolean)
specify whether input data is shuffled at the begininning of each epoch

mode (str)
’train’, ‘val’ or ‘test’

num_positions” (int)
specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.

step_size (int)
specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.
bpnet_params (dictionary) –
python dictionary containing parameters specific to BPNet. Contains the following keys -

name (str)
model architecture name

filters (int)
number of filters for BPNet

control_smoothing (list)
nested list of gaussiam smoothing parameters. Each inner list has two values - [sigma, window_size] for supplemental control tracks
reference_genome (str) – the path to the reference genome fasta file
chrom_sizes (str) – path to the chromosome sizes file
chroms (str) – the list of chromosomes that will be sampled for batch generation
num_threads (int) – number of parallel threads for batch generation
epochs (int) – number of iterations for looping over input data
batch_size (int) – size of each generated batch of data
samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None

Members

gen()¶: Generator function to yield batches of data

len()¶

The number of batches per epoch

Returns: number of batches of data generated in each epoch
Return type: int

set_early_stopping()¶: Set early stopping flag to True

set_ready_for_next_epoch()¶: Set the variable that controls batch generation for the next epoch to True

set_stop()¶: Set stop flag to True

2. sequtils¶

mseqgen.sequtils.one_hot_encode(sequences)¶

One hot encoding of a list of DNA sequences

Parameters: sequences (list) –
Returns: 3-dimension numpy array with shape (len(sequences), len(list_item), 4)
Return type: numpy.ndarray

mseqgen.sequtils.reverse_complement_of_sequences(sequences)¶

Reverse complement of DNA sequences

Parameters: sequences (list) – python list of strings of DNA sequence of arbitraty length
Returns: python list of strings
Return type: list

mseqgen.sequtils.reverse_complement_of_profiles(profiles, stranded=True)¶

Reverse complement of an genomics assay signal profile

Parameters: profiles (numpy.ndarray) – 3-dimensional numpy array, a batch of multitask profiles of shape (#examples, seq_len, #assays) if unstranded and (#examples, seq_len, #assays*2) if stranded. In the stranded case the assumption is: the postive & negative strands occur in pairs on axis=2(i.e. 3rd dimension) e.g. 0th & 1st index, 2nd & 3rd…
Returns: 3-dimensional numpy array
Return type: numpy.ndarray

mseqgen.sequtils.getChromPositions(chroms, chrom_sizes, flank, mode='sequential', num_positions=-1, step=50)¶

Chromosome positions spanning the entire chromosome at a) regular intervals or b) random locations

Parameters

chroms (list) – The list of required chromosomes
chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns
flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size
mode (str) – mode of returned position ‘sequential’ (from the beginning) or ‘random’
num_positions (int) – number of chromosome positions to return on each chromosome, use -1 to return positions across the entrire chromosome for all given chromosomes in chroms. mode=’random’ cannot be used with num_positions=-1
step (int) – the interval between consecutive chromosome positions in ‘sequential’ mode

Returns

two column dataframe of chromosome positions (chrom, pos)

Return type

pandas.DataFrame

mseqgen.sequtils.getPeakPositions(tasks, chroms, chrom_sizes, flank, drop_duplicates=False)¶

Peak positions for all the tasks filtered based on required chromosomes and other qc filters. Since ‘task’ here refers one strand of input/output, if the data is stranded the peaks will be duplicated for the plus and minus strand.

Parameters

tasks (dict) – A python dictionary containing the task information. Each task in tasks should have the key ‘peaks’ that has the path to he peaks file
chroms (list) – The list of required test chromosomes
chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns
flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size
drop_duplicates (boolean) – True if duplicates should be dropped from returned dataframe.

Returns

two column dataframe of peak positions (chrom, pos)

Return type

pandas.DataFrame