Documentation

Modules

1. generators

This module contains classes for all the sequence data generators

Classes

MSequenceGenerator - The main base class for all generators.

Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays

MBPNetSequenceGenerator - Derives from MSequenceGenerator.

Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays

1.1 MSequenceGenerator

class mseqgen.generators.MSequenceGenerator(input_config, batch_gen_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=1, batch_size=64, samples=None)

Multi task batch data generation for training deep neural networks on high-throughput sequencing data of various geonmics assays

Parameters
  • input_config (dict) –

    python dictionary with information about the input data. Contains the following keys -

    data (str)

    path to the json file containing task information. See README for more information on the format of the json file

    stranded (boolean)

    True if data is stranded

    has_control (boolean)

    True if control data has been included

  • batch_gen_params (dictionary) –

    python dictionary with batch generation parameters. Contains the following keys -

    input_seq_len (int)

    length of input DNA sequence

    output_len (int)

    length of output profile

    max_jitter (int)

    maximum value for randomized jitter to offset the peaks from the exact center of the input

    rev_comp_aug (boolean)

    enable reverse complement augmentation

    negative_sampling_rate (float)

    the fraction of batch_size that determines how many negative samples are added to each batch

    sampling_mode (str)

    the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the two column pandas dataframe containing the chromosome position information is passed to the ‘samples’ argument of the class

    shuffle (boolean)

    specify whether input data is shuffled at the begininning of each epoch

    mode (str)

    ’train’, ‘val’ or ‘test’

    num_positions” (int)

    specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.

    step_size (int)

    specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.

  • reference_genome (str) – the path to the reference genome fasta file

  • chrom_sizes (str) – path to the chromosome sizes file

  • chroms (str) – the list of chromosomes that will be sampled for batch generation

  • num_threads (int) – number of parallel threads for batch generation, default = 10

  • epochs (int) – number of iterations for looping over input data, default = 1

  • batch_size (int) – size of each generated batch of data, default = 64

  • samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None

Members

gen()

Generator function to yield batches of data

len()

The number of batches per epoch

Returns

number of batches of data generated in each epoch

Return type

int

set_early_stopping()

Set early stopping flag to True

set_ready_for_next_epoch()

Set the variable that controls batch generation for the next epoch to True

set_stop()

Set stop flag to True

1.2 MBPNetSequenceGenerator

class mseqgen.generators.MBPNetSequenceGenerator(input_config, batch_gen_params, bpnet_params, reference_genome, chrom_sizes, chroms, num_threads=10, epochs=100, batch_size=64, samples=None)

Multi task batch data generation for training BPNet on high-throughput sequencing data of various geonmics assays

Parameters
  • input_config (dict) –

    python dictionary with information about the input data. Contains the following keys -

    data (str)

    path to the json file containing task information. See README for more information on the format of the json file

    stranded (boolean)

    True if data is stranded

    has_control (boolean)

    True if control data has been included

  • batch_gen_params (dictionary) –

    python dictionary with batch generation parameters. Contains the following keys -

    input_seq_len (int)

    length of input DNA sequence

    output_len (int)

    length of output profile

    max_jitter (int)

    maximum value for randomized jitter to offset the peaks from the exact center of the input

    rev_comp_aug (boolean)

    enable reverse complement augmentation

    negative_sampling_rate (float)

    the fraction of batch_size that determines how many negative samples are added to each batch

    sampling_mode (str)

    the mode of sampling chromosome positions - one of [‘peaks’, ‘sequential’, ‘random’, ‘manual’]. In ‘peaks’ mode the data samples are fetched from the peaks bed file specified in the json file input_config[‘data’]. In ‘manual’ mode, the bed file containing the chromosome position information is passed to the ‘samples’ argument of the class

    shuffle (boolean)

    specify whether input data is shuffled at the begininning of each epoch

    mode (str)

    ’train’, ‘val’ or ‘test’

    num_positions” (int)

    specify how many chromosome positions to sample if sampling_mode is ‘sequential’ or ‘random’. Can be omitted if sampling_mode is “peaks”, has no effect if present.

    step_size (int)

    specify step size for sampling chromosome positions if sampling_mode is “sequential”. Can be omitted if sampling_mode is “peaks” or “random”, has no effect if present.

  • bpnet_params (dictionary) –

    python dictionary containing parameters specific to BPNet. Contains the following keys -

    name (str)

    model architecture name

    filters (int)

    number of filters for BPNet

    control_smoothing (list)

    nested list of gaussiam smoothing parameters. Each inner list has two values - [sigma, window_size] for supplemental control tracks

  • reference_genome (str) – the path to the reference genome fasta file

  • chrom_sizes (str) – path to the chromosome sizes file

  • chroms (str) – the list of chromosomes that will be sampled for batch generation

  • num_threads (int) – number of parallel threads for batch generation

  • epochs (int) – number of iterations for looping over input data

  • batch_size (int) – size of each generated batch of data

  • samples (pandas.Dataframe) – two column pandas dataframe with chromosome position information. Required column names are column 1:’chrom’, column 2:’pos’. Use this parameter if you set batch_gen_params[‘sampling_mode’] to ‘manual’. default = None

Members

gen()

Generator function to yield batches of data

len()

The number of batches per epoch

Returns

number of batches of data generated in each epoch

Return type

int

set_early_stopping()

Set early stopping flag to True

set_ready_for_next_epoch()

Set the variable that controls batch generation for the next epoch to True

set_stop()

Set stop flag to True

2. sequtils

mseqgen.sequtils.one_hot_encode(sequences)

One hot encoding of a list of DNA sequences

Parameters

sequences (list) –

Returns

3-dimension numpy array with shape (len(sequences), len(list_item), 4)

Return type

numpy.ndarray

mseqgen.sequtils.reverse_complement_of_sequences(sequences)

Reverse complement of DNA sequences

Parameters

sequences (list) – python list of strings of DNA sequence of arbitraty length

Returns

python list of strings

Return type

list

mseqgen.sequtils.reverse_complement_of_profiles(profiles, stranded=True)

Reverse complement of an genomics assay signal profile

Parameters

profiles (numpy.ndarray) – 3-dimensional numpy array, a batch of multitask profiles of shape (#examples, seq_len, #assays) if unstranded and (#examples, seq_len, #assays*2) if stranded. In the stranded case the assumption is: the postive & negative strands occur in pairs on axis=2(i.e. 3rd dimension) e.g. 0th & 1st index, 2nd & 3rd…

Returns

3-dimensional numpy array

Return type

numpy.ndarray

mseqgen.sequtils.getChromPositions(chroms, chrom_sizes, flank, mode='sequential', num_positions=-1, step=50)

Chromosome positions spanning the entire chromosome at a) regular intervals or b) random locations

Parameters
  • chroms (list) – The list of required chromosomes

  • chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns

  • flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size

  • mode (str) – mode of returned position ‘sequential’ (from the beginning) or ‘random’

  • num_positions (int) – number of chromosome positions to return on each chromosome, use -1 to return positions across the entrire chromosome for all given chromosomes in chroms. mode=’random’ cannot be used with num_positions=-1

  • step (int) – the interval between consecutive chromosome positions in ‘sequential’ mode

Returns

two column dataframe of chromosome positions (chrom, pos)

Return type

pandas.DataFrame

mseqgen.sequtils.getPeakPositions(tasks, chroms, chrom_sizes, flank, drop_duplicates=False)

Peak positions for all the tasks filtered based on required chromosomes and other qc filters. Since ‘task’ here refers one strand of input/output, if the data is stranded the peaks will be duplicated for the plus and minus strand.

Parameters
  • tasks (dict) – A python dictionary containing the task information. Each task in tasks should have the key ‘peaks’ that has the path to he peaks file

  • chroms (list) – The list of required test chromosomes

  • chrom_sizes (pandas.Dataframe) – dataframe of chromosome sizes with ‘chrom’ and ‘size’ columns

  • flank (int) – Buffer size before & after the position to ensure we dont fetch values at index < 0 & > chrom size

  • drop_duplicates (boolean) – True if duplicates should be dropped from returned dataframe.

Returns

two column dataframe of peak positions (chrom, pos)

Return type

pandas.DataFrame