Command Line Interface

bpnettrainer

usage: bpnettrainer [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
                    [--learning-rate LEARNING_RATE] [--min-learning-rate MIN_LEARNING_RATE]
                    [--early-stopping-patience EARLY_STOPPING_PATIENCE]
                    [--early-stopping-min-delta EARLY_STOPPING_MIN_DELTA]
                    [--reduce-lr-on-plateau-patience REDUCE_LR_ON_PLATEAU_PATIENCE]
                    [--model-arch-name MODEL_ARCH_NAME]
                    [--sequence-generator-name SEQUENCE_GENERATOR_NAME] [--filters FILTERS]
                    [--counts-loss-weight COUNTS_LOSS_WEIGHT]
                    [--control-smoothing CONTROL_SMOOTHING] [--threads THREADS] [--gpus GPUS]
                    --reference-genome REFERENCE_GENOME --chrom-sizes CHROM_SIZES --chroms
                    CHROMS [CHROMS ...] [--exclude-chroms EXCLUDE_CHROMS [EXCLUDE_CHROMS ...]]
                    [--splits SPLITS] [--output-dir OUTPUT_DIR] [--tag-length TAG_LENGTH]
                    [--time-zone TIME_ZONE] [--automate-filenames]
                    [--model-output-filename MODEL_OUTPUT_FILENAME]
                    [--input-seq-len INPUT_SEQ_LEN] [--output-len OUTPUT_LEN]
                    [--max-jitter MAX_JITTER] [--reverse-complement-augmentation]
                    [--negative-sampling-rate NEGATIVE_SAMPLING_RATE] --input-data INPUT_DATA
                    [--stranded] [--has-control] [--sampling-mode {peaks,sequential,random}]
                    [--shuffle]

Named Arguments

--batch-size, -b

training batch size

Default: 64

--epochs, -e

number of training epochs

Default: 100

--learning-rate, -L

learning rate for Adam optimizer

Default: 0.004

--min-learning-rate, -l

min learning rate for Adam optimizer

Default: 0.0001

--early-stopping-patience

patience value for early stopping callback

Default: 5

--early-stopping-min-delta

minimum change in the validation loss to qualify as an improvement

Default: 0.001

--reduce-lr-on-plateau-patience

patience value for ReduceLROnPlateau callback

Default: 2

--model-arch-name

the name of the model architesture that will be used in training (the name that will be used to fetch the model from model_archs)

Default: “BPNet”

--sequence-generator-name

the name of the sequence generator from mseqgen library that will be used to generate batches of data

Default: “BPNet”

--filters, -f

number of filters to use in BPNet

Default: 64

--counts-loss-weight, -w

Weight for counts mse loss

Default: 100.0

--control-smoothing

Default: [[7.5, 80]]

--threads, -t

number of parallel threads for batch generation

Default: 10

--gpus, -p

number of gpus to use

Default: 1

--reference-genome, -g

number of gpus to use

Default: 1

--chrom-sizes, -c

path to chromosome sizes file

--chroms

master list of chromosomes for the genome

--exclude-chroms

list of chromosomes to be excluded

Default: []

--splits, -s

path to json file

--output-dir, -d

destination directory to store the model

Default: “.”

--tag-length

length of the alphanumeric tag for the model file name (applies if –automate-filenames option is used)

Default: 6

--time-zone

time zone to use for timestamping model directories (applies if –automate-filenames option is used)

Default: “US/Pacific”

--automate-filenames

specify if the model output directory and filename should be auto generated

Default: False

--model-output-filename

basename of the model file without the .h5 extension (required if –automate-filenames is not used)

Default: “”

--input-seq-len

length of input DNA sequence

Default: 3088

--output-len

length of output profile

Default: 1000

--max-jitter

maximum value for randomized jitter to offset the peaks from the exact center of the input

Default: 128

--reverse-complement-augmentation

enable reverse complement augmentation

Default: True

--negative-sampling-rate

number of negatives to sample for every positive peak

Default: 0.0

--input-data, -i

path to json file containing task information

--stranded

specify if the input data is stranded or unstranded

Default: False

--has-control

specify if the input data has controls

Default: False

--sampling-mode

Possible choices: peaks, sequential, random

Default: “peaks”

--shuffle

Default: False

predict

usage: predict [-h] [--batch-size BATCH_SIZE] [--input-seq-len INPUT_SEQ_LEN]
               [--output-len OUTPUT_LEN] [--predict-peaks] --reference-genome REFERENCE_GENOME
               --chrom-sizes CHROM_SIZES --chroms CHROMS [CHROMS ...] --input-data INPUT_DATA
               [--stranded] [--has-control] [--model MODEL] [--model-name MODEL_NAME]
               [--model-dir MODEL_DIR] --output-dir OUTPUT_DIR [--automate-filenames]
               [--time-zone TIME_ZONE] [--exponentiate-counts]
               [--output-window-size OUTPUT_WINDOW_SIZE]
               [--other-tags OTHER_TAGS [OTHER_TAGS ...]]
               [--write-buffer-size WRITE_BUFFER_SIZE]

Named Arguments

--batch-size, -b

test batch size

Default: 64

--input-seq-len

length of input DNA sequence

Default: 3088

--output-len

length of output profile

Default: 1000

--predict-peaks

generate predictions only on the peaks contained in the peaks.bed files

Default: False

--reference-genome, -g

the path to the reference genome fasta file

--chrom-sizes, -s

path to chromosome sizes file

--chroms, -c

list of test chromosomes for prediction

--input-data, -i

path to json file containing task information

--stranded

specify if the input data is stranded or unstranded (i.e in case –has-control is True)

Default: False

--has-control

specify if the input data has controls

Default: False

--model, -m

path to the .h5 model file

--model-name

the name of the model that will be used in for predictions

Default: “BPNet”

--model-dir

directory where .h5 model files are stored

--output-dir, -o

destination directory to store predictions as a bigWig file

--automate-filenames

specify if the predictions output should be stored in a timestamped subdirectory within –output-dir

Default: False

--time-zone

time zone to use for timestamping model directories

Default: “US/Pacific”

--exponentiate-counts

specify if the predicted counts should be exponentiated before writing to the bigWig files

Default: False

--output-window-size

size of the central window of the output profile predictions that will be written to the bigWig files

Default: 1000

--other-tags

list of additional tags to be added as suffix to the filenames

Default: []

--write-buffer-size

size of the write buffer to store predictions before writing to bigWig files

Default: 10000

bounds

usage: bounds [-h] [--input-profiles INPUT_PROFILES [INPUT_PROFILES ...]]
              [--output-names OUTPUT_NAMES [OUTPUT_NAMES ...]] --output-directory
              OUTPUT_DIRECTORY --peaks PEAKS [--peak-width PEAK_WIDTH]
              [--chroms CHROMS [CHROMS ...]]
              [--smoothing-params SMOOTHING_PARAMS [SMOOTHING_PARAMS ...]]

Named Arguments

--input-profiles

list of input bigWig profile

Default: []

--output-names

list of outputnames for the bounds output corresponding to each of the input profiles

Default: []

--output-directory

Path to the output directory

--peaks

Path to the bed file containing the chromosome coordinates. The bed file should have at least 3 columns, the first 3 being ‘chrom’, ‘start’, and ‘end’

--peak-width

the span of the peak to be considered for bounds computation

Default: 1000

--chroms, -c

list of chromosomes to be considered from peaks file

--smoothing-params

sigma and window size for gaussian 1D smoothing of ‘observed’ and ‘predicted’ profiles

Default: [7.0, 81]

metrics

usage: metrics [-h] --profileA PROFILEA --profileB PROFILEB
               [--smooth-profileA SMOOTH_PROFILEA [SMOOTH_PROFILEA ...]]
               [--smooth-profileB SMOOTH_PROFILEB [SMOOTH_PROFILEB ...]] [--countsA COUNTSA]
               [--countsB COUNTSB] [--apply-softmax-to-profileA] [--apply-softmax-to-profileB]
               [--metrics-seq-len METRICS_SEQ_LEN] [--peaks PEAKS] [--bounds-csv BOUNDS_CSV]
               [--step-size STEP_SIZE] --chroms CHROMS [CHROMS ...] [--exclude-zero-profiles]
               --output-dir OUTPUT_DIR [--automate-filenames] [--time-zone TIME_ZONE]
               [--other-tags OTHER_TAGS [OTHER_TAGS ...]] --chrom-sizes CHROM_SIZES

Named Arguments

--profileA, -A

the bigWig with ground truth values or a replicate

--profileB, -B

the bigWig with predicted values or the second replicate

--smooth-profileA

a list of two items, sigma and window width for gaussian smoothing of profileA before computing metrics. Empty list indicates nosmoothing

Default: []

--smooth-profileB

a list of two items, sigma and window width for gaussian smoothing of profileB before computing metrics. Empty list indicates nosmoothing

Default: []

--countsA

the bigWig with region counts assigned to each base (the counts track that is produced by the predict script). This is optional.

--countsB

the bigWig with region counts assigned to each base (the counts track that is produced by the predict script). This is optional.

--apply-softmax-to-profileA

apply softmax to profileA before computingmetrics (in casees where profileA is logits)

Default: False

--apply-softmax-to-profileB

apply softmax to profileB before computingmetrics (in casees where profileB is logits)

Default: False

--metrics-seq-len

the length of the sequence over which to compute the metrics

Default: 1000

--peaks

the path to the file containing

--bounds-csv

the path to the file containing upper andlower bounds for mnll, cross entropy, jsd,pearson & spearman correlation

--step-size

the step size for genome wide metrics

Default: 50

--chroms, -c

list of test chromosomes to compute metrics

--exclude-zero-profiles

exclude observed or predicted profiles that are all zeros

Default: False

--output-dir, -o

destination directory to store metrics results

--automate-filenames

specify if the metrics output should be stored in a timestamped subdirectory within –output-dir

Default: False

--time-zone

time zone to use for timestamping output directories

Default: “US/Pacific”

--other-tags

list of additional tags to be added as suffix to the filenames

Default: []

--chrom-sizes, -s

path to chromosome sizes file

shap_scores

usage: shap_scores [-h] --reference-genome REFERENCE_GENOME --input-seq-len INPUT_SEQ_LEN
                   --control-len CONTROL_LEN --model MODEL [--task-id TASK_ID] --bed-file
                   BED_FILE [--sample SAMPLE] [--chroms CHROMS [CHROMS ...]]
                   [--presort-bed-file] [--control-info CONTROL_INFO]
                   [--control-smoothing CONTROL_SMOOTHING [CONTROL_SMOOTHING ...]]
                   [--num-shuffles NUM_SHUFFLES] [--gen-null-dist] [--seed SEED]
                   --output-directory OUTPUT_DIRECTORY [--automate-filenames]
                   [--time-zone TIME_ZONE]

Named Arguments

--reference-genome, -g

path to the reference genome file

--input-seq-len

the length of the input sequence to the model

--control-len

the length of the control input to the model

--model, -m

the path to the model (.h5) file

--task-id, -t

In the multitask case the integer sequence number of the task for which the interpretation scores should be computed. For single task use 0.

Default: 0

--bed-file, -b

the path to the bed file containing postions at which the model should be interpreted

--sample, -s

the number of samples to randomly sample from the bed file. Only one of –sample or –chroms can be used.

--chroms, -c

list of chroms on which the contribution scores are to be computed. If not specified all chroms in –bed-file will be processed.

--presort-bed-file

specify if the –bed-file should be sorted in descending order of enrichment. It is assumed that the –bed-file has ‘signalValue’ in column 7 to use for sorting.

Default: False

--control-info

path to the input json file that has paths to control bigWigs. The –task-id is matched with ‘task_id’ in the the json file to get the list of control bigWigs

--control-smoothing

sigma and window width for gaussian 1d smoothing of the control

Default: [7.0, 81]

--num-shuffles

the number of dinucleotide shuffles to perform on each input sequence

Default: 20

--gen-null-dist

generate null distribution of shap scores by using a dinucleotide shuffled input sequence

Default: False

--seed

seed to create a NumPy RandomState object usedfor performing shuffles

Default: 20210304

--output-directory, -o

destination directory to store the interpretation scores

--automate-filenames

specify if the interpret output should be storedin a timestamped subdirectory within –output-dir

Default: False

--time-zone

time zone to use for timestamping output directories

Default: “US/Pacific”

motif_discovery

usage: motif_discovery [-h] [--scores-path SCORES_PATH] [--scores-locations SCORES_LOCATIONS]
                       [--output-directory OUTPUT_DIRECTORY]
                       [--modisco-window-size MODISCO_WINDOW_SIZE]

Named Arguments

--scores-path

Path to the importance scores hdf5 file

--scores-locations

path to bed file containing the locations that match the scores

--output-directory

Path to the output directory

--modisco-window-size

size of the window around the peak coodrinate that will be considered for motifdiscovery

Default: 400

logits2profile

usage: logits2profile [-h] --logits-file LOGITS_FILE --counts-file COUNTS_FILE
                      --output-directory OUTPUT_DIRECTORY --output-filename OUTPUT_FILENAME
                      --peaks PEAKS --chroms CHROMS [CHROMS ...] --chrom-sizes CHROM_SIZES
                      [--window-size WINDOW_SIZE]

Named Arguments

--logits-file

Path to the logits bigWig file that was generated by the predict script

--counts-file

Path to the exponentiated counts bigWig file that was generated by the predict script

--output-directory

Path to the output directory

--output-filename

output file name excluding extension

--peaks

Path to the bed file containing the chromosomecoordinates at which the logits to counts conversion should take place

--chroms

list of chroms for the output bigWig header

--chrom-sizes

Path to the chromosome sizes file

--window-size

size of the window around the chromosome coodrinate that will be considered for logits to counts conversion

Default: 1000

embeddings

usage: embeddings [-h] --model MODEL --reference-genome REFERENCE_GENOME
                  [--input-layer-name INPUT_LAYER_NAME] --input-layer-shape INPUT_LAYER_SHAPE
                  [INPUT_LAYER_SHAPE ...] [--embeddings-layer-name EMBEDDINGS_LAYER_NAME]
                  [--cropped-size CROPPED_SIZE]
                  [--numbered-embeddings-layers-prefix NUMBERED_EMBEDDINGS_LAYERS_PREFIX]
                  [--num-numbered-embeddings-layers NUM_NUMBERED_EMBEDDINGS_LAYERS]
                  [--flatten-embeddings-layer] --peaks PEAKS [--batch-size BATCH_SIZE]
                  [--output-directory OUTPUT_DIRECTORY] [--output-filename OUTPUT_FILENAME]

Named Arguments

--model, -m

the path to the model (.h5) file

--reference-genome, -g

number of gpus to use

--input-layer-name

name of the input sequence layer

Default: “sequence”

--input-layer-shape

shape of the input sequence layer (specifylist of values and omit the batch(?) dimension)

--embeddings-layer-name

full name of layer for embeddings output. Cannot be combined with –numbered-embeddings-layers-prefix.

--cropped-size

the size to which all embeddings outputs should be cropped to

--numbered-embeddings-layers-prefix

common prefix string, of all required layers, for matching. Cannot be combined with –embeddings-layer-name

--num-numbered-embeddings-layers

number of embeddings layers with common prefix specified by –numbered-embeddings-layers-prefix.

Default: 8

--flatten-embeddings-layer

specify if the embeddings layers should beflattened

Default: False

--peaks

10 column bed narrowPeak file containing chromosome positions to compute embeddings

--batch-size

batch size for processing the chromosome positions

Default: 64

--output-directory

output directory path

Default: “.”

--output-filename

name of compressed numpy file to store the embeddings

Default: “embeddings.h5”