Command Line Interface¶
bpnettrainer¶
usage: bpnettrainer [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
[--learning-rate LEARNING_RATE] [--min-learning-rate MIN_LEARNING_RATE]
[--early-stopping-patience EARLY_STOPPING_PATIENCE]
[--early-stopping-min-delta EARLY_STOPPING_MIN_DELTA]
[--reduce-lr-on-plateau-patience REDUCE_LR_ON_PLATEAU_PATIENCE]
[--model-arch-name MODEL_ARCH_NAME]
[--sequence-generator-name SEQUENCE_GENERATOR_NAME] [--filters FILTERS]
[--counts-loss-weight COUNTS_LOSS_WEIGHT]
[--control-smoothing CONTROL_SMOOTHING] [--threads THREADS] [--gpus GPUS]
--reference-genome REFERENCE_GENOME --chrom-sizes CHROM_SIZES --chroms
CHROMS [CHROMS ...] [--exclude-chroms EXCLUDE_CHROMS [EXCLUDE_CHROMS ...]]
[--splits SPLITS] [--output-dir OUTPUT_DIR] [--tag-length TAG_LENGTH]
[--time-zone TIME_ZONE] [--automate-filenames]
[--model-output-filename MODEL_OUTPUT_FILENAME]
[--input-seq-len INPUT_SEQ_LEN] [--output-len OUTPUT_LEN]
[--max-jitter MAX_JITTER] [--reverse-complement-augmentation]
[--negative-sampling-rate NEGATIVE_SAMPLING_RATE] --input-data INPUT_DATA
[--stranded] [--has-control] [--sampling-mode {peaks,sequential,random}]
[--shuffle]
Named Arguments¶
- --batch-size, -b
training batch size
Default: 64
- --epochs, -e
number of training epochs
Default: 100
- --learning-rate, -L
learning rate for Adam optimizer
Default: 0.004
- --min-learning-rate, -l
min learning rate for Adam optimizer
Default: 0.0001
- --early-stopping-patience
patience value for early stopping callback
Default: 5
- --early-stopping-min-delta
minimum change in the validation loss to qualify as an improvement
Default: 0.001
- --reduce-lr-on-plateau-patience
patience value for ReduceLROnPlateau callback
Default: 2
- --model-arch-name
the name of the model architesture that will be used in training (the name that will be used to fetch the model from model_archs)
Default: “BPNet”
- --sequence-generator-name
the name of the sequence generator from mseqgen library that will be used to generate batches of data
Default: “BPNet”
- --filters, -f
number of filters to use in BPNet
Default: 64
- --counts-loss-weight, -w
Weight for counts mse loss
Default: 100.0
- --control-smoothing
Default: [[7.5, 80]]
- --threads, -t
number of parallel threads for batch generation
Default: 10
- --gpus, -p
number of gpus to use
Default: 1
- --reference-genome, -g
number of gpus to use
Default: 1
- --chrom-sizes, -c
path to chromosome sizes file
- --chroms
master list of chromosomes for the genome
- --exclude-chroms
list of chromosomes to be excluded
Default: []
- --splits, -s
path to json file
- --output-dir, -d
destination directory to store the model
Default: “.”
- --tag-length
length of the alphanumeric tag for the model file name (applies if –automate-filenames option is used)
Default: 6
- --time-zone
time zone to use for timestamping model directories (applies if –automate-filenames option is used)
Default: “US/Pacific”
- --automate-filenames
specify if the model output directory and filename should be auto generated
Default: False
- --model-output-filename
basename of the model file without the .h5 extension (required if –automate-filenames is not used)
Default: “”
- --input-seq-len
length of input DNA sequence
Default: 3088
- --output-len
length of output profile
Default: 1000
- --max-jitter
maximum value for randomized jitter to offset the peaks from the exact center of the input
Default: 128
- --reverse-complement-augmentation
enable reverse complement augmentation
Default: True
- --negative-sampling-rate
number of negatives to sample for every positive peak
Default: 0.0
- --input-data, -i
path to json file containing task information
- --stranded
specify if the input data is stranded or unstranded
Default: False
- --has-control
specify if the input data has controls
Default: False
- --sampling-mode
Possible choices: peaks, sequential, random
Default: “peaks”
- --shuffle
Default: False
predict¶
usage: predict [-h] [--batch-size BATCH_SIZE] [--input-seq-len INPUT_SEQ_LEN]
[--output-len OUTPUT_LEN] [--predict-peaks] --reference-genome REFERENCE_GENOME
--chrom-sizes CHROM_SIZES --chroms CHROMS [CHROMS ...] --input-data INPUT_DATA
[--stranded] [--has-control] [--model MODEL] [--model-name MODEL_NAME]
[--model-dir MODEL_DIR] --output-dir OUTPUT_DIR [--automate-filenames]
[--time-zone TIME_ZONE] [--exponentiate-counts]
[--output-window-size OUTPUT_WINDOW_SIZE]
[--other-tags OTHER_TAGS [OTHER_TAGS ...]]
[--write-buffer-size WRITE_BUFFER_SIZE]
Named Arguments¶
- --batch-size, -b
test batch size
Default: 64
- --input-seq-len
length of input DNA sequence
Default: 3088
- --output-len
length of output profile
Default: 1000
- --predict-peaks
generate predictions only on the peaks contained in the peaks.bed files
Default: False
- --reference-genome, -g
the path to the reference genome fasta file
- --chrom-sizes, -s
path to chromosome sizes file
- --chroms, -c
list of test chromosomes for prediction
- --input-data, -i
path to json file containing task information
- --stranded
specify if the input data is stranded or unstranded (i.e in case –has-control is True)
Default: False
- --has-control
specify if the input data has controls
Default: False
- --model, -m
path to the .h5 model file
- --model-name
the name of the model that will be used in for predictions
Default: “BPNet”
- --model-dir
directory where .h5 model files are stored
- --output-dir, -o
destination directory to store predictions as a bigWig file
- --automate-filenames
specify if the predictions output should be stored in a timestamped subdirectory within –output-dir
Default: False
- --time-zone
time zone to use for timestamping model directories
Default: “US/Pacific”
- --exponentiate-counts
specify if the predicted counts should be exponentiated before writing to the bigWig files
Default: False
- --output-window-size
size of the central window of the output profile predictions that will be written to the bigWig files
Default: 1000
- --other-tags
list of additional tags to be added as suffix to the filenames
Default: []
- --write-buffer-size
size of the write buffer to store predictions before writing to bigWig files
Default: 10000
bounds¶
usage: bounds [-h] [--input-profiles INPUT_PROFILES [INPUT_PROFILES ...]]
[--output-names OUTPUT_NAMES [OUTPUT_NAMES ...]] --output-directory
OUTPUT_DIRECTORY --peaks PEAKS [--peak-width PEAK_WIDTH]
[--chroms CHROMS [CHROMS ...]]
[--smoothing-params SMOOTHING_PARAMS [SMOOTHING_PARAMS ...]]
Named Arguments¶
- --input-profiles
list of input bigWig profile
Default: []
- --output-names
list of outputnames for the bounds output corresponding to each of the input profiles
Default: []
- --output-directory
Path to the output directory
- --peaks
Path to the bed file containing the chromosome coordinates. The bed file should have at least 3 columns, the first 3 being ‘chrom’, ‘start’, and ‘end’
- --peak-width
the span of the peak to be considered for bounds computation
Default: 1000
- --chroms, -c
list of chromosomes to be considered from peaks file
- --smoothing-params
sigma and window size for gaussian 1D smoothing of ‘observed’ and ‘predicted’ profiles
Default: [7.0, 81]
metrics¶
usage: metrics [-h] --profileA PROFILEA --profileB PROFILEB
[--smooth-profileA SMOOTH_PROFILEA [SMOOTH_PROFILEA ...]]
[--smooth-profileB SMOOTH_PROFILEB [SMOOTH_PROFILEB ...]] [--countsA COUNTSA]
[--countsB COUNTSB] [--apply-softmax-to-profileA] [--apply-softmax-to-profileB]
[--metrics-seq-len METRICS_SEQ_LEN] [--peaks PEAKS] [--bounds-csv BOUNDS_CSV]
[--step-size STEP_SIZE] --chroms CHROMS [CHROMS ...] [--exclude-zero-profiles]
--output-dir OUTPUT_DIR [--automate-filenames] [--time-zone TIME_ZONE]
[--other-tags OTHER_TAGS [OTHER_TAGS ...]] --chrom-sizes CHROM_SIZES
Named Arguments¶
- --profileA, -A
the bigWig with ground truth values or a replicate
- --profileB, -B
the bigWig with predicted values or the second replicate
- --smooth-profileA
a list of two items, sigma and window width for gaussian smoothing of profileA before computing metrics. Empty list indicates nosmoothing
Default: []
- --smooth-profileB
a list of two items, sigma and window width for gaussian smoothing of profileB before computing metrics. Empty list indicates nosmoothing
Default: []
- --countsA
the bigWig with region counts assigned to each base (the counts track that is produced by the predict script). This is optional.
- --countsB
the bigWig with region counts assigned to each base (the counts track that is produced by the predict script). This is optional.
- --apply-softmax-to-profileA
apply softmax to profileA before computingmetrics (in casees where profileA is logits)
Default: False
- --apply-softmax-to-profileB
apply softmax to profileB before computingmetrics (in casees where profileB is logits)
Default: False
- --metrics-seq-len
the length of the sequence over which to compute the metrics
Default: 1000
- --peaks
the path to the file containing
- --bounds-csv
the path to the file containing upper andlower bounds for mnll, cross entropy, jsd,pearson & spearman correlation
- --step-size
the step size for genome wide metrics
Default: 50
- --chroms, -c
list of test chromosomes to compute metrics
- --exclude-zero-profiles
exclude observed or predicted profiles that are all zeros
Default: False
- --output-dir, -o
destination directory to store metrics results
- --automate-filenames
specify if the metrics output should be stored in a timestamped subdirectory within –output-dir
Default: False
- --time-zone
time zone to use for timestamping output directories
Default: “US/Pacific”
- --other-tags
list of additional tags to be added as suffix to the filenames
Default: []
- --chrom-sizes, -s
path to chromosome sizes file
shap_scores¶
usage: shap_scores [-h] --reference-genome REFERENCE_GENOME --input-seq-len INPUT_SEQ_LEN
--control-len CONTROL_LEN --model MODEL [--task-id TASK_ID] --bed-file
BED_FILE [--sample SAMPLE] [--chroms CHROMS [CHROMS ...]]
[--presort-bed-file] [--control-info CONTROL_INFO]
[--control-smoothing CONTROL_SMOOTHING [CONTROL_SMOOTHING ...]]
[--num-shuffles NUM_SHUFFLES] [--gen-null-dist] [--seed SEED]
--output-directory OUTPUT_DIRECTORY [--automate-filenames]
[--time-zone TIME_ZONE]
Named Arguments¶
- --reference-genome, -g
path to the reference genome file
- --input-seq-len
the length of the input sequence to the model
- --control-len
the length of the control input to the model
- --model, -m
the path to the model (.h5) file
- --task-id, -t
In the multitask case the integer sequence number of the task for which the interpretation scores should be computed. For single task use 0.
Default: 0
- --bed-file, -b
the path to the bed file containing postions at which the model should be interpreted
- --sample, -s
the number of samples to randomly sample from the bed file. Only one of –sample or –chroms can be used.
- --chroms, -c
list of chroms on which the contribution scores are to be computed. If not specified all chroms in –bed-file will be processed.
- --presort-bed-file
specify if the –bed-file should be sorted in descending order of enrichment. It is assumed that the –bed-file has ‘signalValue’ in column 7 to use for sorting.
Default: False
- --control-info
path to the input json file that has paths to control bigWigs. The –task-id is matched with ‘task_id’ in the the json file to get the list of control bigWigs
- --control-smoothing
sigma and window width for gaussian 1d smoothing of the control
Default: [7.0, 81]
- --num-shuffles
the number of dinucleotide shuffles to perform on each input sequence
Default: 20
- --gen-null-dist
generate null distribution of shap scores by using a dinucleotide shuffled input sequence
Default: False
- --seed
seed to create a NumPy RandomState object usedfor performing shuffles
Default: 20210304
- --output-directory, -o
destination directory to store the interpretation scores
- --automate-filenames
specify if the interpret output should be storedin a timestamped subdirectory within –output-dir
Default: False
- --time-zone
time zone to use for timestamping output directories
Default: “US/Pacific”
motif_discovery¶
usage: motif_discovery [-h] [--scores-path SCORES_PATH] [--scores-locations SCORES_LOCATIONS]
[--output-directory OUTPUT_DIRECTORY]
[--modisco-window-size MODISCO_WINDOW_SIZE]
Named Arguments¶
- --scores-path
Path to the importance scores hdf5 file
- --scores-locations
path to bed file containing the locations that match the scores
- --output-directory
Path to the output directory
- --modisco-window-size
size of the window around the peak coodrinate that will be considered for motifdiscovery
Default: 400
logits2profile¶
usage: logits2profile [-h] --logits-file LOGITS_FILE --counts-file COUNTS_FILE
--output-directory OUTPUT_DIRECTORY --output-filename OUTPUT_FILENAME
--peaks PEAKS --chroms CHROMS [CHROMS ...] --chrom-sizes CHROM_SIZES
[--window-size WINDOW_SIZE]
Named Arguments¶
- --logits-file
Path to the logits bigWig file that was generated by the predict script
- --counts-file
Path to the exponentiated counts bigWig file that was generated by the predict script
- --output-directory
Path to the output directory
- --output-filename
output file name excluding extension
- --peaks
Path to the bed file containing the chromosomecoordinates at which the logits to counts conversion should take place
- --chroms
list of chroms for the output bigWig header
- --chrom-sizes
Path to the chromosome sizes file
- --window-size
size of the window around the chromosome coodrinate that will be considered for logits to counts conversion
Default: 1000
embeddings¶
usage: embeddings [-h] --model MODEL --reference-genome REFERENCE_GENOME
[--input-layer-name INPUT_LAYER_NAME] --input-layer-shape INPUT_LAYER_SHAPE
[INPUT_LAYER_SHAPE ...] [--embeddings-layer-name EMBEDDINGS_LAYER_NAME]
[--cropped-size CROPPED_SIZE]
[--numbered-embeddings-layers-prefix NUMBERED_EMBEDDINGS_LAYERS_PREFIX]
[--num-numbered-embeddings-layers NUM_NUMBERED_EMBEDDINGS_LAYERS]
[--flatten-embeddings-layer] --peaks PEAKS [--batch-size BATCH_SIZE]
[--output-directory OUTPUT_DIRECTORY] [--output-filename OUTPUT_FILENAME]
Named Arguments¶
- --model, -m
the path to the model (.h5) file
- --reference-genome, -g
number of gpus to use
- --input-layer-name
name of the input sequence layer
Default: “sequence”
- --input-layer-shape
shape of the input sequence layer (specifylist of values and omit the batch(?) dimension)
- --embeddings-layer-name
full name of layer for embeddings output. Cannot be combined with –numbered-embeddings-layers-prefix.
- --cropped-size
the size to which all embeddings outputs should be cropped to
- --numbered-embeddings-layers-prefix
common prefix string, of all required layers, for matching. Cannot be combined with –embeddings-layer-name
- --num-numbered-embeddings-layers
number of embeddings layers with common prefix specified by –numbered-embeddings-layers-prefix.
Default: 8
- --flatten-embeddings-layer
specify if the embeddings layers should beflattened
Default: False
- --peaks
10 column bed narrowPeak file containing chromosome positions to compute embeddings
- --batch-size
batch size for processing the chromosome positions
Default: 64
- --output-directory
output directory path
Default: “.”
- --output-filename
name of compressed numpy file to store the embeddings
Default: “embeddings.h5”