Preparing input for a simulation ================================ countErrorsAndQualityScores.py ------------------------------ Counts the frequency of each quality score by base position for correct and erroneous bases and Ns in aligned reads and calculates empirical probability tables. Usage: python countErrorsAndQualityScores.py input_withMD.bam maximum_read_length results_out [maximum_errors_per_read [sample_size]] Positional arguments: input_withMD.bam File of mapped reads with MD tags in SAM or BAM format maximum_read_length Length of the longest read in the input results_out Path and name of file to receive output maximum_errors_per_read Reads with more mismatch errors than this are ignored. Defaults to maximum_read_length. sample_size The number of reads to process. Defaults to all. SAM2origin.py ------------- Converts read alignments to a list of fragment origins. Usage: SAM2origin.py input.sam output.txt Positional arguments: input.sam Sorted file of mapped reads in SAM format; - for stdin output.txt Path and name of file to receive output; - for stdout calcTranscriptFragmentationProbabilities.py ------------------------------------------- For each input transcript model, calculates the empirical probability of fragment initiation at each position in the transcript. Usage: python calcTranscriptFragmentationProbabilities.py transcripts.gff3 origins.txt.gz output.shelf Positional arguments: transcripts.gff3 Transcript models in GFF3 format origins.txt.gz List of read origin counts prepared with SAM2origin.py, compressed with bgzip and indexed with tabix output.shelf Dictionary of read origin probabilities keyed by transcript ID, in shelf format. get_transcript_coverage_counts.py --------------------------------- Accumulates counts of reads mapping within annotated transcripts and produces a list for use by simulateRNA_Seq.py. Usage: python get_transcript_coverage_counts.py genes.gff3 output_counts.txt mapped_reads.bam Positional arguments: genes.gff3 Gene or transcript models in GFF3 format output_counts.txt Output file mapped_reads.bam Sorted file of mapped reads in BAM format get_isoforms_from_coverage.py ----------------------------- Derives a set of splicing isoforms that can reproduce the pattern of intron readthrough shown by the read coverage of gene models. usage: python get_isoforms_from_coverage.py [-h] [--version] [--verbose] [--in [INPUT]] [--out [OUT]] --coverage COVERAGE --counts COUNTS --shelf SHELF Arguments: INPUT Gene or transcript models in GFF3 format; stdin if - or omitted OUT File for output of isoforms in GFF3 format; stdout if - or omitted COVERAGE File of read coverage depths in bedgraph format, compressed with bgzip and indexed with tabix COUNTS List of read origin counts prepared with SAM2origin.py, compressed with bgzip and indexed with tabix SHELF Dictionary of read origin probabilities keyed by isoform ID, in shelf format Running a simulation ==================== simulateRNA_Seq.py ------------------ Usage: python simulateRNA_Seq.py [options] gene_models.gff3 gene_id-copy_numbers.txt origin_prob_dict.shelf output_filename_base probability_files_directory genome.fa Options: -h, --help show this help message and exit -l READ_LENGTH, --length=READ_LENGTH length of a read; default: 38 -1, --single single or paired reads; default: paired -2, --pair --minsize=SIZE_MIN lower bound for size filter; default: 150 --lowsize=SIZE_LOWER lower end of pass range for size filter; default: 175 --highsize=SIZE_UPPER upper end of pass range for size filter; default: 250 --maxsize=SIZE_MAX upper bound for size filter; default: 300 -i INDEL_RATE, --indelrate=INDEL_RATE probability of an indel error at any position; default: 0 -s SUBST_RATE, --subrate=SUBST_RATE probability of a substitution error at any position; default: 0 -N N_RATE, --Nrate=N_RATE probability that a substitution will introduce an N; default: 0 Positional arguments: gene_models.gff3 Gene or transcript models in GFF3 format, or isoforms.gff3 from get_isoforms_from_coverage.py gene_id-copy_numbers.txt List of gene IDs and desired number of simulated fragments, as produced by get_transcript_coverage_counts.py origin_prob_dict.shelf Dictionary of read origin probabilities, prepared by calcTranscriptFragmentationProbabilities.py or get_isoforms_from_coverage.py output_filename_base Path/basename for simulated read files probability_files_directory Directory containing probability files prepared by countErrorsAndQualityScores.py genome.fa Genome sequence file postprocess_simulation.sh ------------------------- Usage: postprocess_simulation.sh output_filename_base genome.fa Positional arguments: output_filename_base Path/basename for simulated read files genome.fa Genome sequence file Extras ====== compareMaps.py -------------- Compares two mappings of the same read set onto the same genome. Usage: python compareMaps.py map1.sam map2.sam output_stem/ [discard ID suffix (True/False) [include secondary alignments (True/False) ]] Arguments: map1.sam Mapped reads in SAM or BAM format, sorted by read name map2.sam Mapped reads in SAM or BAM format, sorted by read name output_stem/ Directory for output of classified reads discard ID suffix Discard /1, /2 (True/False); default False include secondary alignments (True/False); default False truncateSAM.py -------------- Adjusts a SAM file of true mappings of simulated reads for 3' truncation of the reads usage: python truncateSAM.py [-h] [--version] [--verbose] --in INPUT --out OUT --length LENGTH Arguments: -h, --help show this help message and exit --version show program's version number and exit --verbose, -v Omit to see only fatal error messages; -v to see warnings; -vv to see warnings and progress messages --in INPUT, -i INPUT Path to the input file; required --out OUT, -o OUT Path to the output file; required --length LENGTH, -l LENGTH Target read length; required trimSAMbyRead.py ---------------- Adjusts a SAM file of true mappings of simulated reads for 3' trimming by trimmomatic usage: python trimSAMbyRead.py [-h] [--version] [--verbose] --in INPUT --out OUT --length LENGTH Arguments: -h, --help show this help message and exit --version show program's version number and exit --verbose, -v Omit to see only fatal error messages; -v to see warnings; -vv to see warnings and progress messages --in INPUT, -i INPUT Path to the input file; required --out OUT, -o OUT Path to the output file; required --length LENGTH, -l LENGTH Path to the trimmomatic log file showing trimmed lengths; required get_transcript_origin_counts.py ------------------------------- Extracts the read origin counts for specific transcripts from a genomic read origins file. usage: python get_transcript_origin_counts.py [-h] [--version] [--verbose] [--in [INPUT]] [--out [OUT]] --origins ORIGINS Arguments: -h, --help show this help message and exit --version show program's version number and exit --verbose, -v Omit to see only fatal error messages; -v to see warnings; -vv to see warnings and progress messages --in [INPUT], -i [INPUT] Path to the transcript input file; if omitted or -, input is read from stdin --out [OUT], -o [OUT] Path to the output file; if omitted or -, output is written to stdout --origins ORIGINS, -c ORIGINS Path to the genomic counted origins tabix file accumulateFragmentationProbabilityProfilesForSelectedTranscripts.py ------------------------------------------------------------------- For each transcript, extracts origin counts from all files in a list and outputs stranded probability vectors in a format suitable for R usage: python accumulateFragmentationProbabilityProfilesForSelectedTranscripts.py [-h] [--version] [--verbose] [--in [INPUT]] [--out_dir OUT_DIR] [--counts COUNTS [COUNTS ...]] Arguments: -h, --help show this help message and exit --version show program's version number and exit --verbose, -v Omit to see only fatal error messages; -v to see warnings; -vv to see warnings and progress messages --in [INPUT], -i [INPUT] Path to the transcript input gff3 file; if omitted or -, input is read from stdin --out_dir OUT_DIR, -o OUT_DIR Path to the output file directory; required --counts COUNTS [COUNTS ...], -c COUNTS [COUNTS ...] Space-delimited list of tabix-indexed origin count file names; required