Preparing input for a simulation
================================

countErrorsAndQualityScores.py
------------------------------
Counts the frequency of each quality score by base position for correct and 
erroneous bases and Ns in aligned reads and calculates empirical probability tables.

Usage: python countErrorsAndQualityScores.py input_withMD.bam  maximum_read_length  
		results_out  [maximum_errors_per_read [sample_size]]

Positional arguments:
	input_withMD.bam 		File of mapped reads with MD tags in SAM or BAM format 
	maximum_read_length		Length of the longest read in the input
	results_out				Path and name of file to receive output  
	maximum_errors_per_read	Reads with more mismatch errors than this are ignored. 
							Defaults to maximum_read_length. 
	sample_size				The number of reads to process. Defaults to all.


SAM2origin.py
-------------
Converts read alignments to a list of fragment origins.

Usage: SAM2origin.py  input.sam  output.txt
Positional arguments:
	input.sam	Sorted file of mapped reads in SAM format; - for stdin  
	output.txt	Path and name of file to receive output; - for stdout


calcTranscriptFragmentationProbabilities.py
-------------------------------------------
For each input transcript model, calculates the empirical probability of fragment 
initiation at each position in the transcript.

Usage: python calcTranscriptFragmentationProbabilities.py transcripts.gff3  
		origins.txt.gz  output.shelf
Positional arguments:
	transcripts.gff3	Transcript models in GFF3 format
	origins.txt.gz		List of read origin counts prepared with SAM2origin.py, 
						compressed with bgzip and indexed with tabix
	output.shelf		Dictionary of read origin probabilities keyed by transcript 
						ID, in shelf format.


get_transcript_coverage_counts.py
---------------------------------
Accumulates counts of reads mapping within annotated transcripts and produces a 
list for use by simulateRNA_Seq.py.

Usage: python get_transcript_coverage_counts.py genes.gff3 output_counts.txt 
		mapped_reads.bam
Positional arguments:
	genes.gff3			Gene or transcript models in GFF3 format 
	output_counts.txt 	Output file
	mapped_reads.bam	Sorted file of mapped reads in BAM format


get_isoforms_from_coverage.py
-----------------------------
Derives a set of splicing isoforms that can reproduce the pattern of intron 
readthrough shown by the read coverage of gene models.

usage: python get_isoforms_from_coverage.py [-h] [--version] [--verbose]
                                     [--in [INPUT]] [--out [OUT]] --coverage COVERAGE
                                     --counts COUNTS --shelf SHELF
Arguments:
	INPUT		Gene or transcript models in GFF3 format; stdin if - or omitted
	OUT			File for output of isoforms in GFF3 format; stdout if - or omitted
	COVERAGE	File of read coverage depths in bedgraph format, compressed with 
				bgzip and indexed with tabix	
	COUNTS		List of read origin counts prepared with SAM2origin.py, compressed 
				with bgzip and indexed with tabix
	SHELF		Dictionary of read origin probabilities keyed by isoform ID, in 
				shelf format
                                       

Running a simulation
====================

simulateRNA_Seq.py
------------------
Usage: python simulateRNA_Seq.py  [options] gene_models.gff3 gene_id-copy_numbers.txt 
	origin_prob_dict.shelf output_filename_base probability_files_directory genome.fa

Options:
  -h, --help            show this help message and exit
  -l READ_LENGTH, --length=READ_LENGTH
                        length of a read; default: 38
  -1, --single          single or paired reads; default: paired
  -2, --pair            
  --minsize=SIZE_MIN    lower bound for size filter; default: 150
  --lowsize=SIZE_LOWER  lower end of pass range for size filter; default: 175
  --highsize=SIZE_UPPER
                        upper end of pass range for size filter; default: 250
  --maxsize=SIZE_MAX    upper bound for size filter; default: 300
  -i INDEL_RATE, --indelrate=INDEL_RATE
                        probability of an indel error at any position;
                        default: 0
  -s SUBST_RATE, --subrate=SUBST_RATE
                        probability of a substitution error at any position;
                        default: 0
  -N N_RATE, --Nrate=N_RATE
                        probability that a substitution will introduce an N;
                        default: 0
Positional arguments:
	gene_models.gff3  			Gene or transcript models in GFF3 format, or 
								isoforms.gff3 from get_isoforms_from_coverage.py
	gene_id-copy_numbers.txt 	List of gene IDs and desired number of simulated 
								fragments, as produced by 
								get_transcript_coverage_counts.py
	origin_prob_dict.shelf 		Dictionary of read origin probabilities, prepared 
								by calcTranscriptFragmentationProbabilities.py or 
								get_isoforms_from_coverage.py
	output_filename_base  		Path/basename for simulated read files
	probability_files_directory Directory containing probability files prepared 
								by countErrorsAndQualityScores.py
	genome.fa 					Genome sequence file


postprocess_simulation.sh
-------------------------
Usage: postprocess_simulation.sh output_filename_base genome.fa

Positional arguments:
	output_filename_base  		Path/basename for simulated read files
	genome.fa 					Genome sequence file


Extras
======

compareMaps.py
--------------
Compares two mappings of the same read set onto the same genome.

Usage: python compareMaps.py map1.sam map2.sam output_stem/ 
	[discard ID suffix (True/False) [include secondary alignments (True/False) ]]
Arguments:
	map1.sam 		Mapped reads in SAM or BAM format, sorted by read name
	map2.sam 		Mapped reads in SAM or BAM format, sorted by read name 
	output_stem/ 	Directory for output of classified reads
	discard ID suffix Discard /1, /2 (True/False); default False
	include secondary alignments 	 (True/False); default False
	
	
truncateSAM.py
--------------
Adjusts a SAM file of true mappings of simulated reads for 3' truncation of the reads

usage: python truncateSAM.py [-h] [--version] [--verbose] --in INPUT --out OUT 
		--length LENGTH
Arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --verbose, -v         Omit to see only fatal error messages; -v to see
                        warnings; -vv to see warnings and progress messages
  --in INPUT, -i INPUT  Path to the input file; required
  --out OUT, -o OUT     Path to the output file; required
  --length LENGTH, -l LENGTH
                        Target read length; required


trimSAMbyRead.py
----------------
Adjusts a SAM file of true mappings of simulated reads for 3' trimming by trimmomatic

usage: python trimSAMbyRead.py [-h] [--version] [--verbose] --in INPUT --out OUT 
		--length LENGTH
Arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --verbose, -v         Omit to see only fatal error messages; -v to see
                        warnings; -vv to see warnings and progress messages
  --in INPUT, -i INPUT  Path to the input file; required
  --out OUT, -o OUT     Path to the output file; required
  --length LENGTH, -l LENGTH
                        Path to the trimmomatic log file showing trimmed lengths; 
                        required


get_transcript_origin_counts.py
-------------------------------
Extracts the read origin counts for specific transcripts from a genomic read origins file.

usage: python get_transcript_origin_counts.py [-h] [--version] [--verbose]
                                  [--in [INPUT]] [--out [OUT]] --origins ORIGINS
Arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --verbose, -v         Omit to see only fatal error messages; -v to see warnings; 
  						-vv to see warnings and progress messages
  --in [INPUT], -i [INPUT]
                        Path to the transcript input file; if omitted or -, input 
                        is read from stdin
  --out [OUT], -o [OUT]
                        Path to the output file; if omitted or -, output is 
                        written to stdout
  --origins ORIGINS, -c ORIGINS
                        Path to the genomic counted origins tabix file


accumulateFragmentationProbabilityProfilesForSelectedTranscripts.py
-------------------------------------------------------------------
For each transcript, extracts origin counts from all files in a list and outputs 
stranded probability vectors in a format suitable for R

usage: python accumulateFragmentationProbabilityProfilesForSelectedTranscripts.py
       [-h] [--version] [--verbose] [--in [INPUT]] [--out_dir OUT_DIR] 
       [--counts COUNTS [COUNTS ...]]
Arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --verbose, -v         Omit to see only fatal error messages; -v to see
                        warnings; -vv to see warnings and progress messages
  --in [INPUT], -i [INPUT]
                        Path to the transcript input gff3 file; if omitted or
                        -, input is read from stdin
  --out_dir OUT_DIR, -o OUT_DIR
                        Path to the output file directory; required
  --counts COUNTS [COUNTS ...], -c COUNTS [COUNTS ...]
                        Space-delimited list of tabix-indexed origin count file 
                        names; required