Preparing and running a RNA-Seq read simulation Packaged with RNA-Seq-Simulator is a small example data set that you can use to test your installation. The sequence of commands shown below will produce pairs of simulated reads 50 bases long with 0.5% base error rate in the simulation directory. You can choose to ignore intron retention or to simulate the intron retention in the example read set. In addition to the RNA-Seq-Simulator package, you will need bedtools, bam2fastq, samtools, bgzip, and tabix on your PATH. 0. Gather the input data: A genome sequence in FASTA format, e.g. example_data/genome.fa A set of mapped reads in BAM format, e.g. example_data/reads.bam. The reads should be sorted by position; running samtools index example_data/reads.bam is a quick way to check. The reads don't need to be of uniform length, but you will need to known the maximum read length. The reads should have MD tags; if they don't, run samtools fillmd example_data/reads.bam example_data/genome.fa > example_data/reads.md.bam and use example_data/reads.md.bam in place of example_data/reads.bam. A set of gene (or transcript) models in GFF3 format, e.g. example_data/genes.gff3 1. Extract the empirical read error and quality distributions: python countErrorsAndQualityScores.py example_data/reads.bam max_readlength example_data/errors_and_quals.txt 2. Extract the empirical read origin distribution: samtools view example_data/reads.bam | python SAM2origin.py - - | bgzip -c > example_data/origins.txt.gz tabix -s1 -b2 -e2 -0 example_data/origins.txt.gz 3. Prepare the transcript read-numbers file: python get_transcript_coverage_counts.py example_data/genes.gff3 example_data/copy-numbers.txt example_data/reads.bam 4A. Calculate read origin probability vectors: python calcTranscriptFragmentationProbabilities.py example_data/genes.gff3 example_data/origins.txt.gz example_data/transcripts.origin-prob.shelf OR 4B. To simulate partial intron retention, create pseudo-isoforms: genomeCoverageBed -bga -ibam example_data/reads.bam -split | bgzip -c > example_data/reads.coverage.wig.gz tabix -s1 -b2 -e3 -S0 -0 example_data/reads.coverage.wig.gz python get_isoforms_from_coverage.py --in example_data/genes.gff3 --out example_data/isoforms.gff3 --coverage example_data/reads.coverage.wig.gz --counts example_data/origins.txt.gz --shelf example_data/isoform.origin-prob.shelf 5. Run the simulation: python simulateRNA_Seq.py -l 50 -2 --minsize 200 --lowsize 225 --highsize 300 --maxsize 350 -s 0.005 -N 0.2 example_data/genes.gff3 example_data/copy-numbers.txt example_data/transcripts.origin-prob.shelf simulation/simul-reads example_data example_data/genome.fa OR python simulateRNA_Seq.py -l 50 -2 --minsize 200 --lowsize 225 --highsize 300 --maxsize 350 -s 0.005 -N 0.2 example_data/isoforms.gff3 example_data/copy-numbers.txt example_data/isoform.origin-prob.shelf simulation/simul-reads example_data example_data/genome.fa 6. Postprocess the simulated reads: ./postprocess_simulation.sh simulation/simul-reads example_data/genome.fa To view your simulated reads with IGV, GBrowse, or another genome browser, first sort and index them: samtools view -uS simulation/simul-reads.true_mappings.sam | samtools sort - simulation/simul-reads.true_mappings samtools index simulation/simul-reads.true_mappings.bam