Preparing and running a RNA-Seq read simulation

Packaged with RNA-Seq-Simulator is a small example data set that you can use to test your installation. The sequence of commands shown below will produce pairs of simulated reads 50 bases long with 0.5% base error rate in the simulation directory. You can choose to ignore intron retention or to simulate the intron retention in the example read set.

In addition to the RNA-Seq-Simulator package, you will need bedtools, bam2fastq,  samtools, bgzip, and tabix on your PATH.

0. Gather the input data:
    A genome sequence in FASTA format, e.g. example_data/genome.fa
    A set of mapped reads in BAM format, e.g. example_data/reads.bam. The reads should be sorted by position; running
        samtools index example_data/reads.bam
is a quick way to check.  The reads don't need to be of uniform length, but you will need to known the maximum read length. The reads should have MD tags; if they don't, run 
        samtools fillmd example_data/reads.bam example_data/genome.fa > example_data/reads.md.bam 
and use example_data/reads.md.bam in place of example_data/reads.bam.
    A set of gene (or transcript) models in GFF3 format, e.g. example_data/genes.gff3

1. Extract the empirical read error and quality distributions:
    python countErrorsAndQualityScores.py example_data/reads.bam max_readlength example_data/errors_and_quals.txt

2. Extract the empirical read origin distribution:
    samtools view example_data/reads.bam | python SAM2origin.py - - | bgzip -c > example_data/origins.txt.gz
    tabix -s1 -b2 -e2 -0 example_data/origins.txt.gz

3. Prepare the transcript read-numbers file:
    python get_transcript_coverage_counts.py example_data/genes.gff3 example_data/copy-numbers.txt example_data/reads.bam

4A. Calculate read origin probability vectors:
     python calcTranscriptFragmentationProbabilities.py example_data/genes.gff3 example_data/origins.txt.gz example_data/transcripts.origin-prob.shelf

 OR

4B. To simulate partial intron retention, create pseudo-isoforms:
     genomeCoverageBed -bga -ibam example_data/reads.bam -split | bgzip -c > example_data/reads.coverage.wig.gz
     tabix -s1 -b2 -e3 -S0 -0 example_data/reads.coverage.wig.gz
     python get_isoforms_from_coverage.py --in example_data/genes.gff3 --out example_data/isoforms.gff3 --coverage example_data/reads.coverage.wig.gz --counts example_data/origins.txt.gz --shelf example_data/isoform.origin-prob.shelf

 5. Run the simulation:
     python simulateRNA_Seq.py  -l 50 -2 --minsize 200 --lowsize 225 --highsize 300 --maxsize 350 -s 0.005 -N 0.2 example_data/genes.gff3 example_data/copy-numbers.txt example_data/transcripts.origin-prob.shelf simulation/simul-reads example_data example_data/genome.fa
 OR
     python simulateRNA_Seq.py  -l 50 -2 --minsize 200 --lowsize 225 --highsize 300 --maxsize 350 -s 0.005 -N 0.2 example_data/isoforms.gff3 example_data/copy-numbers.txt example_data/isoform.origin-prob.shelf simulation/simul-reads example_data example_data/genome.fa

6. Postprocess the simulated reads:
    ./postprocess_simulation.sh simulation/simul-reads example_data/genome.fa
    
To view your simulated reads with IGV, GBrowse, or another genome browser, first sort and index them:
samtools view -uS simulation/simul-reads.true_mappings.sam | samtools sort - simulation/simul-reads.true_mappings 
samtools index simulation/simul-reads.true_mappings.bam