pIRS (profile basd Illumina pair-end Reads Simulator) Contents ======== 1. Introduction 2. Program framework 3. Usage 4. Output files 5. Output file format 6. Note 1 Introduction ============== pIRS is a program for simulating Illumina PE reads, with a series of characters generated by Illumina sequencing platform, such as insert size distribution, sequencing error(substitution, insertion, deletion), quality score and GC content-coverage bias. As all know, the insert size follows a normal distribution, so users should set the mean value and standard deviation. We usually set the standard deviation as 1/20 of the mean value. We simulate the normal distribution by Box-Muller method. The program simulates sequencing error, quality score and GC content-coverage bias according to the empirical distribution profile. We provide some default profiles counted from lots of real sequencing data. To simulate reads from diploid genome, users should simulate the diploid genome sequence firstly by setting the ratio of heterozygosis SNP, heterozygosis InDel and structure variation. 2 Program framework =================== 2.1 Profile Generator Six tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html) are required. The full process is shown in getprofile.sh.example as an example. 2.1.1 GC%-Depth Profile Stat. a). Run soap and soap.coverage to get .depth single file(s). gzip is OK to over it. b). Run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files. c). Run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%. d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows. 2.1.2 Base-Calling Profile Stat: a). Run soap or bwa to get .{soap,single} or .sam file(s). b). Run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix . c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files. However, it is up to you to keep the read length matches. 2.1.3 InDel Profile Stat: a). Choose samples with NO polymorphism InDel, such as the Coliphage samples that shipped with Illumina Sequencers. b). Run bwa to get .sam/.bam file. c). Run indelstat_sam_bam to get the profile. 2.1.4 Insert size & mapping ratio stat: a). Run soap or bwa to get .{soap,single} or .sam file(s). b). Run alignment_stator *. * alignment_stator cannot stat. mapping ratio for sam files now. 2.2 Simulator Two commands: pirs diploid: use for generating diploid genome sequence. Read the input genome sequence and then simulate SNP, InDel, SV(structure variation) on it. At last, output the result genome sequence. pirs simulate: use for simulating Illumina data, output PE-read file. Note: a) If you only want to simulate the diploid genome sequence, "pirs diploid" is enough. b) If you want to simulate sequencing data of haploid genome, only you need is "pirs simulate". c) If you want to simulate sequencing data of diploid genome, you first need to run "pirs diploid" to get the other diploid genome sequence, and then run "pirs simulate" using both the original genome sequence and the previous output sequence as the input. 3 Usage ======= pirs [option] diploid generate diploid genome. simulate simulate Illumina reads. 3.1 pirs diploid Generate the diploid genome sequence Usage: pirs diploid [OPTIONS] -i input reference sequence (file.fa / file.fa.gz) -s heterozygous SNP rate (default=0.001) -a value of transition divided by transvertion for heterSNP (default=2) -d small heterozygous InDel rate (default=0.0001) -v structure variation (large insertion, deletion, invertion) rate (default=1e-6) -c output file type, 0: text, 1: gz compressed (default=1) -o output prefix (default=ref_sequence) -h output help information and exit 3.2 pirs simualte Simulate sequencing data of Illumina sequencing platform Usage: pirs simualte [OPTIONS] -i input reference sequence (file.fa / file.fa.gz) -I another input reference for diploid genome (generated by "pirs diploid", set only when simulating reads of diploid genome) -s input Base-calling profile (default=$Bin/Profiles/Base-Calling_Profiles/humNew.PE100.matrix.gz) -d input GC content-coverage depth(GC% depth) profile for simulating GC content-coverage bias, the default GC content-coverage bias profile are determined based on the twice of read length -b input InDel-error profile,input InDel-error profile for simulating InDel-error of reads, (default=(exe_path)/Profiles/InDel_Profiles/phixv2.InDel.matrix) -l read length, read1 and read2 are the same length (default=100) -x sequencing coverage depth (default=5) -m average insert size (default=500) -v the standard deviation of insert size (default=5% of average insert size) -a simulate reads InDel-error, 0:no, 1:yes. (default=1) -g simulate GC content-coverage bias, 0: no, 1: yes (default=1) -q simulate quality value, 0: no(fasta), 1: yes(fastq) (default=1) -M simulate quality value by Quality-transition mode, 0:no, 1:yes, (default:1) -Q ASCII shift of quality value, generally 64 or 33 for Illumina data, (default:64) -E mode of read-end masking by EAMSS algorithm, 0:none, 1:Quality=2, 2:lowercase Base, (default:0) -f cyclize insert fragment (influence on PE reads direction) 0:read1-forward, read2-reverse, 1: contrary to parameter 0 (default=0) -c output file type, 0: text, 1: gz compressed (default=1) -o output prefix (default=Illumina) -h output help information and exit 4 Output files ============== 4.1 Simulation result of diploid genome sequence. Example command line: pirs diploid -i Human_ref.fa -s 0.001 -a 2 -d 0.00001 -v 0.000001 -o Human >simulate_seq.o 2>simulate_seq.e Output files: a) Human.snp.indel.invertion.fa.gz: another diploid genome sequence. b) Human_indel.lst: InDel information list. c) Human_snp.lst:SNP information list. d) Human_invertion.lst:invertion information list. e) simulate_seq.o, simulate_seq.e: records of the program running information. 4.2 Simulation result of Illumina sequencing data for haploid genome. Example command line: pirs simualte -i Human_ref.fa -m 170 -l 90 -x 5 -v 10 -o Human >simulate_170.o 2>simulate_170.e Output files: a) Human_90_170_1.fq.gz, Human_90_170_2.fq.gz: the pair-end read files. b) Human_90_170.error_rate.distr: the error distribution file. c) Human_90_170.insertsize.distr: the insert size distribution file. d) Human_90_170.read.info.gz: record the information of every reads. e) simulate_170.o, simulate_170.e: record the running information of program. 4.3 Simulation result of Illumina sequencing data for diploid genome. Example command line: pirs simualte -i Human_ref.fa -I Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10 -o Human >simulate_800.o 2>simulate_800.e Output files: a) Human_70_800_1.fq.gz, Human_70_800_2.fq.gz: the pair-end read files. b) Human_70_800.error_rate.distr: the error distribution file. c) Human_70_800.insertsize.distr: the insert size distribution file. d) Human_70_800.read.info.gz: record the information of every reads. e) simulate_800.e, simulate_800.o: records of the program running information. 5 Output file format ==================== a) *.fq/*.fq.gz For example: the output file name: *_70_800_1.fq.gz, it mean 70bp reads, the average length of insert size is 800bp. In read file1: @read_800_21/1 ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG + egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb @read_800_22/1 CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT + gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg note: In Row-1, "800" represent for the mean insert size; "21" represent for the order of read in file; The next "1" represent for read file1. b) *.read.info.gz Column 1: reads ID. Column 2: "1" represent for this read generates from sequence which is input by option -i, "2" represent for this read generates from the reference sequence which is input by -I; for simulated data of haploid genome, it will be always "1". Column 3: chromosome sequence ID. Column 4: position(1-based) in chromosome. Column 5: "+" forward direction; "-" reverse direction. Column 6: real insert size. Column 7: length of read-end masking by EAMSS algorithm. Column 8: read-position(1-based) of substitution error and raw base(->)error base. Column 9: read-position(1-based) of insertion error and the base of insertion. Column 10: read-position(1-based) of deletion error and the base of deletion c) *_snp.lst For example: I 3 G A I 45342 C T I 104775 C T ..... Column 1: chromosome sequence ID. Column 2: position(1-based) of SNP in raw chromosome. Column 3: base of raw chromosome sequence. Column 4: the base of SNP. I 3 G A Ref: a t G c a // 3 is the position of G base in the chromosome base. : a t A c a // G base is a A in another diploid sequence. d) *_indel.lst For example: I - 3 1 C IV + 5 2 AC ..... Column 1: chromosomesequence ID. Column 2: "-" Deletion; "+" Insertion. Column 3: position(1-based) of InDel in raw chromosome sequence. Note that this position represent for the prev-base of InDel region. Column 4: the base number of InDel. Column 5: the base of InDel. I - 3 1 C Ref: a t G c a // 3 is the position of G base in the chromosome base. : a t G - a // following the G base is a deletion of "C" base. IV + 5 2 A C Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base. : t t g c A A C g t t// following the A base is a insertion of "AC" bases. e) *_invertion.lst For example: I 50191 100 I 948984 200 Column 1: chromosome sequence ID. Column 2: position(1-based) of invertion in raw chromosome sequence. Note that this position represent for the prev-base of invertion region. Column 3: length of invertion. 6 Note ====== a) When simulating InDel in genome: they go halves on the total rate respectively, 1~6bp bases as the number of InDel, and the rate distribution of each type as below: 1bp-64.82%, 2bp-17.17%, 3bp-7.20%, 4bp-7.29%, 5bp-2.18%, 6bp-1.34%, which is the empirical distribution from panda re-sequencing data, we can set the total InDel rate with option -d. When simulating SV (structural variation): insertion, deletion and invertion, each of them share 1/3 of total rate, here we use this rate distribution for simplicity: 100bp-70%, 200bp-20%, 500bp-7%, 1000bp-2%, 2000bp-1%, we can set the total SV rate with option -v. b) Program do not simulate reads contain "N" char. If the input reference contains "N" char, reads which generated in this region will be discarded. c) The allowed max length of read depends on the cycle number of profile. User should set read length less than half of cycle number. d) When masking quality values, the program using the same EAMSS algorithm from CASAVA v1.8.0 . e) The program parses one chromosome at a time, so dose the output. In diploid mode, each reference file is sampled with half depth. f) Random-seed in this program is set with system time, so when you need to simulate data in batches,it needs a interval for each task(such as one second). For update & support, please refer to http://code.google.com/p/pirs/ .