PeakSeq Version 1.01 Paper by Joel Rozowsky, et. al. Coded in C by Theodore Gibson. This is the program described in "PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to Controls" by Rozowsky et. al. To run it, first open the file config.h. This file contains all parameters to the program as well as all input and output file locations and filenames. The suggested value for the parameters are already listed in the file. The parameters are as follows: READ_LENGTH: The assumed length of a single read. Used for interpretation of both ChIP-Seq and simulated data. W_SIZE: The size in nucleotides of a single window to be considered separately in the simulation and analysis. MAX_GAP: The maximum gap in nucleotides allowed between peaks for them to be merged together. Hits that have greater seperations aren't merged. MIN_FDR: The required false discovery rate. N_SIMS: The number of simulations per window to estimate the FDR. MIN_CHR: The lowest chromosome analysed. (23 is X, 24 is Y, 25 is M.) MAX_CHR: The highest chromosome analysed. (23 is X, 24 is Y, 25 is M.) W_PER_C: The number of windows in a single chromosome. Any windows with 0 reads will be skipped. BIN_SIZE: Bin size for linear regression. BIN_SIZE_M: Bin size for linear regression of chromosome M. MAX_COUNT: The maximum number of reads that will be counted that begin at the same nucleotide position. EXTENDED_REGION_SIZE: The amount on each side that regions are extended when extended regions are used. PVAL_THRESH: The threshold pvalue for a peak to be outputted to the final file. The input and output filenames are as follows. Files with "prefixes" and "suffixes" are files that differ between chromosomes. These files are assumed to be in the format FILENAME_PREFIX + chr# + FILENAME_SUFFIX. For example a typical sgr file would be "PolII.chr12.sgr" or "PolII.chrY.sgr". ELAND_PREFIX / ELAND_SUFFIX: The ChIP-Seq data. SGR_PREFIX / SGR_SUFFIX: The sgr file corresponding to the same data as the Eland file. MAP_FILENAME: The file containing the number of reads in each window to be analyzed. This data is used to compute the fraction of mapable nucleotides in that window. INPUT_PREFIX / INPUT_SUFFIX: The control Eland file. OUTPUT: The output file for all peaks sorted by starting position. File format: Column 1: The chromosome number. Column 2: The start position of the peak on this chromosome. Column 3: The stop position of the peak on this chromosome. Column 4: The number of reads from the eland file located in the peak. Column 5: The adjusted number of reads from the control input file located in the peak. Column 6: The enrichment of the peak ( reads from sample file / (reads from control file * scaling factor) ). Column 7: The excess of the peak (reads from sample file - reads from control file). Column 8: The p-value. FINAL: The output file for all peaks after BH correction sorted by the resulting q-values. For peaks with q-values of 0 (less than is able to be computed), the peaks are sorted by z-score. The file format is identical except that column 8 now contains the q-value instead of the p-value. Once the config.h file has been altered, type "make all" in the terminal window. Once the program compiles, type "./PeakSeq_v1.01" to run it. After completion, the output will be located in the two filenames indicated in the OUTPUT and FINAL fields of the config.h file. Header files are arranged in the following order: 1) Public structure and type definitions. 2) Public function prototypes. Implementation files are arranged in the following order: 1) Private structure and type definitions 2) Private function prototypes 3) Public functions 4) Private functions Within each category all definitions are in the order of first use in the program. A brief explanation of the contents of each module: analyze.c / analyze.h: This module deals with output to the FINAL file. It sorts all the peaks by p-value and adjusts the p-value using BH correction for multiple hypothesis testing. config.h: This module contains all parameters for the program. filter.c / filter.h: This module deals with the output to the OUTPUT file. As each peak is found, it counts the number of reads from both the sample Eland and the control input files located within the peak and outputs statistics based on these numbers to the OUTPUT file. io.c / io.h: This module deals with input/output concerns. It deals with both directions of the conversion from chr1 - chrM format to 1-25 format for chromosome numbering as well as scanning lines in a file. main.c: The main program. All functions are called from this module. random.c / random.h: This module deals with the random number generator. Code was copied and comments were paraphrased from the following computer science textbook: Roberts, Eric. Programming Abstractions in C: a second course in computer science. Reading, Massachusetts: Addison Wesley Longman, Inc., 1998. simulator.c / simulator.h: This module deals with the simulated data and the simulation. It also finds the threshold for each window based on the simulations. util.c / util.h: This module deals with memory management and program termination. It is Copyright (C) 2008 by Michael Fischer and was distributed for use in his 2008 Spring CS 223 course. window.c / window.h: This module deals with all data that differs between windows, including number of reads, the locations of the peaks, and the mappability fractions.