PeakSeq
Version 1.01
Paper by Joel Rozowsky, et. al.
Coded in C by Theodore Gibson.


This is the program described in "PeakSeq: Systematic Scoring of ChIP-Seq
Experiments Relative to Controls" by Rozowsky et. al.

To run it, first open the file config.h.  This file contains all parameters to
the program as well as all input and output file locations and filenames.  The
suggested value for the parameters are already listed in the file.  The
parameters are as follows:

READ_LENGTH: The assumed length of a single read.  Used for interpretation of
	both ChIP-Seq and simulated data.
W_SIZE: The size in nucleotides of a single window to be considered separately
	in the simulation and analysis.
MAX_GAP: The maximum gap in nucleotides allowed between peaks for them to be
	merged together.  Hits that have greater seperations aren't merged.
MIN_FDR: The required false discovery rate.
N_SIMS: The number of simulations per window to estimate the FDR.
MIN_CHR: The lowest chromosome analysed. (23 is X, 24 is Y, 25 is M.)
MAX_CHR: The highest chromosome analysed. (23 is X, 24 is Y, 25 is M.)
W_PER_C: The number of windows in a single chromosome.  Any windows with 0
	reads will be skipped.
BIN_SIZE: Bin size for linear regression.
BIN_SIZE_M: Bin size for linear regression of chromosome M.
MAX_COUNT: The maximum number of reads that will be counted that begin at the
	same nucleotide position.
EXTENDED_REGION_SIZE: The amount on each side that regions are extended when
	extended regions are used.
PVAL_THRESH: The threshold pvalue for a peak to be outputted to the final
	file.


The input and output filenames are as follows.  Files with "prefixes" and
"suffixes" are files that differ between chromosomes.  These files are assumed
to be in the format FILENAME_PREFIX + chr# + FILENAME_SUFFIX. For example a
typical sgr file would be "PolII.chr12.sgr" or "PolII.chrY.sgr".

ELAND_PREFIX / ELAND_SUFFIX: The ChIP-Seq data.
SGR_PREFIX / SGR_SUFFIX: The sgr file corresponding to the same data as the
	Eland file.
MAP_FILENAME: The file containing the number of reads in each window to be
	analyzed.  This data is used to compute the fraction of mapable
	nucleotides in that window.
INPUT_PREFIX / INPUT_SUFFIX: The control Eland file.
OUTPUT: The output file for all peaks sorted by starting position.
	File format:
		Column 1: The chromosome number.
		Column 2: The start position of the peak on this chromosome.
		Column 3: The stop position of the peak on this chromosome.
		Column 4: The number of reads from the eland file located in the peak.
		Column 5: The adjusted number of reads from the control input file
			located in the peak.
		Column 6: The enrichment of the peak ( reads from sample file /
			(reads from control file * scaling factor) ).
		Column 7: The excess of the peak (reads from sample file -
			reads from control file).
		Column 8: The p-value.
FINAL: The output file for all peaks after BH correction sorted by the
	resulting q-values.  For peaks with q-values of 0 (less than is able to be
	computed), the peaks are sorted by z-score. The file format is identical
	except that column 8 now contains the q-value instead of the p-value.


Once the config.h file has been altered, type "make all" in the terminal
window. Once the program compiles, type "./PeakSeq_v1.01" to run it.  After
completion, the output will be located in the two filenames indicated in the
OUTPUT and FINAL fields of the config.h file.


Header files are arranged in the following order:
	1)  Public structure and type definitions.
	2)  Public function prototypes.
Implementation files are arranged in the following order:
	1)  Private structure and type definitions
	2)  Private function prototypes
	3)  Public functions
	4)  Private functions
Within each category all definitions are in the order of first use in the
program.


A brief explanation of the contents of each module:
analyze.c / analyze.h: This module deals with output to the FINAL file.  It
	sorts all the peaks by p-value and adjusts the p-value using BH correction
	for multiple hypothesis testing.
config.h: This module contains all parameters for the program.
filter.c / filter.h: This module deals with the output to the OUTPUT file.  As
	each peak is found, it counts the number of reads from both the sample
	Eland and the control input files located within the peak and outputs
	statistics based on these numbers to the OUTPUT file.
io.c / io.h: This module deals with input/output concerns.  It deals with both
	directions of the conversion from chr1 - chrM format to 1-25 format for
	chromosome numbering as well as scanning lines in a file.
main.c: The main program.  All functions are called from this module.
random.c / random.h: This module deals with the random number generator.  Code
	was copied and comments were paraphrased from the following computer
	science textbook:
		Roberts, Eric. Programming Abstractions in C: a second course in
		computer science.  Reading, Massachusetts: Addison Wesley Longman,
		Inc., 1998.
simulator.c / simulator.h: This module deals with the simulated data and the
	simulation.  It also finds the threshold for each window based on the
	simulations.
util.c / util.h: This module deals with memory management and program
	termination.  It is Copyright (C) 2008 by Michael Fischer and was
	distributed for use in his 2008 Spring CS 223 course.
window.c / window.h: This module deals with all data that differs between
	windows, including number of reads, the locations of the peaks, and the
	mappability fractions.