PEAK-SEQ -- PREPROCESSING Paper by Joel Rozowsky, et. al. Coded in Perl and Shell Script by Joel Rozowsky. Coded in C by Theodore Gibson. This archive contains two implementations for the same procedure. The first is completely in C and the second is a combination of shell script and Perl. These two implementations produce identical output. This program takes an Eland file containing all 25 chromosomes as input and outputs a separate Eland file for each chromosome as well as an SGR file for each chromosome. Splitting up the data by chromosome allows for parallelized computation. These Eland and SGR files are the input taken by both implementations of the Peak-Seq program. To run the Perl/Shell version, open the files and ensure that the input and output files are correct. Then type "sh parse_script_PolII" or "sh parse_script_Input" in the terminal to run the shell script. These programs split the Eland file into 25 separate Eland files corresponding to each chromosome. Then, run the perl script using either "perl create_signal_map_Input.pl" or "perl create_signal_map_PolII.pl". This will convert each Eland file into an SGR file. To run the C implemenatation, first open the file config.h. This file contains all parameters to the program as well as all input and output file locations and filenames. The suggested value for the parameters are already listed in the file. The parameters are as follows: READ_LENGTH: The assumed length of a single read. The input and output filenames are as follows. Files with "prefixes" and "suffixes" are files that differ between chromosomes. These files are assumed to be in the format FILENAME_PREFIX + chr# + FILENAME_SUFFIX. For example a typical sgr file would be "PolII.chr12.sgr" or "PolII.chrY.sgr". INPUT_FILENAME: The name of the Eland file containing data for all the chromosomes. ELAND_PREFIX / ELAND_SUFFIX: The output Eland files that have been separated by chromosome. SGR_PREFIX / SGR_SUFFIX: The output SGR files that have been separated by chromosome. Once the config.h file has been altered, type "make all" in the terminal window. Once the program compiles, type "./Preprocess" to run it. After completion, the 50 output files will be located in the locations specified in the ELAND and SGR fields of the config.h file. Header files are arranged in the following order: 1) Public structure and type definitions. 2) Public function prototypes. Implementation files are arranged in the following order. 1) Private structure and type definitions 2) Private function prototypes 3) Public functions 4) Private functions Within each category all definitions are in the order of first use in the program. A brief explanation of the contents of each module: config.h: This module contains all parameters for the program. io.c / io.h: This module deals with input/output concerns. It deals with both directions of the conversion from chr1 - chrM format to 1-25 format for chromosome numbering as well as scanning lines in a file. main.c: The main program. All functions are called from this module. sgr.c / sgr.h: This module deals with sorting and output related to the SGR files. util.c / util.h: This module deals with memory management and program termination. It is Copyright (C) 2008 by Michael Fischer and was distributed for use in his 2008 Spring CS 223 course.