This directory holds the GREAT-related code and makefile required to create working
copies of the novel sections of the core calculation engine functionality of
GREAT (McLean CY et al., Nature Biotechnology, 2010).
These functions include the following steps:
 1. Computational calculation of the regulatory domain of all genes
 2. Calculating a genomic region-based binomial p-value given a set of
regulatory domains associated with the term.

The GREAT codebase is dependent on the UCSC Kent source tree.  Consequently, the
UCSC Kent source tree must be downloaded and installed before attempting to build the GREAT
core calculation engine tools.  All steps required to get the GREAT core calculation engine
tools up and running are documented below.  These instructions are intended for Unix/Linux
systems only; building the GREAT core engine tools on other systems is beyond the scope of
these instructions.


### Building the provided GREAT core calculation engine tools ###
1. Download and install the UCSC Kent source tree
     a. A current copy of the Kent source tree is freely available for academic, nonprofit,
           and personal use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, and can also
           be obtained via CVS (see http://genome.ucsc.edu/admin/cvs.html).
     b. Build and installation instructions for the Kent source tree are available at
           http://genome.ucsc.edu/admin/jk-install.html.  Only the first six steps are required;
           in fact only the library jkweb.a is necessary.

2. Update the GREAT core calculation engine tools makefile to access the required Kent source library functions
     a. From step 1, you have installed the Kent source tree in some base directory (hereafter
           referenced as $BASE_DIR).  Within the $BASE_DIR/kent/src/lib/$MACHTYPE/ directory
           there should be a library file named jkweb.a.  Verify that this file exists.  If it
           does not exist, step 1 was not performed properly and should be redone.
     b. Open the GREAT core calculation engine tools makefile using a text editor.
		   The first line reads "KENT_DIR = path/to/your/kent/src".
           Update this path definition with the actual location of your Kent source directory.  This will be
           $BASE_DIR/kent/src for whatever value of $BASE_DIR is appropriate.

3. Build the GREAT core calculation engine tools
     a. If the KENT_DIR assignment within the makefile is set up correctly, simply typing 'make' in the
           GREAT directory will build the GREAT core calculation engine tools.
			* An executable named createRegulatoryDomains can be used to calculate regulatory domains.
			* An executable named calculateBinomialP can be used to calculate genomic region-based binomial p-values
           Calling either program with no arguments prints its usage message.


### Running the createRegulatoryDomains tool ###
The createRegulatoryDomains tool is used to generate computationally defined regulatory domains for all genes in a gene set.
The tool requires four arguments:

1. TSS.in
	This is a file holding a list of all genes to which you want to assign regulatory domains.  Each line of the file
	should correspond to a single gene to which you assign a regulatory domain.  Each line should have four fields
	tab-delimited:
              chromosome      transcription start site      strand      geneName

2. chrom.sizes
	This is a tab-delimited file holding the number of basepairs in each chromosome, in the following two-field format:
              chromosome      chromosome size

3. oneClosest|twoClosest|basalPlusExtension
	This argument corresponds to the type of association rule desired (see the
	"Association rules from genomic regions to genes" section of the Online Methods of our paper at
	 http://dx.doi.org/10.1038/nbt.1630 for a full description of each method).

4. regDoms.out
	This is the output file listing all of the computationally defined regulatory domains of each gene.  This file is
	in the following tab-delimited format:

              chromosome      chromStart      chromEnd      geneName     transcription start site      strand

	The chromosome, geneName, transcription start site, and strand are all identical to those in the input TSS.in file.
	The span of [chromStart, chromEnd) is the computationally-defined regulatory domain of the gene.  Note that this
	file is a valid BED file (http://genome.ucsc.edu/FAQ/FAQformat.html#format1).


Options:
	The -maxExtension, -basalUpstream, and -basalDownstream options allow users to vary the amount of genome
	associated with each gene.  Note that the basalUpstream and basalDownstream options are only relevant to the
	basalPlusExtension association rule.


### Running the calculateBinomialP tool ###
The calculateBinomialP tool is used to calculate the genomic region-based binomial p-value of enrichment for a particular
ontology term, based on the fraction of the genome associated with the term, the number of genomic regions in the input
set, and the number of genomic regions that are associated with genes annotated with the term.  The tool requires four
arguments:

1. regdoms.in
	This is a file of fully-specified regulatory domains of all genes in the genome that are annotated with the term
	of interest.  The format of the file follows the output of the createRegulatoryDomains tool:

              chromosome      chromStart      chromEnd      geneName     transcription start site      strand

	Note that these regulatory domains may overlap each other.

2. antigap.bed
	This is a BED file (http://genome.ucsc.edu/FAQ/FAQformat.html#format1) specifying all regions of the genome in which
	input genomic regions may land (e.g. all non–assembly gap base pairs in the genome).

	Important note:  antigap.bed is required to consist entirely of non-overlapping regions.

3. numTotalRegions
	The total number of genomic regions in the input set.

4. numRegionsHit
	The number of input genomic regions that are annotated with the ontology term of interest (due to their midpoints
	overlapping the regulatory domain of one or more genes annotated with the term).


The binomial p-value of enrichment for the term, given the four inputs, is printed to standard output.


### Additional help ###
Please direct any questions about compilation or usage to great@cs.stanford.edu.