========================
Ready-made khmer scripts
========================

k-mer counting
==============

**load-into-counting.py**: build a counting hash.

   Usage::

	load-into-counting.py [ options ] <output.kh> <file1> <file2> ...

   Build a counting hash table using the sequences in <file1-N>, and save it
   to <output.kh>.

   Example::

	scripts/load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa

**abundance-dist.py**: calculate the abundance distribution.

   Usage::
	
	scripts/abundance-dist.py [ options ] <input.kh> <datafile> <histout>

   Use a counting hash table to count the k-mer abundance distribution in
   <datafile>; output distribution to <histout>.

   Example::

	scripts/load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa
	scripts/abundance-dist.py -z out.kh data/100k-filtered.fa out.hist

   The file 'out.hist' will contain the histogram of k-mer abundances.  Columns
   are (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction
   of total distinct k-mers.

**filter-abund.py**: trim sequences at a min k-mer abundance.

   Usage::
	
	filter-abund.py [ -C <cutoff> ] <input.kh> <file1> <file2> ...

   Load a counting hash table from <input.kh> and use it to trim the
   sequences in <file1-N>.  Trimmed sequences will be placed in
   <fileN>.abundfilt.

   Example::

	scripts/load-into-counting.py -k 20 -x 5e7 table.kh data/100k-filtered.fa
	scripts/filter-abund.py -C 2 table.kh data/100k-filtered.fa

**count-median.py**: count median, average, and stddev of k-mer abundance in sequences.

   Usage::

        count-median.py <input.kh> <input sequence file> <output file>

   Load a counting hash table from <input.kh> and use it to calculate the
   median, average, and stddev k-mer count for each sequence in the
   <input sequence file>.  Output counts are placed in <output file>,
   in the format "name median average stddev sequence_length".
   
   
**count-overlap.py**: count the overlap k-mers, which are the k-mers
   apperaring in two sequence datasets.

   Usage::
   
        count-overlap.py [options] <first_filename> <second_filename> <report_filename>

   Note: "curve_report_filename" is optional. If it is provided, the script will
   generate a report of the increase of overlap k-mers as the number of
   sequences in the second dataset increases. The report can be used to generate
   a curve to show that trend easily.The report file contains the numbers of
   unique k-mers in the two datasets seperately and the number of overlap k-mers
   appearing in both datasets..


"""

Partitioning
============

**load-graph.py**: load sequences into the compressible graph format.

   Usage::

        load-graph.py [ options ] <graphbase> <file1> [ <file2> ... ]

   Load in a set of sequences, marking waypoints as you go, and save
   into a ht/tagset pair of files.  See 'extract-partitions' for a
   complete workflow.

**partition-graph.py**: partition a graph based on waypoint connectivity.

   Usage::

	partition-graph.py [ options ] <graphbase>

   Partition the given graph (ht + tagset) into disconnected subgraphs,
   and save the resulting partitionmap(s) as *.pmap files.

   See 'Artifact removal' to understand the stoptags argument.

**merge-partitions.py**: merge pmap files into a single merged pmap file.

   Usage::

        merge-partitions.py [ options ] <graphbase>

   Take the <graphbase>.subset.N.pmap files and merge them all into a
   single <graphbase>.pmap.merged file for annotate-partitions to use.

**annotate-partitions.py**: annotate sequences with partition IDs.

   Usage::

	annotate-partitions.py [ -k <ksize> ] <pmap.merged> <file1> [ <file2> ... ]

   Load in a partitionmap (generally produced by
   partition-graph.py/merge-partitins.py) and annotate the sequences
   in the given files with their partition IDs.  Use
   'extract-partitions.py' to actually extract sequences into separate
   group files.

   Example (results will be in ``random-20-a.fa.part``)::

	scripts/load-graph.py -k 20 example tests/test-data/random-20-a.fa
	scripts/partition-graph.py example
	scripts/merge-partitions.py -k 20 example
	scripts/annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa

**extract-partitions.py**: separate sequences annotated with partitions into group files

   Usage::

	extract-partitions.py [ options ] <prefix> <file1.part> [ <file2.part> ... ]

   Example (results will be in ``example.group0000.fa``)::

	scripts/load-graph.py -k 20 example tests/test-data/random-20-a.fa
	scripts/partition-graph.py example
	scripts/merge-partitions.py -k 20 example
	scripts/annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa
	scripts/extract-partitions.py example random-20-a.fa.part


Artifact removal
----------------

The following scripts are specialized scripts for finding and removing
highly-connected k-mers (HCKs).  See :doc:`partitioning-big-data`.

**make-initial-stoptags.py**: find an initial set of HCKs

   Usage::

	make-initial-stoptags.py [ options ] <graphbase>

   Load a ht/tagset created by load-graph.py, and do a small set of
   traversals from graph waypoints; on these traversals, look for
   k-mers that are repeatedly traversed in high-density regions of the
   graph, i.e. are highly connected.  Output those k-mers as an
   initial set of stoptags, which can be fed into partition-graph.py,
   find-knots.py, and filter-stoptags.py.

   The hashtable size options parameters are for a counting hash to
   keep track of repeatedly-traversed k-mers.  The subset size option
   specifies the number of waypoints from which to traverse; for
   highly connected data sets, the default (1000) is probably ok.

**find-knots.py**: find all HCKs

   Usage::

       find-knots.py [ options ] <graphbase>

   Load an ht/tagset created by load-graph, and a set of pmap files
   created by partition-graph.  Go through each pmap file, select the
   largest partition in each, and do the same kind of traversal as in
   make-initial-stoptags from each of the waypoints in that partition;
   this should identify all of the HCKs in that partition.  These HCKs
   are output to <graphbase>.stoptags after each pmap file.

   Parameter choice is reasonably important.  See the pipeline in
   :doc:`partitioning-big-data` for an example run.

   This script is not very scalable and may blow up memory and die
   horribly.  You should be able to use the intermediate stoptags to
   restart the process, and if you eliminate the already-processed
   pmap files, you can continue where you left off.

**filter-stoptags.py**: trim sequences at stoptags.

   Usage::

	filter-stoptags.py [ -k <ksize> ] <input.stoptags> <file1> [ <file2> ... ]

   Load stoptags in from the given .stoptags file and use them to trim
   or remove the sequences in <file1-N>.  Trimmed sequences will be placed in
   <fileN>.stopfilt.

Digital normalization
=====================

**normalize-by-min.py**: do digital normalization (remove entirely
redundant sequences)

   Usage::

	normalize-by-min.py [ options ] <file1> <file2> ...

   Discard sequences based on whether or not their **minimum** k-mer
   abundance lies above a specified cutoff.  Kept sequences will be
   placed in <fileN>.keep.

   Note: this is, in theory, a loss-less operation.  Only sequences that
   contribute no novel k-mers will be discarded.

   Example::

	scripts/normalize-by-min.py -k 17 tests/test-data/test-abund-read-2.fa

**normalize-by-median.py**: do digital normalization (remove mostly
redundant sequences)

   Usage::

	normalize-by-median.py [ options ] <file1> <file2> ...

   Discard sequences based on whether or not their **median** k-mer
   abundance lies above a specified cutoff.  Kept sequences will be
   placed in <fileN>.keep.

   Example::

	scripts/normalize-by-min.py -k 17 tests/test-data/test-abund-read-2.fa