======================== Ready-made khmer scripts ======================== k-mer counting ============== **load-into-counting.py**: build a counting hash. Usage:: load-into-counting.py [ options ] ... Build a counting hash table using the sequences in , and save it to . Example:: scripts/load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa **abundance-dist.py**: calculate the abundance distribution. Usage:: scripts/abundance-dist.py [ options ] Use a counting hash table to count the k-mer abundance distribution in ; output distribution to . Example:: scripts/load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa scripts/abundance-dist.py -z out.kh data/100k-filtered.fa out.hist The file 'out.hist' will contain the histogram of k-mer abundances. Columns are (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers. **filter-abund.py**: trim sequences at a min k-mer abundance. Usage:: filter-abund.py [ -C ] ... Load a counting hash table from and use it to trim the sequences in . Trimmed sequences will be placed in .abundfilt. Example:: scripts/load-into-counting.py -k 20 -x 5e7 table.kh data/100k-filtered.fa scripts/filter-abund.py -C 2 table.kh data/100k-filtered.fa **count-median.py**: count median, average, and stddev of k-mer abundance in sequences. Usage:: count-median.py Load a counting hash table from and use it to calculate the median, average, and stddev k-mer count for each sequence in the . Output counts are placed in , in the format "name median average stddev sequence_length". **count-overlap.py**: count the overlap k-mers, which are the k-mers apperaring in two sequence datasets. Usage:: count-overlap.py [options] Note: "curve_report_filename" is optional. If it is provided, the script will generate a report of the increase of overlap k-mers as the number of sequences in the second dataset increases. The report can be used to generate a curve to show that trend easily.The report file contains the numbers of unique k-mers in the two datasets seperately and the number of overlap k-mers appearing in both datasets.. """ Partitioning ============ **load-graph.py**: load sequences into the compressible graph format. Usage:: load-graph.py [ options ] [ ... ] Load in a set of sequences, marking waypoints as you go, and save into a ht/tagset pair of files. See 'extract-partitions' for a complete workflow. **partition-graph.py**: partition a graph based on waypoint connectivity. Usage:: partition-graph.py [ options ] Partition the given graph (ht + tagset) into disconnected subgraphs, and save the resulting partitionmap(s) as *.pmap files. See 'Artifact removal' to understand the stoptags argument. **merge-partitions.py**: merge pmap files into a single merged pmap file. Usage:: merge-partitions.py [ options ] Take the .subset.N.pmap files and merge them all into a single .pmap.merged file for annotate-partitions to use. **annotate-partitions.py**: annotate sequences with partition IDs. Usage:: annotate-partitions.py [ -k ] [ ... ] Load in a partitionmap (generally produced by partition-graph.py/merge-partitins.py) and annotate the sequences in the given files with their partition IDs. Use 'extract-partitions.py' to actually extract sequences into separate group files. Example (results will be in ``random-20-a.fa.part``):: scripts/load-graph.py -k 20 example tests/test-data/random-20-a.fa scripts/partition-graph.py example scripts/merge-partitions.py -k 20 example scripts/annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa **extract-partitions.py**: separate sequences annotated with partitions into group files Usage:: extract-partitions.py [ options ] [ ... ] Example (results will be in ``example.group0000.fa``):: scripts/load-graph.py -k 20 example tests/test-data/random-20-a.fa scripts/partition-graph.py example scripts/merge-partitions.py -k 20 example scripts/annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa scripts/extract-partitions.py example random-20-a.fa.part Artifact removal ---------------- The following scripts are specialized scripts for finding and removing highly-connected k-mers (HCKs). See :doc:`partitioning-big-data`. **make-initial-stoptags.py**: find an initial set of HCKs Usage:: make-initial-stoptags.py [ options ] Load a ht/tagset created by load-graph.py, and do a small set of traversals from graph waypoints; on these traversals, look for k-mers that are repeatedly traversed in high-density regions of the graph, i.e. are highly connected. Output those k-mers as an initial set of stoptags, which can be fed into partition-graph.py, find-knots.py, and filter-stoptags.py. The hashtable size options parameters are for a counting hash to keep track of repeatedly-traversed k-mers. The subset size option specifies the number of waypoints from which to traverse; for highly connected data sets, the default (1000) is probably ok. **find-knots.py**: find all HCKs Usage:: find-knots.py [ options ] Load an ht/tagset created by load-graph, and a set of pmap files created by partition-graph. Go through each pmap file, select the largest partition in each, and do the same kind of traversal as in make-initial-stoptags from each of the waypoints in that partition; this should identify all of the HCKs in that partition. These HCKs are output to .stoptags after each pmap file. Parameter choice is reasonably important. See the pipeline in :doc:`partitioning-big-data` for an example run. This script is not very scalable and may blow up memory and die horribly. You should be able to use the intermediate stoptags to restart the process, and if you eliminate the already-processed pmap files, you can continue where you left off. **filter-stoptags.py**: trim sequences at stoptags. Usage:: filter-stoptags.py [ -k ] [ ... ] Load stoptags in from the given .stoptags file and use them to trim or remove the sequences in . Trimmed sequences will be placed in .stopfilt. Digital normalization ===================== **normalize-by-min.py**: do digital normalization (remove entirely redundant sequences) Usage:: normalize-by-min.py [ options ] ... Discard sequences based on whether or not their **minimum** k-mer abundance lies above a specified cutoff. Kept sequences will be placed in .keep. Note: this is, in theory, a loss-less operation. Only sequences that contribute no novel k-mers will be discarded. Example:: scripts/normalize-by-min.py -k 17 tests/test-data/test-abund-read-2.fa **normalize-by-median.py**: do digital normalization (remove mostly redundant sequences) Usage:: normalize-by-median.py [ options ] ... Discard sequences based on whether or not their **median** k-mer abundance lies above a specified cutoff. Kept sequences will be placed in .keep. Example:: scripts/normalize-by-min.py -k 17 tests/test-data/test-abund-read-2.fa