======================== Ready-made khmer scripts ======================== While khmer is research software (and will probably remain research software for a while!) there are a bunch of scripts that can be used out of the box. Below is a quick-n-dirty guide to these scripts. Basic k-mer filtering ===================== filter-exact.py Quality trimming ~~~~~~~~~~~~~~~~ quality-trim.py -- Filter Illumina data sets by removing reads containing too many Ns, and truncate reads at QV=2 bases. Preprocessing before any further steps. Usage:: %% quality-trim.py This script can be applied to multiple data sets in parallel or in series. Removing reads by k-mer match ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ filter-if-present.py -- Remove reads from a FASTA/FASTQ file that contain any of the k-mers present in another FASTA/FASTQ file. Used to remove rRNA, repeats, Illumina primers, etc. Usage:: %% filter-if-present.py Output is placed in a file named after the FASTA/FASTQ file filtered, with the suffix '.masked'. This script can be applied to multiple data sets in parallel or in series, as long as the same mask file is used. Abundance filtering ------------------- To use a probabilistic counting hash table to filter out reads containing low-abundance k-mers, see the scripts 'filter-inexact-*.py'. An "All k-mers" filter ~~~~~~~~~~~~~~~~~~~~~~ filter-inexact-all.py -- removes reads containing any k-mer with an abundance of 1 (technically, < 2 ;). This is an all-by-all process; filter **all your data** together in one go, or else you will erroneously eliminate reads. (Consider the situation where a k-mer is present in precisely two reads, one in `file1` and one in `file2`. Both reads will be eliminated if `file1` and `file2` are filtered separately. A few points: - this code uses only a few hundred megabytes beyond the hash table, so make the hashtable size as close to your available physical memory as you can. - the one-sided counting error means that some reads with low-abundance k-mers *might* be kept, but no reads will erroneously be removed. You can run it iteratively, with different hash table sizes each time, to decrease the error rate. - increase MIN_ABUNDANCE beyond 2 to eliminate higher abundance reads. For example, setting MIN_ABUNDANCE to 5 will eliminate reads that contain k-mers with 4 or fewer abundance across the data set. - if you run this script iteratively on a data set, it will continue to eliminate reads. Consider the situation where a k-mer A is present in the same read as a k-mer B, but k-mer B is also in another read. If the first read (containing both k-mers A and B) is eliminated because k-mer A is low-abundance, then in the filtered data set, k-mer B will now be low-abundance and the second read will be eliminated in the next pass. In comparison, 'filter-inexact-any.py' (below) is invariant. Other k-mer filters ~~~~~~~~~~~~~~~~~~~ We have not found these filters to be as useful as 'filter-inexact-all.py', but they should work: filter-inexact-any.py -- remove filter-inexact-once.py filter-inexact-run.py De Bruijn graph analysis and manipulation ========================================= Graph size filtering ~~~~~~~~~~~~~~~~~~~~ graph-size-py.py - eliminate reads belonging to small de Bruijn graphs Partitioning ~~~~~~~~~~~~ do-th-subset-calc.py (-save, -load below, but in memory) Pipeline: do-th-subset-save.py - to do local graph traversal and produce partition maps do-subset-merge.py - to merge partition maps (optional, inexact) filter-subsets-by-partsize.py - to eliminate small partitions do-th-subset-load.py - to annotate sequences with their partition ID (optional) combine-pe.py - to combine partitions that are joined by paired ends extract-partitions.py - to extract reads into groups, grouped by partition (utility) multi-abyss.py / multi-velvet.py - to assemble (reporting) subset-report.py Hashtable/tagset utilities ========================== load-ht-and-tags.py Miscellaneous utilities ======================= assemstats.py assemstats2.py fastq-to-fasta.py strip-partition.py join_pe.py extract-long-sequences.py Miscellaneous read analysis =========================== abundance-hist-by-position.py hi-lo-abundance-by-position.py Support code ============ test_scripts.py velvet-assemble.sh make-random.py To remove? ========== agglom.py ctb-iterative-bench* do-intertable-part.py extract-surrender.py To classify =========== extract-single-partition.py fasta-to-abundance-hist.py get-occupancy.py graph-partition-separate.py graph-size-py-th.py graph-size.py kmer-abundance-hist.py length-dist.py occupy.py partition-size-dist.py readmask-do-filter.py test.sh