200m reads / 20gb of Iowa prairie GPGC reads. Started Aug 23, with commit 69b71c1de56, using do-th-subset-save.py. real 1188m36.685s (20 hrs) user 2489m9.995s (41 hrs) sys 38m56.133s K=32 HTSIZE=4**17 + 1 N_THREADS=4 SUBSET_SIZE=1m One major slowdown for the subset calculation was that the subset-save I/O functions did not release the GIL; that was the next commit! Used approximately 30gb of RAM (2x x-large EC2) --- FAILURE! iowa corn: 520340000 reads, 21837974692 k-mers hashtable occupancy: 0.290823292997 started Aug 24, with commit cb65c122edc, using do-th-subset-save.py. K=32 HTSIZE=2 * 4**17 + 1 N_THREADS = 8 SUBSET_SIZE = 5m busted 68gb of RAM (4x x-large EC2) -- FAILURE! iowa prairie: 641190000 reads, 27472734843 k-mers hashtable occupancy: 0.318750168071 started Aug 24, with commit cb65c122edc, using do-th-subset-save.py. K=32 HTSIZE=2 * 4**17 + 1 N_THREADS = 8 SUBSET_SIZE = 5m busted 68gb of RAM (4x x-large EC2) -- iowa corn / all reads subset size 1m htsize = 2* 4**17 + 1 64gb of ram standing!! -- RUNNING: iowa corn started August 25th, with commit 6d2eaab378, using do-th-subset-save.py. (subsets, independently saved, using set for all_tags). K=32 HTSIZE=2 * 4**17 + 1 N_THREADS = 8 SUBSET_SIZE = 1m => 447 subsets of 1m (i.e. 447m tags) ht & tags saved. holding steady at 56.5gb. 447 subsets => 224 (pairwise merge) in 8 threads/~8gb of RAM (1gb/thread), real 29m24.087s user 93m54.879s sys 4m32.190s 224 subsets => 117 (pairwise merge) in 8 threads/17gb of RAM (2gb/thread) real 30m27.233s user 98m54.571s sys 5m48.366s 117 subsets => 59 in 8 threads/36gb of RAM (4.5gb/thread) real 32m34.097s user 91m50.299s sys 4m18.347s 59 subsets => 28 (??) in 6 threads/??gb of RAM real 35m30.460s user 85m20.804s sys 4m57.506s 28 subsets => 14 in 2 threads/40 gb of RAM (20gb/thread) ... Get it down to seven, then: % python ~/khmer/scripts/filter-subsets-by-partsize.py /mnt/iowa-corn-*.tagset m7.merge.*.pmap which runs in 46gb of RAM in about 3 hours. To quote, TM size: 446639081 maxifying: m7.merge.0.pmap maxifying: m7.merge.1.pmap maxifying: m7.merge.2.pmap maxifying: m7.merge.3.pmap maxifying: m7.merge.4.pmap maxifying: m7.merge.5.pmap maxifying: m7.merge.6.pmap discarding filtering: m7.merge.0.pmap OLD partition map size: 149326590 NEW partition map size: 93784167 filtering: m7.merge.1.pmap OLD partition map size: 149508451 NEW partition map size: 93784167 filtering: m7.merge.2.pmap OLD partition map size: 149639032 NEW partition map size: 93784167 ^R filtering: m7.merge.3.pmap OLD partition map size: 148603585 NEW partition map size: 93784167 filtering: m7.merge.4.pmap OLD partition map size: 149595619 NEW partition map size: 93784167 filtering: m7.merge.5.pmap OLD partition map size: 149625331 NEW partition map size: 93784167 filtering: m7.merge.6.pmap OLD partition map size: 149455440 NEW partition map size: 93784167 Now doing merge, in about 22gb of RAM, with do-th-subset-load (id 4e15e692de6); ~2 hrs. --- RUNNING: iowa prairie started August 25th, with commit 6d2eaab378, using do-th-subset-save.py. (subsets, independently saved, using set for all_tags). K=32 HTSIZE=2 * 4**17 + 1 N_THREADS = 8 SUBSET_SIZE = 1m => 532 subsets of 1m (i.e. 532m tags) ht & tags saved. holding steady at 60.7gb. 532 subsets => 266 (pairwise merge) in 8 threads/~8gb of RAM (1gb/thread), real 34m56.468s user 115m54.872s sys 6m26.410s 266 subsets => 133 in 8 threads/22gb of RAM (2.5gb/thread) real 36m50.139s user 119m41.566s sys 6m3.270s 133 subsets => 67 in 8 threads/40 gb of RAM (~5 gb/thread) real 37m51.513s user 104m36.663s sys 6m30.359s ** update to 7e0948bd835, removed memcpy in favor of pointer manipulation 67 subsets => 34 subsets in 6 threads/55 gb of RAM (~10gb/thread) real 45m43.421s user 121m8.939s sys 7m50.598s 34 subsets => 17 subsets in 3 threads/?? gb of RAM real 50m18.937s user 89m19.661s sys 4m8.407s 17 subset => 9 in 1 thread/36 gb of RAM ... filtering: subsets-merge6.merge.12.pmap OLD partition map size: 157617196 NEW partition map size: 130577864 filtering: subsets-merge7.merge.0.pmap OLD partition map size: 184833576 NEW partition map size: 130577864 ... 28gb to load 7+ of 9 subsets with do-th-subset-load.py. --- Sep 4, 2010; Iowa corn. Initial partitioning (32gb ht, 1m subset size) 54gb RAM standing real 3107m47.636s user 23305m52.488s sys 3m39.383s Sep 4 , 2010: Iowa prairie. Initial partitioning (32gb ht, 1m subset size) 58gb RAM standing real 3734m7.595s user 27826m49.692s sys 3m23.419s --- 22906493 is a good way to over-surrender on 250k reads.