200m reads / 20gb of Iowa prairie GPGC reads.
Started Aug 23, with commit 69b71c1de56, using do-th-subset-save.py.

real    1188m36.685s   (20 hrs)
user    2489m9.995s    (41 hrs)
sys     38m56.133s

K=32
HTSIZE=4**17 + 1
N_THREADS=4
SUBSET_SIZE=1m

One major slowdown for the subset calculation was that the subset-save
I/O functions did not release the GIL; that was the next commit!

Used approximately 30gb of RAM (2x x-large EC2)

---

FAILURE!

iowa corn:

520340000 reads, 21837974692 k-mers
hashtable occupancy: 0.290823292997

started Aug 24, with commit cb65c122edc, using do-th-subset-save.py.

K=32
HTSIZE=2 * 4**17 + 1
N_THREADS = 8
SUBSET_SIZE = 5m

busted 68gb of RAM (4x x-large EC2)

--

FAILURE!

iowa prairie:

641190000 reads, 27472734843 k-mers
hashtable occupancy: 0.318750168071

started Aug 24, with commit cb65c122edc, using do-th-subset-save.py.

K=32
HTSIZE=2 * 4**17 + 1
N_THREADS = 8
SUBSET_SIZE = 5m

busted 68gb of RAM (4x x-large EC2)

--

iowa corn / all reads
subset size 1m
htsize = 2* 4**17 + 1

64gb of ram standing!!

--

RUNNING:

iowa corn

started August 25th, with commit 6d2eaab378, using do-th-subset-save.py.
(subsets, independently saved, using set for all_tags).

K=32
HTSIZE=2 * 4**17 + 1
N_THREADS = 8
SUBSET_SIZE = 1m

=> 447 subsets of 1m (i.e. 447m tags)

ht & tags saved.

holding steady at 56.5gb.

447 subsets => 224 (pairwise merge) in 8 threads/~8gb of RAM (1gb/thread),
real    29m24.087s
user    93m54.879s
sys     4m32.190s

224 subsets => 117 (pairwise merge) in 8 threads/17gb of RAM (2gb/thread)
real    30m27.233s
user    98m54.571s
sys     5m48.366s

117 subsets => 59 in 8 threads/36gb of RAM (4.5gb/thread)
real    32m34.097s
user    91m50.299s
sys     4m18.347s

59 subsets => 28 (??) in 6 threads/??gb of RAM
real    35m30.460s
user    85m20.804s
sys     4m57.506s

28 subsets => 14 in 2 threads/40 gb of RAM (20gb/thread)
...

Get it down to seven, then:

  % python ~/khmer/scripts/filter-subsets-by-partsize.py /mnt/iowa-corn-*.tagset m7.merge.*.pmap

which runs in 46gb of RAM in about 3 hours.

To quote,

TM size: 446639081
maxifying: m7.merge.0.pmap
maxifying: m7.merge.1.pmap
maxifying: m7.merge.2.pmap
maxifying: m7.merge.3.pmap
maxifying: m7.merge.4.pmap
maxifying: m7.merge.5.pmap
maxifying: m7.merge.6.pmap
discarding
filtering: m7.merge.0.pmap
OLD partition map size: 149326590
NEW partition map size: 93784167
filtering: m7.merge.1.pmap
OLD partition map size: 149508451
NEW partition map size: 93784167
filtering: m7.merge.2.pmap
OLD partition map size: 149639032
NEW partition map size: 93784167
^R
filtering: m7.merge.3.pmap
OLD partition map size: 148603585
NEW partition map size: 93784167
filtering: m7.merge.4.pmap
OLD partition map size: 149595619
NEW partition map size: 93784167
filtering: m7.merge.5.pmap
OLD partition map size: 149625331
NEW partition map size: 93784167
filtering: m7.merge.6.pmap
OLD partition map size: 149455440
NEW partition map size: 93784167

Now doing merge, in about 22gb of RAM, with do-th-subset-load (id
4e15e692de6); ~2 hrs.

---

RUNNING:

iowa prairie

started August 25th, with commit 6d2eaab378, using do-th-subset-save.py.
(subsets, independently saved, using set for all_tags).

K=32
HTSIZE=2 * 4**17 + 1
N_THREADS = 8
SUBSET_SIZE = 1m

=> 532 subsets of 1m (i.e. 532m tags)

ht & tags saved.

holding steady at 60.7gb.

532 subsets => 266 (pairwise merge) in 8 threads/~8gb of RAM (1gb/thread),
real    34m56.468s
user    115m54.872s
sys     6m26.410s

266 subsets => 133 in 8 threads/22gb of RAM (2.5gb/thread)
real    36m50.139s
user    119m41.566s
sys     6m3.270s

133 subsets => 67 in 8 threads/40 gb of RAM (~5 gb/thread)
real    37m51.513s
user    104m36.663s
sys     6m30.359s

** update to 7e0948bd835, removed memcpy in favor of pointer manipulation

67 subsets => 34 subsets in 6 threads/55 gb of RAM (~10gb/thread)
real    45m43.421s
user    121m8.939s
sys     7m50.598s

34 subsets => 17 subsets in 3 threads/?? gb of RAM
real    50m18.937s
user    89m19.661s
sys     4m8.407s

17 subset => 9 in 1 thread/36 gb of RAM

...


filtering: subsets-merge6.merge.12.pmap
OLD partition map size: 157617196
NEW partition map size: 130577864
filtering: subsets-merge7.merge.0.pmap
OLD partition map size: 184833576
NEW partition map size: 130577864

...

28gb to load 7+ of 9 subsets with do-th-subset-load.py.

---

Sep 4, 2010; Iowa corn.  Initial partitioning (32gb ht, 1m subset size)

54gb RAM standing

real    3107m47.636s
user    23305m52.488s
sys     3m39.383s

Sep 4 , 2010: Iowa prairie.  Initial partitioning (32gb ht, 1m subset size)

58gb RAM standing

real    3734m7.595s
user    27826m49.692s
sys     3m23.419s

---

22906493 is a good way to over-surrender on 250k reads.