Home Blog/Rants Downloads C++ Papers GitHub LinkedIn
Downloads - executables, source code, and corpora
MSufSort 4 - Parallel Suffix Array Construction Algorithm (Preview/Demo)
An early demonstration of MSufSort 4 which was missing some core algorithms (induction sorting). This release was intended simply to demonstate its generally superior performance but on some sources the current absense of MSufsort's core induction sorting algorithm will lead to poor performance.
The Guantlet Corpus
A special corpus that I put together as a robustness corpus for suffix array construction algorithms. The collection is designed to stress SACAs with difficult edge cases and inputs which are known to be problematic for various approaches to SACA. This is an open corpus that anyone can contribute to if they can demonstrate that their contribution is be problematic for one or more modern SACA.
MSufSort 3.1.1 - Suffix Array Construction
The last stable MSufSort release, dating back to 2007, introduced several new concepts such as the tandem repeat sort, cache friendly second stage ITS, and BWT directly from first stage ITS. These ideas have subsequently been adopted by other top suffix array construction algorithms.
M99 - Fast Burrows/Wheeler Compressor
A BWT compressor from 1999 which demonstrated high compression and speed using a unique entropy encoding algorithm for direct encoding of the BWT.
M03 - Context Aware Burrows/Wheeler Compressor
The first and only context aware BWT compressor. Acheives very high compression. It actually took until 2009 to implement and is a beta that I wrote before the birth of my first daughter. (I knew I would not have any time for development after that). I plan to reimplement M03 in the future to include a technique I call 'context skipping' which, from early experiments, doubled the speed of context modeling. It's still unclear how much this will improve compression though. Only time will tell.