**Dejean's conjecture**

If you take a finite string over some finite alphabet and repeat it twice, you get a

*square*(e.g., abc^2 = abcabc). Similarly, a string repeated three times is a

*cube*(e.g., abc^3 = abcabcabc). It's not too hard to extend this operation to allow the exponent to be a rational number (e.g., abc^(4/3) = abca). In general, if X = Y^r (X and Y are strings, r is a number), then X is called an r-power.

What we can do next is start building infinite strings (the alphabet is still finite) that

*avoid*r-powers for some given r (i.e. no substring of our infinite string is an r-power). The well-known Thue-Morse string over the binary alphabet, for instance, avoids cubes and all powers strictly greater than two. No infinite binary string can avoid squares, though, so we say that the

*repetitive threshold*for the binary alphabet is 2 (put another way, RT(2) = 2).

Françoise Dejean's conjecture gives the value of RT for all alphabet sizes (RT(3) = 7/4, RT(4) = 7/5, RT(k > 4) = k/(k - 1)). The absolutely stunning thing about the conjecture is the sheer number of researchers it took to prove it for various values of k (strictly speaking, it's not a conjecture anymore as it was proved, but old habits die hard, I guess, and the old name still prevails). The final result for the remaining values of k was obtained just recently, in 2009.

The proof techniques get progressively more complex for each alphabet size, so I'm not sure a layman like me can grasp the more recent papers, but the 1984 one from Pansiot (k = 4) seems approachable. The only problem is that it's in French. :-)

**Burrows–Wheeler Transform**

The BWT is fantastically beautiful and extremely useful algorithmic technique. The idea seems deceptively simple: you take a string, sort its cyclic shifts, and then form a new string consisting of the last characters of the shifts (in sorted order). The non-trivial observation is that this transform is invertible, and the inverse, as well as the BWT itself, can be computed efficiently (the naive implementation requires quadratic time/space and is not of much interest).

I used to think the BWT was mostly used for data compression, but it turned out it can do much more than that:

1. The FM-index data structure based on the BWT allows you to build a full-text index for a large string (the go-to example is the human genome, as always) that is way more efficient than suffix arrays/trees/DAWGs.

2. You can solve the inexact string matching problem for short patterns, which is exactly what is needed for short DNA read alignment.

mehasSeptember 2 2014, 03:33:04 UTC 4 years ago

dfyzSeptember 4 2014, 20:10:08 UTC 4 years ago