Иван Комаров (dfyz) wrote,
Иван Комаров


So the CSEDays 2014 conference (dubbed "Strings, Languages, Automata") is over, and I thought I'd write briefly about two things that fascinated me the most. I'm no expert on neither combinatorics on words nor pattern matching algorithms, so my understanding doesn't go beyond superficial, but here it goes anyway.

Dejean's conjecture

If you take a finite string over some finite alphabet and repeat it twice, you get a square (e.g., abc^2 = abcabc). Similarly, a string repeated three times is a cube (e.g., abc^3 = abcabcabc). It's not too hard to extend this operation to allow the exponent to be a rational number (e.g., abc^(4/3) = abca). In general, if X = Y^r (X and Y are strings, r is a number), then X is called an r-power.

What we can do next is start building infinite strings (the alphabet is still finite) that avoid r-powers for some given r (i.e. no substring of our infinite string is an r-power). The well-known Thue-Morse string over the binary alphabet, for instance, avoids cubes and all powers strictly greater than two. No infinite binary string can avoid squares, though, so we say that the repetitive threshold for the binary alphabet is 2 (put another way, RT(2) = 2).

Françoise Dejean's conjecture gives the value of RT for all alphabet sizes (RT(3) = 7/4, RT(4) = 7/5, RT(k > 4) = k/(k - 1)). The absolutely stunning thing about the conjecture is the sheer number of researchers it took to prove it for various values of k (strictly speaking, it's not a conjecture anymore as it was proved, but old habits die hard, I guess, and the old name still prevails). The final result for the remaining values of k was obtained just recently, in 2009.

The proof techniques get progressively more complex for each alphabet size, so I'm not sure a layman like me can grasp the more recent papers, but the 1984 one from Pansiot (k = 4) seems approachable. The only problem is that it's in French. :-)

Burrows–Wheeler Transform

The BWT is fantastically beautiful and extremely useful algorithmic technique. The idea seems deceptively simple: you take a string, sort its cyclic shifts, and then form a new string consisting of the last characters of the shifts (in sorted order). The non-trivial observation is that this transform is invertible, and the inverse, as well as the BWT itself, can be computed efficiently (the naive implementation requires quadratic time/space and is not of much interest).

I used to think the BWT was mostly used for data compression, but it turned out it can do much more than that:

1. The FM-index data structure based on the BWT allows you to build a full-text index for a large string (the go-to example is the human genome, as always) that is way more efficient than suffix arrays/trees/DAWGs.

2. You can solve the inexact string matching problem for short patterns, which is exactly what is needed for short DNA read alignment.
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded 

Спасибо, было интересно. А насколько (в разных типичных ситуациях) FM-индекс лучше других структур?
Мне на самом деле известно только одно практическое применение (биоинформатика). Как утверждал докладчик, в биоинформатике FM-index рвёт всех на китайский флаг, ибо в несколько раз компактнее, чем всё остальное.