Jump to content

Essays/Burrows-Wheeler Transform

From J Wiki

The Burrows-Wheeler transform is a brilliantly simple way to group together characters in a string, making it more amenable to compression. Crucially, transform is bijective.

The forward transform works by arranging all rotations of the string, sorting the rotations, and taking the last column of the resultant array. Let us follow the example in the paper in J:

   S=:'abraca'
   N=.#S
   (i.N)|."0 1 S
abraca
bracaa
racaab
acaabr
caabra
aabrac
   M=./:~ (i.N)|."0 1 S
   M
aabrac
abraca
acaabr
bracaa
caabra
racaab
   {:"1 M
caraab

We will also need the index of the original string in the sorted array to be able to undo the transform, viz.

   I=.M i. S
   I
1

Putting it all together in a monad:

bw =: monad define
    S=.y
    N=.#S
    M=./:~ (i.N)|."0 1 S
    ({:"1 M);M i. S
)
   bw 'abraca'
┌──────┬─┐
│caraab│1│
└──────┴─┘

; (link) is a J idiom where one would return multiple values as a tuple in other languages.

Let L be the transformed string; to undo the transform, begin by forming the vector T of indices where the characters of L appear in the sorted string. The J implementation offers some clarity:

   L=.'caraab'
   T=.(/:~L) pi L
   T
4 0 5 1 2 3

where pi is referenced from "progressive index-of."

oc=: [: ((] - {) /:@/:) i.~
pi=: #@[ ({. i.&(,.oc) }.) [ i.,

The paper then defines (somewhat confusingly) Ti(x) to be the image of x by permutation according to T applied i times. Again J offers some clarity; we tabulate the iterated permutations:

   N=.#L
   1 |. ({/\(T"0 (i.N)))
2 4 3 0 5 1
5 2 1 4 3 0
3 5 0 2 1 4
1 3 4 5 0 2
0 1 2 3 4 5
4 0 5 1 2 3

Extracting the Ith column (recall I is the index returned by bw, defined to be M i. S):

   ix=.(I&{)"1 (1 |. {/\(T"0 (i.N)))
   ix
4 2 5 3 1 0

which defines a permutation of L to recover the original string (backwards and with a little twiddling of the indices):

   2 |. |. ix { L
abraca

We can define ix in one step using F:.

   ix=.(I&{) F:. { (T"0 i.(N+1))
   ix
4 2 5 3 1 0

or (the below avoids copying/filling an array with T)

    ix=. T (I&{) F:. (] T&{) (i.N)
    ix
4 2 5 3 1 0

which extracts the Ith index as we go.

So we have recovered the original string S with L and I.

Putting it all together in a dyad:

bw_d =: dyad define
    L=.x
    I=.y
    T=.(/:~L) pi L
    N=.#L
    ix=. T (I&{) F:. (] T&{) (i.N)
    2 |. |. ix { L
)
   'caraab' bw_d 1
abraca

Contributed by Vanessa McHale, including suggested improvements by Michael Day.