[{"@context":"http:\/\/schema.org\/","@type":"BlogPosting","@id":"https:\/\/wiki.edu.vn\/en\/wiki12\/burrows-wheeler-transform-wikipedia\/#BlogPosting","mainEntityOfPage":"https:\/\/wiki.edu.vn\/en\/wiki12\/burrows-wheeler-transform-wikipedia\/","headline":"Burrows\u2013Wheeler transform – Wikipedia","name":"Burrows\u2013Wheeler transform – Wikipedia","description":"Algorithm used in data compression techniques The Burrows\u2013Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs","datePublished":"2022-10-03","dateModified":"2022-10-03","author":{"@type":"Person","@id":"https:\/\/wiki.edu.vn\/en\/wiki12\/author\/lordneo\/#Person","name":"lordneo","url":"https:\/\/wiki.edu.vn\/en\/wiki12\/author\/lordneo\/","image":{"@type":"ImageObject","@id":"https:\/\/secure.gravatar.com\/avatar\/cd810e53c1408c38cc766bc14e7ce26a?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/cd810e53c1408c38cc766bc14e7ce26a?s=96&d=mm&r=g","height":96,"width":96}},"publisher":{"@type":"Organization","name":"Enzyklop\u00e4die","logo":{"@type":"ImageObject","@id":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/11\/book.png","url":"https:\/\/wiki.edu.vn\/wiki4\/wp-content\/uploads\/2023\/11\/book.png","width":600,"height":60}},"image":{"@type":"ImageObject","@id":"https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/887d46efbb40f6f35ac2b8a5a3a60e42cef84bb2","url":"https:\/\/wikimedia.org\/api\/rest_v1\/media\/math\/render\/svg\/887d46efbb40f6f35ac2b8a5a3a60e42cef84bb2","height":"","width":""},"url":"https:\/\/wiki.edu.vn\/en\/wiki12\/burrows-wheeler-transform-wikipedia\/","wordCount":9742,"articleBody":"Algorithm used in data compression techniques The Burrows\u2013Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a “free” method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows\u2013Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.[1]Table of Contents Description[edit]Example[edit]Explanation[edit]Optimization[edit]Bijective variant[edit]Dynamic Burrows\u2013Wheeler transform[edit]Sample implementation[edit]BWT applications[edit]BWT for sequence alignment[edit]BWT for image compression[edit]BWT for compression of genomic databases[edit]BWT for sequence prediction[edit]References[edit]External links[edit]Description[edit]When a character string is transformed by the BWT, the transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.For example:InputSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESOutputTEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT[2]The output is easier to compress because it has many repeated characters.In this example the transformed string contains six runs of identical characters:XX,SS,PP,..,II,andIII, which together make 13 out of the 44 characters.Example[edit]The transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S. Given an input string S = ^BANANA| (step 1 in the table below), rotate it N times (step 2), where N = 8 is the length of the S string considering also the symbol ^ representing the start of the string and the red | character representing the ‘EOF’ pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column L = BNN^AA|A after step 3, and the index (0-based) I of the row containing the original string S, in this case I = 6.Transformation1. Input2. Allrotations3. Sort intolexical order4. Take thelast column5. Output^BANANA|^BANANA||^BANANAA|^BANANNA|^BANAANA|^BANNANA|^BAANANA|^BBANANA|^ANANA|^BANA|^BANA|^BANANBANANA|^NANA|^BANA|^BANA^BANANA||^BANANAANANA|^BANA|^BANA|^BANANBANANA|^NANA|^BANA|^BANA^BANANA||^BANANABNN^AA|AThe following pseudocode gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string s contains a special character ‘EOF’ which is the last character and occurs nowhere else in the text.function BWT (string s) create a table, where the rows are all possible rotations of s sort rows alphabetically return (last column of the table)function inverseBWT (string s) create empty table repeat length(s) times \/\/ first insert creates first column insert s as a column of table before first column of the table sort rows of the table alphabetically return (row that ends with the 'EOF' character)Explanation[edit]To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word “the”. Sorting the rotations of this text will group rotations starting with “he ” together, and the last character of that rotation (which is also the character before the “he “) will usually be “t”, so the result of the transform would contain a number of “t” characters along with the perhaps less-common exceptions (such as if it contains “ache “) mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).The remarkable thing about the BWT is not that it generates a more easily encoded output\u2014an ordinary sort would do that\u2014but that it does this reversibly, allowing the original document to be re-generated from the last column data.The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the last and first columns (of each row) together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the “end of file” character at the end is the original text. Reversing the example above is done like this:Inverse transformationInputBNN^AA|AAdd 1Sort 1Add 2Sort 2BNN^AA|AAAABNN^|BANANA^BANAN|^A|ANANA|BANANA^B|^Add 3Sort 3Add 4Sort 4BANNANNA|^BAANAANA|^BA|^ANAANAA|^BANNANNA|^BA|^BBANANANANA|^^BANANANANA||^BAA|^BANANANA|A|^BBANANANANA|^^BAN|^BAAdd 5Sort 5Add 6Sort 6BANANNANA|NA|^B^BANAANANAANA|^|^BANA|^BAANANAANA|^A|^BABANANNANA|NA|^B^BANA|^BANBANANANANA|^NA|^BA^BANANANANA|ANA|^B|^BANAA|^BANANANA|ANA|^BA|^BANBANANANANA|^NA|^BA^BANAN|^BANAAdd 7Sort 7Add 8Sort 8BANANA|NANA|^BNA|^BAN^BANANAANANA|^ANA|^BA|^BANANA|^BANAANANA|^ANA|^BAA|^BANABANANA|NANA|^BNA|^BAN^BANANA|^BANANBANANA|^NANA|^BANA|^BANA^BANANA|ANANA|^BANA|^BAN|^BANANAA|^BANANANANA|^BANA|^BANA|^BANANBANANA|^NANA|^BANA|^BANA^BANANA||^BANANAOutput^BANANA|Optimization[edit]A number of optimizations can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A “character” in the algorithm can be a byte, or a bit, or any other convenient size.One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the suffix array, and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing):"},{"@context":"http:\/\/schema.org\/","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki12\/#breadcrumbitem","name":"Enzyklop\u00e4die"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/wiki.edu.vn\/en\/wiki12\/burrows-wheeler-transform-wikipedia\/#breadcrumbitem","name":"Burrows\u2013Wheeler transform – Wikipedia"}}]}]