2017-01-08 Variable size integer encoding

Posted on January 8, 2017

It is a known fact of nature that most integers are small. When transferring data over the network, or to/from disk, or pretty much anywhere the IO is slow, keeping data compact is a plus. Obviously, compression is an option, but it is fairly CPU heavy. Taking advantage of the fact that, in practice, most integer values are small is an option to effectively reduce the amount of data for a modest cost in term of CPU.

LEB128

LEB128 is a variable length encoding used in the file format DWARF, used for debug data and exception unwinding. Variation of LEB128 are used in many software, including git, protocol buffer and many others. The technique involve splitting the integer in groups of 7bits. Each bytes in the encoding contains 7 bits of the integer, plus one bit indicating if there is a next group of 7 bits coming.

Range	Encoding
0 .. 2^7 – 1	0xxxxxxx
2^7 .. 2^14 – 1	1xxxxxxx 0xxxxxxx
2^14 .. 2^21 – 1	1xxxxxxx 1xxxxxxx 0xxxxxxx
2^21 .. 2^28 – 1	1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx
2^28 .. 2^35 – 1	1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx

UTF-8

UTF-8 was made with a few constraints in mind. First, it had to be compatible with ASCII, which means all value bellow 128 must be encoded as this. In addition, none of the remaining bytes in the multibyte encoding must be in the 0 – 127 range, so legacy software wouldn’t confuse them for ASCII characters. As text processing is often performance critical, it needed to be decodable quickly. Finally, decoding would happen forward, but also backward.

UTF-8 ‘s first byte is made of a variable numbers of 1, followed by a 0, and then actual bits from the data. The number of ones indicate the number of bytes this value uses. For one byte values, the number of ones would be 0, and, as a result, integer in the 0 – 127 range can be encoded as themselves (they all start with a 0 bit). This means that, contrary to the LEB style of encoding, the number of bytes is known up-front and this can be leveraged to increase ILP when forward decoding.

Following bytes are started by 10, and then 6 buts of payload. This is necessary so that none of theses bytes ends up in the 0 – 127 range, but also to help backward decoding. Because 1 bytes integers start with 0 and 2 bytes ones starts with 110, a bytes starting with 10 is a sure sign that it is within a multi-byte encoding.

Range	Encoding
0 .. 2^7 – 1	0xxxxxxx
2^7 .. 2^11 – 1	110xxxxx 10xxxxxx
2^11 .. 2^16 – 1	1110xxxx 10xxxxxx 10xxxxxx
2^16 .. 2^21 – 1	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
2^21 .. 2^26 – 1	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The UTF-8 encoding is not as efficient in term of space as the previously mentioned encoding, however, it is remarkable that it can achieve such a level of compatibility with raw ASCII, while remaining fairly compact and efficient.

Bitcoin’s var_int

The bitcoin protocol uses a form of variable size encoding for integer, mostly for expressing array like structures. The first byte of the integer is either 0xFD for 3 bytes encoding, 0xFE for 5 bytes encoding, 0xFF for 8 bytes encoding and is considered to be a 1 bytes integer for anything less than 0xFD .

this scheme is quite primitive but is very easy to understand. It encode integer on 1 bytes for values up to 252, but pessimize everything past that point because it require a 1 byte prefix.

Range	Encoding
0 .. 252	(1 byte integer)
253 .. 2^16 – 1	0xFD (2 byte integer)
2^16 .. 2^32 – 1	0xFE (4 byte integer)
2^32 .. 2^64 – 1	0xFF (8 byte integer)

Dlugosz’s Variable-Length Integer Encoding

This encoding is used in the ZIP format. It strive to encode small integers efficiently, but also try to have efficient encoding for larger yet common integer formats, such as 128 bits UUID, arbitrarily large integers and has extension points for future encoding. The format uses a header as does UTF-8 , but the header is somewhat more complex and subsequent bytes do not need to start with 10.

Follow that link to learn more about Dlugosz’ Variable-Length Integer Encoding.

Putting it all together

All these formats strike different trade offs. However, most do have some inherent inefficiencies. For instance, most of these formats offer redundant ways to encode small numbers as large number encoding can also be used to store small numbers. The range that large numbers can encode could by offsetting it by the largest value the previous range can encode. This comes at the cost of one addition per encoding/decoding, which is relatively cheap. While the range that one would win isn’t that great, it is sometime a problem to not have a canonical representation of numbers, and, in these case, a range check needs to be performed anyway, and that’s no less expensive than the addition, but do not extend the range.