Byte Size

Crates.io Docs.rs License

A short string compressor/decompressor that can store 20,000+ words in three bytes or less.

Similar to smaz, byte-size is able to compress small strings, something that other conventional compression algorithms struggle with.

However, byte-size is typically better than smaz, certainly for very commonly used words (out of 10000 most common words, less than 1% had better compression with smaz) byte-size can also represent numbers, repeated sequences and non-alphanumeric characters more efficiently than smaz. It can encode unicode characters, but not very efficiently. If your text includes a few unicode characters it should still compress better, but if your strings are mostly unicode characters, other schemes such as Unishox are better.

Cost

byte-size uses several tables with over 18000 total entries. Obviously this will incur a large runtime memory and binary file size cost, but if you have the memory available, it is worth it to compress more effectively.

Examples

Using examples directly from smaz we have:

[Insert examples]

We can see how every example is compressed more with byte-size than smaz.

How?

It's basically just two tables one of about a thousand most commonly used lemmas (expressible as 2 bytes) and another of 10s of thousands of lemmas (expressible as 3 bytes)

On top of that, we have a few commonly used 2 and 3 byte sequences expressible as just 1 byte, that can be used as lemma prefix/sufixes, or can be used to construct words not in either list.

There are 3 lists: - One byte wonders (OBW): Made up of the printable ascii characters (with a few control sequences), common prefix/suffixes and common bigrams. - Two byte common (TBC): Made up of 1793 of the most common lemmas (that aren't also in the OBW list) - Three byte uncommon (TBU): Made up of 16512 of the most common lemmas (that aren't in either previous lists)

These lists are stored in the package root directory. These lists can be modified and these modifications will work. Lists are represented as a file, where each line is a new lemma encoded via percent encoding (to allow non printable characters and unicode sequences)

Encoding

The Snaz encoding is as follows: