sqlite-zstd

Extension for sqlite that provides transparent dictionary-based row-level compression for sqlite. This basically allows you to compress entries in a sqlite database almost as well as if you were compressing the whole DB file, but while retaining random access.

Depending on the data, this can reduce the size of the database by 90% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller).

Transparent Compression

Basic Functionality

md5-21f416f23e9275fe1a7ca44fb18e5237

if compact is true, the output will be without magic header, without checksums, and without dictids. This will save 4 bytes when not using dictionaries and 8 bytes when using dictionaries. this also means the data will not be decodeable as a normal zstd archive with the standard tools. The same compact argument must also be passed to the decompress function.

  • zstd_decompress(data: blob, is_text: bool, dictionary: blob | int | null = null, compact: bool = false) -> text|blob

    Decompresses the given data. if the dictionary is wrong, the result is undefined

    Note that passing dictionary as an int is recommended, since then the dictionary only has to be prepared once.

    is_text specifies whether to output the data as text or as a blob. Note that when outputting as text the encoding depends on the sqlite database encoding. sqlite-zstd is only tested with UTF-8.

    compact must be specified when the compress function was also called with compact.

  • zstd_train_dict(agg, dict_size: int, sample_count: int) -> blob

    Aggregate function (like sum() or count()) to train a zstd dictionary on sample_count samples of the given aggregate data

    Example use: select zstd_train_dict(tbl.data, 100000, 1000) from tbl will return a dictionary of size 100kB trained on 1000 samples in tbl

    The recommended number of samples is 100x the target dictionary size. As an example, you can train a dict of 100kB with the "optimal" sample count as follows:

    sql select zstd_train_dict(data, 100000, (select (100000 * 100 / avg(length(data))) as sample_count from tbl)) as dict from tbl

    Note that dictsize and samplecount are assumed to be constants.

  • zstd_train_dict_and_save(agg, dict_size: int, sample_count: int) -> int

    Same as zstd_train_dict, but the dictionary is saved to the _zstd_dicts table and the id is returned.

  • Compiling

    This project can be built in two modes: (a) as a Rust library and (b) as a pure SQLite extension (with --features build_extension).

    You can get the SQLite extension binaries from the GitHub releases. Alternatively, you can build the extension by hand:

    ``` cargo build --release --features build_extension

    should give you target/release/libsqlite_zstd.so

    ```

    Usage

    You can either load this library as SQLite extension or as a Rust library. Note that sqlite extensions are not persistent, so you need to load it each time you connect to the database.

    Sqlite CLI

    Either load it in the REPL:

    sh $ sqlite3 file.db SQLite version 3.34.0 2020-12-01 16:14:00 sqlite> .load .../libsqlite_zstd.so [2020-12-23T21:30:02Z INFO sqlite_zstd::create_extension] [sqlite-zstd] initialized sqlite>

    Or alternatively:

    sqlite3 -cmd '.load libsqlite_zstd.so' 'select * from foo'

    C Api

    c int success = sqlite3_load_extension(db, "libsqlite_zstd.so", NULL, NULL);

    See here for more information.

    Rust

    The recommended method is to add sqlite_zstd as a dependency to your project, then load it using

    rust let conn: rusqlite::Connection; sqlite_zstd::load(&conn)?;

    Alternatively, you can load the extension like any other extension:

    rust let conn: rusqlite::Connection; conn.load_extension("libsqlite_zstd.so", None)?;

    See here for more information.

    Verbosity / Debugging

    You can change the log level by setting the environment variable SQLITE_ZSTD_LOG=error for less logging and SQLITE_ZSTD_LOG=debug for more logging.

    Future Work / Ideas / Todo

    TODO: describe

    select zstdenabletransparent('{"table": "events", "column": "data", "compressionlevel": 19, "dictchooser": "case when date(timestamp, ''weekday 0'') < date(''now'', ''weekday 0'') then data_type || ''.'' || date(timestamp, ''weekday 0'') else null end"}');

    select case when date(timestamp, 'weekday 0') < date('now', 'weekday 0') then data_type || '.' || date(timestamp, 'weekday 0') else null END from events limit 10000