Sonic

Build Status Dependency Status Buy Me A Coffee

Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples, that can then be queried against.

Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query.

A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic - when under load - responds to search queries in the ฮผs range, eats ~30MB RAM and has a north-to-null CPU footprint; see our benchmarks).

๐Ÿ‡ซ๐Ÿ‡ท Crafted in Nantes, France.

:newspaper: The Sonic project was initially announced in a post on my personal journal.

Sonic

ยซ Sonic ยป is the mascot of the Sonic project. I drew it to look like a psychedelic hipster hedgehog.

Who uses it?

Crisp

๐Ÿ‘‹ You use Sonic and you want to be listed there? Contact me.

Features

Limitations

How to use it?

Installation

Sonic is built in Rust. To install it, either download a version from the Sonic releases page, use cargo install or pull the source code from master.

Install from source:

If you pulled the source code from Git, you can build it using cargo:

bash cargo build --release

You can find the built binaries in the ./target/release directory.

Install clang to be able to compile the required RocksDB dependency.

Install from Cargo:

You can install Sonic directly with cargo install:

bash cargo install sonic-server

Ensure that your $PATH is properly configured to source the Crates binaries, and then run Sonic using the sonic command.

Install from Docker Hub:

You might find it convenient to run Sonic via Docker. You can find the pre-built Sonic image on Docker Hub as valeriansaliou/sonic.

First, pull the valeriansaliou/sonic image:

bash docker pull valeriansaliou/sonic:v1.0.2

Then, seed it a configuration file and run it (replace /path/to/your/sonic/config.cfg with the path to your configuration file):

bash docker run -p 1491:1491 -v /path/to/your/sonic/config.cfg:/etc/sonic.cfg -v /path/to/your/sonic/store/:/var/lib/sonic/store/ valeriansaliou/sonic:v1.0.2

In the configuration file, ensure that:

Sonic Channel will be reachable from tcp://localhost:1491.

Configuration

Use the sample config.cfg configuration file and adjust it to your own environment.

Available configuration options are commented below, with allowed values:

[server]

[channel]

[channel.search]

[store]

[store.kv]

[store.kv.pool]

[store.kv.database]

[store.fst]

[store.fst.pool]

[store.fst.graph]

Run Sonic

Sonic can be run as such:

./sonic -c /path/to/config.cfg

Perform searches and manage objects

Both searches and object management (ie. data ingestion) is handled via the Sonic Channel protocol only. As we want to keep things simple with Sonic (similarly to how Redis does), connecting to Sonic Channel is the way to go when you need to interact with the Sonic search database.

Sonic Channel can be accessed via the telnet utility from your computer. The very same system is also used by all Sonic Channel libraries (eg. NodeJS).


1๏ธโƒฃ Sonic Channel (uninitialized)

Issuing any other command โ€” eg. QUIT โ€” in this mode will abort the TCP connection, effectively resulting in a QUIT with the ENDED not_recognized response.


2๏ธโƒฃ Sonic Channel (Search mode)

The Sonic Channel Search mode is used for querying the search index. Once in this mode, you cannot switch to other modes or gain access to commands from other modes.

โžก๏ธ Available commands:

โฉ Syntax terminology:

Notice: the bucket terminology may confuse some Sonic users. As we are well-aware Sonic may be used in an environment where end-users may each hold their own search index in a given collection, we made it possible to manage per-end-user search indexes with bucket. If you only have a single index per collection (most Sonic users will), we advise you use a static generic name for your bucket, for instance: default.

โฌ‡๏ธ Search flow example (via telnet):

bash T1: telnet sonic.local 1491 T2: Trying ::1... T3: Connected to sonic.local. T4: Escape character is '^]'. T5: CONNECTED <sonic-server v1.0.0> T6: START search SecretPassword T7: STARTED search protocol(1) buffer(20000) T8: QUERY messages user:0dcde3a6 "valerian saliou" LIMIT(10) T9: PENDING Bt2m2gYa T10: EVENT QUERY Bt2m2gYa conversation:71f3d63b conversation:6501e83a T11: QUERY helpdesk user:0dcde3a6 "gdpr" LIMIT(50) T12: PENDING y57KaB2d T13: QUERY helpdesk user:0dcde3a6 "law" LIMIT(50) OFFSET(200) T14: PENDING CjPvE5t9 T15: PING T16: PONG T17: EVENT QUERY CjPvE5t9 T18: EVENT QUERY y57KaB2d article:28d79959 T19: SUGGEST messages user:0dcde3a6 "val" T20: PENDING z98uDE0f T21: EVENT SUGGEST z98uDE0f valerian valala T22: QUIT T23: ENDED quit T24: Connection closed by foreign host.

Notes on what happens:


3๏ธโƒฃ Sonic Channel (Ingest mode)

The Sonic Channel Ingest mode is used for altering the search index (push, pop and flush). Once in this mode, you cannot switch to other modes or gain access to commands from other modes.

โžก๏ธ Available commands:

โฉ Syntax terminology:

Notice: the bucket terminology may confuse some Sonic users. As we are well-aware Sonic may be used in an environment where end-users may each hold their own search index in a given collection, we made it possible to manage per-end-user search indexes with bucket. If you only have a single index per collection (most Sonic users will), we advise you use a static generic name for your bucket, for instance: default.

โฌ‡๏ธ Ingest flow example (via telnet):

bash T1: telnet sonic.local 1491 T2: Trying ::1... T3: Connected to sonic.local. T4: Escape character is '^]'. T5: CONNECTED <sonic-server v1.0.0> T6: START ingest SecretPassword T7: STARTED ingest protocol(1) buffer(20000) T8: PUSH messages user:0dcde3a6 conversation:71f3d63b Hey Valerian T9: ERR invalid_format(PUSH <collection> <bucket> <object> "<text>") T10: PUSH messages user:0dcde3a6 conversation:71f3d63b "Hello Valerian Saliou, how are you today?" T11: OK T12: COUNT messages user:0dcde3a6 T13: RESULT 43 T14: COUNT messages user:0dcde3a6 conversation:71f3d63b T15: RESULT 1 T16: FLUSHO messages user:0dcde3a6 conversation:71f3d63b T17: RESULT 1 T18: FLUSHB messages user:0dcde3a6 T19: RESULT 42 T20: PING T21: PONG T22: QUIT T23: ENDED quit T24: Connection closed by foreign host.

Notes on what happens:


4๏ธโƒฃ Sonic Channel (Control mode)

The Sonic Channel Control mode is used for administration purposes. Once in this mode, you cannot switch to other modes or gain access to commands from other modes.

โžก๏ธ Available commands:

โฉ Syntax terminology:

โฌ‡๏ธ Control flow example (via telnet):

bash T1: telnet sonic.local 1491 T2: Trying ::1... T3: Connected to sonic.local. T4: Escape character is '^]'. T5: CONNECTED <sonic-server v1.0.0> T6: START control SecretPassword T7: STARTED control protocol(1) buffer(20000) T8: TRIGGER consolidate T9: OK T10: PING T11: PONG T12: QUIT T13: ENDED quit T14: Connection closed by foreign host.

Notes on what happens:


๐Ÿ“ฆ Sonic Channel Libraries

Sonic distributes official Sonic Channel bindings for your programming language:

๐Ÿ‘‰ Cannot find the library for your programming language? Build your own and be referenced here! (contact me)

Which text languages are supported?

Sonic supports a wide range of languages in its lexing system. If a language is not in this list, you will still be able to push this language to the search index, but stop-words will not be eluded, which could lead to lower-quality search results.

The languages supported by the lexing system are:

How fast & lightweight is it?

Sonic was built for Crisp from the start. As Crisp was growing and indexing more and more search data into a full-text search SQL database, we decided it was time to switch to a proper search backend system. When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.

At the end, we decided to build our own search backend, designed to be simple and lightweight on resources.

You can run function-level benchmarks with the command: cargo bench --features benchmark

๐Ÿ‘ฉโ€๐Ÿ”ฌ Benchmark #1

โžก๏ธ Scenario

We performed an extract of all messages from the Crisp team used for Crisp own customer support.

We want to import all those messages into a clean Sonic instance, and then perform searches on the index we built. We will measure the time that Sonic spent executing each operation (ie. each PUSH and QUERY commands over Sonic Channel), and group results per 1,000 operations (this outputs a mean time per 1,000 operations).

โžก๏ธ Context

Our benchmark is ran on the following computer:

Sonic is compiled as following:

Our dataset is as such:

โžก๏ธ Scripts

The scripts we used to perform the benchmark are:

  1. PUSH script: sonic-benchmark_batch-push.js
  2. QUERY script: sonic-benchmark_batch-query.js

โฌ Results

Our findings:

Compared results per operation (on a single object):

We took a sample of 8 results from our batched operations, which produced a total of 1,000 results (1,000,000 items, with 1,000 items batched per measurement report).

This is not very scientific, but it should give you a clear idea of Sonic performances.

Time spent per operation:

Operation | Average | Best | Worst --------- | ------- | ----- | ----- PUSH | 275ฮผs | 190ฮผs | 363ฮผs QUERY | 880ฮผs | 852ฮผs | 1ms

Batch PUSH results as seen from our terminal (from initial index of: 0 objects):

Batch PUSH benchmark

Batch QUERY results as seen from our terminal (on index of: 1,000,000 objects):

Batch QUERY benchmark

:fire: Report A Vulnerability

If you find a vulnerability in Sonic, you are more than welcome to report it directly to @valeriansaliou by sending an encrypted email to valerian@valeriansaliou.name. Do not report vulnerabilities in public GitHub issues, as they may be exploited by malicious people to target production servers running an unpatched Sonic instance.

:warning: You must encrypt your email using @valeriansaliou GPG public key: :key:valeriansaliou.gpg.pub.asc.

:gift: Based on the severity of the vulnerability, I may offer a $200 (US) bounty to whomever reported it.