Rustpotter

An open source wakeword spotter forged in rust.

Overview

The aim of this project is to detect specific keywords in a live audio stream.

Rustpotter allows two detection methods, both based on using the mel frequency cepstral coefficients (mfccs) of the audio, exposed as the two kind of wakewords:

Rustpotter supports wav audio in any sample rate, will use only the first channel data, but only support the following sample encodings:

Detection Mechanism Overview

When you feed Rustpotter with a stream, it keeps a window of mfccs vectors (can be seen as a matrix of mfccs) that grows until the length needed by the loaded wakewords.

The input length requested by Rustpotter varies depending on the configured format but is constant and equivalent to 30ms of audio. Internally, it generates a vector of mfccs for each 10ms of audio. So the audio window is updated 3 times each time you call the process method.

From the moment the window has the correct size, Rustpotter starts scoring the window on each update in order to find a successful detection (score is over the defined threshold).

A detection is considered a partial detection (not emitted) until n more updated are processed (half of the length of the feature window). If in this time a detection with a higher score is found, it replaces the current partial detection, and this countdown is reset.

The Score

The score is a numeric value in range 0 - 1 that represents the accuracy of the detection.

When using a wakeword model the score represents the inverse similarity between the predicted label and the prediction for the none label.

When using a wakeword reference the score represents the aggregated similarity against the mfccs of each of the records used on creations. Calculated in base to the score mode option.

Score Mode

When using a wakeword reference rustpotter needs to unify the scores against the mfccs of each of the records used on creations.

You can configure how this is done using the score_mode option. The following modes are available:

The Averaged Score

Another numeric value in range 0 - 1 calculated on detection.

When using a wakeword reference the average threshold represents the similarity of the current audio mfccs against a single mfccs matrix generated by averaging the mfccs of the records used on creations. The averaged threshold can be used to reduce cpu usage as it aborts the detection then the averaged score is not surpassed.

When using a wakeword model it the inverse similarity between the predicted label and the prediction for next matched label. It will match the score unless you are using a model trained in more that one label, so if that case it's better to set averaged threshold to 0 to disable it.

Remember you can set the avg_threshold config to zero to disable using this score.

Audio Filters

Rustpotter includes two audio filter implementations: a gain-normalizer filter and a bass-pass filter.

These filters are disabled by default, and their main purpose is to improve the detector's performance in the presence of noise.

Partial detections

To discard false detections, you can require a certain number of partial detections to occur. This is configured through the min_scores config option.

Detection

A successful Rustpotter detection provides you with some relevant information about the detection process so you know how to configure the detector to achieve a good configuration (minimize the number of misses/false detections).

It looks like this when using a wakeword reference:

rust RustpotterDetection { /// Detected wakeword name. name: "hey home", /// Detection score against the averaged features matrix. (zero if disabled) avg_score: 0.41601, /// Detection score. (calculated from the scores using the selected score mode). score: 0.6618781, /// Detection score against the mfccs of each record used on creation. scores: { "hey_home_g_5.wav": 0.63050425, "hey_home_g_3.wav": 0.6301979, "hey_home_g_4.wav": 0.61404395, "hey_home_g_1.wav": 0.6618781, "hey_home_g_2.wav": 0.62885964 }, /// Number of partial detections. counter: 40, /// Gain applied by the gain-normalizer or 1. gain: 1., }

It looks like this when using a wakeword model:

rust RustpotterDetection { /// Detected wakeword name. name: "hey home", /// Inverse similarity against the seconds more probable label. (zero if disabled) avg_score: 0.9994159, /// Inverse similarity against the none label probability. score: 0.9994159, /// Label probabilities. scores: { "hey home": 7.999713, "none": -10.5787945 } /// Number of partial detections. counter: 28, /// Gain applied by the gain-normalizer or 1. gain: 1., }

Rustpotter exposes a reference to the current partial detection that allows read access to it for debugging purposes.

Model Types

They used the same model names that a recognized stt. The following sizes are for models files trained on 1950ms of audio. Those are:

Web Demos

The spot demo is available so you can quickly try out Rustpotter using a web browser.

It includes some models generated using multiple voices from a text-to-speech service. You can also load your own ones.

The wakeword reference generator demo is available so you can quickly record samples and generate Rustpotter wakeword references using your own voice.

Please note that both run entirely on your browser, your voice is not sent anywhere, they are hosted using Github Pages.

Related projects

Changelog overview

A minimal overview of the changes introduced on each major version.

v3:

v2:

v1:

Basic Usage

rust use rustpotter::{Rustpotter, RustpotterConfig, Wakeword}; // assuming the audio input format match the rustpotter defaults let mut rustpotter_config = RustpotterConfig::default(); // Configure format/filters/detection options ... // Instantiate rustpotter let mut rustpotter = Rustpotter::new(&rustpotter_config).unwrap(); // load a wakeword rustpotter.add_wakeword_from_file("./tests/resources/hey_home.rpw").unwrap(); // You need a buffer of size `rustpotter.get_samples_per_frame()` when using samples. // You need a buffer of size `rustpotter.get_bytes_per_frame()` when using bytes. let mut samples_buffer: Vec<i16> = vec![0; rustpotter.get_samples_per_frame()]; // while true { Iterate forever // fill the buffer with the required samples ... let detection = rustpotter.process(samples_buffer); if let Some(detection) = detection { println!("{:?}", detection); } // }

References

This project started as a port of the project node-personal-wakeword and it's based on public available articles, and uses a ton of amazing crates.

Motivation

The motivation behind this project is to learn about audio analysis and the Rust language/ecosystem.

As such, this is not intended to be a production-grade tool, but with a well trained wakeword model it achieves the quality expected.

Contributing

Feel free to suggest or contribute any improvements that you have in mind, either to the code or the detection process.

If you need any assistance, please feel free to open an issue.

Best regards!