A Rust implementation of mel spectrograms aligned to the results from the whisper.cpp, pytorch and librosa reference implementations and suited to streaming audio.
[_BEG_] fellow Americans.[_TT_43]
[_BEG_] ASK![_TT_51]
[_BEG_] NOT![_TT_25]
Mel spectrograms can be used as the primary API for inference: encode once, retrieve for interence.
There is zero information loss with respect to the model's view on the original audio data - but 10x data saving, before compression.
Mel filter banks are within 1.0e-7 of librosa.filters.mel
and identical to
the GGML model-embedded filters used by whisper.cpp.
A stft implementation that allows creating spectrograms from an audio steam - near identical to those produced by whisper.cpp internally.
An example of whisper inference from mel spectrograms via whisper-rs
can be
found in the tests.
Mel spectorgrams can be saved in Tga format - an uncompressed image format supported by OSX and Windows.
Tga images can be created from any audio input, cropped in Photoshop and reused:
let range = save_tga_8bit(&mel_spectrogram, n_mels, file_path).unwrap();
// do something with the image - maybe splice it to isolate a particular feature,
// and resave it
let dequantized_mel = load_tga_8bit(file_path, &range).unwrap();
// load with whisper-rs
state.set_mel(&dequantized_mel).unwrap();
Please refer to these PRs, both whisper.cpp and whisper-rs require small changes:
https://github.com/tazz4843/whisper-rs/pull/75 - merged, use master.
https://github.com/ggerganov/whisper.cpp/pull/1130
whisper.cpp produces mel spectrograms with 1.0e-6 precision. However, these spectorgrams are invariant to 8-bit quantisation: we can save them as 8-bit images and not lose useful information - not lose any actual information about the sound wave at all.
Heisenberg's Uncertainty Principal puts a limit on how much resolution a spectrogram can have - the more we zoom in on a wave, the more blurry it becomes.
Time stretching by overlapping (whisper uses a 60% overlap) mitigates this, to a point. But after that more precision doesn't mean more accuracy, and may actually cause noise:
Indeed, we only need 1.0e-1 precision to get accurate results, and rounding to 1.0e-1 seems more accurate for some difficult transcriptions.
Consider these samples from the jfk speech used in the orignal whisper.py tests:
[src/lib.rs:93] &mel_spectrogram[10..20] = [
0.15811597,
0.26561865,
0.07558561,
0.19564378,
0.16745868,
0.21617787,
-0.29193184,
0.12279237,
0.13897367,
-0.17434756,
]
[src/lib.rs:92] &mel_spectrogram_rounded[10..20] = [
0.2,
0.3,
0.1,
0.2,
0.2,
0.2,
-0.3,
0.1,
0.1,
-0.2,
]
Once quantised, the spectrgrams are the same:
(top: not rounded, botton: rounded to 1.0e-1)
A lot has to do with how speech can be encapsulated almost entirely in the frequency domain, and how effectively the mel scale divides those frequencies into 80 bins. 8-bytes of 0-255 grayscale is probably overkill even to measure the total power in each of those bins - it could be compressed even further.