Our models support both 8000 and 16000 Hz. Although other values are not directly supported, multiples of 16000 (e.g. 32000 or 48000) are cast to 16000 inside of the JIT model!
Though for majority of use cases no tuning is necessary by design, a good start would be to plot probabilities, select the threshold, min_speech_duration_ms and min_silence_duration_ms. See thus discussion and docstrings for examples.
This should give you some idea. Also please see the docstring for some base values. typically anything higher than 16 kHz is not required for speech. The model most likely will have problems with extremely long chunks.
Yes. Though the models were designed for streaming, they can also be used to process long audios. Please see the provided utils, the jit model for example has method model.reset_states().
Link.
As of this moment, we have not published any of those for lack of time and motivation. For citations and further reading please see links in the README.
The JIT model actually contains two separate models in one package, one for 8 kHz, one for 16 kHz. Yes, we tried all of the sensible model sharing variants and arrived at this one after a lot of experiments.
ONNX supports out-of-the-box quantization, you can just try it. As for JIT - there always were some problems in running the quantized models on ARM / mobile, so we decided to drop quantization support until the ecosystems mature a bit.
8 kHz is standard phone quality, while 16 kHz is usually used for ASR purposes. Theoretically, voice has very few frequencies that cannot be covered by 16 kHz audio, though for TTS 24 kHz or 48 kHz audio still sounds better. 32 kHz and 48 kHz can be easily downsampled to 16 kHz by averaging or omitting some samples.
This is due to ONNX runtime itself. We noticed too that for small input tensors for small models there is a visible speed boost, i.e. 30 - 60%. Our JIT models do not contain modules that can be easily "merged together" (like ConvBnRelu for example), so this is kind of unavoidable. As for WebRTC speed, the ONNX model is already at least the same order of magnitude (< 1 ms per chunk), albeit still slower.