Is your feature request related to a problem? Please describe.
This is related to #13284.
#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.
Take Microsoft Zira Desktop (SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.
Other voices such as OneCore voices also have a few milliseconds leading silence.
Describe the solution you'd like
We can detect and remove the silence audio part in WavePlayer, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all use WavePlayer now, they can all benefit from this. The synthesizer may need to tell WavePlayer when the audio will start or end, so that WavePlayer can locate the "leading silence" part more easily.
Describe alternatives you've considered
Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to WavePlayer.
Additional context
I'm not sure what is the best approach to implement this.
Is your feature request related to a problem? Please describe.
This is related to #13284.
#17592 closed that issue by making SAPI5 voices output via WASAPI. This did improve the responsiveness, but we can improve it even further by removing the leading silence part.
Take
Microsoft Zira Desktop(SAPI5) as an example. When speaking at 1X speed, the leading silence is 100ms long. When speaking at its maximum rate (3X speed), the leading silence becomes about 30ms long. If we can remove the leading silence, it will respond even faster.Other voices such as OneCore voices also have a few milliseconds leading silence.
Describe the solution you'd like
We can detect and remove the silence audio part in
WavePlayer, either in the Python part or in the C++ part. As eSpeak, OneCore and SAPI5 (plus MSSP) all useWavePlayernow, they can all benefit from this. The synthesizer may need to tellWavePlayerwhen the audio will start or end, so thatWavePlayercan locate the "leading silence" part more easily.Describe alternatives you've considered
Create a stand-alone module for detecting and removing the silence audio part, either in Python or in C++. The synthesizers should pass the audio data to this module before feeding it to
WavePlayer.Additional context
I'm not sure what is the best approach to implement this.