Skip to content

speech recognition sample#20291

Merged
alalek merged 12 commits intoopencv:masterfrom
spazewalker:master
Oct 4, 2021
Merged

speech recognition sample#20291
alalek merged 12 commits intoopencv:masterfrom
spazewalker:master

Conversation

@spazewalker
Copy link
Copy Markdown
Contributor

@spazewalker spazewalker commented Jun 21, 2021

GSoC 2021: Speech Recognition using OpenCV AudioIO

Project details

PR details

Creating ONNX model

NVIDIA trained jasper using FP16 precision. OpenCV needs FP32. We need to change onnx model's graph. This is done using this script : convert_jasper_to_FP32.py. Pre-trained converted onnx can be found here. Original pre-trained model by NVIDIA can be found here.

Usage

usage: speech_recognition.py [-h] --input_audio INPUT_AUDIO [--show_spectrogram] [--model MODEL] [--output OUTPUT] [--backend {0,2,3}] [--target {0,1,2}]

This script runs Jasper Speech recognition model

optional arguments:
  -h, --help            show this help message and exit
  --input_audio INPUT_AUDIO
                        Path to input audio file. OR Path to a txt file with relative path to multiple audio files in different lines (default: None)
  --show_spectrogram    Whether to show a spectrogram of the input audio. (default: False)
  --model MODEL         Path to the onnx file of Jasper. default="jasper.onnx" (default: jasper.onnx)
  --output OUTPUT       Path to file where recognized audio transcript must be saved. Leave this to print on console. (default: None)
  --backend {0,2,3}     Select a computation backend: 0: automatically (by default) 2: OpenVINO Inference Engine 3: OpenCV Implementation (default: 0)
  --target {0,1,2}      Select a target device: 0: CPU target (by default) 1: OpenCL 2: OpenCL FP16 (default: 0)

Todo

  • Use AudioIO instead of soundfile.
  • Check performance.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Docs

@l-bat l-bat added the GSoC label Jun 21, 2021
@spazewalker spazewalker changed the title speech recognition sample added.(initial commit) speech recognition sample Jun 22, 2021
@l-bat
Copy link
Copy Markdown
Contributor

l-bat commented Jun 23, 2021

Please add description at the beginning of sample as in

'''
You can download the converted pb model from https://www.dropbox.com/s/qag9vzambhhkvxr/lip_jppnet_384.pb?dl=0
or convert the model yourself.
Follow these steps if you want to convert the original model yourself:
To get original .meta pre-trained model download https://drive.google.com/file/d/1BFVXgeln-bek8TCbRjN6utPAgRE0LJZg/view
For correct convert .meta to .pb model download original repository https://github.com/Engineering-Course/LIP_JPPNet
Change script evaluate_parsing_JPPNet-s2.py for human parsing
1. Remove preprocessing to create image_batch_origin:
with tf.name_scope("create_inputs"):
...
Add
image_batch_origin = tf.placeholder(tf.float32, shape=(2, None, None, 3), name='input')
2. Create input
image = cv2.imread(path/to/image)
image_rev = np.flip(image, axis=1)
input = np.stack([image, image_rev], axis=0)
3. Hardcode image_h and image_w shapes to determine output shapes.
We use default INPUT_SIZE = (384, 384) from evaluate_parsing_JPPNet-s2.py.
parsing_out1 = tf.reduce_mean(tf.stack([tf.image.resize_images(parsing_out1_100, INPUT_SIZE),
tf.image.resize_images(parsing_out1_075, INPUT_SIZE),
tf.image.resize_images(parsing_out1_125, INPUT_SIZE)]), axis=0)
Do similarly with parsing_out2, parsing_out3
4. Remove postprocessing. Last net operation:
raw_output = tf.reduce_mean(tf.stack([parsing_out1, parsing_out2, parsing_out3]), axis=0)
Change:
parsing_ = sess.run(raw_output, feed_dict={'input:0': input})
5. To save model after sess.run(...) add:
input_graph_def = tf.get_default_graph().as_graph_def()
output_node = "Mean_3"
output_graph_def = tf.graph_util.convert_variables_to_constants(sess, input_graph_def, output_node)
output_graph = "LIP_JPPNet.pb"
with tf.gfile.GFile(output_graph, "wb") as f:
f.write(output_graph_def.SerializeToString())'
'''

  1. How to get FP32 ONNX model from pre-trained model
  2. Provide link to the converted model

if __name__ == '__main__':

# Computation backends supported by layers
backends = (cv.dnn.DNN_BACKEND_DEFAULT, cv.dnn.DNN_BACKEND_OPENCV)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try forward net with OpenVINO (cv.dnn.DNN_BACKEND_INFERENCE_ENGINE)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried. It gave this error: error: (-213:The function/feature is not implemented) Unknown backend identifier in function 'cv::dnn::dnn4_v20210301::wrapMat'


parser = argparse.ArgumentParser(description='This script runs Jasper Speech recognition model',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--input_audio', type=str, help='Path to input audio file.')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify supported audio formats?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add required=True

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, we need to use AudioIO. So, should I add the formats supported there? I suppose mp3, wav and mp4 are supported.

@alalek
Copy link
Copy Markdown
Member

alalek commented Aug 22, 2021

@spazewalker Could you please check if approach from #20558 works for this case?

@spazewalker
Copy link
Copy Markdown
Contributor Author

@spazewalker Could you please check if approach from #20558 works for this case?

@alalek Just tested it. It works for this case.

@spazewalker spazewalker marked this pull request as ready for review August 22, 2021 16:57
@spazewalker spazewalker marked this pull request as draft August 22, 2021 16:58
support for multiple files at once
Co-authored-by: Liubov Batanina  <piccione-mail@yandex.ru>

fix whitespaces
@alalek
Copy link
Copy Markdown
Member

alalek commented Sep 27, 2021

Lets merge it with soundfile workaround.

@spazewalker Please make PR to "Ready for review" if it is ready for merging.

@alalek
Copy link
Copy Markdown
Member

alalek commented Oct 2, 2021

"Ready for review"

@spazewalker Ping. Or let us know if you want to improve something else.

@spazewalker
Copy link
Copy Markdown
Contributor Author

@alalek I'm actually waiting for #19721 to get merged. I think videoio would replace the soundfile.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍

@spazewalker spazewalker marked this pull request as ready for review October 3, 2021 06:33
@alalek alalek merged commit 4938765 into opencv:master Oct 4, 2021
@alalek alalek mentioned this pull request Oct 15, 2021
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
speech recognition sample

* speech recognition sample added.(initial commit)

* fixed typos, removed plt

* trailing whitespaces removed

* masking removed and using opencv for displaying spectrogram

* description added

* requested changes and add opencl fp16 target

* parenthesis and halide removed

* workaround 3d matrix issue

* handle multi channel audio

support for multiple files at once

* suggested changes

fix whitespaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants