Skip to content

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is …#29722

Merged
ArthurZucker merged 3 commits intohuggingface:mainfrom
sywangyi:text-to-audio
Apr 5, 2024
Merged

if output is tuple like facebook/hf-seamless-m4t-medium, waveform is …#29722
ArthurZucker merged 3 commits intohuggingface:mainfrom
sywangyi:text-to-audio

Conversation

@sywangyi
Copy link
Contributor

…the first element

@sywangyi
Copy link
Contributor Author

test code

import torch
from transformers import pipeline, set_seed
import scipy


def main():
    set_seed(555)

    generator = pipeline(
        "text-to-speech",
        model="facebook/hf-seamless-m4t-medium",
        torch_dtype=torch.float32,
        device="cuda",
    )

    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
    print(f"speech = {speech}")
    scipy.io.wavfile.write("techno2.wav", rate=speech["sampling_rate"], data=speech["audio"][0])


if __name__ == "__main__":
    main()

@sywangyi
Copy link
Contributor Author

Traceback (most recent call last):
  File "/mnt/disk1/wangyi/transformers/test.py", line 23, in <module>
    main()
  File "/mnt/disk1/wangyi/transformers/test.py", line 17, in main
    speech = generator("Hello, my dog is cute", forward_params={"tgt_lang": "rus"})
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 182, in __call__
    return super().__call__(text_inputs, **forward_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1206, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/base.py", line 1214, in run_single
    outputs = self.postprocess(model_outputs, **postprocess_params)
  File "/mnt/disk1/wangyi/transformers/src/transformers/pipelines/text_to_audio.py", line 204, in postprocess
    output_dict["audio"] = waveform.cpu().float().numpy()
AttributeError: 'tuple' object has no attribute 'cpu

@sywangyi
Copy link
Contributor Author

@amyeroberts please have a review.

@amyeroberts
Copy link
Contributor

Hmmm, this is funny, as it indicates either unexpected outputs from the model, it shouldn't be used with this pipeline or that it wasn't properly tested for the pipeline @ylacombe could you please take a look into this?

…the first element

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@sywangyi
Copy link
Contributor Author

could anyone help review the pr?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for the fix. Could you add a test?
What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?You are right, the output ofSeamlessM4TForTextToSpeech` is a wave and length. The fix should be enough
We need a test to make sure the waveform generated is correct.

@ArthurZucker ArthurZucker requested a review from ylacombe March 30, 2024 18:14
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sywangyi, thanks for spotting this! Could you also add a test to test_pipeline_text_to_audio.py so that we make sure it works ?

What is weird is that this model has a vocoder. (hifi_gan): SeamlessM4THifiGan but the pipeline does not. So it's not properly integrated (or incompatible with SpeechT5Vocoder?

This would be the case if it was in MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES here!
Since the vocoder here is specific to M4T, I made the choice to add the full model (+ vocoder) to MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES

@sywangyi
Copy link
Contributor Author

sywangyi commented Apr 1, 2024

test added and also found some issue when test batch_size=2 @ylacombe

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
def postprocess(self, waveform):
output_dict = {}

if isinstance(waveform, tuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably make more sense to use return_dict=True to make sure we rely on naming rather than indexing?
This supposed the first is always a waveform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not return_dict for semaless_m4v case see https://github.com/huggingface/transformers/blob/main/src/transformers/models/seamless_m4t/modeling_seamless_m4t.py#L3520. also if the class is returned. postprocess in text_to_audio.py is lack of disposal of class input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

if return_intermediate_token_ids:
return SeamlessM4TGenerationOutput(
waveform=waveform,
waveform_lengths=waveform_lengths,
sequences=sequences,
unit_sequences=output_unit_ids,
)
return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

Copy link
Contributor Author

@sywangyi sywangyi Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ylacombe actually return_dict_in_generate should be used. because this is used in generate() func. speech t5 also has such issue. see the return of speech t5. tuple or tensor will be returned. https://github.com/huggingface/transformers/blob/main/src/transformers/models/speecht5/modeling_speecht5.py#L2561-L2593

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think pipeline should support all format output like dict, tuple, tensor, WDYT?

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating @sywangyi and for spotting the return_dict shortcoming, let me know if you want to take care of it, otherwise I can do it today to allow for this PR to be merged !

def postprocess(self, waveform):
output_dict = {}

if isinstance(waveform, tuple):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shortcoming of the modeling code of M4T (and M4Tv2) that doesn't use return_dict=True:

if return_intermediate_token_ids:
return SeamlessM4TGenerationOutput(
waveform=waveform,
waveform_lengths=waveform_lengths,
sequences=sequences,
unit_sequences=output_unit_ids,
)
return waveform, waveform_lengths

Would you add to add return_dict to M4T modeling code @sywangyi ? This would look like something like this where relevant:

if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)

If not, I can take care of this, but the current PR relies on this fix to have batch_size > 2 working

@sywangyi
Copy link
Contributor Author

sywangyi commented Apr 2, 2024

I also add "return_intermediate_token_ids" in the testcase and fix the issue in postprocess.

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good catch, thankful that you added a small test! 🤗

Comment on lines +203 to +206
if isinstance(waveform, dict):
waveform = waveform["waveform"]
elif isinstance(waveform, tuple):
waveform = waveform[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! Thanks

@ArthurZucker ArthurZucker merged commit 79d62b2 into huggingface:main Apr 5, 2024
@sywangyi sywangyi deleted the text-to-audio branch November 19, 2025 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants