Conversation
ylacombe
left a comment
There was a problem hiding this comment.
Hey @ArthurZucker, I've made some refactoring on the tokenizer, which I believe will content you 🤗
I still have to add copied from statements in the tests, but I think you can already finish your review here, if that's ok with you!
Thanks!
| <!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | ||
|
|
||
| [BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) | ||
| [BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) |
There was a problem hiding this comment.
I actually don't know why it happens, do you have any ideas?
| represented by the `inputs_ids` passed when calling the Text-To-Units sub-model of [`~SeamlessM4TModel`], | ||
| [`~SeamlessM4TForSpeechToSpeech`] or [`~SeamlessM4TForTextToSpeech`]. | ||
|
|
||
| > Parameters shared across sub-models |
There was a problem hiding this comment.
This seems to break check_repository_consistency, any ideas why @ydshieh ? I think PretrainedConfig uses something similar
There was a problem hiding this comment.
Sorry for being late. I will check this afternoon!
There was a problem hiding this comment.
It's conv_depthwise_kernel_size (int, defaults to 31, *optional*, defaults to 31): causing problem
There was a problem hiding this comment.
change it to conv_depthwise_kernel_size (int, *optional*, defaults to 31):
then run
python utils/check_docstrings.py --fix_and_overwriteThere was a problem hiding this comment.
For lines like > Text-To-Unit (t2u) model specific parameters, I think utils/check_docstrings.py could not handle it. Is this a standard way to add comment while we are in an Args block? Could this be rendered correctly?
There was a problem hiding this comment.
For lines like
> Text-To-Unit (t2u) model specific parameters, I thinkutils/check_docstrings.pycould not handle it. Is this a standard way to add comment while we are in anArgsblock? Could this be rendered correctly?
Hi @ydshieh , thanks for your feedback! this is something that I've mimicked from PretrainedConfig
How should I handle this ? I'd like to keep these lines!
There was a problem hiding this comment.
Added to OBJECT_TO_IGNORE in utils/check_docstrings.py after having check it passes our requirements without the weird > markdown.
Thanks for your offline support on this @ydshieh
There was a problem hiding this comment.
As requested, I've made a big cleaning, I believe it's much better now, thanks for your guidance @ArthurZucker !
| if src_lang is not None: | ||
| self.src_lang = src_lang | ||
| if tgt_lang is not None: | ||
| self.tgt_lang = tgt_lang |
There was a problem hiding this comment.
I actually think they should be in the docstrings, WDYT?
| @classmethod | ||
| def _from_pretrained( | ||
| cls, | ||
| resolved_vocab_files, | ||
| pretrained_model_name_or_path, | ||
| init_configuration, | ||
| *init_inputs, | ||
| token=None, | ||
| cache_dir=None, | ||
| local_files_only=False, | ||
| _commit_hash=None, | ||
| _is_local=False, | ||
| **kwargs, | ||
| ): | ||
| tokenizer = super()._from_pretrained( | ||
| resolved_vocab_files, | ||
| pretrained_model_name_or_path, | ||
| init_configuration, | ||
| *init_inputs, | ||
| token=token, | ||
| cache_dir=cache_dir, | ||
| local_files_only=local_files_only, | ||
| _commit_hash=_commit_hash, | ||
| _is_local=_is_local, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| # ensure also set after from pretrained | ||
| tokenizer.set_src_lang_special_tokens(tokenizer._src_lang) | ||
| tokenizer.set_tgt_lang_special_tokens(tokenizer._tgt_lang) | ||
|
|
||
| return tokenizer |
There was a problem hiding this comment.
I unfortunately had to add this to pass test, otherwise, it won't use set_src_lang_special_tokens and set_tgt_lang_special_tokens
There was a problem hiding this comment.
but it should be enough in the init no ?
There was a problem hiding this comment.
Well, it should but from_pretrained first call __init__, and then add special tokens, so the lang tokens are set to unk_token_id
There was a problem hiding this comment.
If the tokenizer fast is properly saved this should not happen (meaning the additional special tokens are added after only if you don't have a fast tokenizer file save. Otherwise it's directly passed. This is something I gotta take care in the intiialization (meaning yes you should be able to set the target language normally without having to call this
There was a problem hiding this comment.
I'm quite sure this happens during the regular test suite and not the integration tests though!
| pass | ||
|
|
||
| @unittest.skip( | ||
| reason="Expected missing keys serve when using SeamlessM4TForXXX.from_pretrained from a checkpoint saved by SeamlessM4TModel.save_pretrained." |
There was a problem hiding this comment.
Well, for example, I've set SeamlessM4TForSpeechToSpeech keys to ignore like this:
_keys_to_ignore_on_load_missing = ["text_encoder"]
But the model doesn't have a text_encoder, it is used to make it compatible with loading weights from a SeamlessM4TModel checkpoint without warnings.
| pass | ||
|
|
||
| @unittest.skip( | ||
| reason="SeamlessM4TModel has actually a bigger architecture than seamlessM4T models for specific tasks." |
There was a problem hiding this comment.
The base model is SeamlessM4TModel, but it has actually more components than the non-base models, a.k.a the task-specific models
utils/not_doctested.txt
Outdated
| docs/source/en/model_doc/roformer.md | ||
| docs/source/en/model_doc/rwkv.md | ||
| docs/source/en/model_doc/sam.md | ||
| docs/source/en/model_doc/seamless_m4t.md |
There was a problem hiding this comment.
Just slow I believe, I'll move it to the right file than
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks a lot better indeed! Left a few small nits
| # fairseq | '<pad>' | '<unk>' | '<s>' | '</s>' | 'an' | 'en' | '▁d' | 'er' | 'in' | '▁s' | ||
|
|
||
| # Mimic fairseq token-to-id alignment for the first 4 token | ||
| self.fairseq_tokens_to_ids = {pad_token: 0, unk_token: 1, bos_token: 2, eos_token: 3} |
There was a problem hiding this comment.
| self.fairseq_tokens_to_ids = {pad_token: 0, unk_token: 1, bos_token: 2, eos_token: 3} | |
| self._added_tokens_decoder = {AddedToken(pad_token): 0, AddedToken(unk_token): 1, AddedToken(bos_token): 2, AddedToken(eos_token): 3} |
| if token in self.fairseq_tokens_to_ids: | ||
| return self.fairseq_tokens_to_ids[token] | ||
| spm_id = self.sp_model.PieceToId(token) |
There was a problem hiding this comment.
this should not be needed. The tokens that are part of the added tokens decoder do not reach this part
| inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs) | ||
| if "__" not in tgt_lang: | ||
| tgt_lang = f"__{tgt_lang}__" | ||
| tgt_lang_id = self.convert_tokens_to_ids(tgt_lang) |
There was a problem hiding this comment.
convert_tokens_to_ids calls super. convert_tokens_to_ids_with_added_voc so no worries if you don't have the fairseq token id
| self._convert_id_to_token(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset) | ||
| } | ||
| # need to ensure that fairseq_tokens_to_id are placed at the beginning of the vocabulary | ||
| vocab.update(self.fairseq_tokens_to_ids) |
There was a problem hiding this comment.
| vocab.update(self.fairseq_tokens_to_ids) |
|
|
||
| def get_vocab(self): | ||
| vocab = { | ||
| self._convert_id_to_token(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset) |
There was a problem hiding this comment.
| self._convert_id_to_token(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset) | |
| self.convert_ids_to_tokens(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset) |
| def _tokenize(self, text: str) -> List[str]: | ||
| return self.sp_model.encode(text, out_type=str) |
There was a problem hiding this comment.
(by that I mean if the sentencepiece model allways adds an extra space after the special tokens!
| @classmethod | ||
| def _from_pretrained( | ||
| cls, | ||
| resolved_vocab_files, | ||
| pretrained_model_name_or_path, | ||
| init_configuration, | ||
| *init_inputs, | ||
| token=None, | ||
| cache_dir=None, | ||
| local_files_only=False, | ||
| _commit_hash=None, | ||
| _is_local=False, | ||
| **kwargs, | ||
| ): | ||
| tokenizer = super()._from_pretrained( | ||
| resolved_vocab_files, | ||
| pretrained_model_name_or_path, | ||
| init_configuration, | ||
| *init_inputs, | ||
| token=token, | ||
| cache_dir=cache_dir, | ||
| local_files_only=local_files_only, | ||
| _commit_hash=_commit_hash, | ||
| _is_local=_is_local, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| # ensure also set after from pretrained | ||
| tokenizer.set_src_lang_special_tokens(tokenizer._src_lang) | ||
| tokenizer.set_tgt_lang_special_tokens(tokenizer._tgt_lang) | ||
|
|
||
| return tokenizer |
There was a problem hiding this comment.
but it should be enough in the init no ?
|
Thanks for the quick review, I'll address most your comments this morning. I still have this pending questions if you have time: |
|
Hey @ArthurZucker, your suggestions made it even cleaner! I've also added some copied from statements! Thanks for that! Do you think you can review a bit more thoroughly the modeling code soon ? |
|
on it tomorrow! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Overall looks good to me. THanks for the huge effort!
| } | ||
|
|
||
|
|
||
| class SeamlessM4TConfig(PretrainedConfig): |
There was a problem hiding this comment.
No problem thanks for explaining! 🤗
|
|
||
|
|
||
| def _load_hf_config(model_type="medium"): | ||
| if model_type == "medium": |
There was a problem hiding this comment.
ok, wondering for potential futur models but we'll see then!
| Meta SeamlessM4T is made of 8 main components: | ||
| - speech_encoder (#1) and speech_encoder_frontend (#2) | ||
| - t2u_model (#3) | ||
| - text_encoder (#4) and text_encoder_frontend (#5) | ||
| - text_decoder (#6) [and text_decoder_frontend (#5) = equals to text_encoder_frontend] | ||
| - final_proj (#7) | ||
| - vocoder (#8) |
| frame_length=400, | ||
| hop_length=160, | ||
| fft_length=512, | ||
| power=2.0, | ||
| center=False, | ||
| preemphasis=0.97, | ||
| mel_filters=self.mel_filters, | ||
| log_mel="log", | ||
| mel_floor=1.192092955078125e-07, | ||
| remove_dc_offset=True, |
src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py
Outdated
Show resolved
Hide resolved
| raise ValueError( | ||
| "`input_ids`,`input_features`, `inputs_embeds` and `encoder_outputs` are all empty. Make sure at least one of them is not." | ||
| ) | ||
| elif input_features is not None: |
|
Thanks for the last review @ArthurZucker! |
* first raw commit * still POC * tentative convert script * almost working speech encoder conversion scripts * intermediate code for encoder/decoders * add modeling code * first version of speech encoder * make style * add new adapter layer architecture * add adapter block * add first tentative config * add working speech encoder conversion * base model convert works now * make style * remove unnecessary classes * remove unecessary functions * add modeling code speech encoder * rework logics * forward pass of sub components work * add modeling codes * some config modifs and modeling code modifs * save WIP * new edits * same output speech encoder * correct attention mask * correct attention mask * fix generation * new generation logics * erase comments * make style * fix typo * add some descriptions * new state * clean imports * add tests * make style * make beam search and num_return_sequences>1 works * correct edge case issue * correct SeamlessM4TConformerSamePadLayer copied from * replace ACT2FN relu by nn.relu * remove unecessary return variable * move back a class * change name conformer_attention_mask ->conv_attention_mask * better nit code * add some Copied from statements * small nits * small nit in dict.get * rename t2u model -> conditionalgeneration * ongoing refactoring of structure * update models architecture * remove SeamlessM4TMultiModal classes * add tests * adapt tests * some non-working code for vocoder * add seamlessM4T vocoder * remove buggy line * fix some hifigan related bugs * remove hifigan specifc config * change * add WIP tokenization * add seamlessM4T working tokenzier * update tokenization * add tentative feature extractor * Update converting script * update working FE * refactor input_values -> input_features * update FE * changes in generation, tokenizer and modeling * make style and add t2u_decoder_input_ids * add intermediate outputs for ToSpeech models * add vocoder to speech models * update valueerror * update FE with languages * add vocoder convert * update config docstrings and names * update generation code and configuration * remove todos and update config.pad_token_id to generation_config.pad_token_id * move block vocoder * remove unecessary code and uniformize tospeech code * add feature extractor import * make style and fix some copies from * correct consistency + make fix-copies * add processor code * remove comments * add fast tokenizer support * correct pad_token_id in M4TModel * correct config * update tests and codes + make style * make some suggested correstion - correct comments and change naming * rename some attributes * rename some attributes * remove unecessary sequential * remove option to use dur predictor * nit * refactor hifigan * replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config * add tests * change tgt_lang logic * update generation ToSpeech * add support import SeamlessM4TProcessor * fix generate * make tests * update integration tests, add option to only return text and update tokenizer fast * fix wrong function call * update import and convert script * update integration tests + update repo id * correct paths and add first test * update how new attention masks are computed * update tests * take first care of batching in vocoder code * add batching with the vocoder * add waveform lengths to model outputs * make style * add generate kwargs + forward kwargs of M4TModel * add docstrings forward methods * reformate docstrings * add docstrings t2u model * add another round of modeling docstrings + reformate speaker_id -> spkr_id * make style * fix check_repo * make style * add seamlessm4t to toctree * correct check_config_attributes * write config docstrings + some modifs * make style * add docstrings tokenizer * add docstrings to processor, fe and tokenizers * make style * write first version of model docs * fix FE + correct FE test * fix tokenizer + add correct integration tests * fix most tokenization tests * make style * correct most processor test * add generation tests and fix num_return_sequences > 1 * correct integration tests -still one left * make style * correct position embedding * change numbeams to 1 * refactor some modeling code and correct one test * make style * correct typo * refactor intermediate fnn * refactor feedforward conformer * make style * remove comments * make style * fix tokenizer tests * make style * correct processor tests * make style * correct S2TT integration * Apply suggestions from Sanchit code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct typo * replace torch.nn->nn + make style * change Output naming (waveforms -> waveform) and ordering * nit renaming and formating * remove return None when not necessary * refactor SeamlessM4TConformerFeedForward * nit typo * remove almost copied from comments * add a copied from comment and remove an unecessary dropout * remove inputs_embeds from speechencoder * remove backward compatibiliy function * reformate class docstrings for a few components * remove unecessary methods * split over 2 lines smthg hard to read * make style * replace two steps offset by one step as suggested * nice typo * move warnings * remove useless lines from processor * make generation non-standard test more robusts * remove torch.inference_mode from tests * split integration tests * enrich md * rename control_symbol_vocoder_offset->vocoder_offset * clean convert file * remove tgt_lang and src_lang from FE * change generate docstring of ToText models * update generate docstring of tospeech models * unify how to deal withtext_decoder_input_ids * add default spkr_id * unify tgt_lang for t2u_model * simplify tgt_lang verification * remove a todo * change config docstring * make style * simplify t2u_tgt_lang_id * make style * enrich/correct comments * enrich .md * correct typo in docstrings * add torchaudio dependency * update tokenizer * make style and fix copies * modify SeamlessM4TConverter with new tokenizer behaviour * make style * correct small typo docs * fix import * update docs and add requirement to tests * add convert_fairseq2_to_hf in utils/not_doctested.txt * update FE * fix imports and make style * remove torchaudio in FE test * add seamless_m4t.md to utils/not_doctested.txt * nits and change the way docstring dataset is loaded * move checkpoints from ylacombe/ to facebook/ orga * refactor warning/error to be in the 119 line width limit * round overly precised floats * add stereo audio behaviour * refactor .md and make style * enrich docs with more precised architecture description * readd undocumented models * make fix-copies * apply some suggestions * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * correct bug from previous commit * refactor a parameter allowing to clean the code + some small nits * clean tokenizer * make style and fix * make style * clean tokenizers arguments * add precisions for some tests * move docs from not_tested to slow * modify tokenizer according to last comments * add copied from statements in tests * correct convert script * correct parameter docstring style * correct tokenization * correct multi gpus * make style * clean modeling code * make style * add copied from statements * add copied statements * add support with ASR pipeline * remove file added inadvertently * fix docstrings seamlessM4TModel * add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown * add seamlessm4t to assisted generation ignored models --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
What does this PR do?
Meta recently introduced Seamless M4T, a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
SeamlessM4T supports multiple audio and/or translation tasks, namely S2TT, S2ST, T2TT, T2ST, where the last T stands for translation.
In other words, this model seamlessly supports audio|text to translated audio|text.
SeamlessM4T weights are already available on the hub (large and medium) and the code is available on the seamless_communication git repo.
In terms of architecture, and after having discussed with @sanchit-gandhi, I've came up with 4 differents models for the 4 tasks and one model that can do each task.
I've been working on the integration for a couple of days already. At the moment, the converting script is more or less ready and the different models can generate.
Here is a TODO of what's left to be done:
cc @sanchit-gandhi and @ArthurZucker !