Skip to content

Integrate w2v-bert2-LoRA-adapter-MFA model#439

Merged
cdliang11 merged 3 commits intowenet-e2e:masterfrom
shangguanqituan:feature/w2v-integration
Dec 2, 2025
Merged

Integrate w2v-bert2-LoRA-adapter-MFA model#439
cdliang11 merged 3 commits intowenet-e2e:masterfrom
shangguanqituan:feature/w2v-integration

Conversation

@shangguanqituan
Copy link
Collaborator

Overview

This pull request (PR) integrates the advanced model proposed in the paper "ENHANCING SPEAKER VERIFICATION WITH W2V-BERT 2.0 AND KNOWLEDGE DISTILLATION GUIDED STRUCTURED PRUNING" into the wespeaker framework.

We have successfully implemented the full three-stage training pipeline for the w2v-bert2-lora-adapter-mfa model and ensured its compatibility with the existing wespeaker ecosystem. This not only introduces a powerful new model to the framework but also opens up new possibilities for future research and applications.

Key Features and Changes

  1. Model Integration:

    • Frontend: Added a w2vbert2 frontend in wespeaker/frontend, which incorporates LoRA (Low-Rank Adaptation) for efficient fine-tuning.
    • Model: Implemented an Adapter-MFA (Multi-Factor Attention) module in wespeaker/models. This module serves as the speaker model and is specifically designed to process the multi-layer hidden states from the w2vbert2 frontend.
  2. New Training Pipeline:

    • Added complete three-stage training configuration files (YAML) and a corresponding execution script, run_w2v.sh, to facilitate the reproduction of the original paper's training process.
  3. Framework Adaptations:

    • Dataflow Handling:
      • To meet the input requirements of the MFA module, the w2vbert2 frontend now returns all Transformer layer hidden states (all_hidden_states) as a tuple.
      • To handle this new return type (which differs from other frontends that return a last_hidden_state tensor), we have added conditional logic in the executor and extract modules to ensure smooth pipeline execution.
    • DistributedDataParallel (DDP) Configuration: Due to the increased complexity of gradient computation introduced by LoRA and MFA, we found it necessary to set find_unused_parameters=True for DistributedDataParallel in train.py to resolve gradient synchronization issues.
    • Learning Rate Schedulers: Added two new schedulers, WarmupLR_withStepDecay and WarmupCosineScheduler, to scheduler.py to meet the specific requirements for reproducing the paper's results.
    • ASP Compatibility: The Automatic Speaker Verification Pipeline (ASP) has been slightly modified to be compatible with the new model's outputs without affecting the functionality of existing models in the framework.
  4. Pre-trained Model Compatibility:

    • To allow users to easily load the official pre-trained checkpoints (ckpt) from the paper, we have modified checkpoint.py.
    • This change addresses a classifier dimension mismatch (5994*3 vs. 5994) that occurs because the original paper's third training stage does not use speed_perturb data augmentation. The code can now intelligently slice the weights to the correct dimension, enabling successful loading of the official model.

We believe this integration will greatly enrich the wespeaker model zoo and provide the community with a powerful new tool. We look forward to your feedback and review!

@JiJiJiang
Copy link
Collaborator

Please fix flake8 errors.

@shangguanqituan
Copy link
Collaborator Author

Thanks for your feedback! I've just pushed a new commit that addresses all the flake8 style issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you would prefer not to expose your local directories here

Copy link
Collaborator

@wsstriving wsstriving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the results and the pretrained model pages?

@shangguanqituan
Copy link
Collaborator Author

Okay, I will update later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why modify this file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure the simplicity of PR, irrelevant parts do not need to be modified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t intentionally modify those two files — it seems the formatting tool automatically updated them during commit. I will submit a clean PR later.

@shangguanqituan
Copy link
Collaborator Author

Update Summary:

  1. Code Reorganization: I have performed a clean commit to strictly limit changes to relevant files. This ensures that no unrelated files (e.g., whitespace changes in other scripts) are touched, addressing the previous feedback.
  2. Updated Results: I have updated the README.md with the latest experimental results, including:
  • Reproduction Results: Trained on VoxCeleb (from scratch).
  • Verification Results: Inference using the official checkpoint to verify correctness.
  1. Pretrained Models: I have uploaded the checkpoints to ModelScope and updated the model list in README.md.
  • Reproduced Models: Trained on VoxCeleb.
  • Official Models: Trained on VoxCeleb + VoxBlink.

Note on ONNX: I attempted to export the model to ONNX using wespeaker/bin/export_onnx.py, but it failed due to the complexity of the W2V-BERT architecture (specifically dynamic axes in MFA adapter layers). Therefore, I marked the Runtime Model column as -.

@@ -0,0 +1,273 @@
#!/bin/bash

# Copyright 2025 Your Name/Org (your_email@example.com)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to your name and email address

@@ -0,0 +1,388 @@
# Copyright (c) 2025 Your Name/Org (your_email@example.com)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to your name and email address

@@ -0,0 +1,126 @@
# Copyright (c) 2025 Your Name/Org
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to your name and email address

@cdliang11
Copy link
Collaborator

Good job!

@cdliang11
Copy link
Collaborator

Please fix flake8 errors

@cdliang11 cdliang11 merged commit 7d7b707 into wenet-e2e:master Dec 2, 2025
4 checks passed
@cdliang11
Copy link
Collaborator

Merged.

@wsstriving
Copy link
Collaborator

@cdliang11 maybe we also want to support this model in the CLI mode? @shangguanqituan We also need to decide which checkpoint to upload in this mode, the one you trained or the one provided by the original paper

@cdliang11
Copy link
Collaborator

@cdliang11 maybe we also want to support this model in the CLI mode? @shangguanqituan We also need to decide which checkpoint to upload in this mode, the one you trained or the one provided by the original paper

Of course, supporting this model in CLI mode is feasible. @shangguanqituan, please select the model checkpoint to adopt.

@shangguanqituan
Copy link
Collaborator Author

Let's use the checkpoint provided by the original paper — w2v_bert2_voxblink_official_LM.pth.
It’s listed in pretrained.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants