Skip to content
Yuxiang Lin edited this page Sep 16, 2025 · 2 revisions

Welcome to the MER-Factory wiki! The following is the development roadmap of MER-Factory

TODO LIST:

  1. Batch Inference: Implement efficient batch inference for Hugging Face models to enable high-throughput processing.
  2. Model API: Develop a lightweight API server for model access, which will decouple dependencies and simplify integration.
  3. Expanded Modality Support: Introduce feature extraction tools for new modalities, beginning with audio.
  4. End-to-End Tutorial: Create a comprehensive tutorial to guide users through a complete workflow.

MER-Factory Technical Development Roadmap

Overview

This document outlines the technical development strategy for MER-Factory, an open-source initiative dedicated to advancing the field of Affective Computing. Our long-term vision is to establish MER-Factory as the leading platform for streamlined research and development of multimodal emotion reasoning models. The short-term goals are focused on building a robust and reproducible foundation for this vision.

Long-term Goal:

To engineer a comprehensive, end-to-end automated platform for affective computing, covering the entire development lifecycle—from advanced task definition (e.g., empathy, personality traits) to state-of-the-art model training, fine-tuning, evaluation, and deployment. The ultimate aim is to significantly lower the barrier for creating and analyzing complex human-centered AI.

Short-term Goals:

  • Task Support: Stably support core emotion tasks (Emotion Recognition, Cause Extraction, Sarcasm Detection) with both single-modality and multi-modality processing capabilities.
  • Core Workflow: Establish and solidify the dataset construction and model training pipeline.
  • Evaluation System: Initially establish an automated evaluation and review process primarily based on LLM-as-a-Judge, supplemented by Human-in-the-Loop.

I. Core Emotion-Related Tasks

The factory will support the following core tasks, with the flexibility to handle both single-modality (e.g., text-only) and multi-modality (e.g., text, audio, visual) inputs.

  • Emotion Recognition: Identify and classify emotions expressed in a given context.
  • Emotion Cause Extraction (ECE): Identify and extract the specific utterances or events that trigger a particular emotion.
  • Sarcasm Detection: Determine whether an utterance is sarcastic by analyzing content and context, including potential inconsistencies across different modalities.

II. Dataset Exportation

  • LLaMA-Factory Formats:
    • Alpaca format.
    • ShareGPT format (with multimodal support).
  • MS-Swift Formats:
    • Standard messages key JSONL format.

III. LLM Training Pipeline

Use the export dataset to fine-tune model with:

  • LLaMA-Factory (Done)
  • MS-Swift (ModelScope Swift)

IV. Evaluation Pipeline

  • LLM-as-a-Judge: Utilize a powerful "judge" LLM to automatically assess the quality of generated data based on predefined criteria.
  • Human-in-the-Loop (HITL): Provide a lightweight review interface for manual auditing, correction, and approval of the generated data.