Inspiration

AI-Generated Content (AIGC) has significantly reshaped the approach to creation and innovation over the past year, and it is poised for further growth in the years ahead. However, training models for AIGC can be computationally expensive and contribute to CO2 emissions, particularly if the training data are of mediocre quality, leading to slower convergence and suboptimal outcomes. We are attempting to address this issue by selecting high-quality training data, enabling the model to converge more quickly.

What it does

Our approach to addressing this challenge is twofold: video filtering and recaptioning.

For training purposes, an ideal video should contain a sufficient amount of "motion" between frames, such as a person moving around, but the variation should not be overly dramatic, such as sudden scene transitions. Those outliers will be filtered out first and the remaining ones will be recaptioned. On another front, we enhance the dataset by recaptioning it using a Large Multimodal Model (LMM). By inputting the original captions along with the scene frames, we significantly improve the quality and detail of the transcribed captions.

How we built it

We developed our AI Generated Content Dataset Filter system through a combination of advanced video analysis and machine learning techniques. Initially, we curated a diverse set of videos to represent a wide range of motion and scene transitions. We then implemented algorithms to analyze each frame of these videos for motion detection, utilizing metrics such as RGB pixel differences, optical flow, and cosine similarity of frame embeddings. To classify and filter the videos efficiently, we integrated an XGBoost classifier, trained to identify videos that met our criteria based on the motion analysis..

For the recaptioning process, we leveraged Google's Gemini Pro Vision, a Large Multimodal Model, to generate detailed and accurate captions by feeding it the video frames alongside their original captions.

To evaluate the result, we use the CLIP-ViT model to convert the video and the caption into embeddings in the same dimension from which we can use cosine similarity to determine its similarity.

Challenges we ran into

The retrieval and processing of the dataset posed a significant initial challenge, given the sheer volume of videos requiring download and segmentation. We overcame this by leveraging multiprocessing, streamlining the workflow from download to scene segmentation.

Tuning the motion detection parameters presented another hurdle; identifying the optimal balance to capture the desired level of motion without including abrupt scene transitions was complex. We addressed this by integrating human insight with machine learning, using an XGBoost classifier informed by human-labeled data.

Another challenge was ensuring the high quality of recaptioning; and understanding the context and nuances of a video scene. Using an open-source model locally was a viable option, but the shoddy result steered our plans to use a well-known model like Google's Gemini Pro Vision.

Accomplishments that we're proud of

  • The XGBoost classifier in filtering high-quality video has an accuracy of 85%.
  • The evaluation score (CLIP score) is consistently higher when using recaptioned and filtered data.

What we learned

  • Embedding videos and images
  • Optimization tricks (multithreading, PyTorch, Numpy optimization, etc.)
  • Video analyzing techniques (frame-by-frame encoding using variational autoencoder)
  • How to open an onigiri

What's next for AI-Generated Content Dataset Filter

  • Using a locally-hosted Large Multimodal Model instead of relying on a third-party model (not possible within this timeframe due to hardware and time limitation)

GitHub Repository Link: https://github.com/SuperShyMLDA2024/filtrain

Built With

Share this project:

Updates