This project focuses on enhancing accessibility for visually impaired individuals through advanced AI technologies. We fine-tuned the Mistral 7B model using the VizWiz dataset to create a robust and conversational visual question answering (VQA) system. The VizWiz dataset, which contains over 31,000 visual questions paired with images taken by blind users, presents real-world challenges such as poor image quality and unanswerable questions. By leveraging the CuMo architecture, our model surpasses LLava-based implementations in handling multimodal tasks, including image and video analysis. The fine-tuned model significantly improves the accuracy and relevance of responses to visual questions, providing a valuable tool for visually impaired users to gain insights about their surroundings independently.

Built With

Share this project:

Updates