Stack
- Used TensorFlow for model prototyping, training, and testing
- Used Numpy for its variety of mathematical functions to assist in loss functions and transformations
- Used OpenCV for its video handling as well as augmentations to create a diverse training dataset
- Used MediaPipe for hand keypoint estimation to create tensors to feed into models
Inspiration
I've always thought how weird it must feel to be deaf, to not be able to hear the world around you, to hear your loved ones talking. That's why I made SignVerse, which is a live translating system that converts hand gestures into text. Obviously, due to hardware limitations and extreme difficulty, I could not use the entire 2000 phrase WLASL dataset. I instead settling for 8 common day-to-day phrases used by everyone.
What it does
SignVerse uses live camera feed to track hand movements across frames, sends that data through a classifier, and outputs one of 8 phrases: NO, YES, HELLO/GOODBYE, SORRY, THANK YOU, HOW ARE YOU, I AGREE, and I DISAGREE.
How we built it
I recorded data for 8 different samples, with each video lasting about 2-3 seconds. Then, I normalized and preprocessed the data. I did this by using mediapipe to extract keypoints (skeleton) of the hand on screen, and saved these to npy files (a special filetype by numpy for array storage). This dramatically reduced the amount of data I trained on, while also capturing more context. I save about 21 numbers per frame, rather than 2764800 numbers. Then, I created a neural network using the TensorFlow sublassing API (architecture details in README) and trained it on my keypoints. After that, I just made a simple script to take camera input, run preprocessing steps, and translate.
Challenges we ran into
My first attempt was using the entire WLASL dataset, which consists of over 2000 unique phrases. This came with extremely computational restrictions, and a model to do this would need to have billions of parameters. This is also caused undersampling, as each class only had a few samples, meaning the model gets almost zero information. For the model to learn something meaningful, it might require hours of training on a powerful GPU, with the addition of dozens of GB of data. My second attempt was doing individual letters (SignAlpha dataset), and it worked extremely well. After contemplating, I realized that this wouldn't really work well in production, as people are never going to be spelling out everything they say. That's when I finally switched over to use 8 concrete phrases.
Accomplishments that we're proud of
I'm proud of being able to efficiently change my approach to the problem, and recognizing flaws as well as the reasons why things don't work. I'm happy that I was able to refactor my code with minimal difficulty, permitting for easy changes in my training and preprocessing.
What we learned
I learned about the scalability of models and data, and the important of having clean, balanced data. I also learned about the tradeoff for speed and performance, as a model with a lower complexity is easier and faster to train/infer, but may perform worse overall. On the flipside, a model with a higher complexity is harder and takes longer to train/infer, but will generally have better performance (though not applicable in all cases).
What's next for SignVerse
The next thing for SignVerse would be to add more phrases, or possibly even mixing letters and phrases to create a combo of my previous ideas.
Log in or sign up for Devpost to join the conversation.