The Idea
Self-timers on cameras are a huge pain. Balancing iPhones on your bookshelf, running over to your group, waiting awkwardly for 30 seconds because you have a poor conception of time... Rough.
We think it'd be cool to have a camera mount that can pivot and tilt to match your location. On top of that, voice commands will tell it when to take pictures and control fine adjustments. Then, the picture will be sent to your phone. Meet Nibl.
We'll sense with a camera and a microphone. Voice and image processing will be done on board the Beaglebone Blue, while servo actuation will be done from the Mbed.
Baseline Goals:
- Activate on command “hit me”
- Identify target with a visual cue
- Save photo to local memory
Reach Goals:
- Display photos on phone screen
- Multiple voice commands for controlling movement
- Live tracking of the visual
Voice recognition with MFCC
As for voice recognition, we decided to prototype what we'll finally implement on the MCU in Python on a laptop. Now, we probably don't want to do the voice recognition the ESE 224 way. That is:
- Take an FFT of an input signal
- Dot it with a whole lot of labeled sample FFTs
- Find the highest dot product, and assign the input signal that label.
The weaknesses of this method are:
- Very specific to one voice
- Requires a huge amount of data and comparing
- Not good at identifying a signal that's just noise.
We scoured the internet for ways to do light-weight voice recognition. What we settled on is called MFCC: Mel Frequency Cepstrum Coefficients. MFCC is a way to break down a signal into coefficients that represent how much of the energy of the signal resides in different frequency ranges (designed to mimic the human cochlea!). The best resource we found was from Practical Cryptography. The procedure is comically involved. Here are the Sparknotes:
- Filter the input signal
- Multiply it by a cosine
- Take the real coefficients of the FFT of the signal (aka DCT)
- Calculate the power periodogram of the FFT
- Make a series of 26 triangular filters based on Mel frequencies (which correspond to sensivity of the human ear), then convert back to the normal frequency domain
- See how much energy lies in each of those triangular filters
- Take the log all those energies
- Then take the DCT of all those energies
- Just throw out the last 14
- Finally, we have just 12 numbers to represent our signal
Why go through all this trouble? I want to reemphasize the last point. If we sample at 8000 Hz, and take a 1 second window, then our input signal is represented with 8000 numbers. MFCC reduces the dimensionality of this classification problem from 8000 to 12. Wow. Check out the different characteristic peaks in MFCC plots of two commands in the gallery.
Classification
Now that we these 12 numbers, how do we identify what kind of command was recorded? We recorded a test set of 4 different sounds:
- "Hit me"
- "One"
- "Two"
- Just background noise
We then used a classification algorithm called quadratic discriminant analysis (QDA). If you're interested in QDA, see this resource from SciKit. To perform QDA, all we'd need to store on the microcontroller would be some mean vectors and covariance matrices. The covariance matrices would be 12x12 for every output class. Way better than storing dozens of FFT vectors of length 8000!
Results
What happened next was pretty remarkable. At around six in the morning, we finally ran our test. We didn't believe the results at first. We recorded a test set of 80 samples, 20 of each class (garbo refers to background noise). We used 3/4 of that for training and 1/4 for validation. Here's the printout.
On training data, the results are:
Correct hit mes, Missed hit mes:17, 0
Correct ones, Missed ones:16, 0
Correct twos, Missed twos:13, 0
Correct garbo, Missed garbo:14, 0
On test data, the results are:
Correct hit mes, Missed hit mes:3, 0
Correct ones, Missed ones:4, 0
Correct twos, Missed twos:7, 0
Correct garbo, Missed garbo:6, 0
QDA correctly identified every sample in our test set. Granted, these tests were in quiet conditions, done with one voice, and recorded close to the mic. Nevertheless, when we record an additional sample and have it classify on the spot, it guesses right for either of our voices! There's still work to do. But we're very pleased with this start. We think with a bigger test set recorded in more realistic conditions, QDA and MFCC will be a winning combination.
Image Processing
Jason and I had to lift our shirts up during the demo in order for the heat-seaking to even work at a close range. Despite the sexy-factor, the teaching staff didn't love that. So, we are switching to processing camera images on a visual cue. We bought two bright, uniquely colored gift bags at CVS. We have nicknamed them Cosmo and Wanda. From some pictures of Cosmo and Wanda, we've figured out what ratios of RGB values identify the colors well. Using this information, we will process images to see how much Cosmo and Wanda they have, and find their "center of mass."
The algorithm works pretty nicely! Check out Nibl locating Cosmo and Wanda in the gallery.
Log in or sign up for Devpost to join the conversation.