Having your iPhone respond to “Hey Siri” seems like such a simple thing, but it’s actually quite complicated. Recognizing this code phrase, and the person who said it, is critical for Apple speech-recognition system.
A post in Apple’s Machine Learning Journal just published today describes many of the challenges developers overcame to make this work.
One of the complications is that recognizing “Hey Siri” has to happen on the iPhone or iPad. Most of Siri’s speech recognition is done by uploading the user’s words to a remote server, but that only begins after the “Hey Siri” phrase has been recognized by the phone. Apple’s commitment to privacy prevents the iPhone from sending everything it hears to a server.
Every phone and most Apple tablets since the iPhone 6s have had a low-power and always-on processor that continuously listens for the key phrase “Hey Siri”. That’s all this chip does. This voice recognition processor uses a neural network that mimics the layout of a living brain.
The Machine Learning Journal article is only about “Hey Siri” because all the rest of Siri’s speech recognition is done on servers. That’s an entirely different process. And one that has a whole raft of problems. Still, Apple is on a hiring spree to fix them.
Why “Hey Siri”?
Apple picked its key phrase because it’s short and easy to say. The Siri voice recognition system debuted on the iPhone 4S several years before, but required pushing the Home button to activate. According to Apple, many people started their requests with “Hey Siri” even before this phrase had a role.
The down side is that this key phase resembles many other phrases, such as “are you serious?”. The iPhone’s dedicated processor also has to deal with all the other people chattering nearby, some of whom might be talking to their own iPhones.
According to today’s Machine Learning Journal article, the chip first picks the phrase “Hey Siri” out of what it hears, then it it checks if the phrase was said by the person it was trained to listen for.
The processor turns the audio into a 13-dimensional vector to recognize that someone said “Hey Siri”. It then converts the audio into a 442-dimension vector to see whether the correct speaker uttered the key phrase.
Apple posted the details of how it picks the all-important phrase out of the air in a Machine Learning Journal article in October. The newest post discusses how the neural chip learns to recognize its owner.
Training “Hey Siri”
Everyone remembers that they had to train their iPhone to recognize their own voice by saying “Hey Siri” several times. This is called explicit enrollment.
What very few people realize is that the system continues to learn what their voice sounds like after the training session. This is because the session is almost always done under ideal conditions, while the iPhone has to learn to recognize “Hey Siri” with all kinds of ambient noise. For some time after training officially ends, every use of “Hey Siri” is being used to learn more.
So try to avoid letting other people say “Hey Siri” near your iPhone while it’s still learning your voice.
Apple set itself a difficult task when it decided to do voice recognition directly on a smartphone. But the alternative was to send recordings of everything said near the iPhone to a remote server to recognize the key phrase. Apple wasn’t going to turn its devices into spies.
Of course, that didn’t bother Amazon. That’s exactly how its Echo devices do all their speech recognition.