scarily… They don’t need to to be this creepy, but even I’m a tad baffled by this.
Yesterday me and a few friends were at a pub quiz, of course no phones allowed, so none were used.
It came down to a tie break question of my team and another. “What is the run time of the Lord of the Rings: Fellowship of the ring” according to IMDb.
We answered and went about our day. Today my friend from my team messaged me - top post on his “today feed” is an article published 23 hours ago…
Forgive the pointless red circle… I didnt take the screenshot.
My friend isn’t a privacy conscience person by any means, but he didnt open IMDb or google anything to do with the franchise and hasn’t for many months prior. I’m aware its most likely an incredible coincidence, but when stuff like this happens I can easily understand why many people are convinced everyone’s doom brick is listening to them…


It’s a technique called Keyword Spotting (KWS). https://en.wikipedia.org/wiki/Keyword_spotting
This uses a tiny speech recognition model that’s trained on very specific words or phrases which are (usually) distinct from general conversation. The model being so small makes it extremely optimized even before any optimization steps like quantization, requiring very little computation to process the audio stream to detect whether the keyword has been spoken. Here’s a 2021 paper where a team of researchers optimized a KWS to use just 251uJ (0.00007 milliwatt-hours) per inference: https://arxiv.org/pdf/2111.04988
The small size of the KWS model, required for the low power consumption, means it alone can’t be used to listen in on conversations, it outright doesn’t understand anything other than what it’s been trained to identify. This is also why you usually can’t customize the keyword to just anything, but one of a limited set of words or phrases.
This all means that if you’re ever given an option for completely custom wake phrases, you can be reasonably sure that device is running full speech detection on everything it hears. This is where a smart TV or Amazon Alexa, which are plugged in, have a lot more freedom to listen as much as they want with as complex of a model as they want. High-quality speech-to-text apps like FUTO Voice Input run locally on just about any modern smartphone, so something like a Roku TV can definitely do it.
I appreciate the links, but these are all about how to efficiently process an audio sample for a signal of choice.
My question is, how often is audio sampled from the vicinity to allow such processing to happen.
Given the near-immediate response of “Hey Google”, I would guess once or twice a second.
Your stumbling block seemed to be that you didn’t understand how it was possible, so I was trying to explain that, but I may have done a poor job of emphasizing why the technique I described matters. When you said this in a previous comment:
That was a misunderstanding of how the technology works. With a keyword spotter (KWS), which all smartphone assistants use to detect their activation phrases, they they aren’t catching any “other fish” in the first place, so there’s nothing to put into “specific baskets”.
To borrow your analogy of catching fish, a full speech detection model is like casting a large net and dragging it behind a ship, catching absolutely everything and identifying all the fish/words so you can do things with them. Relative to a KWS, it’s very energy intensive and catches everything. One is not likely to spend that amount of energy just to throw back most of the fish. Smart TVs, cars, Alexa, they can all potentially use this method continuously because the energy usage from constantly listening with a full model is not an issue. For those devices, your concern that they might put everything other than the keyword into different baskets is perfectly valid.
A smartphone, to save battery, will be using a KWS, which is like baiting a trap with pheromones only released by a specific species of fish. When those fish happen to swim nearby, they smell the pheromones and go into the trap. You check the trap periodically, and when you find the fish in there, you pull them out with a very small net. You’ve expended far less effort to catch only the fish you care about without catching anything else.
To use yet another analogy, a KWS is like a tourist in a foreign country where they don’t know the local language and they’ve gotten separated from their guide. They try to ask locals for help but they can’t understand anything, until a local says the name of the tour group, which the tourist recognizes, and is able to follow that person back to their group. That’s exactly what a KWS system experiences, it hears complete nonsense and gibberish until the key phrase pops out of the noise, which they understand clearly.
This is what we mean when we say that yes, your phone is listening constantly for the keyword, but the part that’s listening cannot transcribe your conversations until you or someone says the keyword that wakes up the full assistant.
Yes, KWS systems generally keep a rolling buffer of audio a few seconds long, and scan it a few times a second to see if it contains the key phrase.
Boom.
Now given that the KWS cost isn’t high, imagine what other keywords it’s unofficially scanning for.