Hey Heather, it's me again.
Ahhh, nothing like being around virtual assistant-enabled devices to rile up your concerns about who’s listening. Personally, I don’t own any smart speakers so I never really thought about how they worked until recently being surrounded by a few. My conjecture as to how they worked was that the listening is contained locally because it would be insane for the devices to stream everything 24/7. It probably only sends the information to the cloud after hearing the wake word. Probably.
I have little faith in my conjectures regarding implementation of even the most trivial stuff. I’ve been burnt before. “What? No, no, that small image there? Of course it’s not just resized with CSS. Who in their right mind would upload a huge file and use up all their bandwidth. Hahaha ha ha… huh? ohno”.
Anyway, I hopped on IRC and asked:
does anyone have a link to a good/basic/introductory explanation as to how voice assistance (eg alexa) works?”
At the time, I didn’t realize that this question was much broader than the specific thing I actually wanted to know. This is how I found myself reading the Furby source code at 1am, trying to understand the assembly written and admiring the code comments. In my defense, it was both (very) interesting and (somewhat) relevant.
I eventually snapped out of it and realized that what I wanted to know was: when would the device send the information to the cloud?
I found it unsettling to be aware that the device was always listening. Like all the time. All I wanted was the gist of how it worked, and details on when data was sent. Thankfully, Computerphile had a video on how a system like Alexa (probably) works. Huzzah!
The gist of it
Here’s the broad strokes. Voice-controlled assistant devices capture the audio as a waveform, then run it through Automatic Speech Recognition (ASR). It’ll then check the waveform to pattern match it to the programmed wake words. This is all done locally. If a wake word is detected, then that pattern and what follows is sent to the cloud where the wake word is checked again, and the rest goes through Natural Language Processing (NLP). After that parsing step, there’s likely a bunch of other processes that happen which generate an action and a response. The response is transformed in Text-to-Speech (TTS) and sent back to the device.
C’est pas plus compliqué que ça.
Alright, so we’ve got the gist, but what is happening between the wake word detection and the sending to the cloud? A friend had the brilliant idea to suggest I check if Alexa had an API for developers, and it does!
The interesting bit for me is that there’s a double check on the wake word. As mentioned, the engine on the device does the initial detection of the wake word. Then it’s sent to the cloud which filters it to verify for a false wake. If it is a false wake, the service will send a directive to stop the audio stream. I found this information on the Enable Wake Word Verification page so you can check it out there.
So, am I completely satisfied with this answer? No, of course not. I mean, it does reduce some tension knowing it’s not always paying attention. But what I’d like to do is spy on the devices themselves and see what’s actually being sent. Maybe there’d be a way to check the traffic on the router? I don’t know. Like I said, I don’t have a device to check when it’s talking to the ~cloud~.