
By now, most of us use the voice recognition on our devices regularly. We shout at Alexa to play a different song; we ask Siri on our phones to give us driving directions. The technology has evolved exponentially since Siri and Alexa arrived more than a decade ago—and that was before generative AI.
Thanks to generative AI, you can actually create a Siri-like app yourself… or integrate Siri-like features into your existing apps. Let’s look at what it takes to build a standalone voice recognition app. We’ll keep it simple: this app should:
- Listen to what you say and convert the speech to text
- Interpret what you want
- And either:
- Come up with some sort of command, as with a device that performs some task in response to spoken language (such as a request to play a song) OR
- Uses AI to build a human-sounding response and speaks that human-sounding response back in a natural-sounding voice
Summary
Coding Libraries You’ll Need
We recommend building this app using Python, as it has the largest AI infrastructure of any language. Let’s break down each step and talk about what libraries might help.
Speech Recognition. We’ll treat this as a strictly voice-to-text conversion; we’ll then feed the resulting text into the AI. We won’t attempt to simultaneously convert the text and interpret its meaning. A great Python library for doing just that is called SpeechRecognition. Behind the scenes, it uses Google’s speech API and doesn’t require any kind of API key. We recommend starting out with this, and then exploring other options such as Vosk.
A quick note on such libraries and tools: Today’s speech recognition, even when not processing the meaning of input, still uses advanced AI to easily recognize many different languages, and within those languages, accents. This wasn’t always the case; early tools on Android and iPhone phones were pretty bad.
Interpreting the text: After receiving the text version of whatever the user spoke, the next step is to determine the intent of what was just asked. For this step, there are many options, such as simply sending the text to ChatGPT through OpenAI’s API. However, you may need to include at least a little bit of context. For example, if you’re ultimately building an IoT device that controls a cat food dispenser through voice commands (yes, such things do exist already), simply sending the text “Time to feed the cat” won’t really have much meaning to ChatGPT.
(To be sure, I went to ChatGPT and simply typed “Time to feed the cat” and its response was, “Don’t keep that feline waiting! Go forth, and fulfill your sacred duty as Keeper of the Kibble.”)
Instead, you need to ask ChatGPT (or whatever AI you prefer) to provide meaning for the query. But even if you provide something longer for ChatGPT, such as “Suppose I have an automatic cat feeder and I'm adding in voice recognition. The human just said this to it (as transcribed by a voice-to-text library): ‘Time to feed the cat’ Can you help me interpret that and decide what I should do next?”, well, you’re still going to get a long-winded response that really isn’t particularly helpful. You need a way to interpret the meaning of the text and come up with a command to follow.
Think about how Alexa does it. If you say, “Alexa, play The Rolling Stones,” you don’t want to hear Alexa come back with a smart answer about the Rolling Stones (or a long-winded explanation for how an app might translate that command). You want a single command returned. That can still be done with ChatGPT by sending it something like the following:
Let's have you simply return a command. No explanation or discussion. Suppose I have an automatic cat feeder and I'm adding in voice recognition. The human just said something, and I want you to interpret it, and then give me a command. My choices for commands are:
EMPTY AND REFILL WATER BOWL
DISPENSE CAT FOOD
DISPOSE OF CAT FOOD
START AUTOMATIC CLEANER
Following is the human's request; please respond with one of the above commands:
"Time to feed the cat"
Notice that I framed it in a way that I can tack on the human request at the very end with a simple string concatenation; then I can send it off to ChatGPT. I sent the above, and it responded with:
DISPENSE CAT FOOD
This is where generative AI absolutely shines. To have some fun, I used the same as above but instead concatenated the following sentence: "I'm busy right now! It's Saturday morning, I'm vacuuming and scrubbing the floors, you know what you need to do too!" Its response:
START AUTOMATIC CLEANER
Now it’s beyond the scope of our article here to show you how to build the next step, but we can offer a few pointers:
- For the interpretation step, you can easily use OpenAI’s API. However, you do have additional options if you prefer to keep it local. You can use libraries such as spaCy, which is a bit more involved but potentially worth it, as it can identify parts of speech and extract what are called entities from the input. Another library is Rasa, which helps with “intent recognition".” And finally, you can also use Hugging Face Transformers library if you want to stick to generative AI, but keep it local.
- You might be automating a device using something like an Arduino (which sounds like a fun project). In that case, because of limited horsepower, you’ll need to offload everything to external APIs, such as Google’s Speech to Text and OpenAI’s API. Then you’ll get back the command, and put it in a case or switch statement, followed by the actual hardware control such as turning on a motor, etc.
- Alternatively, you might have a laptop and just want a simple verbal response. We’ll explore that next.
- Or you might have a laptop you’ve written the code for. You launch an app and ask it to perform some command (this might be a music app, for instance). In that case, your “command” might actually be a combination of commands. For example, the command might be a simple word like “PLAY” followed by a band name or song name, such as “PLAY ROLLING STONES.” But can you then forward such a command to a music app? Let’s explore that after the previous.
Providing a Verbal Response
If you’re building an app that provides a verbal response, you likely wouldn’t ask the generative AI for a simple command; rather, in this case, you really do want a long response that’s possibly explaining something.
In this case, you would still want to provide context. When your user asks, “What year did the Rolling Stones release their first album?” the last thing you want is for your user to be subjected to a long explanation on how you would build an app that would figure out what year the Rolling Stones released their first album. You might provide context such as this:
“I’ve built a question-and-answer system that focuses on musicians from the 20th and 21st century. My user asked the following question, which I’ve decoded through speech-to-text. Can you give me an answer in a fun, excited language to the following question, and perhaps include a couple fun snippets about the musician? Please keep it to under 30 sentences. Here is the question: What year did the Rolling Stones release their first album?”
After receiving the textual response, you’ll need to then convert it to actual voice for playback. For this you’ll need a text-to-speech library. The easiest option is pyttsx3, which works offline. It includes several different voice options. (Tip: You can even let your readers choose their preferred voice response.) Other options include:
- Google Text-to-Speech: Free, and pretty good voices
- Eleven Labs: Ultra realistic voices, but not free.
- Amazon Polly. Also realistic voices, and again, not free.
Launching a Music (or Similar) App
To wrap this up, let’s just provide some general thoughts about launching other apps.
First, you’ll need to decide what types of apps your app can launch. It would be overambitious to suggest that it can launch any app on the user’s device. Building a single app that can be as disparate as “Hey, transcribe the following text into a Word document” and “Play me Taylor Swift’s latest single” would be a serious undertaking.
Instead, stick to one app. A music app is ideal here: find out the app’s API and what you need to construct to send to the API. That’s the key from going from the “command” step mentioned earlier to actually launching the app. A great starting point here might be the Spotify API or even the Apple Music API. Build your app to convert the command to the necessary API call.
You could potentially have the generative AI build the API command for you; for example, instead of it returning “PLAY Rolling Stones,” it would return an actual HTTP request. But I would advise against this. While the AI might get it right 90 percent of the time, it might get it wrong in other instances, and you don’t want even a relatively small rate of failure.
Instead, be very specific on what it should return to you in the form of a simple command, which you can easily parse and interpret, and from there construct an API call. (You could, however, ask for the response to be in JSON format, with one member being a command such as PLAY or LOOKUP, and another member being “artist” and another being “song”.)
Conclusion
Thanks to generative AI combined with easy-to-use APIs, building a voice command app has become an incredibly easy, fun project that could look great in your portfolio.
Want to take it up a notch? Build a web-facing interface. You have plenty of options there as well, including deciding how much to move to the back end versus how much can work in the front end. If you can get a hiring manager obsessed with your app, you might boost your chances of landing a great new job.