Hi Alexa, what's next? Breaking the voice technology ceiling

Join senior executives in San Francisco on July 11-12 to learn how leaders are integrating and optimizing AI investments for success. Find out more

Amazon's recent announcement that they would cut staff and budget from the Alexa department called the voice assistant a "colossal failure". In its wake, there have been discussions that voice as an industry is stagnant (or even worse, in decline).

I have to say that I disagree.

While it's true that this voice has hit its ceiling of use cases, that doesn't equate to stagnation. It just means that the current state of technology has some limitations that are important to understand if we want it to evolve.

Put simply, today's technologies don't work in a way that meets the human norm. To do this, three features are needed:

Event

Transform 2023

Join us in San Francisco on July 11-12, where senior executives will discuss how they've integrated and optimized AI investments for success and avoided common pitfalls.

Register now Superior Natural Language Understanding (NLU): There are many good companies that have conquered this aspect. The technological capabilities are such that they can understand what you say and know the usual ways people can say whatever they want. For example, if you say, "I'd like a burger with onions," it knows you want the onions on the burger, not in a separate bag. Voice Metadata Extraction: Voice technology should be able to determine if a speaker is happy or frustrated, how far away they are from the mic, and their identity and accounts. It needs to recognize the voice well enough to know when you or someone else is speaking. Overcoming Crosstalk and Unattached Noise: The ability to understand in the presence of crosstalk even when other people are talking and when there are noises (traffic, music, babbling) not independently accessible to noise cancellation algorithms.

Some companies get the first two. These solutions are typically designed to work in sound environments that assume there is a single speaker with mostly background noise cancelled. However, in a typical public place with multiple noise sources, this is a debatable assumption.

Reaching the "holy grail" of voice technology

It's also important to take a moment to explain what I mean by noise that can and cannot be canceled. Noise to which you have independent access (connected noise) can be cancelled. For example, cars equipped with voice control have independent electronic access (via a streaming service) to content played over the car's speakers.

This access ensures that the acoustic version of this content captured on the microphones can be canceled using well-established algorithms. However, the system does not have independent electronic access to content spoken by passengers in the car. That's what I call unattached noise, and it can't be undone.

That's why the third ability...

Hi Alexa, what's next? Breaking the voice technology ceiling

Join senior executives in San Francisco on July 11-12 to learn how leaders are integrating and optimizing AI investments for success. Find out more

Amazon's recent announcement that they would cut staff and budget from the Alexa department called the voice assistant a "colossal failure". In its wake, there have been discussions that voice as an industry is stagnant (or even worse, in decline).

I have to say that I disagree.

While it's true that this voice has hit its ceiling of use cases, that doesn't equate to stagnation. It just means that the current state of technology has some limitations that are important to understand if we want it to evolve.

Put simply, today's technologies don't work in a way that meets the human norm. To do this, three features are needed:

Event

Transform 2023

Join us in San Francisco on July 11-12, where senior executives will discuss how they've integrated and optimized AI investments for success and avoided common pitfalls.

Register now Superior Natural Language Understanding (NLU): There are many good companies that have conquered this aspect. The technological capabilities are such that they can understand what you say and know the usual ways people can say whatever they want. For example, if you say, "I'd like a burger with onions," it knows you want the onions on the burger, not in a separate bag. Voice Metadata Extraction: Voice technology should be able to determine if a speaker is happy or frustrated, how far away they are from the mic, and their identity and accounts. It needs to recognize the voice well enough to know when you or someone else is speaking. Overcoming Crosstalk and Unattached Noise: The ability to understand in the presence of crosstalk even when other people are talking and when there are noises (traffic, music, babbling) not independently accessible to noise cancellation algorithms.

Some companies get the first two. These solutions are typically designed to work in sound environments that assume there is a single speaker with mostly background noise cancelled. However, in a typical public place with multiple noise sources, this is a debatable assumption.

Reaching the "holy grail" of voice technology

It's also important to take a moment to explain what I mean by noise that can and cannot be canceled. Noise to which you have independent access (connected noise) can be cancelled. For example, cars equipped with voice control have independent electronic access (via a streaming service) to content played over the car's speakers.

This access ensures that the acoustic version of this content captured on the microphones can be canceled using well-established algorithms. However, the system does not have independent electronic access to content spoken by passengers in the car. That's what I call unattached noise, and it can't be undone.

That's why the third ability...

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow