One of the biggest efforts of AI systems developers is reaching a more natural human-computer interaction, one that actually understands the natural language, including simple words and commands. Currently, the speech of computers is more understandable after the application of deep neural networks such as Google Voice Search or WaveNet.
Most of the current computerized voices can’t engage in a conversation flow, which makes the caller adjusts itself to the system when the idea is completely the opposite. In this manner, Google AI has announced the creation of Google Duplex, which is a technology that leads natural conversations to do specific “real world” tasks through your phone, such as scheduling an appointment, for example.
Duplex scheduling a hair salon appointment:
Duplex calling a restaurant:
What is Google Duplex about?
The system will allow people to talk spontaneously, as they would do with another person, without modifying their vocabulary or expressions. Duplex is constrained to closed domains but narrow enough to explore extensively. In this manner, the system is deeply trained in specific domains to carry out natural instead of general conversations.
As the system sounds very natural, it’s very hard to identify that is a fully automatic computer system talking with a real business. For this purpose, the transparency is a key to guarantee a comfortable conversation between both parties. That’s why it’s necessary to know the intent of the call to understand the context.
The main challenges
Handling natural conversations is a difficult job for a system, especially for these reasons:
- The natural language is very hard to understand.
- You can’t easily model a natural behavior.
- Creating a natural speech with the right intonation is very difficult.
- People talk faster and less clearly in natural conversations.
- Phone calls may have loud background noises and sound quality issues.
It’s necessary to admit that when we talk spontaneously, we use more complex sentences than we do when we communicate to a machine. Mid-sentence, more verbose and omitting words are common behaviors of our general conversations.
Example of complex statement:
How does Google Duplex work?
The efficiency on this system relies upon the advances related with understanding, interacting, timing and speaking, which helps the conversation to sound natural. Google Duplex uses a Recurrent Neural Network (RNN) built using TensorFlow Extended to process the main challenges.
For a better precision in the information, Google trained Duplex’s RNN on a corpus of anonymized phone conversation data. Then, the network takes advantage of the Google’s automatic speech recognition technology (ASR) taking in consideration important aspects like:
- Features from the audio.
- The history of the conversation.
- The parameters of the conversation.
The key was about training the model to handle each task separately but leveraged the shared corpus across tasks. They also improved the model with hyperparameter optimization from TFX.
Behind its Natural Sound
To create a voice that didn’t seem computerized at all, they used a combination of a concatenative text to speech (TTS) engine with a synthesis TTS engine, using Tacotron and WaveNet to change the intonation of the speech according to the context.
Besides, one of the main aspects that makes it sound natural is the incorporation of speech disfluencies such as “hmm” and “uh’s”. Just as we do in the real life, the system also adds these sounds when It’s still processing the information that is receiving. That’s why they make the conversation sound more familiar.
Matching the expectations of people is very important for latency because sometimes people are waiting for an instant response, being more sensitive to this matter. In this manner, the system needs to detect the levels of latency and, from that point, when low latency is required, they use faster, low confidence models like speech recognition or endpoints.
In some cases, they use hesitant response, just what people would do in real conversations if they didn’t fully understand their counterpart. In this manner, they vary the level of latency according to the complexity of a sentence.
Its Powerful System Operation
Google Duplex is prepared to handle complex conversations and it has a high grade of autonomy to complete the majority of the tasks by itself, without human participation. For this matter, the system includes a self-monitoring capability that allows it to recognize the tasks that can’t complete autonomously. In these cases, the system communicates a human operator to complete the task.
To make this system more versatile and adaptable to a different context, it needs to be trained in new domains. That’s why they use real-time supervising training for these purposes, affecting the behavior of the system in real time as needed until it performs at an optimum quality level. At this point, the supervision is not necessary anymore and the system makes the phone calls autonomously.
Which are the benefits of Google Duplex?
If your business relies upon appointment bookings handles by Google Duplex, your customers will be able to book through the Google Assistant, getting rid of the job related to training employees or changing practices. This system will also reduce the absence of appointments to allow easily their cancellation or rescheduling.
The customers will be very pleased too because they often call companies asking them information that is not available on the website. In this manner, Google Duplex can communicate with the enterprise, asking them common doubts, such as the open hours, and make it accessible to anyone in Google.
In this manner, the companies will reduce the number of unnecessary calls received during a day. The best part of all is that businesses can operate in the way they always have, which means that companies won’t deal with a learning curve nor changes to get the most of this tool!
Duplex handling interruptions:
Real-life tasks are now easier, as the user just has to interact with Google Assistance, asking what it needs and then, the call happens in the background without the user even noticing or involving on it. As some of the other important advantages, we have:
- Booking during off-hours or with limited connectivity.
- It helps address accessibility and language barriers.
- Allow hearing-impaired users or people in a different country to handle tasks over the phone.
There is no doubt that we are getting closer to interact with technology as naturally and spontaneously in the same way that we interact with each other. Google Duplex is definitely an example of it by making possible to handle a natural conversation with a system in specific scenarios.
Reaching an important improvement that makes people more comfortable