What Is A Voice User Interface
What Is A Voice User Interface
A Voice User Interface (VUI) enables users to interact with a device or application using spoken
voice commands. VUIs give users complete control of technology hands free, often times
without even having to look at the device.
A combination of Artificial Intelligence (AI) technologies are used to build VUIs, including
Automatic Speech Recognition, Name Entity Recognition, and Speech Synthesis among others.
VUIs can also be contained either in devices or inside of applications.
The backend infrastructure, including AI technologies used to create the VUI’s speech
components, are often stored in a public or private cloud where the user’s speech is processed. In
the cloud, AI components determine the intent of the user and return a given response back to the
device or application where the user is interacting with the VUI.
Well known VUIs include Amazon Alexa, Apple Siri, Google Assistant, Samsung Bixby,
Yandex Alisa, and Microsoft Cortana. For the best user experience, VUIs have visuals created by
a Graphical User Interface and additional sound effects to accompany them. Each VUI today has
its own way of handling sound effects are used so that users know when the VUI is active,
listening, processing speech, or responding back to the user. The benefits of VUIs include hands-
free accessibility, productivity, and better customer experience that will change how the world
interacts with artificial intelligence.
During the creation of Audrey there was an input and output procedure like used today in
modern VUI devices. First, a speaker recited a digit or digits into a telephone and made sure to
make a 350 milliseconds pause between each word. Next, Audrey listened to the speaker’s input
and with speech processes it sorted the speech sounds and patterns to understand the input.
Audrey would then visibly respond by flashing a light like modern VUI devices.
Although Audrey could distinguish the numbers, Audrey could not universally understand
everyone’s voice or language style and could only respond to a familiar speaker. Unfortunately
this was not a feature like modern day VUI in devices, Audrey was simply not advanced enough
and needed a familiar speaker to maintain a 97 percent digit recognition accuracy. With a few
other designated speakers, Audrey’s accuracy was 70-80 percent, but far less with other speakers
it was unfamiliar with. Why was Audrey created in the first place if manual push-button dialling
was cheaper and easier to work with? Recognized speech requires less bandwidth (less
frequencies for transmitting a signal) than the original sound waves in a telephone. It would also
be more practical for reducing data traveling through wires and future technology.
Tangora
Shortly after the creation of Audrey, the most significant voice technology advancement was in
1971 when the U.S Department of Defense’s research team funded five years of a Speech
Understanding Research program. Their goal was to reach a minimum of 1,000 vocabulary
words with the help of companies such as IBM. In the 1980s, IBM built a voice activated
typewriter called Tangora. Tangora was capable of understanding and handling a 20,000-word
vocabulary. Today voice activated typing systems have evolved to be used in smartphones to
send a text or write a research paper in a matter of moments.
Overtime, computer technology advanced VUI, Graphical User Interface (GUI), and User
Experience (UX) design is placed into a small device that fits in the palm of a hand. Even GUI
and UX is becoming old news due to the quick adoption of voice-only devices that no longer use
these features. Speech recognition technology went from understanding 9 numbers to millions of
phrases and words from any voice. This advancement was made possible with new speech
recognition software processes such as Automatic Speech Recognition, Name Entity
Recognition, and Speech Synthesis.
Technology used to create a VUI
A range of Artificial Intelligence technologies are used to create VUIs, including Automatic
Speech Recognition, Name Entity Recognition, and Speech Synthesis.
NER assists ASR in resolving words as their entities. On the basis of voice input alone, “New
York City” is recognized as “new” “York” “city”. NER then identifies this as a unique location
and adjusts to “New York City”. NER is highly contextual and needs additional input to
confidently determine entities. Sometimes, NER is reliant on previous training and will not be
able to confidently determine an input’s entity.
Speech Synthesis
Speech Synthesis produces artificial human voice and speech using input text. VUI does the job
in three stages. The stages are input, processing, and output. Speech Synthesis is simply a text-
to-speech (TTS) output where a device reads out loud what was input with a simulated voice
through a loudspeaker.
These AI technologies analyze, learn, and mimic human speech patterns and can also adjust the
speech intonation, pitch, and cadence. Intonation is the way a person’s voice rises or falls as they
speak. Factors that affect intonation is emotion, accent, and diction. Pitch is the tone of voice, but
it is not affected by emotion. Pitch is high or low and can be best described as a squeaky or deep
voice. Cadence is the flow of voice that fluctuates in pitch as someone is speaking or reading.
For example, a public speaker will change their cadence by descending their voice during a
declarative sentence to make an impact on their audience.
Once all of this information is stored and analyzed, these technologies will use it to improve
itself and the VUI through what is called machine learning. The clouds and technologies will
determine the intent of the user and return a response through the application or device.
Intents & Entities
Voice commands consist of intents and entities. The intent is the objective of the voice
interaction and has two approaches. There are local intents and global intents. A local intent is
when the user is asked a question in which they respond “Yes” or “No”. A global intent is when
a user has a more complex answer. When designing VUI’s, the way different commands can be
said need to be taken into consideration in order to recognize the intent and respond correctly.
Here is an example of getting directions to a location: “Get directions to 1600 Pennsylvania
Avenue”, “Take me to 1600 Pennsylvania Avenue”. Entities are variables within intents. Think
of it as the blanks needed to fill into a Mad Libs booklet, such as “ Book a hotel in {location} on
{date}” or “Play {song}.”
VUI vs GUI
User Experience (UX) is the overall experience of an interface product such as a website,
application, and more in terms of how aesthetically pleasing it is or how easy it is to navigate for
users. Together VUI and GUI play a large role in UX design because they assemble a product for
consumers.
Voice User Interface
As explained earlier, Voice User Interface (VUI) enables users to interact with a device or
application using spoken voice commands. VUIs give users complete control of technology
hands free, often times without even having to look at the device.
Graphical User Interface (GUI)
Graphical User Interface (GUI) is graphical layout and design of a device. For example, the
screen display and apps on a smartphone or computer is a graphical user interface. GUI can be
used to display visuals for VUI, such as a graphic of sound waves when a voice assistant on a
smartphone responds to its user. Another real life example can be how Google and Apple Siri
use VUI and GUI together.
Apple Siri VUI & GUI
Apple Siri responds to “Hey Siri” using VUI or by pressing down on the home button of the
Apple device. Users will know that Siri is active when Siri says “What can I help you with?”
through its speaker or on the screen using GUI. While a user speaks to Siri, colorful
representational wavelengths move to the sound of speech. This also shows users that Siri is
actively listening and processing their question. When a user is quiet, Siri will prompt “Go
ahead, I’m listening…” If a user still does not respond, then it will display on the screen “Some
things you can ask me:” with a few examples of what it can do, such as calling, face timing,
emailing, and more.
This GUI feature is specifically catered to people who are new to Siri and are unsure on what to
do. The Apple device will also display what the user has asked and Siri’s response on the screen
to show what is being understood from the interaction. Other features that Apple Siri has is the
customization of Siri’s gender, accent, and language.
VUI vs Voice AI
The term Voice Artificial intelligence (AI) is used with VUI very commonly. Both terms usually
get confused to mean the same thing since they are closely connected. VUI is all about the voice
user experience on a device. Voice AI is the term for speech recognition technologies. The
technologies fall under the Voice AI umbrella and are Automatic Speech Recognition, Name
Entity Recognition, and Speech Synthesis.
Voice command devices also known as voice assistants use VUI and can be auditory,
tactile, or visual. Devices can also range from a small sized speaker or to a blue light that blinks
in a car’s stereo when it hears a command. More common examples of a voice command device
are iPhone Siri, Alexa, and Google Home. These voice assistants are made to help people in
daily tasks. There are also device genres for what the VUI is used for. This influences how the
interaction between the user and device is set up.
Improving Lives
With individualized experiences, VUI can lead society to a more accessible world and help give
a better quality of life. VUI benefits users with disabilities such as the visually impaired or others
that cannot adapt to visual UI or keyboards. VUI is also becoming popular with Seniors who are
new to technology. Aging has many effects on abilities such as sensory, movement, and memory,
which makes VUI an alternative to hands-on assistance. With the assistance of VUI, elders can
communicate with loved ones and use devices without the confusion and frustration.
VUI in Education
Educational strategies are constantly being updated in educational systems for all ages. VUI can
be a learning tool where classrooms interact with a voice assistant to create a new experience and
cater to all learning styles. Since VUI is very accessible, training isn’t required for using it which
makes it very easy to use in any audience.
Technology Innovation
As VUI grows, it will change the way that products are designed and start a new job demand.
VUI design will become a key skill for designers due to the evolving user experience. User
Experience (UX) designers are trained in providing experiences for physical input and graphical
output. VUI design is different from UX because the design guidelines and principles are
different. This will encourage designers to focus more on VUI design. In 2019, it was estimated
that 111.8 million people in the US will use a voice assistant at least monthly, up 9.5% from last
year. Since users are using voice assistants more than ever, it will eventually become a habit and
the new device feature that everyone will own.
It will be easier for users to speak to a device than to physically use a device after the habit has
been formed. This will create a high demand for VUI knowledgeable designers and contribute to
the change of how devices are designed.
Lastly, another benefit to voice command devices is that they don’t stay stagnant to what they
are programmed to do. Over time, the interaction between the user and voice-user interface
improves through machine learning as discussed earlier. The user learns how to better utilize the
voice command device and the device in return learns how to work with its user.