When was the last time you interacted with a voice recognition system? Was it an automated phone menu, for the bank or the electric company? Was it an app, like Google translate? A piece of smart home equipment, like Alexa or Siri? More to the point: how was that interaction? Did you feel calmer after using it? Did you feel truly understood?

If you reacted to these questions with a snort, or possibly an eyeroll, then Dr. Hynek Boril feels your pain. The researcher is deeply invested in building better voice recognition algorithms.

“The goal,” says Boril, an Assistant Professor of electrical engineering at UW-Platteville, “is to develop algorithms that will be able to reliably identify what language and dialect is being spoken, what is being said, who is speaking, their language proficiency, their emotional or mental state, and in some limited aspects their health condition, and to discern all that when dealing with speech uttered in natural environments with varying acoustics and background noise.”

Description of Research

Essentially, Boril wants speech recognition software that is as good at recognizing tone, inflection, mood and meaning as human speakers are. He’s set himself quite a challenge, especially considering the current limitations of the technology we all know… and don’t exactly love.

“Many readers probably have some experience with using Siri, Alexa, Google Translate or other voice-enabled systems and are likely aware of all the current limitations of these,” says Boril. “While setting a morning alarm clock is usually a breeze, we likely wouldn't opt for using any of these voice systems for dictating a letter to our employer, or drafting a project proposal, due to the extensive voice recognition errors."

"Moreover, things get worse when we move from a calm place to acoustically adverse environments like public transport, cafeterias or lecture halls, where our voices have to compete with background noise, voices of others, etc. My research primarily focuses on the design of algorithms that will eliminate or at least notably reduce the impact of these distracting factors on speech technology.”

To grasp the scope of Dr. Boril’s research, it’s important to understand all the information that words can carry, beyond their raw meaning. Boril explains: “Speech carries much more information than just the 'linguistic' content. The color of voice, intonation and emphasis put on different words all refine the actual meaning of what is being said and also provide cues to the mental and even physical state of the speaker. We sound different when we are emotional, nervous, distracted; when performing a physically demanding task or when we have a cold. We say things differently when addressing a child, a friend, a relative or a boss. Uncertainty, hesitation, and deception all leave traces in how we sound as well.”

The information conveyed by these metrics is called “paralinguistic content”, and it is a key factor in how we communicate with each other. When we convert our speech to text, this paralinguistic content goes missing, and with it, important tonal clues. Anyone who has ever struggled to interpret a cryptic text from a date already knows this. “Paralinguistic content is difficult or even impossible to capture in written text, even when accompanied by an ensemble of smileys,” says Boril, “but [it] can be tracked by computer algorithms and leveraged in building automated emphatic interactive agents that will be able to aid us in a variety of tasks.”

What kind of tasks? Boril has plenty of ideas:

  • Emergency phone lines with emphatic interactive agents can assess urgency of the call […] and expedite high priority calls to human operators. 
  • Automatic assessment of pronunciation and accent is used in online tutoring apps to help children and second language learners.
  • Voice characteristics […] carry strong speaker-specific information and are routinely utilized in information access/security and forensics applications.
  • Intelligent in-vehicle automatic driving safety systems monitor for distraction or fatigue in the driver's behavior [...] The driver's vocal interactions with the passengers or the car's infotainment system provide one strong modality in the monitoring process.

These are just a few examples of a nearly endless pool of applications relying on automated speech assessment and classification. With this many possible applications, it’s a wonder more people aren’t working on speech recognition and natural language processing. Or is it?

As Dr. Boril explains, it takes much more than a computer whiz with a laptop to tackle the challenge of building a better voice recognition system. “To better understand speech,” says he says, “we need to study vast amounts of speech data. We need to have access to a large sample of speakers of the language of interest, ideally with balanced demographics (i.e., males versus females, a reasonable age distribution), speaking styles, rich phonetic content, large vocabulary used in the utterances, etc."

"Leading companies in voice technologies use hundreds of thousands of hours of spontaneous speech recorded through their online services to train and refine their speech engines. Academic research typically has to scale the experiments down to fit the available computational resources, so we operate typically on hundreds to thousands of hours of data.”

To process that much raw data, you need more computational power than can be found in the average consumer laptop. A lot more. “Since both the engine training and evaluation are computationally exhaustive tasks, we use high performance clusters with multiple computational cores (CPU clusters) and Graphical Processing Unit (GPU) servers to parallelize the processes,” says Dr. Boril. “Thanks to the funding from the College of EMS at UW-Platteville, I was able to set up a computational GPU server in our ECE department, where students can learn how to design and run parallel experiments in the same fashion as seen in top academic and industry research labs.”

Besides using cutting-edge research technology, the Pioneers in Dr. Boril’s research lab will be gaining experience in an exciting and interconnected field. “Speech research is bynature multidisciplinary,” says Boril, “as it combines knowledge from phonetics, linguistics, social sciences, acoustics, digital signal processing, statistical modeling, machine learning and other fields. This makes it quite exciting as we get to collaborate with colleagues from different backgrounds and get exposure to different approaches and views. When coming up with an experimental design, we need some sort of a sanity check: a seemingly perfect design in the eyes of an electrical engineer may be found less than satisfactory when reviewed by a linguist. So […] we need to work together, making sure that what we try to pursue is meaningful in a broader sense.”

The collaboration Dr. Boril describes between scientists and linguists is an increasingly important feature of modern computer science, as smart devices fill our homes, cars, and lives. “With the increasing presence of technology in our daily lives, it becomes instrumental to be able to communicate with various electronic devices that surround us in an effective and natural way,” says Dr. Boril.

“Rather than having to master endless menus, buttons, icons, and on-screen hand gestures, it is much more natural for us to simply tell the device what we want to do - with 2-3 word commands being sufficient to describe a plethora of tasks.” It’s an ambitious goal, and one that you might remember, next time you’re asking an automated phone menu to “speak to a human”. Hynek Boril hears your plea. He's working as fast as he can.

Contact Information

College of Engineering, Mathematics and Science

0254 Sesquicentennial Hall
Regular Hours: 7:45 a.m. - 4:15 p.m., Mon.-Fri. | Summer Hours: 7:30 a.m. - 4 p.m., Mon.-Fri.

EMS Dean's List

Every semester we recognize students with a grade point average of 3.5 or higher & 12 completed credits. 

View the List