Machine-Labeled Data + Artificial Noise = Better Speech Recognition

March 30, 2019

5901

Although deep neural networks have enabled accurate large-vocabulary speech recognition, training them requires thousands of hours of transcribed data, which is time-consuming and expensive to collect. So Amazon scientists have been investigating techniques that will let Alexa learn with minimal human involvement, techniques that fall in the categories of unsupervised and semi-supervised learning.

At this year’s International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I are presenting a semi-supervised-learning approach to improving speech recognition performance — especially in noisy environments, where existing systems can still struggle.

We first train a speech recognizer — the “teacher” model — on 800 hours of annotated data and use it to “softly” label another 7,200 hours of unannotated data. Then we artificially add noise to the same dataset and use that, together with the labels generated by the teacher model, to train a second speech recognizer — the “student” model. We hope to make the behavior of the student model in the noisy domain approach that of the teacher model in the clean domain, and thus improve the noise robustness of the speech recognition system.

^{The architecture of our teacher-student model. “Logits selection” refers to the selection of high-confidence senones.}

On test data that we produced by simultaneously playing recorded speech and media sounds through loudspeakers and re-recording the combined acoustic signal, our system shows a 20% relative reduction in terms of word error rate versus a system trained only on the clean, annotated data.

An automatic speech recognition system has three main components: an acoustic model, a pronunciation model, and a language model. The inputs to the acoustic model are short snippets of audio called frames. For every input frame, the output is thousands of probabilities. Each probability indicates the likelihood that the frame belongs to a low-level phonetic representation called a senone.

In training the student model, we keep only the highest-confidence senones from the teacher, which turns out to be a quite effective approach.

The outputs of the acoustic model pass to the pronunciation model, which converts senone sequences into possible words, and those pass to the language model, which encodes the probabilities of word sequences. All three components of the system work together to find the most likely word sequence given the audio input.

Both our teacher and student models are acoustic models, and we experiment with two criteria for optimizing them. With the first, the models are optimized to maximize accuracy on a frame-by-frame basis, at the level of the acoustic model. The other training criterion is sequence-discriminative: both the teacher and student models are further optimized to minimize error across sequences of outputs, at the levels of not only the acoustic model but the pronunciation model and language model as well.

We find that sequence training makes the teacher models more accurate, apart from the performance of the student models. It also slightly increases the relative improvement offered by the student models.

To add noise to the training data, we used a collection of noise samples, most of which involved media playback — such as music or television audio — in the background. For each speech example in the training set, we randomly selected one to three noise samples to add to it. Those samples were processed to simulate closed-room acoustics, with the properties of the simulated room varying randomly from one training example to the next.

For every frame of audio data that passes to an acoustic model, most of the output probabilities are extremely low. So when we use the teacher’s output to train the student, we keep only the highest probabilities. We experimented with different numbers of target probabilities, from five to 40.

Intriguingly, this modification by itself improved the performance of the student model relative to the teacher, even on clean test data. Training the student to ignore improbable hypotheses enabled it to devote more resources to distinguishing among probable ones.

In addition to limiting the number of target probabilities, we also applied a smoothing function to them, which evened them out somewhat, boosting the lows and trimming the highs. The degree of smoothing is defined by a quantity called temperature. We found that a temperature of 2, together with keeping the 20 top probabilities, yielded the best results.

Apart from the data set produced by re-recording overlapping audio, we used two other data sets to test our system. One was a set of clean audio samples, and the other was a set of samples to which we’d added noise through the same procedure we used to create the training data.

Our best-performing student model was first optimized according to the per-frame output from the teacher model, using the entire 8,000 hours of data with noise added, then sequence-trained on the 800 hours of annotated data. Relative to a teacher model sequence-trained on 800 hours of hand-labeled clean data, it yielded a 10% decrease in error rate on the clean test data, a 29% decrease on the noisy test data, and a 20% decrease on the re-recorded noisy data.

Minhua Wu is an applied scientist in the Alexa Speech group.

Paper: “Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-Student Learning”

Alexa science

Acknowledgments: Ladislav Mosner, Anirudh Raju, Sree Hari Krishnan Parthasarathi, Kenichi Kumatani, Shiva Sundaram, Roland Maas, Björn Hoffmeister

Somfy’s First Zigbee3.0 Range of Connected Shades and Curtains

How to Automate Your Mornings

Use Your Smart Home Devices to Help You Work From Home

Build a Twitter-Based Home Automation System with a Raspberry Pi

Everything You Need to Do to Secure Your Raspberry Pi Home…

Shoppable Videos: Blurring the Lines Between Entertainment and E-Commerce

How to Use Music in Your Marketing Videos Legally

How to Use ChatGPT to Create Great Business Video Scripts: 20…

Insights From Nine Experts About How Advancing Tech Will Impact Video…

Should You Use Hubitat to Automate Your Smarthome?

Best smart home systems for a connected domicile

If You Plan on Installing a Ton of Smarthome Devices, Skip…

Preloved Quilt Blanket: Set of 2

Did you know the thrill of jet skiing traces back to…

فرصة عمل من المنزل المهتمة مرحبا #girl #model #music #travel #motivation…

02/01/24… Happy birthday! 🤭🎂#io #compleanno #cagliari #me #birthday #happy…

@NEWSMAX Is he gonna fly everywhere @JohnKerry @MarlonWayans @dbongino @elonmusk

Build a Web Connected Thermostat with a Raspberry Pi and and…

Sinopé TH1120RF programmable line-voltage thermostat review: A smarter way to control…

Best smart thermostat: Reviews and buying advice

Build a Web Connected Thermostat with a Raspberry Pi and and…

Keen Home Smart Vents: The Perfect Temperature In Any Room

Security Showdown: Smart Locks vs. “Dumb” Locks

Build an Intruder Detector with a Raspberry Pi

Create a Multiple-Camera, Motion Sensor-Controlled Surveillance System with a Raspberry Pi

SimpliSafe upgrades the DIY home security experience

Nest’s video doorbell is now shipping

Machine-Labeled Data + Artificial Noise = Better Speech Recognition

LEAVE A REPLY Cancel reply

Latest article

Why Bluetooth 5.1 Will Make it Easier to Find Your Missing Car Keys

IKEA’s Symfonisk Line Will Include a Table Lamp With Sonos Speaker, Support AirPlay 2

Everything You Need to Do to Secure Your Raspberry Pi Home Automation Projects

Preloved Quilt Blanket: Set of 2

Did you know the thrill of jet skiing traces back to...

Shoppable Videos: Blurring the Lines Between Entertainment and E-Commerce