March 26, 2019

Brown Bag Talk: Neural Models of Text Normalization for Speech Applications

Wednesday, March 27, 1:30pm-2:30pm
Location: SBUS 025
NO RSVP necessary

Richard Sproat, Research Scientist, Google Research, New York

(Joint work with Ke Wu, Hao Zhang, Kyle Gorman, Felix Stahlberg, Xiaochang Peng and Brian Roark).

Abstract:

Speech applications such as text-to-speech (TTS) or automatic speech recognition (ASR), must not only know how to read ordinary words, but must also know how to read numbers, abbreviations, measure expressions, times, dates, and a whole range of other constructions that one frequently finds in written texts. The problem of dealing with such material is called text normalization. The traditional approach to this problem, and the one currently used in Google’s deployed TTS and ASR systems, involves large hand-constructed grammars, which are costly to develop and tricky to maintain. It would be nice if one could simply train a system from text paired with its verbalization.

I will present our work on applying neural sequence-to-sequence RNN models to the problem of text normalization. Given sufficient training data, such models can achieve very high accuracy, but also tend to produce the occasional error — reading “kB” as “hectare”, misreading a long number such as “3,281” — that would be problematic in a real application. The most powerful method we have found to correct such errors is to use finite-state over-generating covering grammars at decoding time to guide the RNN away from “silly” readings: Such covering grammars can be learned from a very small amount of annotated data. The resulting system is thus a hybrid system, rather than a purely neural one, a purely neural approach being apparently impossible at present.

Brief bio:

Richard Sproat is a Research Scientist at Google Research, New York. From 2009-2012 he was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University. Prior to going to OHSU, he was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He was also a full-time faculty member at the Beckman Institute. He still holds adjunct positions in Linguistics and ECE at UIUC.

Before joining the faculty at UIUC he worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs — Research where he worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques. Before joining Ken’s department he worked in the Human/Computer Interaction Research Department headed by Candy Kamm. His main project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC.