Advances in an artificial intelligence technique known as deep learning are helping companies like Microsoft and Nuance create powerful speech analytics and voice recognition software at a much quicker pace than ever before.
Multilingual Text to Speech in a Person's Own Voice
Deep learning is a system of pattern recognition modeled after the ability of the human brain to gather information and learn from it. Microsoft Chief Research Officer Richard Rashid gave a rousing demonstration of the technology at a recent conference in China.
Rashid lectured to an audience while a software system showed his words on a screen above his head. The speech recognition program then translated the words into Chinese, and even spoke the words in Rashid's own voice, according to the New York Times' John Markoff.
Rashid does not speak Chinese. How did Microsoft teach a computer to simulate a person's voice in a language that a person had never spoken? The easy way to do it would be simply to have Rashid actually speak a group of phrases in Chinese, record it, and then use a computer to extrapolate that pre recorded sound into whatever phrase needed to be translated.
That's a neat trick, but hardly practical. In order to teach a computer how to do something like this, breakthroughs in things like deep learning are required. Microsoft has a renowned research division, and while we don't know exactly how far it has gone into this particular branch of artificial intelligence technology, we did track down the secret to how Rashid made headlines in the Times last week.
Trajectory Tiling: How They Did It
Microsoft calls it trajectory tiling -- the entire process was laid out in a March 2012 presentation by Rashid at the TechFest conference, a celebration of 20 years of Microsoft research. It should come as no surprise then that Rashid has been in charge of this Microsoft segment during the entirety of those two decades. In fact, he was hired away from a Carnegie Mellon professor's job specifically to help start up Microsoft Research.
Increasingly, the virtual and the physical worlds are merging", Rashid said at TechFest 2012 in March. "Part of this is happening because we are giving computers the same senses we have. We are giving them the ability to see, we are giving them the ability to hear and to understand."
Dedicated to basic computer science research for the last 20 years, Rashid said he is confident his company is changing the kinds of applications it can build and how we interact with computers.
Making Voice Sausage
Frank Soong, principal researcher, Microsoft Research Asia did the trajectory tiling demonstration at TechFest 2012. He called it trained multilingual text to speech. Instead of starting with Rashid's own voice reciting various Chinese phrases like in the above example, a reference speaker is recorded instead.