That is, a Chinese voice is recorded as a baseline. This voice is then used to construct the underlying parameter trajectory, the fundamental frequency, the gain and the loudness of the targeted translation. In this case it's Chinese, but the same method would be used for whatever language is needed. 

Once the baseline is established, the two voices (reference speaker and target person's voice) are warped or equalized. The trajectory is warped towards the English speaker, in this case.

Teaching a computer to speak translated text in person's own voice.

This is where the voice sausage is made. The English speaking person's voice database is broken into tiny pieces, in this case 5 milliseconds. Then the engineers construct all the pieces which are closest to the trajectory of the warped Chinese sentence. This sausage like network is formed, and then within that network, they find the best concatenation of all the sequence of tiles.

This can be used to form a Chinese language sentence in the English speaking person's own voice. More sentences can be built from there, and together they form training sentences for the Chinese text to speech program. With this program in hand, whenever the English speaking person needs something translated, it can be spoken by the computer in that person's voice. 

At TechFest 2012, however, there was an added step Rashid didn't show off at his now famous lecture in China. Researchers built a virtual talking head simulation of Craig Mundie, head of research and strategy (and Rashid's boss), and used his voice to show off the software.

Craig Mundie lip syncing Chinese, all simulated with a computer.

So in this image, Mundie's voice and image are computer generated, but very much based on real data from the man himself. None of this would be possible without deep learning. The problem is, the system is still far from perfect, a problem voice recognition and interactive voice response (IVR) technology has had since the 1970s.

Speech Analytics in 2013

While Microsoft and Nuance, the technology behind Apple's Siri digital assistant, are getting better at speech recognition and machine learning at a faster pace, the enterprise still has to rely on a bevy of less sci fi-y tools.

Aberdeen Group has a new speech analytics buyers guide out in November, and it focuses on things like improving financial returns at call centers, for example. The report found speech to text in particular is "especially helpful for contact centers serving a wide range of demographics as it helps them update their vocabulary with words and phrases used more widely by certain demographic groups."

We spoke to Aberdeen researcher Omer Minkara who wrote the report, and he said revolutionizing customer interactions through voice recognition could take another 5-10 years.

"Despite the clear benefits of helping businesses personalize customer interactions, both consumers and businesses find many gaps and inaccuracies in real-life use cases of these tools," Minkara said in an email.

"This use case information needs to be compiled, analyzed and built into refinement of next generation speech recognition tools in order to improve the accuracy and timeliness of these tools within future customer interaction scenarios."