In its blog, Google presents Translation, its new integral model of translation from voice to voice. That the company has been perfecting its translation models for years is not something new, but that these models are capable of imitating people’s voices is.

Google points out that the main objective of this is to help people who speak different languages to communicate with each other. To achieve this new voice-to-speech system, they propose a unique sequence-to-sequence model, which moves away from cascading systems and improves, according to Google, speed, composition errors and the translation itself.

Imitating accents and pronunciation

Google tells us that Translation is based on an end-to-end model, superior to traditional cascade systems. With this, they intend to demonstrate that it is possible to translate speech from one language to another without the need for an intermediate representation of text in either of the two languages, something that cascading systems do require.

Google’s new tool takes the source spectograms and directly generates other spectograms with the content translated into the desired language. To do this, it uses a neural Vocoder, in charge of giving the desired shape to the waves of the output spectogram. They also use an encoder capable of preserving the characteristics of the voice being recorded.

The main novelty of Translation is that it does not work in cascade, and that it adds elements such as an encoder capable of retaining the characteristics of the speech of the recorded voice.

When training Translation, Google uses a multi-tasking objective that seeks to predict source and target transcripts, while simultaneously generating final spectograms.

In short, Google registers the voice of the speaker, manages to preserve the characteristics of his speech, and manages to generate an output spectogram translated into the target language, maintaining these characteristics of speech.

Emulating natural language

Creating natural voice models has long been a Google obsession. We could see it in the way Google Assistant talks. This is mainly the difference that they look for with the rest of assistants and models, the naturalness.

Google itself admits that its results are below the traditional cascade systems, but demonstrate the viability of end-to-end voice systems, which was its main objective.

First, they show us how Translation works under a cascade model. We have a Spanish input, a reference translation, and the output translation itself. If we listen to the translation of the cascade model, we are faced with the typical locked and sequential language of the old attendees.