Amazon researchers develop cutting-edge Base TTS text-to-speech model

AI SaaS Inc. researchers have developed a new text-to-speech model, Base TTS, that can pronounce words more naturally than earlier neural networks.

TechCrunch reported the project late Wednesday. The researchers detailed the architecture of Base TTS in an academic paper published on Monday. 

Besides generating more natural-sounding audio than its predecessors, the model is also the largest neural network in the category. The most advanced version of BASE TTS features about 1 billion parameters, which are configuration settings that determine how an artificial intelligence processes data. In general, increasing an AI model’s parameter count expands the range of tasks it can perform.

Amazon’s researchers trained Base TTS on 100,000 hours’ worth of audio sourced from the public web. English-language recordings account for about 90% of the dataset. To streamline the training process, the researchers split the audio into small files that each included no more than 40 seconds of speech.

“Echoing the widely-reported ‘emergent abilities’ of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences,” the researchers wrote in the paper detailing the system. 

At the architectural level, BASE TTS comprises two separate AI models. The first turns text entered by the user into abstract mathematical representations dubbed speechcodes. The second neural network, in turn, transforms those mathematical representations into audio. 

The first model is based on the Transformer architecture that underpins OpenAI’s GPT-4. Developed by Google LLC in 2017, the architecture allows neural networks to consider the context of a word when trying to determine its meaning. That feature enables Transformer-based neural networks to interpret input data more accurately than earlier algorithms.

The Transformer model in Base TTS turns text imputed by the user into speechcodes, mathematical representations that the other components of the system can more easily process. The model also performs two other tasks. According to the researchers, it compresses speechcodes to speed up processing and ensures that the audio Base TTS produces doesn’t include unnecessary elements such as background noise.

Once the speechcodes are ready, they move into the second AI model that underpins Base TTS. That model turns the data into spectrograms, which are graphs used to visualize sound waves. Those graphs can be easily turned into AI-generated speech.

Amazon’s researchers assessed Base TTS’s capabilities with the help of an expert linguist, as well as an automatic evaluation benchmark called MUSHRA. They determined that the model can read input text aloud in a more naturally-sounding way than earlier models.

During the evaluation, Base TTS successfully pronounced the @ sign and other symbols along with paralinguistic sounds such as “shh.” It also managed to read aloud English-language sentences that contained foreign words and questions. According to Amazon, Base TTS completed the task even though it wasn’t specifically trained to process some of the sentence types included in the evaluation dataset. 

Photo: Unsplash

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy



Leave a Reply

Your email address will not be published. Required fields are marked *