How to select a TTS service
![]()
On several occasions we have had The opportunity to develop solutions that require a Text-to-Speech conversion service. This technology is known as TTS (Text-to-Speech). If your project requires this type of technology, you'll find a wide range of options in terms of quality, cost, and voices.
Taking the above into account, this entry has been made with the aim of providing some advice that you should consider when selecting a provider of this type of technology.
Factors to consider
Quality
From our perspective, this is the most important factor. Ideally, the voice should sound as natural as possible, and the speed, pitch, and accent should be appropriate so that the recipient can understand the message clearly. Similarly, the service must be able to understand numerical values, dates, acronyms such as API, SIP, etc., or even names like Microsoft or Google.
However, we must add that this point is very subjective and will depend largely on the accent and type of voice.
Cost
It's clear that cost is a very important factor, because even if the service is of very high quality, excessively high costs can make the project financially unviable. In fact, based on our experience with similar projects, we've even been asked to change the TTS engine due to cost considerations. This is certainly the most difficult aspect to compare, since each provider charges differently for their services. Some charge per transaction, while others charge a certain amount per thousand or million characters. Similarly, many offer a free tier where a certain number of transactions or thousands of characters are free each month.
Some recommendations to resolve doubts regarding the cost are:
- Clearly define the character size and number of transactions you'll be performing to compare each option. For example, if you have a large number of short-character transactions, it might be better to choose a provider that charges per character rather than per transaction.
- Caching previously made requests significantly reduces costs, since the text to be played is usually repeated constantly. This, in addition to reducing costs, allows for increased playback speed in real-time applications, as there will be no need to request audio that has already been synthesized.
Supported languages and variants
Depending on your goals, this point may be more or less relevant. When looking at the tools, you'll find that most offer excellent English support, but the quality and quantity of Spanish voice options are quite limited, and in many cases, this is important. For example, if you want to build a transactional IVR for Latin America and the TTS provider only offers voices for Spanish/Spain, the accent will certainly be noticeable.
One of the best examples is Google's recently released TTS API, with impressive support for English and even a wide variety of voices, as well as the ability to generate audio using the WaveNet voice type, a technology that allows the voice to sound quite natural. However, support for Spanish is quite poor, being far surpassed by Amazon Polly, Nuance, or IBM Watson.
In summary…
When selecting a TTS provider, the idea is to conduct a cost-benefit analysis of each of the factors mentioned above. Therefore, for one of our projects, we undertook the task of comparing several TTS tools.
In this case, the text used is the following:
«"Hello Mauricio. We are calling from Banco Americano to remind you that you currently have an outstanding balance of $258,870. We encourage you to settle your account before April 20, 2018, to avoid being reported to credit bureaus. For more information, please call us at 3148901850 or visit our website at www.bancoamericano.com. Thank you."»
- Amazon Polly
- Microsoft Bing text to speech
- IBM Watson
- Nuance (better known as Loquendo)
- iSpeech
- Google Cloud TTS
- Festival (Free Alternative)
Regarding quality, as we've mentioned, this point is highly subjective and some people will prefer one service over another; however, from our perspective, the best are:
- Amazon Polly
- IBM Watson
- Microsoft Bing text to speech
- Nuance (better known as Loquendo)
- iSpeech
Considering the needs of the project we are presenting as an example, we estimated that the transactions average 500 characters, allowing us to normalize the cost of each provider to dollars per transaction. Thus, for example, we concluded that Polly, which charges $4 per million characters (assuming 5 million characters per month for the first twelve months), would cost us $1,400.002 per transaction. Meanwhile, Nuance, which charges per transaction regardless of the number of characters, would cost us $1,400.008 per transaction.
When a project involves playing audio in multiple languages and variants, the number of supported languages and voices should also be considered when selecting a provider. Nuance, Amazon, and Bing offer a wide range of options, with Nuance being the most diverse for Spanish.
Therefore, we created the following comparative table:
Conclusion
In our case, the decision we made for the project we presented as an example was to select Polly due to its cost-to-quality ratio. While Nuance, IBM, and Bing may offer slightly better quality, the cost of the service is very high compared to the marginal improvement achieved. When analyzing Google, we found it to be the best alternative if the voices are only in English, and it's also worth highlighting its new WaveNet voice feature (which elevates the quality above the others). Finally, if quality isn't a concern and you're unwilling to pay a penny for TTS, Festival is a free alternative that could be considered.
By: Jose Franco
#We invite you to read our blog post «iKono at AstriCon 2017»»
Learn about our Corporate Solutions
Learn about the IP telephony, multi-agent chat, and mass text and voice messaging solutions for your business.



3 Responses