On several occasions we have had the opportunity to develop solutions that require a Text to Audio conversion service. This technology is known as TTS for its acronym in English (Text To Speech). Now, if the project to be developed requires the incorporation of this type of technology, we find a large number of alternatives both in quality, costs and voices.
Taking the above into account, we wanted to make this entry with the aim of providing some advice that should be taken into account when selecting a provider of this type of technology.
Factors to consider
Quality
From our point of view this is the most important factor. Ideally, the voice should be as little robotic as possible and the speed, pitch and accent provided should be adequate, so that the person receiving the message can understand it clearly. Likewise, the service must have the ability to understand numerical quantities, dates, acronyms such as API, SIP, etc., or even names such as Microsoft or Google.
Now, we must add that this point is very subjective and will depend largely on the accent and type of voice.
Cost
It is clear that cost is a very important factor, since even if the service is of very good quality, if the costs are very high, they can make the project financially unviable. Even in our experience in projects of this type, we have been asked to change the TTS engine due to cost factors. This point is surely the most difficult to compare since each provider charges differently for services. Some charge per transaction while others charge a certain amount per thousand or million characters. Likewise, most of them offer a free tier where each month you have a certain number of transactions or thousands of free characters.
Some of the recommendations to resolve doubts regarding the cost are:
- Be clear about the size in characters and the number of transactions that will be carried out to be able to compare each of the alternatives. For example, if you have a large number of short-character transactions, it may be better to select a provider that charges per character rather than per transaction.
- Caching previously made requests significantly reduces costs, since generally the texts to be reproduced are constantly repeated. This, in addition to reducing costs, allows increasing the playback speed in real-time applications, since there will be no need to request audio that has already been synthesized.
Supported languages and variants
Depending on what you want to achieve, this point may have more or less relevance. When looking at the tools we find that most of them have excellent support for English, but the quality and quantity of variants for Spanish is quite small and in many cases this is important. For example, if we want to build a transactional IVR in Latin America and the TTS provider only has voices for Spanish/Spain, surely the accent will not go unnoticed.
One of the best examples is the recently exposed Google API for TTS, with impressive support for English and even a wide variety of voices, as well as the ability to generate audio with the WaveNet voice type, a technology that allows voice sounds quite natural. However, support for Spanish is quite precarious, being widely surpassed by Amazon Polly, Nuance or IBM's Watson.
In summary…
When selecting a TTS provider, the idea is to do a cost-benefit analysis between each of the factors set out above. Thus, for one of our projects we took the task of carrying out a comparison of some TTS tools.
In this case the text used is the following:
"Hello Mauricio. We are calling you from Banco Americano to remind you that as of today you have a debt of $258,870. We invite you to update yourself before 04/20/2018 and thus avoid being reported to the risk centers. For more information you can call the telephone number 3148901850 or visit our website www.bancoamericano.com. Thank you so much"
- Amazon Polly:
- Microsoft Bing text to speech:
- IBM Watson:
- Nuance (better known as Loquendo):
- iSpeech
- Google Cloud TTS:
- Festival (Free Alternative)
Regarding quality, as we have said, this point is very subjective and there will be people who prefer a certain service over another, however, from our perspective the best are:
- Amazon Polly
- IBM Watson
- Microsoft Bing text to speech
- Nuance (better known as Loquendo)
- iSpeech
Considering the needs of the project that we are presenting as an example, we evaluated that the transactions carried out have an average of 500 characters, so we were able to normalize the cost of each of the suppliers to dollars per transaction. So, for example, we were able to conclude that Polly, which charges $4 per million characters (providing 5 million characters per month in the first twelve months) would cost us $$0.002 per transaction. While Nuance that charges per transaction, regardless of the number of characters, would cost us $0.008 dollars per transaction.
When the project to be developed involves the reproduction of audio in several languages and variants, the number of supported languages and voices must also be taken into account when selecting the provider. There, Nuance, Amazon and Bing offer a wide alternative, with Nuance being the most diverse in the case of Spanish.
Thus, we make the following comparative table:
Conclusion
In our case, the decision we made for the development of the project that we present as an example was to select Polly for its cost/quality ratio. Since although Nuance, IBM and Bing may have a little better quality, the cost of the service is very high compared to the marginal quality obtained. When analyzing Google, we found it to be the best alternative if the voiceovers are only in English and it is even worth highlighting its new WaveNet voice feature (which raises the quality above the others). Finally if quality is not an issue and you are not willing to pay a dime for TTS, Festival is a free alternative which could be considered.
Join the discussion One Comment