How Apple is making Siri sound more human

One of the most famous synthesised voices belongs to Stephen Hawking. Created a number of years ago, Professor Hawking's voice is instantly recognisable to millions of people around the world. But if you think about it, the technology used for creating his speech has advanced drastically over the last decade, yet his voice hasn't changed. Changing it would be to change who he is and how you expect him to sound when he talks.

It's different when it comes to personal voice assistants on smartphones though. We aren't as precious about keeping the same vocal sounds first developed over a decade ago. We are happy for the likes of Apple's Siri and Amazon's Alexa to change and sound better over time, to sound more human, to sound more genuine.

Apple, following the launch of Siri on the iPhone 4S in 2011, has improved how its personal assistant sounds every year. Its perpetual goal is to make her sound as human as possible by using a combination of new and old speech synthesis techniques. The end result is an assistant that sounds life-like as possible.

And in 2017, with the launch of iOS 11, Apple has made a significant breakthrough. Siri is getting a new voice.

Out goes the clunky voice of old, and in comes speech that is considerably more nuanced than in iOS 9 or 10. It's a voice that Apple hopes will help endear people towards Siri even more going forward.

"For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality and expressivity of Siri’s voice," explained Apple's Siri department.

After evaluating hundreds of candidates before choosing the best one, they got the winning vocal actress to record over 20 hours of speech for processing.

That recorded speech was then sliced into its elementary components, such as half-phones. It is then recombined depending on a user's input text to create entirely new speech.

You can immediately hear the improvements the moment you say "Hey Siri":

https://clyp.it/z2fzurzo/widget

https://clyp.it/coemz2pb/widget

https://clyp.it/30hwyksa/widget

Although the process of recording the speech and then slicing it into new words was successful in the past, Apple believed it could do better.

The approach Apple took wasn't easy. You have to select the appropriate "phone" segments and join them together. The acoustic characteristics of each "phone" depend on its neighbouring "phones" and the pattern and rhythm of speech, which often makes the speech units incompatible with each other.

It's why previous iterations of Siri sounded robotic at times.

To solve the problem, Apple turned to deep learning and created a system that can "accurately predict both target and concatenation" elements in the database of half-phones that it has access to.

"The benefit of this approach becomes more clear when we consider the nature of speech. Sometimes the speech features, such as formants, are rather stable and evolve slowly, such as in the case of vowels. Elsewhere, speech can change quite rapidly, such as in transitions between voiced and unvoiced speech sounds. To take this variability into account, the model needs to be able adjust its parameters according to the aforementioned variability," the team explained.

The end result? Siri now not-only sounds more human, but also greatly improves on former versions.

When you talk to Siri in iOS 11 (we've been playing with the Public Beta) there are noticeable differences in the way she is able to reply to your questions. She's no Scarlett Johansson in the film Her, but Siri's transition to that model has started.

You'll be able to hear the new Siri when it arrives on iOS 11 due out this month.