Why Voice Authentication Should Not Be Used to Secure Critical Assets

Written by

We know voices can be duplicated with almost trivial effort, but until now I’d never found a good use case worth putting in the effort for to actually try it. Much to my surprise, when I contacted banking giant HSBC over their phone banking line recently, I was prompted to register for voice authentication.

I duly obliged, not because I thought this would improve my experience, but as I genuinely had not come across an implementation that tried to secure a critical asset (like your money) using voice authentication.

First off, before you run off and attempt to log into everyone’s bank account with their duplicated voice, you do need some preliminary info. You need two sets of information:

  1. One of: Account number (with sort code), 16-digit card number or customer ID number
  2. Your date of birth

The first identifier will be the more complex to get and would require a bit of a targeted approach or social engineering but you only need one of the three. The second piece of data is almost trivial nowadays if anyone has a social media presence, just look for the birthday cakes on their social media accounts and read the comments.

Once you have that information, you’re then presented with the voice authentication prompt. Now to register, you need to repeat the phrase “my voice is my password” four times which is enough for them to create a voice sample. That’s it.

When you authenticate, the phrase is the same. Yes, that means every single HSBC customer that has activated voice authentication uses the same authentication phrase “my voice is my password.” This makes it even easier.

Duplicating Someone’s Voice is Easy

So what does it take to duplicate someone’s voice? Well, it takes about five minutes. Find a public recording of them on any social media post. I took a recording of me from a security conference, sampled one minute’s worth of audio and then used Lovo.ai or speechify.com to clone the voice (both are free to use). Type in the text you want the cloned voice to say – again, everyone has the same phrase.

Then you need to play it back into the phone conversation, which is literally just having the MP3 file play on your phone when it asks you to authenticate your voice. If you wanted to automate this you could – for example the Twilio API, even under its free tier has the capacity for you to play specific files after specific prompts programmatically. Skype also has the capacity to inject audio files into calls using drag and drop and of course, all completely free.

But what is the potential damage done if you access phone banking. According to HSBC, you can do the following: check your balance, make payments, pay bills, transfer money, set up standing orders, update your details and block your card or report it stolen. Quite a lot then.

And what did the bank have to say about all this? I contacted their customer team explaining my concerns around voice cloning and voice authentication and this was their response:

"There is nothing worry about, I understand that you are concerned if your voice can be imitated. However this is not as easy as it seems to be. From my end I can assure that there cannot be any security breaches on your account. If you are still very concerned, you can have a word with our telephone banking on how the voice passwords work. Please contact our Telephone Banking Helpdesk on 03457 404 404 and from overseas +44 1226 261 010, the team works from Monday to Sunday between 8 am to 8 pm, who should be able to look into this for you."

Quite contradictory and not very reassuring considering I’d just cloned my own voice and accessed my account in about five minutes. Could the implementation be improved?

The most obvious is removing the default phrase that every HSBC customer uses. For example, randomly generating a set of words for someone to say to authenticate would work, but then you’d need a larger voice sample to make sure their voice can be recognized across the whole spectrum of the English language and this would frustrate users who would have to talk for a few minutes just to register.

This also wouldn’t defeat voice cloning since the cloned voices are generated using the same mechanism – longer sample equals better accuracy. Generating them on the fly also isn’t a barrier. For example, with Speechify there’s a short wait of around 10-15 seconds before whatever you’ve typed is generated in the cloned voice as audio. The prompt gives you a few tries to authenticate via voice so you have plenty of time to generate new cloned audio depending on what is being asked.

What’s the solution? Simply put, don’t use it. Even as a layer in multi-factor authentication it is so trivial to duplicate that it shouldn’t be relied upon to secure anything, especially your money.

What’s hot on Infosecurity Magazine?