There is a great deal of public concern about deepfakes, most of it centered on the ramifications of being able to quickly and easily face-swap videos. That concern is certainly well-founded, but it may be obscuring an even more immediate threat – deepfake audio. Voice-swapping has already been put to use in at least a handful of artificial intelligence (AI) cyber attacks on businesses, enabling attackers to gain access to corporate networks and convince employees to authorize a money transfer.
The future of AI cyber attacks
The primary use of deepfake audio is to enhance a very common type of attack – business email compromise (BEC).
A business email compromise attack usually begins with some sort of phishing to gain access to the company network and reconnoiter the payment systems. Once the attackers have identified the employees who are authorized to release payments and some of the regular transactions that occur, they impersonate a CEO or CFO to pass a false authorization for a payment to an entity made up to look like one of the company’s regular business partners.
Up until now, hackers have relied on forging and spoofing emails to commit BEC. The ability to use deepfake audio provides them with a powerful new tool to enhance this very popular form of malicious activity. Attackers usually rely on pressure to carry off the attack, playing the role of the executive harrying the finance employee. The ability to call these employees up on the phone and use the technology to impersonate senior leadership not only adds to the authenticity of the request, but also allows them to dial up the pressure.
How does deepfake audio work?
Deepfake audio is one of the most advanced new forms of AI cyber attacks in that it relies on a machine learning algorithm to mimic the voice of the target. The AI uses generative adversarial networks (GAN) that constantly compete with each other; one creates a fake, the other tries to identify it as fake, and they each learn from every new attempt.
As with the fake videos, the attackers create a voice model by feeding the algorithm “training data”; all sorts of voice clips of the target, often collected from public sources like speeches, presentations, corporate videos and interviews.
However, deepfake audio is much more flexible than deepfake video at present. With deepfake video, the training model needs to be fed a base video to swap the target’s face onto. Once a robust enough deepfake audio profile is built, it can be used with specialized “text-to-speech” software to create scripts for the fake voice to read.
It can take considerable time and resources to create a truly perfect deepfaked voice, something that may be cost-prohibitive to attackers. The most advanced of these can create a voice profile by listening to 20 minutes of audio, but in most cases the process is much longer and is very resource-intensive. Dr. Alexander Adam, data scientist at AI training lab Faculty, estimates that training a very convincing deepfake audio model costs thousands of dollars in computing resources. However, the attacks seen in the wild thus far have cleverly used background noise to mask imperfections, for example simulating someone calling from a spotty cellular phone connection or being in a busy area with a lot of traffic.
Symantec elaborated on the computing power and the voice resources needed to create a convincing deepfake, noting that the algorithm needs an adequate amount of speech samples that capture the speaker’s natural speech rhythms and intonations. That means that attackers need access to a large body of clear voice samples from the target to properly train the algorithm. It would be prudent for upper-level executives that have the authority to issue payments to review their available body of public audio to determine how much of a risk there is, and perhaps implement added verification requirements for those individuals. Of course, the possibility that an attacker might engage a target in a phone or in-person conversation to obtain the voice data they need should also be considered as this takes its place among the more common AI cyber attacks.
To give you an idea of how far along the upper end of this AI based software is, take a look at this Adobe presentation from 2016 in which actor/director Jordan Peele’s voice is deepfaked in real time. Peele later made a landmark deepfake video in 2018 in which he used AI technology to impersonate former president Barack Obama, complete with convincing fake audio. While deepfakes that impersonate senior company executives are still very rare, they have been used to compromise both private and public leadership. It is suspected that the early 2019 coup in the small African nation of Gabon was triggered by an attempt to use a deepfake that backfired.
Symantec stated that they are working on analysis methods that could review the audio of a call and give the recipient a probability rating of how authentic it is. There are existing technological means to prevent these attacks, but at present would be expensive to implement and are not yet readily positioned to be adopted to addressing deepfake audio calls. One such possibility is to use a certification system for inter-organizational calls. Another is the use of blockchain technology with voice-over-IP (VoIP) calls to authenticate the caller.
In the interim, protection against these menacing new AI cyber attacks ties in with basic cyber security in handling all forms of BEC and invoicing fraud – the foundation is employee education. Many employees are not aware of what deepfake videos are, let alone the possibility that faked audio can be used to simulate a call from a superior. Simple education can motivate an employee to question an unusual payment or network access request.
In addition to training, fundamental BEC protection methods like filtering and authentication frameworks for email can help to cut these attacks off at the pass by preventing cyber criminals from phishing their way into the network. Standard payment protocols that require multi-factor authentication or a return call placed to the authorizing party also do a great deal to shut even the most advanced AI cyber attacks down.