Home » Cybersecurity » This Caller Does Not Exist: Using AI to Conduct Vishing Attacks

This Caller Does Not Exist: Using AI to Conduct Vishing Attacks

by Andre Rosario on August 7, 2024

As technology advances, threat actors increasingly leverage these innovations to conduct social engineering attacks including vishing, targeting the human element that technical controls often cannot fully safeguard.

What is Vishing?

Voice phishing, also known as Vishing, is a type of social engineering attack that uses phone calls or audio messages as a delivery method. These calls deceive people into divulging personal, financial or sensitive information. Vishing calls can also be used to convince people to carry out a malicious action such as resetting the password of a user account or transferring money to a threat actor’s bank account.

Social engineering attacks are increasing year-over-year in complexity, sophistication, and frequency. Threat actors are looking for alternative attack paths as defensive technologies improve at detecting, mitigating and stopping cyberattacks. These technological advancements are apparent in the prolific use of artificial intelligence and machine learning in defensive operations. However, AI is also being used in offensive operations as well. In this blog post, GuidePoint’s Threat and Attack Simulation Team (TAS) will explore how threat actors can leverage AI-generated voices to social engineer their targets and provide guidance on how to protect you and your organization.

The following is a quick reference on the tools and platforms used in this post:

ElevenLabs – Used to create AI-generated voices using text-to-speech or speech-to-speech capabilities. Voices can also be cloned by uploading a one-minute video or soundbite.
Soundux – A free and open-source soundboard that can be used to play AI-generated speech during a phone call.
Voicemeeter – An audio mixer application that can create virtual sound cards and allow Soundux audio to be used in voice-over-internet protocol (VoIP) calls or virtual meetings. Background sounds can also be introduced to add legitimacy to these attacks.
Google Voice – A VoIP provider that can place phone calls from a computer.

Use Cases and Scenarios

Threat actors routinely use social engineering attacks to go after “high-value” targets. What makes a person a “high-value” target? It depends on the threat actor, but typically an attacker is going to target someone who has access to:

Sensitive financial information.
Intellectual property.
Internal servers or data centers.

No matter what job role a person has, they have access to information that threat actors want. This means that everyone is a potential target for social engineering attacks. The following scenarios are real-world examples of social engineering attacks where a threat actor uses AI-generated voices.

A threat actor discovers an interview featuring the CEO of a company. Using this audio, the CEO’s voice is cloned using AI. The threat actor then calls a subsidiary company to request a wire transfer and receives $243,000. (Source: Forbes)
A threat actor discovers valid credentials in a data breach. They clone the voice of an employee using AI and contact the company’s IT help desk. The threat actor convinces the IT help desk to reset the user’s multi-factor authentication profile. After gaining access to the internal network, the threat actor deploys ransomware causing nearly $100,000,000 in damages. (Source: CyberArk)

The best way to defend against these attacks is by educating ourselves on how threat actors operate, and to become familiar with the tools, techniques and procedures used to carry out these attacks.

Why Use Text-To-Speech?

Real-time voice changing using AI-generated voice files can be challenging and it isn’t as realistic as text-to-speech. The current technology for real-time voice changing isn’t convincing enough. However, with the current speed of innovation regarding AI voice technology, that may change very soon!

Generating AI Voices

ElevenLabs offers free and paid subscription plans to create AI-generated audio. While the default voices provided by ElevenLabs work well, they sound more like a voiceover for a commercial than a user calling to reset their password.

ElevenLabs has a voice library that includes various voice models that can better handle natural speech. This can be accessed by navigating to the “Voices” tab, then to “Voice Library”.

Advanced filters can be applied to determine the type of voice, age, language, and accent used by clicking the filter button on the right side of the filter bar.

The voice you pick will be added under your “Voices” tab where you can generate audio from text.

Each voice can be modified to change the tone and overall sound of the generated audio. This is also where you can clone a voice using existing audio or video. This is a premium feature and requires a paid subscription, but any custom voices will be saved within the “VoiceLab” tab.

Using the Multilingual model, audio can be produced in multiple languages depending on the prompt. Various modifications can be made to change how the AI voice sounds.

Stability will determine how “natural” the voice sounds. A more variable setting percentage will add intonation, inflection, and tone to different areas of the prompt.
A higher similarity will make the AI voice sound more robotic. Each voice model is unique, and some may require changes to the similarity setting percentage to sound more “natural”.
Some voice models will allow you to change the style exaggeration percentage. Think of this as increasing the emotions in a message. These can sometimes sound jarring, and with each model being unique, there will need to be minor changes to get the audio to sound perfect.

Adding Pauses, Stammering and Filler Words

To better mirror natural speaking, you can add special tags within the prompt that will make the AI voice pause or throw in some “uhhs” or “ahhs” between words.

Adding a break tag will make the voice pause. Keep in mind that the maximum pause is three seconds.

“This is an example of what my AI voice will say” <break time=”2.5s” /> “After a short break I am back!”

Additional stammering or filler phrases can be added by using an ellipsis (…) between words.

… This is an example of what my AI voice will say … Let’s get to hacking!

Sometimes the audio might be too rushed, or there might be a pause in the wrong place. You can regenerate the speech and get a new audio file.

These audio files can also be slowed down using free audio editing software such as Audacity. After installing Audacity and opening the audio file, the sound clip can be slowed down without changing the pitch by first selecting the clip, then by clicking the “Effect” tab. Next, expand the “Pitch and Tempo” submenu, and select “Change Speed and Pitch”.

Save the generated or edited files locally in a new project folder. This project folder will be used later by Soundux to play the AI-generated audio through a phone call. Name the file after your prompt, this will make it easier to find the sentence you want to play on the call.

Creating Prompts

Good storytellers have captivated people for millennia. Social engineering attacks are like stories, threat actors impersonate the reality we see around us to deceive and trick people into acting. Threat actors can be thought of as theatrical actors in the sense that developing a realistic attack scenario or pretext requires a sense of acting.

Have you ever called an IT help desk or technical support about an annoying issue? Social engineering attacks follow the same conventions and conversational flow.

“Thank you for calling the Acme co-helpdesk, this is Adam speaking.”

“Hey there Adam, this is James Doe. I’m hoping you can help me out here, I tried changing my password on my own and it didn’t work. Can you give me a temporary password so I can get back to work?”

These conversational flows can be pre-scripted, writing out a response to any possible questions or reactions will allow a threat actor to have an AI-generated response for any situation.

Creating a Sound Board

Once the voice files are saved, they will be played using Soundux, a free and open-source soundboard project. A Soundux installer can be found here, or the executable can be compiled from the source code from the project’s GitHub repository.

Soundux may produce an error about the audio output. Because we are using a custom audio engine, this error can be ignored.

After installing the software, configure Soundux to use the audio project folder that was created earlier by clicking “Add Tab”. This will import the audio files and allow them to be played.

Afterward, click “Reload” to import the audio files. Click on the title of the file and it will play it using your assigned output device.

Installing and Configuring the Audio Engine

Voicemeeter offers a free audio mixing engine that can create virtual sound ports on your machine. This means that you can route audio from a program like Soundux and use that audio as input to another program such as Google Voice or any other VoIP provider.

Voicemeeter Banana can be downloaded here.

Next, Voicemeeter and Soundux will be configured to work together. Within Soundux, change the “Output Device” to “Voicemeeter Input”.

This will force Soundux to output the AI-generated audio to Voicemeeter. The “Voicemeeter Input” value is enabled by default. To hear the AI-generated audio, “A1” must be enabled for the “Voicemeeter Input” column. Enabling “B1” will allow the audio to be played in a virtual port and used by other programs like a microphone. The following image explains each function within Voicemeeter.

Additional adjustments can be made such as increasing the bass of the audio by enabling the “EQ” in the virtual audio out column. The volume of the AI-generated voice can be lowered if needed using the fader gain slider in the same column.

Soundux should now output sound using Voicemeeter as an audio engine. The sound can be tuned and modified depending on the audio samples.

Conducting the Vishing Attack

VoIP providers such as Google Voice, Twilio, or Vonage can be used to make phone calls using a desktop computer. These aren’t the only platforms that can be used to carry out vishing attacks. Software such as Microsoft Teams, Zoom, or Cisco WebEx can be used to start audio calls. For example, a threat actor can impersonate an internal user, clone their voice, or create a voice within their demographic, and invite another person to an audio-only meeting.

For this example, Google Voice will be used to make phone calls.

After configuring a free phone number or transferring an existing phone number, Google Voice will need to be configured to use the virtual output of Voicemeeter as the microphone. These settings can be accessed by clicking the headset button shown below.

Configure the microphone within Google Voice and select “Voicemeeter Out B1”. Ensure “B1” is enabled under the “Voicemeeter Input” column. Next, set the ringing and speaker settings to “Voicemeeter AUX Input”. The following image highlights the associated areas in their respective colors.

Make a test call using Google Voice to test the attack and adjust the output of the AI-generated voice as needed. To adjust the volume of the AI voice, lower the fader gain in the first virtual output column as seen below.

Click each audio clip within Soundux that your AI voice will say during the conversation. The audio of the person who called will be outputted to your headphones or speakers, allowing you to quickly reply as if you were talking on the phone.

Defending Against AI Vishing Attacks

As the use of AI increases, so will the abuse of AI. Threat actors will always update their tactics to match the latest advances in technology. There are some ways to protect your organization and protect yourself. The term “trust but verify” is at the core of defending against vishing attacks.

Educate users on the rise of AI-based social engineering attacks. Users not only need to closely monitor emails but phone calls and meetings as well. Implement this education within your security awareness training.
Implement strong identity verification steps across help desks and support groups. Ensure that phone calls are verified with a form of multi-factor authentication (MFA).

What Does Strong Identity Verification Look Like?

The first step would be to move away from only requiring personal information as a verification measure. Threat actors often conduct reconnaissance on their victims ahead of an attack, looking for birthdates, addresses, pets, or the names of people’s children. All this information can be found on social media, public records, or public data leaks.

If a user calls the help desk requesting any activity that could change the details of an account such as a password, or a multi-factor authentication token, the term “trust but verify” should come into effect. One way to stop vishing attacks is to require a physical method to verify their identity.

If a user requests a password reset, the help desk can send a push notification or request a verification code from their device that handles multi-factor authentication. A remote threat actor would need to have physical access to the legitimate user’s device or carry out extensive attacks to gain access to their phone or physical authentication device.

“Trust but verify” can also apply to your personal life as well. Threat actors don’t only target organizations and governments but increasingly target everyday people directly.

We all get phone calls from trusted sources, some of which may require sensitive information. A good measure is to call them back directly and request some form of authentication. This will rule out caller ID spoofing and get you one step closer to verifying the call. Perhaps they could send a push notification to your phone or send you an email to verify the call.

No matter what, properly verifying communications can mitigate the risk of a successful social engineering attack. As the threat landscape changes and technology progresses, we all must have a sense of caution and “trust but verify” as a standard moving forward.