How do we use AI to generate new audio from scratch?
In my last article in this series on The AI Developer’s Toolkit, I introduced you to the three most popular AI tools for audio analysis. These tools allowed us to extract useful information from audio recordings.
However, there are many cases where want to generate new audio from scratch. This set of tasks is referred to as audio synthesis.
In this article, I’ll introduce you to the three most popular AI tools for audio synthesis.
Speech synthesis (aka. text-to-speech) allows a computer to communicate to us using spoken words. Essentially, it allows a computer to speak aloud using an artificial human voice.
For example, we can use speech synthesis to read text out loud. We provide the speech-synthesis model with a body of text as input. Then the model produces audio containing the spoken words as output.
Speech synthesis is useful anytime you need a computer to speak to a human using a natural-sounding voice. For example:
Voice synthesis allows us to synthesize a specific person’s voice using an existing audio recording in combination with edits to the text of the recording’s transcript. Essentially, it allows us to edit a person’s spoken words just like editing text in a word processor.
For example, we can use voice synthesis to change a mistake in a recorded sentence. We provide the voice-synthesis model with the original audio and an edited transcript as input. Then the model produces the updated audio containing the synthesized edits as output.
Voice synthesis is useful anytime you need to edit a specific person’s voice in an audio source. For example:
Speech translation allows us to convert spoken words in one language into spoken words in another language. Essentially, it allows us to translate speech in real-time.
For example, we can use speech translation to translate spoken English into the French equivalent. We provide the speech-translation model with an audio recording saying “Hello World” as input. Then the model produces an audio recording saying “Bonjour le monde” as output.
Speech translation is useful anytime you need to translate one speaker’s language into a listener’s language. For example:
Beyond the three common examples we’ve seen, there are also a variety of other audio-synthesis tools. For example:
As we can see audio-synthesis tools allow us to transform existing audio and generate new audio from scratch.
If you’d like to learn how to use all of the tools listed above, please watch my online course: The AI Developer’s Toolkit.
The future belongs who those who invest in AI today. Don’t get left behind!