June 15, 2021 Author: Matthew Renze

How do we use AI to generate new audio from scratch?

In my last article in this series on The AI Developer’s Toolkit, I introduced you to the three most popular AI tools for audio analysis. These tools allowed us to extract useful information from audio recordings.

However, there are many cases where want to generate new audio from scratch. This set of tasks is referred to as audio synthesis.

In this article, I’ll introduce you to the three most popular AI tools for audio synthesis.

Speech Synthesis

Speech synthesis (aka. text-to-speech) allows a computer to communicate to us using spoken words. Essentially, it allows a computer to speak aloud using an artificial human voice.

For example, we can use speech synthesis to read text out loud. We provide the speech-synthesis model with a body of text as input. Then the model produces audio containing the spoken words as output.

Speech synthesis is useful anytime you need a computer to speak to a human using a natural-sounding voice. For example:

  • dictating responses to requests for hands-free applications
  • narrating the text contained in a document
  • creating more natural user interfaces

Voice Synthesis

Voice synthesis allows us to synthesize a specific person’s voice using an existing audio recording in combination with edits to the text of the recording’s transcript. Essentially, it allows us to edit a person’s spoken words just like editing text in a word processor.

For example, we can use voice synthesis to change a mistake in a recorded sentence. We provide the voice-synthesis model with the original audio and an edited transcript as input. Then the model produces the updated audio containing the synthesized edits as output.

Voice synthesis is useful anytime you need to edit a specific person’s voice in an audio source. For example:

  • editing mistakes in a recording from a podcast
  • eliminating the need for audio overdubs in films
  • creating natural-sounding AI voice agents

Speech Translation

Speech translation allows us to convert spoken words in one language into spoken words in another language. Essentially, it allows us to translate speech in real-time.

For example, we can use speech translation to translate spoken English into the French equivalent. We provide the speech-translation model with an audio recording saying “Hello World” as input. Then the model produces an audio recording saying “Bonjour le monde” as output.

Speech translation is useful anytime you need to translate one speaker’s language into a listener’s language. For example:

  • translating conversations in real-time
  • translating a keynote for an international audience
  • automating audio overdubs in multiple languages for video

Other Tools

Beyond the three common examples we’ve seen, there are also a variety of other audio-synthesis tools. For example:

  • Sound generation – which allows us to add realistic sound effects to silent videos
  • Music generation – which allows us to compose entirely new songs from scratch
  • Instrument translation – which can take a song performed by one instrument and recreate it using another instrument

As we can see audio-synthesis tools allow us to transform existing audio and generate new audio from scratch.


If you’d like to learn how to use all of the tools listed above, please watch my online course: The AI Developer’s Toolkit.

The future belongs who those who invest in AI today. Don’t get left behind!

Start Now!

Share this Article