• AIPressRoom
  • Posts
  • Encode, decode and transcode audio or video content with AI

Encode, decode and transcode audio or video content with AI

Google Cloud Speech-to-Text is a fully managed service that converts speech to text in real time. It can be used to transcribe audio and video files, create subtitles for videos, and build voice-activated applications.

The service supports a wide range of audio formats, including WAV, MP3, and AAC. It can also transcribe audio in a variety of languages, including English, Spanish, French, German, Japanese and many more

Google Cloud Speech-to-Text is easy to use. You can simply upload your audio file to the service, and it will automatically transcribe it into text. You can also use the service to transcribe live audio, such as a phone call or a meeting. Speech-to-Text samples are given here for V1 and V2.

Problem

However, what if the input audio encoding is not supported by the STT API?

Supported audio encodings are given at https://cloud.google.com/speech-to-text/docs/encoding

Fortunately, there are a number of third-party tools available to assist with encoding conversion.

FFmpeg is a popular multimedia framework for handling audio and video data. It can be used to encode, decode, transcode, and stream audio and video content. In this blog, we will demonstrate how to use FFmpeg in various scenarios to obtain the correct encoding for calling an STT API.

  1. Running it locally from the command line

  2. Invoking it through a Python program

  3. Building a container image with GCP buildpacks and FFmpeg

  4. Running from Vertex AI Workbench

Running it locally from the command line

Download ffmpeg software from – and install it.

Take a sample input audio source is encoded in “acelp.kelvin” which is not supported by STT API

To determine how the audio source is encoded, run ffmpeg -i input.wav output will shown the encoding