How to Handle Mixed Music and Speech

Depending on the media you have and your objectives, you might have a mix of music and speech. Whether the music should be preserved or removed is dependent on the context. The music may be playing in the background of a public announcement system or a car that drives by while recording. In other cases, the music may have been inserted to the introduction to set the energy or mood of a podcast.

There are a few Media APIs that can help you decide how to handle mixed music and speech given the objectives of your project.

  • The Audio Diagnose API can identify if your media has music or not and how much
  • The Enhance API can either preserve or attempt to reduce or remove music

Detecting Music

The Audio Diagnose API can be used for content classification. By starting a job by calling POST /media/diagnose and checking the results you can determine the percentage of a piece of media that is attributed to music, speech, or silence.

Here is an example of the type of data returned:

"music": {
    "percentage": 34.8
"silence": {
    "percentage": 1.6,
    "at_beginning": 0,
    "at_end": 0,
    "num_sections": 54,
    "silent_channels": []
"speech": {
    "percentage": 94,
    "events": {
        "plosive": 3739,
        "sibilance": 675

If you want to know even more about the music, the Analyze API can identify additional details such as start time, bpm, instruments, genres, and keys.

Preserving Music

The Media Enhance API will sometimes consider music content as noise or otherwise unintended when trying to optimize for speech. You can change this behavior by specifying the content type as podcast.

The body for the POST /media/enhance request would look like this:

"content": {
    "type": "podcast"

That setting will dial back the aggressiveness when isolating speech. Alternatively, if you know the content has a lot of music, the music content type may be more appropriate. Many of the other content types are intended to improve speech quality and may not handle music as you intended.

Alternatively, you can enable music detection more directly. By setting this value to true the API will detect music in order to avoid possible distortions.

The body for the POST /media/enhance request in that case would look like this:

"audio": {
    "music": {
        "detection": {
            "enable": true

The reason you might choose music detection over content type is that each content type will also adjust other settings that you may have already tuned to the type of content you intend to process.