Riffusion's AI generates music from text using visual sonograms

An AI- generated image of musical notes exploding from a computer screen.Enlarge / An AI generated image of musical notes exploding from a computer screen. Ars-Technica

On Thursday, two tech enthusiasts released Riffusion, an AI model that generates music from text prompts by creating a visual representation of sound and converting it to audio for playback . It uses a refined version of the Stable Diffusion 1.5 image synthesis model, applying visual latent diffusion to sound processing in a novel way.

Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms, which store audio in a two-dimensional image. In a sonogram, the X axis represents time (the order in which frequencies are played, from left to right) and the Y axis represents the frequency of sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment.

Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom stable diffusion model with sample sonograms linked to descriptions of the sounds or musical genres they represented. With this knowledge, Riffusion can generate new music on the fly based on text prompts describing the type of music or sound you want to hear, like "jazz", "rock", or even typing on a keyboard.

After generating the sonogram image, Riffusion uses Torchaudio to change the sonogram to sound, playing it back as audio.

A sonogram represents time, frequency, and amplitude in a two-dimensional image. Enlarge / A sonogram represents time, frequency and amplitude in a two-dimensional image. diffusion

"This is the v1.5 stable scattering model with no modifications, just fine-tuned on spectrogram images paired with text," the makers of Riffusion write on its explain page. "It can generate infinite variations of a prompt by varying the seed. All of the same web UI and techniques like img2img, inpainting, negative prompts, and interpolation work by default."

Visitors to the Riffusion website can experience the AI ​​model through an interactive web application that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real-time while viewing the spectrogram continuously to the side left of the page.

A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms. Enlarge / A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms.

Riffusion's AI generates music from text using visual sonograms
An AI- generated image of musical notes exploding from a computer screen.Enlarge / An AI generated image of musical notes exploding from a computer screen. Ars-Technica

On Thursday, two tech enthusiasts released Riffusion, an AI model that generates music from text prompts by creating a visual representation of sound and converting it to audio for playback . It uses a refined version of the Stable Diffusion 1.5 image synthesis model, applying visual latent diffusion to sound processing in a novel way.

Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms, which store audio in a two-dimensional image. In a sonogram, the X axis represents time (the order in which frequencies are played, from left to right) and the Y axis represents the frequency of sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment.

Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom stable diffusion model with sample sonograms linked to descriptions of the sounds or musical genres they represented. With this knowledge, Riffusion can generate new music on the fly based on text prompts describing the type of music or sound you want to hear, like "jazz", "rock", or even typing on a keyboard.

After generating the sonogram image, Riffusion uses Torchaudio to change the sonogram to sound, playing it back as audio.

A sonogram represents time, frequency, and amplitude in a two-dimensional image. Enlarge / A sonogram represents time, frequency and amplitude in a two-dimensional image. diffusion

"This is the v1.5 stable scattering model with no modifications, just fine-tuned on spectrogram images paired with text," the makers of Riffusion write on its explain page. "It can generate infinite variations of a prompt by varying the seed. All of the same web UI and techniques like img2img, inpainting, negative prompts, and interpolation work by default."

Visitors to the Riffusion website can experience the AI ​​model through an interactive web application that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real-time while viewing the spectrogram continuously to the side left of the page.

A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms. Enlarge / A screenshot of the Riffusion website, which lets you type prompts and hear the resulting sonograms.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow