Riffusion's AI generates music from text using visual sonograms
On Thursday, two tech enthusiasts released Riffusion, an AI model that generates music from text prompts by creating a visual representation of sound and converting it to audio for playback . It uses a refined version of the Stable Diffusion 1.5 image synthesis model, applying visual latent diffusion to sound processing in a novel way.
Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms, which store audio in a two-dimensional image. In a sonogram, the X axis represents time (the order in which frequencies are played, from left to right) and the Y axis represents the frequency of sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment.
Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom stable diffusion model with sample sonograms linked to descriptions of the sounds or musical genres they represented. With this knowledge, Riffusion can generate new music on the fly based on text prompts describing the type of music or sound you want to hear, like "jazz", "rock", or even typing on a keyboard.
After generating the sonogram image, Riffusion uses Torchaudio to change the sonogram to sound, playing it back as audio.
"This is the v1.5 stable scattering model with no modifications, just fine-tuned on spectrogram images paired with text," the makers of Riffusion write on its explain page. "It can generate infinite variations of a prompt by varying the seed. All of the same web UI and techniques like img2img, inpainting, negative prompts, and interpolation work by default."
Visitors to the Riffusion website can experience the AI model through an interactive web application that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real-time while viewing the spectrogram continuously to the side left of the page.
On Thursday, two tech enthusiasts released Riffusion, an AI model that generates music from text prompts by creating a visual representation of sound and converting it to audio for playback . It uses a refined version of the Stable Diffusion 1.5 image synthesis model, applying visual latent diffusion to sound processing in a novel way.
Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms, which store audio in a two-dimensional image. In a sonogram, the X axis represents time (the order in which frequencies are played, from left to right) and the Y axis represents the frequency of sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment.
Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom stable diffusion model with sample sonograms linked to descriptions of the sounds or musical genres they represented. With this knowledge, Riffusion can generate new music on the fly based on text prompts describing the type of music or sound you want to hear, like "jazz", "rock", or even typing on a keyboard.
After generating the sonogram image, Riffusion uses Torchaudio to change the sonogram to sound, playing it back as audio.
"This is the v1.5 stable scattering model with no modifications, just fine-tuned on spectrogram images paired with text," the makers of Riffusion write on its explain page. "It can generate infinite variations of a prompt by varying the seed. All of the same web UI and techniques like img2img, inpainting, negative prompts, and interpolation work by default."
Visitors to the Riffusion website can experience the AI model through an interactive web application that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real-time while viewing the spectrogram continuously to the side left of the page.
What's Your Reaction?