Stability AI Introduces Stable Audio for Text-to-Audio Generation
No attachments for this post
Stability AI has launched its latest innovation, Stable Audio, enabling users to produce short audio clips using text instructions. The company, previously recognized for its Stable Diffusion text-to-image generation technology, continues to expand its offerings. Stable Diffusion underwent an enhancement in July with the introduction of the SDXL base model, boosting its image composition prowess. By August, Stability AI had ventured into the realm of code through StableCode.
The foundation of StableAudio closely mirrors the AI methods utilized by Stable Diffusion for image creation. Specifically, Stable Audio employs a diffusion model trained on audio rather than visuals, facilitating the crafting of unique audio segments.
Discussing the product's conception, Ed Newton-Rex, VP of Audio at Stability AI, mentioned, “Users simply articulate the desired music or audio through text, and our platform crafts it.” Newton-Rex, a seasoned entrepreneur in computer-generated music, founded Jukedeck in 2011, which TikTok acquired in 2019.
Contrary to assumptions, Stable Audio's technology doesn't trace back to Jukedeck. Instead, its origins lie in Stability AI’s musical research hub, Harmonai, brainchild of Zach Evans. Evans explained that the audio advancement involves repurposing image generation principles for the audio sector.
While the concept of crafting basic audio tracks using tech isn't revolutionary, Stability AI's generative power stands out. Instead of the conventional symbolic generation using MIDI files, which often results in repetitive tunes, Stable Audio's capability transcends these limitations, producing more refined music. The model was enriched using 800,000 licensed tracks from AudioSparks.
However, Stable Audio won’t generate music reminiscent of specific artists like the Beatles. Musicians, Newton-Rex pointed out, tend to seek originality rather than mimicry.
The Stable Audio framework houses around 1.2 billion parameters, equivalent to the initial Stable Diffusion image model. The model was entirely developed by Stability AI using a technique termed Contrastive Language Audio Pretraining (CLAP). Accompanying the Stable Audio release is a prompt guide, assisting users in crafting optimal text prompts for desired audio results.
Prospective users can access Stable Audio through a complimentary tier or a $12/month Pro package. The free variant permits 20 creations monthly with a 20-second duration each, while the Pro variant offers 500 creations with tracks lasting 90 seconds.
Newton-Rex emphasized the company's aim to make this groundbreaking technology accessible to all for exploration and experimentation.
Comments on this post
No comments have been added for this post.
You must be logged in to make a comment.